Studies in Classification, Data Analysis, and Knowledge Organization Managing Editors H.-H. Bock, Aachen W. Gaul, Karlsruhe M. Vichi, Rome
Editorial Board Ph. Arabie, Newark D. Baier, Cottbus F. Critchley, Milton Keynes R. Decker, Bielefeld E. Diday, Paris M. Greenacre, Barcelona C. Lauro, Naples J. Meulman, Leiden P. Monari, Bologna S. Nishisato, Toronto N. Ohsumi, Tokyo O. Opitz, Augsburg G. Ritter, Passau M. Schader, Mannheim C. Weihs, Dortmund
Titles in the Series E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) New Approaches in Classification and Data Analysis. 1994 (out of print) W. Gaul and D. Pfeifer (Eds.) From Data to Knowledge. 1995 H.-H. Bock and W. Polasek (Eds.) Data Analysis and Information Systems. 1996
S. Borra, R. Rocci, M. Vichi, and M. Schader (Eds.) Advances in Classification and Data Analysis. 2001 W. Gaul and G. Ritter (Eds.) Classification, Automation, and New Media. 2002
E. Diday, Y. Lechevallier, and O. Opitz (Eds.) Ordinal and Symbolic Data Analysis. 1996
K. Jajuga, A. Sokoøowski, and H.-H. Bock (Eds.) Classification, Clustering and Data Analysis. 2002
R. Klar and O. Opitz (Eds.) Classification and Knowledge Organization. 1997
M. Schwaiger and O. Opitz (Eds.) Exploratory Data Analysis in Empirical Research. 2003
C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, and Y. Baba (Eds.) Data Science, Classification, and Related Methods. 1998
M. Schader, W. Gaul, and M. Vichi (Eds.) Between Data Science and Applied Data Analysis. 2003
I. Balderjahn, R. Mathar, and M. Schader (Eds.) Classification, Data Analysis, and Data Highways. 1998
H.-H. Bock, M. Chiodi, and A. Mineo (Eds.) Advances in Multivariate Data Analysis. 2004
A. Rizzi, M. Vichi, and H.-H. Bock (Eds.) Advances in Data Science and Classification. 1998 M. Vichi and O. Opitz (Eds.) Classification and Data Analysis. 1999 W. Gaul and H. Locarek-Junge (Eds.) Classification in the Information Age. 1999 H.-H. Bock and E. Diday (Eds.) Analysis of Symbolic Data. 2000 H. A. L. Kiers, J.-P. Rasson, P. J. F. Groenen, and M. Schader (Eds.) Data Analysis, Classification, and Related Methods. 2000
D. Banks, L. House, F. R. McMorris, P. Arabie, and W. Gaul (Eds.) Classification, Clustering, and Data Mining Applications. 2004 D. Baier and K.-D. Wernecke (Eds.) Innovations in Classification, Data Science, and Information Systems. 2005 M. Vichi, P. Monari, S. Mignani and A. Montanari (Eds.) New Developments in Classification and Data Analysis. 2005
W. Gaul, O. Opitz, and M. Schader (Eds.) Data Analysis. 2000
D. Baier, R. Decker, and L. SchmidtThieme (Eds.) Data Analysis and Decision Support. 2005
R. Decker and W. Gaul (Eds.) Classification and Information Processing at the Turn of the Millenium. 2000
C. Weihs and W. Gaul (Eds.) Classification ± the Ubiquitous Challenge. 2005
Myra Spiliopoulou ´ Rudolf Kruse Christian Borgelt ´ Andreas Nçrnberger Wolfgang Gaul Editors
From Data and Information Analysis to Knowledge Engineering Proceedings of the 29th Annual Conference of the Gesellschaft fçr Klassifikation e.V. University of Magdeburg, March 9±11, 2005
With 239 Figures and 120 Tables
12
Professor Dr. Myra Spiliopoulou Otto-von-Guericke-Universitåt Magdeburg Institut fçr Technische und Betriebliche Informationssysteme Universitåtsplatz 2 39106 Magdeburg Germany
[email protected]
Professor Dr. Wolfgang Gaul Universitåt Karlsruhe (TH) Institut fçr Entscheidungstheorie und Unternehmensforschung 76128 Karlsruhe
[email protected]
Professor Dr. Rudolf Kruse Dr. Christian Borgelt Jun.-Professor Dr. Andreas Nçrnberger Otto-von-Guericke-Universitåt Magdeburg Institut fçr Wissensund Sprachverarbeitung Universitåtsplatz 2 39106 Magdeburg Germany
[email protected] [email protected] [email protected]
ISSN 1431-8814 ISBN-10 3-540-31313-3 Springer-Verlag Berlin Heidelberg New York ISBN-13 978-3-540-31313-7 Springer-Verlag Berlin Heidelberg New York Cataloging-in-Publication Data Library of Congress Control Number: 2005938846 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer ´ Part of Springer Science+Business Media springeronline.com ° Springer-Verlag Berlin ´ Heidelberg 2006 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Softcover-Design: Erich Kirchner, Heidelberg SPIN 11584247
43/3153 ± 5 4 3 2 1 0 ± Printed on acid-free paper
Preface This volume contains revised versions of selected papers presented during the 29th Annual Conference of the German Classification Society (Gesellschaft f¨ ur Klassifikation, GfKl’2005). The conference was held at the Otto-vonGuericke-University Magdeburg in March 2005. The theme of the GfKl’2005 was “From Data and Information Analysis to Knowledge Engineering” and encompassed 230 presentations in 74 sessions, including 11 plenary and semiplenary talks. With 324 attendants from 23 countries, the 29th GfKl conference established a new participation record for the conference series. The conference again provided an attractive interdisciplinary forum for discussions and mutual exchange of knowledge. It was organized in cooperation with the Slovenian Artificial Intelligence Society (SLAIS). The conference was accompanied by several collocated events. In addition to the Librarians Workshop and the traditional meetings of the working groups, a new important event took place for the first time — the Doctoral Workshop for PhD students. Starting at the GfKl’2004, a Data Mining Competition took place for the second time; for the particularly challenging data analysis problem posed this year 40 solutions were submitted. The papers in this volume were selected in a second reviewing process after the conference. Each of the 131 submitted long versions of conference contributions was reviewed by two reviewers, and 92 were accepted for this volume. In addition to papers in the fundamental areas Classification, Clustering, and Data Analysis, this volume contains many papers on a wide range of topics with a strong relation to Computer Science. Examples are Text Mining (largest track of the conference as well as in this post-conference volume), Web Mining, Fuzzy Data Analysis, IT Security, Adaptivity and Personalization, and Visualization. Application-oriented topics were addressed in several conference talks. In this volume, the corresponding papers are grouped into the clusters: (1) Economics, Marketing, Banking and Finance, (2) Medicine, Bioinformatics, Biostatistics, (3) Music Analysis. The last paper in this volume reports on the solutions of the winning data mining contestants. The editors of these proceedings would like to thank the members of the program committee, all reviewers for their vigorous and timely reviewing process, and the authors for their contributions. Special thanks go to the area chairs, who have undertaken the coordination of the reviewing process for their individual tracks and worked under a rigorous time schedule. The success of the GfKl’2005 conference is due to the effort and involvement of many people. We would like to thank foremostly the local organization team of Silke Reifgerste, Marko Brunzel, Dirk Dreschel, Tanja Falkowski, Folker Folkens, Henner Graubitz, Roland M¨ uller and Rene Schult and their student support team for their hard work in the preparation of this conference and for their support during the event itself. Most cordial thanks go to the organizers of the collocated events: Werner Esswein (TU Dresden) for the organization of the Doctoral Workshop, Hans-J. Hermes (TU Chemnitz) and
vi
Preface
Bernd Lorenz (FH M¨ unchen) who organized the Librarians Workshop, Christian Klein (SPSS GmbH Software) and Michael Thess (prudsys AG) for their involvement in the organization of the industrial track and to Jens Strackeljan (Otto-von-Guericke-University Magdeburg) as well as Roland Jonscher and Sigurd Prieur (Sparkassen Rating und Risikosysteme GmbH, Berlin) for the coordination of the Data Mining Competition. The awards for the competition were sponsored by the Deutscher Sparkassen- und Giroverband. Institutional support has been of paramount importance for the success of the GfKl’2005. Our first thanks go to the Faculty of Computer Science and the Otto-von-Guericke-Universit¨ at Magdeburg for providing rooms, facilities, support and assistance to the organization of this conference. We are particularly indebted to the University Rector Klaus Erich Pollmann for his support and involvement. We gratefully acknowledge the support of the city of Magdeburg in organizing the city reception event. In addition, we would like to thank DaimlerChrysler AG and our sponsors Deutscher Sparkassen- und Giroverband, Heins+Partner GmbH, prudsys AG, Springer Verlag GmbH and SPSS GmbH Software for their support. Finally, we would like to thank Christiane Beisel and Martina Bihn of Springer-Verlag, Heidelberg, for their support and dedication to the production of this volume. The German Classification Society entrusted us with the organization of the GfKl’2005. We are grateful for this honor and for all institutional and personal support provided to us in all phases of the GfKl’2005, from the first planning phase until the print of this volume.
Myra Spiliopoulou, Rudolf Kruse, Christian Borgelt, Andreas N¨ urnberger, Wolfgang Gaul Magdeburg and Karlsruhe, January 2006
Organization Chairs Local Chair Myra Spiliopoulou (Otto-von-Guericke-University Magdeburg, Germany) Publication Chair Rudolf Kruse (Otto-von-Guericke-University Magdeburg, Germany) Publicity Chair Andreas N¨ urnberger (Otto-von-Guericke-University Magdeburg, Germany) Submission and Book Preparation Christian Borgelt (Otto-von-Guericke-University Magdeburg, Germany) Program Chair Wolfgang Gaul (University of Karlsruhe, Germany)
Program Committee Hans-Hermann Bock (RWTH Aachen, Germany) Reinhold Decker (University of Bielefeld, Germany) Bernard Fichet (University of Aix-Marseille II, France) Wolfgang Gaul (University of Karlsruhe, Germany) Rudolf Kruse (Otto-von-Guericke-University Magdeburg, Germany) Hans-Joachim Lenz (Free University of Berlin, Germany) Dunja Mladeni´c (J. Stefan Institute, Slovenia) Otto Opitz (University of Augsburg, Germany) Myra Spiliopoulou (Otto-von-Guericke-University Magdeburg, Germany) Maurizio Vichi (University of Roma — “La Sapienza”, Italy) Claus Weihs (University of Dortmund, Germany) Klaus-Dieter Wernecke (Charit´e Berlin, Germany)
Program Sections and Area Chairs Clustering Hans-Hermann Bock (RWTH Aachen, Germany) Discrimination Gunter Ritter (University Passau, Germany) Multiway Classification and Data Analysis Sabine Krolak-Schwerdt (Saarland University, Germany) Henk A.L. Kiers (University of Groningen, Netherlands)
viii
Organization
Multimode Clustering and Dimensionality Reduction Maurizio Vichi (University Roma — “La Sapienza”, Italy) Robust Methods in Multivariate Statistics Andrea Cerioli (University of Parma, Italy) Dissimilarities and Clustering Structures Bernard Fichet (University of Aix-Marseille II, France) PLS Path Modeling, PLS Regression and Classification Natale C. Lauro (University “Federico II” of Napoli, Italy) V. Esposito Vinzi (University “Federico II” of Napoli, Italy) Ranking, Multi-label Classification, Preferences Johannes F¨ urnkranz (Technical Universtity Darmstadt, Germany) Eyke H¨ ullermeier (Philipps-University Marburg, Germany) Computational Advances in Data Analysis Hans-Joachim Lenz (Free University Berlin, Germany) Fuzzy Data Analysis Rudolf Kruse (Otto-von-Guericke-University Magdeburg, Germany) Visualization Patrick J.F. Groenen (Erasmus University Rotterdam, Netherlands) Classification and Analysis in Data Intensive Scenarios Gunter Saake (Otto-von-Guericke-University Magdeburg, Germany) Data Mining and Explorative Multivariate Data Analysis Luigi D’Ambra (University “Federico II” of Napoli, Italy) Paulo Giudici (University of Pavia, Italy) Text Mining Andreas N¨ urnberger (Otto-von-Guericke-University Magdeburg, Germany) Dunja Mladeniˇc (Jozef Stefan Institute Ljubljana, Slovenia) Web Mining Myra Spiliopoulou (Otto-von-Guericke-University Magdeburg, Germany) Adaptivity and Personalization Andreas Geyer-Schulz (University Karlsruhe, Germany) Lars Schmidt-Thieme (Albert-Ludwigs-University Freiburg, Germany) User and Data Authentication in IT Security Jana Dittmann (Otto-von-Guericke-University Magdeburg, Germany) Banking and Finance Hermann Locarek-Junge (Technical University Dresden, Germany)
Organization
ix
Marketing Daniel Baier (Brandenburg University of Technology Cottbus, Germany) Matthias Meyer (Ludwig-Maximilians-University Mnchen, Germany) Economics Otto Opitz (University Augsburg, Germany) Mining in Business Processes Claus Rautenstrauch (Otto-von-Guericke-University Magdeburg, Germany) Bioinformatics and Biostatistics Berthold Lausen (Friedrich-Alexander University Erlangen-Nuremberg, Germany) Classification of High-dimensional Biological and Medical Data Siegfried Kropf (Otto-von-Guericke-University Magdeburg, Germany) Johannes Bernarding (Otto-von-Guericke-University Magdeburg, Germany) Classification with Latent Variable Models Angela Montanari (University Bologna, Italy) Medical and Health Sciences Klaus-Dieter Wernecke (Charit´e Berlin, Germany) Music Analysis Claus Weihs (University Dortmund, Germany) Industrial Applications and Solutions Myra Spiliopoulou (Otto-von-Guericke-University Magdeburg, Germany)
Additional Reviewers Mark Ackermans Sven Apel Michael Berthold Eva Ceulemans Steffen Bickel Ulf Brefeld Christian D¨ oring Daniel Enache Tanja Falkowski Mar´ıa Teresa Gallegos Michael Gertz Hans Goebl Gerard Govaert Peter Grzybek Larry Hall Fred A. Hamprecht
(in alphabetical order)
Enrico Hauer Hartmut Hecker Christian Hennig Andreas Hilbert Andreas Hotho Frank Klawonn Juergen Kleffe Meike Klettke Peter Kuhbier Andreas Lang Berthold Lausen Wolfgang Lehner Wolfgang May Iven Van Mechelen Alexander Mehler
Paola Monari Fabian M¨ orchen Hans-Joachim Mucha Daniel M¨ ullensiefen Gerhard Paaß Marco Riani Gunter Ritter Fabrice Rossi Kai-Uwe Sattler Eike Schallehn Ingo Schmitt Benno Stein Gerd Stumme Michiel van Wezel Adalbert Wilhelm
Contents
Plenaries and Semi-plenaries Boosting and 1 -Penalty Methods for High-dimensional Data with Some Applications in Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. B¨ uhlmann
1
Striving for an Adequate Vocabulary: Next Generation ’Metadata’ . . . . 13 D. Fellner and S. Havemann Scalable Swarm Based Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 L.O. Hall and P.M. Kanade SolEuNet: Selected Data Mining Techniques and Applications . . . . . . . . 32 N. Lavraˇc Inferred Causation Theory: Time for a Paradigm Shift in Marketing Science? . . . . . . . . . . . . . . . . . . . 40 J.A. Mazanec Text Mining in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 D. Mladeniˇc Identification of Real-world Objects in Multiple Databases . . . . . . . . . . . 63 M. Neiling Kernels for Predictive Graph Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 S. Wrobel, T. G¨ artner, and T. Horv´ ath
Clustering PRISMA: Improving Risk Estimation with Parallel Logistic Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 B. Arnrich, A. Albert, and J. Walter Latent Class Analysis and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 95 J.G. Dias
xii
Contents
An Indicator for the Number of Clusters: Using a Linear Map to Simplex Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 103 M. Weber, W. Rungsarityotin, and A. Schliep
Discriminant Analysis On the Use of Some Classification Quality Measure to Construct Mean Value Estimates Under Nonresponse . . . . . . . . . . . . . 111 W. Gamrot A Wrapper Feature Selection Method for Combined Tree-based Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 E. Gatnar Input Variable Selection in Kernel Fisher Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 N. Louw and S.J. Steel The Wavelet Packet Based Cepstral Features for Open Set Speaker Classification in Marathi . . . . . . . . . . . . . . . . . . . . . 134 H.A. Patil, P.K. Dutta, and T.K. Basu A New Effective Algorithm for Stepwise Principle Components Selection in Discriminant Analysis . . . . . . . . . . . . 142 E. Serikova and E. Zhuk A Comparison of Validation Methods for Learning Vector Quantization and for Support Vector Machines on Two Biomedical Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 D. Sommer and M. Golz Discriminant Analysis of Polythetically Described Older Palaeolithic Stone Flakes: Possibilities and Questions . . . . . . . . . . 158 T. Weber
Classification with Latent Variable Models Model-based Density Estimation by Independent Factor Analysis . . . . . 166 D.G. Cal` o, A. Montanari, and C. Viroli Identifying Multiple Cluster Structures Through Latent Class Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 G. Galimberti and G. Soffritti
Contents
xiii
Gene Selection in Classification Problems via Projections onto a Latent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 M. Pillati and C. Viroli
Multiway Classification and Data Analysis The Recovery Performance of Two–mode Clustering Methods: Monte Carlo Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 S. Krolak-Schwerdt and M. Wiedenbeck On the Comparability of Relialibility Measures: Bifurcation Analysis of Two Measures in the Case of Dichotomous Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 T. Ostermann and R. Schuster
Ranking, Multi-label Classification, Preferences On Active Learning in Multi-label Classification . . . . . . . . . . . . . . . . . . . . 206 K. Brinker From Ranking to Classification: A Statistical View . . . . . . . . . . . . . . . . . . 214 S. Cl´emen¸con, G. Lugosi, and N. Vayatis
PLS Path Modeling, PLS Regression and Classification Assessing Unidimensionality within PLS Path Modeling Framework . . . 222 K. Sahmer, M. Hanafi, and E.M. Qannari The Partial Robust M-approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 S. Serneels, C. Croux, P. Filzmoser, and P.J. Van Espen Classification in PLS Path Models and Local Model Optimisation . . . . . 238 S. Squillacciotti
Robust Methods in Multivariate Statistics Hierarchical Clustering by Means of Model Grouping . . . . . . . . . . . . . . . . 246 C. Agostinelli and P. Pellizzari Deepest Points and Least Deep Points: Robustness and Outliers with MZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 C. Becker and S.P. Scholz
xiv
Contents
Robust Transformations and Outlier Detection with Autocorrelated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 A. Cerioli and M. Riani Robust Multivariate Methods: The Projection Pursuit Approach . . . . . . 270 P. Filzmoser, S. Serneels, C. Croux, and P.J. Van Espen Finding Persisting States for Knowledge Discovery in Time Series . . . . . 278 F. M¨ orchen and A. Ultsch
Data Mining and Explorative Multivariate Data Analysis Restricted Co-inertia Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 P. Amenta and E. Ciavolino Hausman Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 V. Choulakian, L. Dambra, and B. Simonetti Nonlinear Time Series Modelling: Monitoring a Drilling Process . . . . . . . 302 A. Messaoud, C. Weihs, and F. Hering
Text Mining Word Length and Frequency Distributions in Different Text Genres . . . 310 G. Anti´c, E. Stadlober, P. Grzybek, and E. Kelih Bootstrapping an Unsupervised Morphemic Analysis . . . . . . . . . . . . . . . . 318 C. Benden Automatic Extension of Feature-based Semantic Lexicons via Contextual Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 C. Biemann and R. Osswald Learning Ontologies to Improve Text Clustering and Classification . . . . 334 S. Bloehdorn, P. Cimiano, and A. Hotho Discovering Communities in Linked Data by Multi-view Clustering . . . . 342 I. Drost, S. Bickel, and T. Scheffer Crosslinguistic Computation and a Rhythm-based Classification of Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 A. Fenk and G. Fenk-Oczlon Using String Kernels for Classification of Slovenian Web Documents . . . 358 B. Fortuna and D. Mladeniˇc
Contents
xv
Semantic Decomposition of Character Encodings for Linguistic Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 D. Gibbon, B. Hughes, and T. Trippel Applying Collaborative Filtering to Real-life Corporate Data . . . . . . . . . 374 M. Grcar, D. Mladeniˇc, and M. Grobelnik Quantitative Text Typology: The Impact of Sentence Length . . . . . . . . . 382 E. Kelih, P. Grzybek, G. Anti´c, and E. Stadlober A Hybrid Machine Learning Approach for Information Extraction from Free Text . . . . . . . . . . . . . . . . . . . . . . . . . 390 G. Neumann Text Classification with Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 B. Novak, D. Mladeniˇc, and M. Grobelnik Towards Structure-sensitive Hypertext Categorization . . . . . . . . . . . . . . . 406 A. Mehler, R. Gleim, and M. Dehmer Evaluating the Performance of Text Mining Systems on Real-world Press Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 G. Paaß and H. de Vries Part-of-Speech Induction by Singular Value Decomposition and Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 R. Rapp Near Similarity Search and Plagiarism Analysis . . . . . . . . . . . . . . . . . . . . . 430 B. Stein and S.M. zu Eissen
Fuzzy Data Analysis Objective Function-based Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 F. H¨ oppner Understanding and Controlling the Membership Degrees in Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 F. Klawonn Autonomous Sensor-based Landing Systems: Fusion of Vague and Incomplete Information by Application of Fuzzy Clustering Techniques 454 B. Korn Outlier Preserving Clustering for Structured Data Through Kernels . . . 462 M.-J. Lesot
xvi
Contents
Economics and Mining in Business Processes Classification-relevant Importance Measures for the West German Business Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 D. Enache, C. Weihs, and U. Garczarek The Classification of Local and Branch Labour Markets in the Upper Silesia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 W. Hantke An Overview of Artificial Life Approaches for Clustering . . . . . . . . . . . . . 486 D. K¨ ampf and A. Ultsch Design Problems of Complex Economic Experiments . . . . . . . . . . . . . . . . 494 J. Kunze Traffic Sensitivity of Long-term Regional Growth Forecasts . . . . . . . . . . . 502 W. Polasek and H. Berrer Spiralling in BTA Deep-hole Drilling: Models of Varying Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 N. Raabe, O. Webber, W. Theis, and C. Weihs Analysis of the Economic Development of Districts in Poland as a Basis for the Framing of Regional Policies . . . . . . . . . . . . . . . . . . . . . . 518 M. Rozkrut and D. Rozkrut
Banking and Finance The Classification of Candlestick Charts: Laying the Foundation for Further Empirical Research . . . . . . . . . . . . . . . 526 S. Etschberger, H. Fock, C. Klein, and B. Zwergel Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 S. H¨ ose and K. Vogl Comparing and Selecting SVM-Kernels for Credit Scoring . . . . . . . . . . . . 542 R. Stecking and K.B. Schebesch Value at Risk Using the Principal Components Analysis on the Polish Power Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 G. Trzpiot and A. Ganczarek
Contents
xvii
Marketing A Market Basket Analysis Conducted with a Multivariate Logit Model . . . . . . . . . . . . . . . . . . . . . . . . 558 Y. Boztu˘g and L. Hildebrandt Solving and Interpreting Binary Classification Problems in Marketing with SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 G. Nalbantov, J.C. Bioch, and P.J.F. Groenen Modeling the Nonlinear Relationship Between Satisfaction and Loyalty with Structural Equation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 M. Paulssen and A. Sommerfeld Job Choice Model to Measure Behavior in a Multi-stage Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 T. Spengler and J. Malmendier Semiparametric Stepwise Regression to Estimate Sales Promotion Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 W.J. Steiner, C. Belitz, and S. Lang
Adaptivity and Personalization Implications of Probabilistic Data Modeling for Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 M. Hahsler, K. Hornik, and T. Reutterer Copula Functions in Model Based Clustering . . . . . . . . . . . . . . . . . . . . . . . 606 K. Jajuga and D. Papla Attribute-aware Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 K. Tso and L. Schmidt-Thieme
User and Data Authentication in IT Security Towards a Flexible Framework for Open Source Software for Handwritten Signature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 R. Guest, M. Fairhurst, and C. Vielhauer Multimodal Biometric Authentication System Based on Hand Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 N. Paveˇsi´c, T. Saviˇc, and S. Ribari´c
xviii
Contents
Labelling and Authentication for Medical Imaging Through Data Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 A. De Rosa, R. Caldelli, and A. Piva Hand-geometry Recognition Based on Contour Landmarks . . . . . . . . . . . 646 R. Veldhuis, A. Bazen, W. Booij, and A. Hendrikse A Cross-cultural Evaluation Framework for Behavioral Biometric User Authentication . . . . . . . . . . . . . . . . . . . . . . . 654 F. Wolf, T.K. Basu, P.K. Dutta, C. Vielhauer, A. Oermann, and B. Yegnanarayana
Bioinformatics and Biostatistics On External Indices for Mixtures: Validating Mixtures of Genes . . . . . . . 662 I.G. Costa and A. Schliep Tests for Multiple Change Points in Binary Markov Sequences . . . . . . . . 670 J. Krauth UnitExpressions: A Rational Normalization Scheme for DNA Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 A. Ultsch
Classification of High-dimensional Biological and Medical Data A Ridge Classification Method for High-dimensional Observations . . . . . 684 M. Gr¨ uning and S. Kropf Assessing the Trustworthiness of Clustering Solutions Obtained by a Function Optimization Scheme . . . . . . . . . . . . . . . . . . . . . . 692 U. M¨ oller and D. Radke Variable Selection for Discrimination of More Than Two Classes Where Data are Sparse . . . . . . . . . . . . . . . . . . 700 G. Szepannek and C. Weihs
Medical and Health Sciences The Assessment of Second Primary Cancers (SPCs) in a Series of Splenic Marginal Zone Lymphoma (SMZL) Patients . . . . . 708 S. De Cantis and A.M. Taormina
Contents
xix
Heart Rate Classification Using Support Vector Machines . . . . . . . . . . . . 716 M. Vogt, U. Moissl, and J. Schaab
Music Analysis Visual Mining in Music Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 F. M¨ orchen, A. Ultsch, M. N¨ ocker, and C. Stamm Modeling Memory for Melodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732 D. M¨ ullensiefen and C. Hennig Parameter Optimization in Automatic Transcription of Music . . . . . . . . 740 C. Weihs and U. Ligges
Data Mining Competition GfKl Data Mining Competition 2005: Predicting Liquidity Crises of Companies . . . . . . . . . . . . . . . . . . . . . . . . . . 748 J. Strackeljan, R. Jonscher, S. Prieur, D. Vogel, T. Deselaers, D. Keysers, A. Mauser, I. Bezrukov, and A. Hegerath Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
Boosting and 1 -Penalty Methods for High-dimensional Data with Some Applications in Genomics Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich CH-8092 Z¨ urich, Switzerland
Abstract. We consider Boosting and 1 -penalty (regularization) methods for prediction and model selection (feature selection) and discuss some relations among the approaches. While Boosting has been originally proposed in the machine learning community (Freund and Schapire (1996)), 1 -penalization has been developed in numerical analysis and statistics (Tibshirani (1996)). Both of the methods are attractive for very high-dimensional data: they are computationally feasible and statistically consistent (e.g. Bayes risk consistent) even when the number of covariates (predictor variables) p is much larger than sample size n and if the true underlying function (mechanism) is sparse: e.g. we allow for arbitrary polynomial growth p = pn = O(nγ ) for any γ > 0. We demonstrate high-dimensional classification, regression and graphical modeling and outline examples from genomic applications.
1
Introduction
We consider methods which are computationally feasible and statistically accurate for very high-dimensional data. Examples of such data include gene expression experiments where a single expression profile yields a vector of measurements whose dimension p is in the range between 5’000 - 25’000. On the other hand, the number of experiments n is typically in the dozens. Thus, we will have to deal with the case p n: the number of variables p is much larger than sample size n. We often refer to this situation as “highdimensional data”. We consider some unsupervised and supervised problems. In the former, the data are realizations of random variables (usually assumed to be i.i.d. or from a stationary process) X1 , . . . , Xn , where Xi ∈ Rp . In the supervised context, we have additional (univariate) response variables Yi , yielding the (X1 , Y1 ), . . . , (Xn , Yn ). In the following, the jth component of x ∈ Rp will be denoted by x(j) . The main goal for supervised settings is function estimation which includes regression and classification. For example, the target of interest is E[Y |X = x] for regression (with Y ∈ R) or P[Y = y|X = x] for classification (with Y ∈ {0, . . . , C − 1} in classification). We will also demonstrate in section 3.3 a new method for graphical modeling in unsuper-
2
P. B¨ uhlmann
vised problems: here the goal is to exploit associations among the different (random) variables. Boosting (Freund and Schapire (1996) and 1 -penalization (Tibshirani (1996)) are very useful techniques for high-dimensional data. From a computational perspective, both have complexity O(p) if p n, i.e. linear in the dimensionality. Moreover, they have reasonable statistical properties if the true underlying signal or structure is sparse.
2
Boosting
Boosting has been proposed by Freund and Schapire (1996) in the machine learning community for binary classification. Since its inception, it has attracted a lot of attention both in the machine learning and statistics literature. This is in part due to its excellent reputation as a prediction method. The gradient descent view of boosting as articulated in Breiman (1998) and Friedman et al. (2000) provides a basis for the understanding and new variants of boosting. As an implication, boosting is not only a black-box prediction tool but also an estimation method in specified classes of models, allowing for interpretation of specific model-terms. 2.1
AdaBoost: An Ensemble Method
AdaBoost (Freund and Schapire (1996)) is an ensemble algorithm for binary classification with Yi ∈ {0, 1}. It is (still) the most popular boosting algorithm which exhibits an excellent performance in numerous empirical studies. It works by specifying a base classifier (“weak learner”) which is repeatedly applied to iteratively re-weighted data, yielding an ensemble of classifiers gˆ[1] (·), . . . , gˆ[m] (·), where each gˆ[k] (·) : Rp → {0, 1}. That is: re-weighted data 1 re-weighted data 2 ··· re-weighted data m
base procedure −→ gˆ[1] (·) base procedure −→ gˆ[2] (·) ··· base procedure −→ gˆ[m] (·)
A key issue of AdaBoost is the way how it re-weights the original data; once we have re-weighted data, one simply applies the base procedure to it as if it would be the original dataset. Finally, the AdaBoost classifier ⎛ ⎞ m [m] CˆAdaBoost (·) = ⎝sign( cj gˆ[m] (·)) + 1⎠ /2 (1) j=1
is constructed by a weighted majority vote among the ensemble of individual classifiers. A statistically motivated description can be found in Friedman et al. (2000).
Boosting and 1 -Penalty Methods for High-dimensional Data
3
Thus, AdaBoost involves three specifications: (1) the base procedure (“weak learner”), (2) the construction of re-weighted data, (3) the size of the ensemble m. Regarding (1), most popular are classification trees; issue (2) is defined by the AdaBoost description (cf. Friedman et al. (2000)); and the value m in (3) is a simple one-dimensional tuning parameter. 2.2
Boosting and Functional Gradient Descent
Breiman (1998) showed that the somewhat mysterious AdaBoost algorithm can be represented as a steepest descent algorithm in function space which we call functional gradient descent (FGD). This great result opened the door to use boosting in other settings than classification. In the sequel, boosting and functional gradient descent (FGD) are used as a terminology for the same method or algorithm. The goal is to estimate a function f ∗ (·) = argminf (·) E[ρ(Y, f (X))]
(2)
where ρ(·, ·) is a real-valued loss function which is typically convex with respect to the second argument. The function class which we minimize over is not of interest for the moment and hence notationally omitted. Examples of loss functions and their minimizers are given in the following table; each case corresponds to a different boosting algorithm, as explained in section 2.2; see also Friedman et al. (2000). range spaces y ∈ R, f ∈ R
ρ(y, f ) |y − f |2
y ∈ {0, 1}, f ∈ R
− log2 (1 + e−2(2y−1)f )
y ∈ {0, 1}, f ∈ R ρ(y, f ) = exp(−(2y − 1)f )
1 2
1 2
f ∗ (x) algorithm E[Y |X = x] L2 Boosting
log
log
p(x)
1−p(x) p(x) 1−p(x)
LogitBoost
AdaBoost
For the two last rows, p(x) = P[Y = 1|X = x]. Boosting pursues some sort of empirical minimization of the empirical risk n1−
n
ρ(Yi , f (Xi ))
(3)
i=1
with respect to f (·). To explain this, we introduce next the notion of a base procedure, often called the “weak learner” in the machine learning community. The Base Procedure Based on some (pseudo-) response variables U = U1 , . . . , Un and predictor variables X = X1 , . . . , Xn , the base procedure yields a function estimate gˆ(·) = gˆ(U,X) (·) : Rp → R.
4
P. B¨ uhlmann
Note that we focus here on function estimates with values in R, rather than classifiers with values in {0, 1} as described in section 2.1. Typically, the function estimate gˆ(x) can be thought as an approximation of E[U |X = x]. Most popular base procedures in machine learning are regression trees (or class-probability estimates from classification trees). Among many other alternative choices, the following base procedure is often quite useful in very high-dimensional situations. Componentwise Linear Least Squares: gˆ(x) = γˆSˆx(S) , n n (j) Ui X (j) γˆj = ni=1 (j)i (j = 1, . . . , p), Sˆ = argmin1≤j≤p (Ui − γˆj Xi )2 . 2 i=1 i=1 (Xi ) ˆ
This base procedure fits a linear regression with the one predictor variable which reduces residual sum of squares most. The Algorithm The generic FGD or boosting algorithm is as follows. Generic FGD algorithm Step 1. Initialize fˆ[0] (·) ≡ 0. Set m = 0. Step 2. Increase m by 1. Compute negative gradient and evaluate it at f = fˆ[m−1] (Xi ): Ui = −
∂ ρ(Y, f )|f =fˆ[m−1] (Xi ) , i = 1, . . . , n. ∂f
Step 3. Fit negative gradient vector U1 , . . . , Un by using the base procedure, yielding the estimated function gˆ[m] (·) = gˆU,X (·) : Rp → R. The function estimate gˆ[m] (·) may be thought of as an approximation of the negative gradient vector (U1 , . . . , Un ). Step 4. Do a one-dimensional numerical line-search for the best step-size sˆ[m] = argmins
n
ρ(Yi , fˆ[m−1] (Xi ) + sˆ g [m] (Xi )).
i=1
Step 5. Up-date fˆ[m] (·) = fˆ[m−1] (·) + ν · sˆ[m] gˆ[m] (·) where 0 < ν ≤ 1 is reducing the step-length for following the approximated negative gradient. Step 6. Iterate Steps 2-5 until m = mstop is reached for some specified stopping iteration mstop .
Boosting and 1 -Penalty Methods for High-dimensional Data
5
The factor ν in Step 5 should be chosen “small”: our proposal for a default value is ν = 0.1. The FGD algorithm does depend on ν but its choice is not very crucial as long as it is taken to be “small”. On the other hand, the stopping iteration mstop is an important tuning parameter of boosting or FGD. Data-driven choices can be done by using cross-validation schemes or internal model selection criteria (B¨ uhlmann (2004)). By definition, the generic FGD algorithm yields a linear combination of base procedure estimates:
mstop
fˆ[mstop ] (·) = ν
gˆ[m] (·)
m=1
which can be interpreted as an estimate from an ensemble scheme, i.e. the final estimator is an average of individual estimates from the base procedure, similar to the formula for AdaBoost in (1). Thus, the boosting solution implies the following constraint for minimizing the empirical risk in (3): the estimate is a linear combination of fits from the base procedure which induces some regularization, see also section 2.6. 2.3
Boosting with the Squared Error Loss: L2 Boosting
When using the squared error loss ρ(y, f ) = |y − f |2 , the generic FGD algorithm above takes the simple form of refitting the base procedure to residuals of the previous iteration, cf. Friedman (2001). L2 Boosting Step 1 (initialization and first estimate). Given data {(Xi , Yi ); i = 1, . . . , n}, fit the base procedure fˆ[1] (·) = νˆ g(Y,X) (·). Set m = 1. Step 2. Increase m by 1. Compute residuals Ui = Yi − fˆ[m−1] (Xi ) (i = 1, . . . , n) and fit the base procedure to the current residuals. The fit is denoted by gˆ[m] (·) = gˆ(U,X) (·). Up-date fˆ[m] (·) = fˆ[m−1] (·) + νˆ g [m] (·), where 0 < ν ≤ 1 is a pre-specified step-size parameter. (The line-search, i.e. Step 4 in the generic FGD algorithm from section 2.2, is omitted). Step 3 (iteration). Repeat Steps 2 and 3 until some stopping value mstop for the number of iterations is reached.
6
P. B¨ uhlmann
With m = 2 (one boosting step) and ν = 1, L2 Boosting has already been proposed by Tukey (1977) under the name “twicing”. L2 Boosting with ν = 1 and with the componentwise least squares base procedure for a fixed collection of p basis functions (instead of p predictor variables) coincides with the matching pursuit algorithm of Mallat and Zhang (1993), analyzed also in computational mathematics under the name of “weak greedy algorithm”. All these methods are known under the keyword “Gauss-Southwell algorithm”. Tukey’s (1977) twicing seems to be the first proposal to formulate the GaussSouthwell idea in the context of a nonparametric smoothing estimator, beyond the framework of linear models (dictionaries of basis functions). Special emphasis is given here to L2 Boosting with the componentwise linear least squares base procedure: it is a method which does variable/feature selection and employs shrinkage of estimated coefficient to zero (regularization), see also section 2.6. 2.4
A Selective Review of Theoretical Results for Boosting
Asymptotic consistency results for boosting algorithms with early stopping as described in section 2.2 have been given by Jiang (2004) for AdaBoost, Zhang and Yu (2005) for general loss function, and B¨ uhlmann (2004) for L2 Boosting; B¨ uhlmann and Yu (2003) have shown minimax optimality of L2 Boosting in the toy problem of one-dimensional curve estimation. There are quite a few other theoretical analyses of boosting-type methods which use an 1 -penalty instead of early stopping for regularization. The result in B¨ uhlmann (2004) covers the situation of a very high-dimensional but sparse linear model Yi =
p
(j)
βj Xi
+ εi , (i = 1, . . . , n),
(4)
j=1
where ε1 , . . . , εn are i.i.d. mean zero variables. High-dimensionality means that the dimension p = pn is allowed to grow very quickly with sample size n, i.e. pn = O(exp(Cn1−ξ )) for some > 0 and 0 < ξ < 1; regarding pC n sparseness, it is required that supn j=1 |βj,n | < ∞ (the coefficients are allowed to change with sample size n, i.e. βj = βj,n ). 2.5
Predictive Performance of Boosting
Most of the first results on the predictive performance on boosting are in classification: they demonstrated that boosting trees is very often substantially better than a single classification tree (cf. Freund and Schapire (1996); Breiman (1998)). In B¨ uhlmann and Yu (2003) it has been pointed out and emphasized that in classical situations, where p n (with p in a reasonable range between 1 and 10), boosting is not better and about as good as
Boosting and 1 -Penalty Methods for High-dimensional Data
7
L2 Boost FPLR 1-NN DLDA SVM misclassifications 30.50% 35.25% 43.25% 36.12% 36.88%
Table 1. Cross-validated misclassification rates for lymph node breast cancer data. L2 Boosting (L2 Boost), forward variable selection penalized logistic regression (FPLR), 1-nearest-neighbor rule (1-NN), diagonal linear discriminant analysis (DLDA) and a support vector machine (SVM).
more established flexible nonparametric methods. In high-dimensional problems however, boosting performs often much better than more traditional methods.
Binary Classification of Tumor Types based on Gene Expression Data There exists by now a vast variety of proposals for classification based on gene expression data. Boosting is one of the fewer methods which does not require a preliminary dimensionality reduction of the problem (often done in an ad-hoc way selecting the best genes according to a score from a two-sample test, e.g. the best 200 genes). Therefore, boosting can be used as a method for multivariate gene selection (instead of the commonly used principle to quantify the effect of single genes only, e.g. differential expression). We consider a dataset which monitors p = 7129 gene expressions in 49 breast tumor samples using the Affymetrix technology. For each sample, a binary response variable is available, describing the status of lymph node involvement in breast cancer. 1 We use L2 Boosting despite the binary classification structure; a justification for this is given in B¨ uhlmann (2004). We estimate the classification performance by a cross-validation scheme where we randomly divide the 49 samples into balanced training- and test-data of sizes 2n/3 and n/3, respectively, and we repeat this 50 times. We compare L2 Boosting with the componentwise linear least squares base procedure, step-size ν = 0.1 and some AIC-estimated stopping iteration (see B¨ uhlmann (2004)) with four other classification methods: 1-nearest neighbors, diagonal linear discriminant analysis, support vector machine with radial basis kernel (from the R-package e1071 and using its default values), and a forward selection penalized logistic regression model (using some reasonable penalty parameter and number of selected genes). For 1-nearest neighbors, diagonal linear discriminant analysis and support vector machine, we preselect the 200 genes which have the best Wilcoxon score in a two-sample problem (estimated from the training dataset only), which is recommended to improve the classification performance. Our L2 Boosting and the forward variable selection penalized regression are run without pre-selection of genes. The results are given in Table 1. 1
The data are available at http://data.cgt.duke.edu/west.php
8
P. B¨ uhlmann
For this dataset with high misclassification rates (high classification noise), the L2 Boosting is very competitive. Moreover, it is an interesting gene selection method: when applied to the whole dataset and using an AIC-estimated stopping iteration (which equals mstop = 108), the method selects 42 out of 7129 genes. 2.6
L2 Boosting and Lasso: Connections and Computational Complexities
In the setting of linear models, Efron et al. (2004) made an intriguing connection between L2 Boosting with componentwise linear least squares and the Lasso (Tibshirani (1996)) defined in formula (5), an 1 -penalized least squares method for linear regression. They consider a version of L2 Boosting, called forward stagewise least squares (denoted in the sequel by FSLR) and they show that for the cases where the design matrix satisfies a “positive cone condition”, FSLR with infinitesimally small step-sizes produces a set of solutions which coincides with the set of Lasso solutions when varying the regularization parameter. Furthermore, Efron et al. (2004) proposed the least angle regression (LARS) algorithm as a clever computational short-cut for FSLR and Lasso. The connection between L2 Boosting and Lasso demonstrates an interesting property of boosting. During the iterations of boosting, we get an “interesting” set of solutions {fˆ[m] (·); m = 1, 2, . . .} and corresponding regression coefficients {βˆ[m] ∈ Rp ; m = 1, 2, . . .}. Heuristically, due to the results in Efron et al. (2004), it is “similar” to the set of Lasso solutions {βˆλ ∈ Rp ; λ ∈ R+ } when varying the penalty parameter λ, where βˆλ = argminβ∈Rp
n i=1
(Yi −
p j=1
(j)
βj Xi )2 + λ
p
|βj |.
(5)
j=1
Computing the set of boosting solutions {fˆ[m] (·); m = 1, 2 . . .} is computationally quite cheap since every boosting step is typically simple: hence, estimating a good stopping iteration mstop via e.g. cross-validation is computationally attractive, and the computational gain can become even more impressive when using an internal model selection criterion such as AIC (B¨ uhlmann (2004)). Of course, for the special case of linear regression, LARS (Efron et al. (2004)) is computationally even more efficient than boosting. The computational complexity of boosting in potentially high-dimensional linear models is O(npmstop ), where mstop denotes the number of iterations in boosting. In the very high-dimensional context with p n, a good value for mstop is of negligible order in comparison to the dimension p. Therefore, for computing a good (or optimal) boosting estimator, and if p n, the computational complexity is O(p), i.e. linear in the dimensionality p. The LARS algorithm for computing all Lasso solutions in (5) when varying over the penalty parameter λ has computational complexity
Boosting and 1 -Penalty Methods for High-dimensional Data
9
O(np min(n, p)); for p n, this becomes O(p) which is again linear in the dimensionality p. We should point out that LARS is quite a bit faster than L2 Boosting with respect to real CPU times.
3
Lasso and 1 -Penalty Methods
We focus here exclusively on linear relationships among (random) variables; this is not restrictive from an L2 -point of view when assuming multivariate normality for the data generating distribution. 3.1
The Lasso for Prediction
We have already defined in (5) the Lasso estimator for the coefficients in a linear model as in (4). Consistency of the Lasso for a high-dimensional but sparse model, which is similar to the discussion after formula (4) has been given by Greenshtein and Ritov (2004). Together with the computational efficiency for computing all Lasso solutions with the LARS algorithm (see section 2.6), this identifies also the Lasso as a very useful method for high-dimensional linear function estimation and prediction. Some empirical comparisons between the Lasso and L2 Boosting with componentwise linear least squares are presented in B¨ uhlmann (2004). Binary Classification of Two Tumor Types For the binary classification problem discussed in section 2.5, the cross-validated misclassification error when using the Lasso for a high-dimensional (p = 7129) linear model is 27.4% (tuning the penalty parameter via an internal cross-validation) which is slightly better than L2 Boosting and all other methods under consideration. The number of selected genes on the whole dataset is 23, i.e. more sparse than L2 Boosting which selects 42 genes (see also next section 3.2). 3.2
Convex Relaxation with the Lasso and Variable Selection
The Lasso estimator as defined in (5) can also be used for variable/feature selection in a linear model (4), as indicated for the tumor classification example above. Due to the geometry of the 1 -space, with the 1 -norm β1 = j |βj |, it is well known that the solution of the convex optimization in (5) is sparse:many of the coefficient estimates βˆj = 0 if λ is sufficiently large. Thus, variable selection by checking whether βˆj is zero or not can be easily done. This selection scheme depends on the implementing λ in the optimization in (5). A natural idea would be to choose the λ such that a crossvalidation score is minimized. This is, however, not an entirely satisfactory choice as it will select too many variables/features; other choices of λ are described in Meinshausen and B¨ uhlmann (2004).
10
P. B¨ uhlmann
We should point out that the computational complexity for variable selection with the Lasso is O(np min(n, p)) while the more traditional way of searching over all subset models with a penalized likelihood score (e.g. BIC) requires (in the worst case) to compute 2p least squares problems. Even when using clever up- and down-dating strategies for optimization of a BIC score, the Lasso computation via the LARS algorithm is much faster involving convex optimization only. 3.3
Gaussian Graphical Modeling with the Lasso
Graphical modeling has become a very useful tool to analyze and display conditional dependencies, i.e. associations, among random variables. We consider the case where the data are i.i.d. realizations from X = (X (1) , . . . , X (p) ) ∼ N (µ, Σ). A Gaussian graphical model can then be defined as follows. The set of edges consists of the indices {1, . . . , p}, corresponding to the components of X. Moreover, there is an undirected edge between node i and j ⇔ X (i) conditionally dependent of X (j) given all other {X (k) ; k = i, j} −1 ⇔ Σij = 0. (6) The latter equivalence holds because of the Gaussian assumption. Furthermore, the elements from the concentration matrix Σ −1 can be linked to re−1 −1 gression: Σij /Σii = βi;j , where X (i) = βi;j X (j) +
βi;k X (k) + ε(i) (i, j = 1, . . . , p; i = j),
(7)
k=i,j
where ε(i) is a mean zero error term. Together with (6), we obtain: there is an undirected edge between node i and j ⇔ βi;j = 0 or βj;i = 0. Thus, we can infer the graph from variable selection in regression by doing variable selection in each of the p regression problems in (7). When using a traditional technique such as all subset selection with the BIC score, this would amount to solve (in the worst case) p2p−1 least squares problems. Alternatively, we can use the Lasso which involves convex optimizations only and is orders of magnitudes faster than all subset selection method. In particular, the Lasso method is feasible in very high dimensions with thousands of nodes or variables. For every regression problem as in (7), we compute the estimated coefficients βˆi;j (which depend on the choice of λ) and
Boosting and 1 -Penalty Methods for High-dimensional Data
11
Fig. 1. Estimated graph using the Lasso for the Arabidopsis dataset.
then define a graph estimate as follows: version 1:
there is an undirected edge between node i and j ⇔ βˆi;j = 0 or βˆj;i = 0, version 2: there is an undirected edge between node i and j ⇔ βˆi;j = 0 and βˆj;i = 0.
Note the asymmetry in the finite-sample estimates while for the population parameters, it holds that: βi;j = 0 ⇔ βj;i = 0. Graph estimation with the Lasso depends on the choice of the penalty parameter λ for 1 -penalized regression. The same difficulty arises as in the regression context: the prediction optimal penalty yields too large graphs. Meinshausen and B¨ uhlmann (2004) prove a consistency result for highdimensional Gaussian graphical modeling. Roughly speaking, even if the number of variables (nodes) p = pn = O(nγ ) for any γ > 0, i.e. an arbitrarily fast polynomial growth of the dimension relative to sample size, but assuming that the true graph is sparse, the Lasso graph estimate equals the true graph with probability tending quickly to 1 as sample size n increases. In Meinshausen and B¨ uhlmann (2004), the Lasso graph estimate has also been compared with forward stepwise selection strategies from the maximum likelihood framework. As a rough summary, the Lasso has better empirical performance (in terms of the ROC curve) if the problem is high-dimensional (relative to sample size n) and the true underlying graph is sparse. 3.4
Estimating a Genetic Network
We applied the Lasso graph estimation method to n = 118 gene expression measurements for p = 39 genes from two biosynthesis pathways in the model plant Arabidopsis Thaliana.2 The problem is “fairly high-dimensional” in 2
The data are available at http://genomebiology.com/2004/5/11/R92#IDA31O2R
12
P. B¨ uhlmann
terms of the ratio n/p. A first goal is to detect potential cross-connections from one to the other pathways. As seen from Figure 1, the Lasso graph estimator yields quite many edges, i.e. too many for biological interpretations. However, such an estimate can be a first starting point for a more biologically driven analysis, see Wille et al. (2004).
References BREIMAN, L. (1998): Arcing classifiers. Ann. Statist., 26, 801–849 (with discussion). ¨ BUHLMANN, P. (2004): Boosting for high-dimensional linear models. To appear in the Ann. Statist. ¨ BUHLMANN, P. and YU, B. (2003): Boosting with the L2 loss: regression and classification. J. Amer. Statist. Assoc., 98, 324–339. EFRON, B., HASTIE, T., JOHNSTONE, I. and TIBSHIRANI, R. (2004): Least angle regression. Ann. Statist., 32, 407–499 (with discussion). FREUND, Y. and SCHAPIRE, R.E. (1996): Experiments with a new boosting algorithm. In: Machine Learning: Proc. Thirteenth International Conference. Morgan Kauffman, San Francisco, 148–156. FRIEDMAN, J.H. (2001): Greedy function approximation: a gradient boosting machine. Ann. Statist., 29, 1189–1232. FRIEDMAN, J.H., HASTIE, T. and TIBSHIRANI, R. (2000): Additive logistic regression: a statistical view of boosting. Ann. Statist., 28, 337–407 (with discussion). GREENSHTEIN, E. and RITOV, Y. (2004): Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10, 971– 988. JIANG, W. (2004): Process consistency for AdaBoost. Ann. Statist., 32, 13–29 (disc. pp. 85–134). MALLAT, S and ZHANG, Z. (1993): Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Proc., 41, 3397–3415. ¨ MEINSHAUSEN, N. and BUHLMANN, P. (2004): High-dimensional graphs and variable selection with the Lasso. To appear in the Ann. Statist. TIBSHIRANI, R. (1996): Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc., Ser. B, 58, 267–288. TUKEY, J.W. (1977): Exploratory data analysis. Addison-Wesley, Reading, MA. ´ E., FURHOLZ, ¨ WILLE, A., ZIMMERMANN, P., VRANOVA, A., LAULE, O., ´ BLEULER, S., HENNIG, L., PRELIC, A., VON ROHR, P., THIELE, L., ¨ ZITZLER, E., GRUISSEM, W. and BUHLMANN, P. (2004): Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology, 5(11) R92, 1-13. ZHANG, T. and YU, B. (2005): Boosting with early stopping: convergence and consistency. Ann. Statist., 33, 1538–1579.
Striving for an Adequate Vocabulary: Next Generation ‘Metadata’ Dieter Fellner and Sven Havemann Institut f¨ ur ComputerGraphik, TU Braunschweig D-38106 Braunschweig, Germany d.fellner |
[email protected]
Abstract. Digital Libraries (DLs) in general and technical or cultural preservation applications in particular offer a rich set of multimedia objects like audio, music, images, videos, and 3D models. But instead of handling these objects consistently as regular documents — in the same way we handle text documents — most applications handle them differently. This is due to the fact that ‘standard’ tasks like content categorization, indexing, content representation or summarization have not yet been developed to a stage where DL technology could readily apply it for these types of documents. Instead, these tasks have to be done manually making the activity almost prohibitively expensive. Consequently, the most pressing research challenge is the development of an adequate ‘vocabulary’ to characterize the content and structure of non-textual documents as the key to indexing, categorization, dissemination and access. We argue that textual metadata items are insufficient for describing images, videos, 3D models, or audio adequately. A new type of generalized vocabulary is needed that permits to express semantic information — which is a prerequisite for a retrieval of generalized documents based on their content, rather than on static textual annotations. The crucial question being which methods and which types of technology will best support the definition of vocabularies and ontologies for non-textual documents. We present one such method for the domain of 3D models. Our approach allows to differentiate between the structure and the appearance of a 3D model, and we believe that this formalism can be generalized to other types of media.
1
Introduction
Today, a digital library is the obvious approach to set up a document database for a specific field of research and for knowledge management in academia as well as in commercial companies. System such as, e.g., HyperWave (2005) permit to create and manage collections of generalized documents: Not only text documents, but also multimedia documents such as images, animations, videos, 3D models, and audio models, to name just the most common types — and each of them comes with a variety of different formats. A modern document management system allows to organize data and documents flexibly, to arrange and re-arrange the digital assets in collections and sub-collections, to create personalized views for each user, etc. In particular, the system permits
14
D. Fellner and S. Havemann
Acquisition → Registration → Categorization → Provision → Archival
Fig. 1. The workflow in a classical public or scientific library.
to provide each document in the database with an (extensible) list of metadata, possibly inherited from a template. Formally, these metadata items are a list of textual keywords and the respective values, also in textual form. 1.1
New Challenges for Public and Scientific Libraries
Looking at textual documents as being only one media type among many others, it becomes clear that this type of document is handled quite specially: A full-text search engine permits to retrieve a specific document based on its content, i.e., one or more words that appears in it. Content-based retrieval of other media types is an active area of research, and only pilot applications exist. The deficits in handling non-textual documents are especially annoying in a situation where the proportion of classical text (books etc.) is decreasing. It becomes ever easier to create a digital image, a video, or a 3D object — but our libraries are not equipped with the right tools to provide all the services for non-standard documents that are available for books or journals. The usual workflow in a library is shown in Fig. 1. The great challenge is to integrate non-standard documents seamlessly with it. Registration means to attach standard metadata (author, title etc.), which is not much different for texts or generalized documents. The categorization usually involves the assignment of appropriate keywords by a librarian. They are entered into the keyword catalog, which contains the inverted mapping from usually several keywords to the publication. The keyword cataloger is the basis for the provision step. Its purpose is to make the books actually accessible to the readers — which implies that they can be found in the first place. So the retrieval (and the delivery) is part of the provision step. Finally, the archiving step is also very critical with non-standard documents, because file formats become obsolete even faster than formats for text-based documents. This problem, however, is not addressed in this paper. 1.2
Unbalanced Situation: Text and Non-Standard Documents
Note that the library workflow applies also to well organized document management systems. In either case, the main problem is to devise methods for two steps, (i) the categorization and (ii) the retrieval of generalized documents. In particular, we have identified the following four missing features. • Vocabulary to describe both the document structure and its content • Indexing schemes that detect complex semantical entities • Summarization methods that create a short ‘abstract’ of a document
Striving for an Adequate Vocabulary: Next Generation ’Metadata’
15
• Automatic processing as the rapidly increasing number of documents prohibits manual augmentation To illustrate how drastic the situation is just consider a comparison between a scanned page of text and a 3D scanned historic amphora. One A4 page of simple text, scanned at 300 dpi, produces roughly 9 million pixels. But of course, nobody would use the pixels to describe the content and the structure of the document. Instead, the actual text is extracted using OCR (optical character recognition). The scanned pixel image, except possible illustrations, is usually discarded, since it is just an artifact and contains no information that is useful on its own. What is most interesting is that OCR does not work only by matching individual characters independently. To improve the recognition rate it uses a (language-dependent) dictionary as well as a catalog of common syllables, i.e., semantic information. Unfortunately there is no such canonical method to process the amphora. Assuming a diameter of 500 and a height of 1000 millimeters, a modern laser scanner needs to measure 1.5 million points on its surface for a millimeterspaced grid, i.e., to assert a sampling density of (only) 25.4 dpi. To some researchers, it may be of great interest to faithfully record all traces history has left on the surface — but the most important fact about it is that it is an amphora. The extraction of such semantic information from the scanned dataset is possible only if (i) the computer has got a general description of amphorae, and (ii) it is possible to determine whether a given scanned model conforms to it. In a broader setting, the general problem can be stated with the following: Metadata Vocabulary Challenge: To develop the proper vocabularies for a new generation of metadata capable of characterizing content and structure of multimedia documents as a key to categorization, indexing, searching, dissemination, and access.
2
A Versatile Vocabulary for Describing 3D Documents
Every human language has a variety of terms to denote the large number of objects around us — a relation that is far from being unique or bijective. Is it then possible to establish a reasonable correspondence between measurable geometrical properties of a shape and the fuzzy, imprecise, and sometimes contradictory shape classes denoted by words such as ‘car’, ‘house’, or ‘chair’ ? Previous approaches have used different sets of geometric features, for example the mass distribution in a 3D solid, to extract a feature vector from a given 3D object (see Novotni and Klein, 2001, 2004, Keim, 1999, Vrani´c and Saupe, 2002, Hilaga et al., 2001, Osada et al., 2002, Funkhouser et al., 2003, Chen et al., 2003, Tangelder and Veltkamp, 2004, to cite only a few).
16
D. Fellner and S. Havemann
When this extraction is applied to a whole database of 3D objects from different shape classes, it becomes possible to examine the statistical correlation between feature vectors and shape classes, to detect clusters of vectors in the high-dimensional feature space. The discriminative power of a particular shape feature for a specific object class is the strength of this correlation. The achievements and the limits of feature-vector approaches have been nicely summarized for the ‘3D Knowledge’ project (US National Science Foundation, 2005). We take a fundamentally different approach. The main idea is that we encode the actual construction of classes of 3D objects. Our method does not use ‘blind’ stochastics, but it requires some understanding of the objects. The shape description is completely explicit, and it is procedural, based on an operator calculus. This means that a shape class is represented through a sequence of (parameterized) shape construction operations. They yield a desired shape instance when provided with the right parameters. One consequence is that our shape representation has to be a full programming language; it is called the Generative Modeling Language (GML, 2005). A concrete example is the generic chair shown in Fig. 2. The only input parameters are a mere five 3D-points. This makes it possible to quickly adapt the chair template to any given (scanned) chair: Although the models do not match in the strict sense (Hausdorff distance), the ‘important’ properties of the target chair can nevertheless be matched — according to the sense of importance that was coded into our template. This is exactly the kind of flexibility that is needed for the extraction of semantic information. With literally the same approach, it is also possible to describe the structure of a 3D object. The images in the bottom row show that a garden chair, a sun bed, and a sofa in fact share the same structure as a chair. The second example is the construction of a typical window from the Gothic period (Fig. 3). Whereas the chair template has demonstrated ‘flat’ pattern matching, the Gothic window illustrates the importance of hierarchical matching. The reason is the recursive structure of the Gothic architecture: The window is contained within a pointed arch. But it also contains two sub-windows that are again pointed arches. It is immediately apparent that similar shape features can appear on different scales, and on different levels of refinement. So, no single-level global feature detection method will ever be able to faithfully detect and recognize the essential style parameters of a sufficiently sophisticated shape. Our approach shares one very desirable property of any procedural method, namely extreme compactness. Since most of the construction can actually be re-used, all windows in Fig. 3 fit into one GML stream of 32 KB of uncompressed ASCII characters. It unfolds in 1 − 2 seconds to a window instance that contains approx. 7 million vertices at the highest level of refinement (Fig. 3, second row). This compactness, of course, only comes at the price of abstraction.
Striving for an Adequate Vocabulary: Next Generation ’Metadata’
17
Fig. 2. A parameterized generic chair model (top) is adapted to given chairs. The free parameters of the model are the five points on the right side of the chair; they are mirrored to the left side. The arrow sliders are manipulators for the five control points; they help to re-parameterize the free parameters, which is an essential property of our model representation (12 KB GML code). – Note that surprisingly different objects share the same generic structure (bottom row).
18
D. Fellner and S. Havemann
Fig. 3. Gothic window tracery is an amazingly challenging domain for parametric and procedural design. First and second row: The basic construction of a pointed arch window consists of four parts: the big arch, the circular rosette, the fillets, and the sub-arches. The decoration of each part can be varied independently (1b-e), and the same construction can be applied recursively to the sub-arches (2a-d). Bottom rows: With the appropriate modeling vocabulary, the dimensions of the window can be varied independently from the window style (3a-d). The use of subdivision surfaces permits a high surface quality with relatively few degrees of freedom (4a-b).
Striving for an Adequate Vocabulary: Next Generation ’Metadata’
2.1
19
Future Work
It is important to note that the purpose of a semantic shape representation on this level is not (yet) to completely replace a 3D scan, but to complement it. To decipher the abstract structure always means to throw away some of the information. From the structural point of view, this may be just artifact information. For other purposes this, however, may exactly be the valuable information: just think of applications controlling the quality of an object’s surface or testing the difference between supposedly identical objects. Part of our current and future work is therefore to find ways how the generic structure of a shape class and the detailed surface of a particular shape instance can be integrated into a single representation — hopefully in a way that combines the strengths and mutually compensates for the weaknesses of both ways to represent shape. Acknowledgment The support from the German Research Foundation (DFG) under the Strategic Research Initiative Distributed Processing and Delivery of Generalized Digital Documents (V 3D 2) (Fellner, 2000, 2004) is gratefully acknowledged.
References CHEN, D.-Y., TIAN, X.-P., SHEN, Y.-T., and OUHYOUNG, M. (2003): On visual similarity based 3d model retrieval. Computer Graphics Forum, 22(3):223–232. FELLNER, D.W., editor (2000). Verteilte Verarbeitung und Vermittlung Digitaler arkung der Dokumente (V 3D 2) — Ein DFG-Schwerpunktprogramm zur Verst¨ Grundlagenforschung im Bereich digitaler Bibliotheken, volume 42, 6 of it+ti. Oldenbourg. FELLNER, D.W. (2001): Graphics content in digital libraries: Old problems, recent solutions, future demands. Journal of Universal Computer Science, 7(5):400– 409. FELLNER, D.W. (2004): Strategic initiative V 3D 2 — Distributed Processing and Delivery of Digital Documents (DFG Schwerpunktprogramm 1041 — Verteilte Vermittlung und Verarbeitung Digitaler Dokumente). German Research Foundation (DFG), 1998-2004. http://graphics.tu-bs.de/V3D2. FUNKHOUSER, T., MIN, P., KAZHDAN, M., CHEN, J., HALDERMAN, A., DOBKIN, D., and JACOBS, D. (2003): A search engine for 3d models. ACM Transactions on Graphics, 22(1):83–105. GML (2005). GML scripting language website. http://www.generative-modeling.org. HAVEMANN, S., and FELLNER, D.W. (2002): A versatile 3D model representation for cultural reconstruction. Proc. VAST 2001 Intl. Symp., pages 213–221. ACM Siggraph. HILAGA, M., SHINAGAWA, Y., KOHMURA, T., and KUNII, T.L. (2001): Topology matching for fully automatic similarity estimation of 3d shapes. Proc. of ACM SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 203–212.
20
D. Fellner and S. Havemann
HyperWave (2005): Document management system. http://www.hyperwave.com. KEIM, D. (1999): Efficient geometry-based similarity search of 3D spatial databases. Proc. ACM International Conference on Management of Data (SIGMOD’99), pages 419–430. ACM Press. NOVOTNI, M., and KLEIN, R. (2001). A geometric approach to 3d object comparison. Proc. International Conference on Shape Modeling and Applications, pages 167–175. IEEE CS Press. NOVOTNI, M., and KLEIN, R. (2004): Shape retrieval using 3d zernike descriptors. Computer Aided Design, 36(11):1047–1062. OSADA, R., FUNKHOUSER, T., CHAZELLE, B., and DOBKIN, D. (2002): Shape distributions. ACM Transactions on Graphics, 21(4):807–832. TANGELDER, J.W.H., and VELTKAMP, R.C. (2004): A survey of content based 3D shape retrieval methods. Proc. Shape Modeling International. US National Science Foundation (2005): 3D knowledge project. http://3dk.asu.edu/. ´ D., and SAUPE, D. (2002): Description of 3D-shape using a complex VRANIC, function on the sphere. Proc. IEEE International Conference on Multimedia and Expo (ICME’02), pages 177–180.
Scalable Swarm Based Fuzzy Clustering Lawrence O. Hall and Parag M. Kanade Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 {pkanade,hall}@csee.usf.edu
Abstract. Iterative fuzzy clustering algorithms are sensitive to initialization. Swarm based clustering algorithms are able to do a broader search for the best extrema. A swarm inspired clustering approach which searches in fuzzy cluster centroids space is discussed. An evaluation function based on fuzzy cluster validity was used. A swarm based clustering algorithm can be computationally intensive and a data distributed approach to clustering is shown to be effective. It is shown that the swarm based clustering results in excellent data partitions. Further, it is shown that the use of a cluster validity metric as the evaluation function enables the discovery of the number of clusters in the data in an automated way.
1
Introduction
Unsupervised clustering is an important data mining tool. It allows one to group unlabeled data objects into clusters of like objects. Fuzzy clustering has been shown to provide good partitions or clusters of data. The most venerable fuzzy clustering approach, fuzzy c-means (FCM) (Bezdek et al. 1999), is an iterative approach which is quite sensitive to initialization. That is, the quality of the resultant clusters and overall partition of the data depends on the initialization that has been chosen. There has been work on choosing initializations that are good (Kim et al. 2004). In this paper, we investigate a swarm intelligence inspired clustering approach which, by virtue of its ability to search in a global way, holds the promise of skipping local extrema that the iterative optimization approach may become trapped in. Swarm based approaches have been used to produce partitions of clusters (Ouadfel and Batouche, 2002, Labroche et al., 2002, Monmarch´e et al., 1999, Kanade and Hall, 2004, Handl et al., 2003a, 2003b, Ultsch, 2004). Our ant inspired approach to partitioning the data differs from others because it focusses on positioning cluster centroids in feature space. Different potential partitions must be evaluated through some evaluation function. In this paper, we investigate a cluster validity function called Xie-Beni (Xie and Beni, 1991, Pal and Bezdek, 1995). As the use of ants is computationally intensive when compared with iterative optimization of the FCM functional, we investigate a data distributed approach to making the approach tractable in time (Hore and Hall, 2004).
22
L.O. Hall and P.M. Kanade
Experimental results show that using the Xie-Beni partition validity metric allows for the discovery of the number of clusters in the data (Hall and Kanade, 2005). It also leads to a good partition of the data. We also show a distributed data clustering approach that will allow for the speed up of the clustering. It results in good partitions of the Iris data. Section 2 discusses the swarm/ant based clustering approach. Section 3 discusses merging cluster centroids from partitions produced in a distributed fashion. Section 4 is experimental results and Section 5 is a discussion and conclusions.
2
Fuzzy Ants Clustering Algorithm
The ants co-ordinate to move cluster centers in feature space to search for optimal cluster centers. Initially the feature values are normalized between 0 and 1. Each ant is assigned to a particular feature of a cluster in a partition. The ants never change the feature, cluster or partition assigned to them as in Kanade and Hall (2004). After randomly moving the cluster centers for a fixed number of iterations, called an epoch, the quality of the partition is evaluated by using the Xie-Beni criterion (4). If the current partition is better than any of the previous partitions in the ant’s memory, the ant remembers its location for this partition. Otherwise the ant, with a given probability goes back to a better partition or continues from the current partition. This ensures that the ants do not remember a bad partition and erase a previously known good partition. Even if the ants change good cluster centers to unreasonable cluster centers, the ants can go back to the good cluster centers as the ants have a finite memory in which they keep the currently best known cluster centers. There are two directions for the random movement of the ant. The positive direction is when the ant is moving in the feature space from 0 to 1, and the negative direction is when the ant is moving in the feature space from 1 to 0. If during the random movement the ant reaches the end of the feature space the ant reverses its direction. After a fixed number of epochs the ants stop. Each ant has a memory of the mem (5 here) best locations for the feature of a particular cluster of a particular partition that it is moving. An ant has a chance to move I times before an evaluation is made (an epoch). It can move a random distance between Dmin and Dmax . It has a probability of resting Prest (not moving for an epoch) and a probability of continuing in the same direction as it was moving at the start of the epoch Pcontinue . At the end of an epoch in which it did not find a position better than any in memory it continues with PContinueCurrent . Otherwise there are a fixed set of probabilities for which of the best locations in memory search should be resumed from for the next epoch (Kanade and Hall, 2004). The probabilities are 0.6 that the ant chooses to go back to the best known partition, 0.2 that the ant goes back to the second best known partition, 0.1 that the ant goes to the third best known partition, 0.075 that the ant goes to the fourth best
Scalable Swarm Based Fuzzy Clustering
23
known partition and 0.025 that the ant goes to the worst or fifth of the known partitions. Since objects’ membership in clusters are not explicitly evaluated at each step, there can be cluster centroids that are placed in feature space such that no object is closer to the centroid than it is other centroids. These are empty clusters and indicate that there are less true clusters than estimated as will be shown in the proceeding. There may also exist clusters with one, two or very few examples assigned to them which are likely spurious if we expect approximately equal size clusters and have cluster sizes larger than some threshold, say thirty. 2.1
Fuzzy Clustering and Partition Validity Evaluation Functions
In Hathway and Bezdek (1995) the authors proposed a reformulation of the optimization criteria used in a couple of common clustering objective functions. The original clustering function minimizes the objective function (1) used in fuzzy c-means clustering to find good clusters for a data partition. Jm (U, β, X) =
n c
m Uik Dik (xk , βi )
(1)
i=1 k=1
where Uik is membership of the k th object in the ith cluster; βi is the ith cluster prototype; m ≥ 1 is the degree of fuzzification; c ≥ 2 is the number of clusters; n is the number of data points; and Dik (xk , βi ) is the distance of xk from the ith cluster center βi . The reformulation replaces the membership matrix U with the necessary conditions which are satisfied by U. In this work, the ants will move only cluster centers and hence we do not want the U matrix in the equation. The reformulated version of Jm is denoted as Rm . The reformulation for the fuzzy optimization function is given in (2). The function Rm depends only on the cluster prototype and not on the U matrix, whereas J depends on both the cluster prototype and the U matrix. The U matrix for the reformulated criterion can be easily computed using (3). Rm (β, X) =
c n k=1
Uik =
1−m Dik (xk , βi )
i=1 1
1 1−m
Dik (xk , βi ) 1−m c j=1
(2)
1
.
(3)
Djk (xk , βj ) 1−m
The Xie-Beni partition validity metric can be described as (Xie and Beni, 1991): Rm (β, X) XB(β, X) = (4) n(mini=j {βi − βj 2 })
24
L.O. Hall and P.M. Kanade S1 S1
S2
S2
S3
2
2
3
1
1
1
1
3
3
3
2
2
S1
S3
2
2
2
1
1
3
3
3
1
b) a)
Fig. 1. a) Correspondence matrices between individual clustered subsets of data with arrows linking the pairs b) a global correspondence matrix for cluster centers.
It is clearly tied to the FCM functional with a strong preference for keeping the smallest distance between any two cluster centroids as large as possible. The smallest XB(β, X) is considered to be the best.
3
Merging Partitions
In order to speed up the swarm based clustering process, clustering can be applied to subsets of data. The subsets can be disjoint. If the data is broken into m subsets, there will be m partitions of data. Since we are going to just work with cluster centroids, we need a set of global centroids which describe all of the subsets. One approach is to create a global set of cluster centers by averaging the m corresponding centroids from the partitions. Correspondence can be determined by beginning with a single partition and matching the nearest neighbor centroids in the second partition with those in the first. However, because subsets may not be stratified (after all we don’t know what the classes are), there may be some clusters that do not exist in some subsets. Also, an individual naturally occurring cluster may be split into two clusters in some subsets. So, the case in which the closest centroids are put into a chain may require that a special case be utilized for clusters which have already been assigned but are still closest to an unassigned cluster. In this case, we simply go to the next closest unassigned cluster and link the two clusters. Consider the case that there are 3 subsets of data and each subset is grouped into 3 clusters. Let S1, S2, and S3 be the subsets. Fig. 1(a) shows the two ”local” correspondence matrices between individual partitions and Fig. 1(b) shows the global centroid correspondence matrix. Using a nearest neighbor approach we can form a chain of clusters across m subsets. Because all subsets may not have representatives (or many representatives) of all classes in the data, we need some method of filtering out cluster centers that have been inappropriately assigned. That is, if we have
Scalable Swarm Based Fuzzy Clustering
25
1
0.9
0.8
0.7
2nd Attribute
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5 1st Attribute
0.6
0.7
0.8
0.9
1
Fig. 2. Gauss-1 Dataset (Normalized)
m cluster centers from the partitions that were created on subsets of data, some k < m may not be representative and should be filtered out before a global centroid is created by averaging the feature values for each cluster in a chain. Typically, one would expect k to be zero or near zero. For the experiments reported here, we do not need to use filtering algorithm. However, there are many possibilities including information theoretic (Hore, 2004), using regression to search for outliers etc.
4
Experiments
There were two data sets utilized to experimentally evaluate the ant based clustering algorithm proposed here. The first was the well-known Iris data set. It consists of four continuous valued features, 150 examples, and three classes (Blake and Merz, 1998). Each class consists of 50 examples. However, one of the classes is clearly linearly separable from the other two and many partition validity metrics will prefer a partition with two classes (as the other two overlap) For this data set, a reasonable argument may be made for two or three clusters. The artificial dataset had 2 attributes, 5 classes and 1000 examples. It was generated using a Gaussian distribution and is shown in Figure 2. The classes are slightly unequally sized (Kanade, 2004) (248, 132, 217, 192 and 211 respectively). 4.1
Experimental Parameters
The parameters used in the experiments are shown in Table 1. Essentially, 30 different partitions were utilized in each epoch. As there is significant randomness in the process, each experiment was run 30 times. Each experiment was done with the known number of clusters or more. For the Iris data set, we also tried two classes because of the fact that in feature space an argument can be made for this number of classes.
26
L.O. Hall and P.M. Kanade Parameter
Number of ants
Value
30 Partitions
Memory per ant
5
Iterations per epoch
Epochs
50
1000
Prest
0.01
Pcontinue
0.75
PContinueCurrent
0.20
Dmin
0.001
Dmax
0.01
m
2 Table 1. Parameter Values
4.2
Results
We report the results from the Iris data set first. When we tried to cluster into three classes; a partition with 50 examples from class 1 and 100 examples from class 2/class 3 was found 10 of 30 times. The rest of the time a cluster with one example was found four times and in the other experiments the cluster with class 1 had a few examples from another class. So, the results seem to clearly indicate that there are two classes. However, we wanted a repeatable method that could objectively determine how many classes existed. We used a threshold on the number of examples in a cluster. The FCM functional has a bias towards producing approximately equal size clusters. It is not the right functional to use for widely different sized clusters. Hence, we used a threshold which was the percent of examples if each cluster was the same size. If a cluster had less than the threshold, it indicated that there was no cluster and the cluster should be merged with another. We did not, in these experiments, try to merge the clusters. The equation is n T = ∗ P, (5) c where n is the number of examples, c is the number of clusters searched for and P is the percentage. Any percentage 2 or greater will lead to the conclusion that there are only 2 clusters in the Iris data when we search for 3. Results are summarized for different c in Table 2. Next, we searched for four clusters in the Iris data. A partition with 50 examples from class 1 and the other two classes perfectly mixed occurred three times. There was always one empty cluster and the largest cluster size was 9 in the case three clusters were found. So, any threshold above 30 percent will provide the conclusion that there are only two clusters.
Scalable Swarm Based Fuzzy Clustering Clusters Ave. clusters searched found 3 2 4 2 5 2 6 2.5
27
P 0.2 0.3 0.9 0.9
Table 2. Number of clusters searched for and average number found for the Iris data with the minimum P over 30 trials.
With five clusters there were typically two or three empty clusters and the “perfect” partition into two clusters occurs twice. If a percentage of 90 or above is used the conclusion will be two clusters exist. This search space is significantly larger and no more epochs were utilized, so we feel the result is a strong one. We also tried six clusters where there were typically two or three empty clusters. In this case, with a percentage of 90 or above the average number of classes was 2.5. There were a number of cases in which the linearly separable class would get discovered as one cluster and the other two classes would be split into two (67/33 or 74/26 for example). Again, in this large search space this seems to be a very reasonable result. One would probably not guess double the number of actual classes. In order to evaluate whether a more complete search might result in the discovery of 2 clusters more often when we initially searched for 6, we changed the number of epochs to 4000 and the number of iterations per epoch to 25. This causes the ant to move less during epochs and have more opportunities (epochs) to find good partitions. With these parameters and a percentage of 90, just 2 clusters were found for all thirty trials. The examples in the linearly separable class were assigned, by themselves, to one cluster nine times. Finally, we report the results when searching for only 2 clusters. In this case, there were always two clusters found (for P < 0.65). In 14/30 trials a partition with the linearly separable class and the other two classes mixed was found. In the other experiments a few examples were assigned with the linearly separable class making its size between 51 and 54 resulting in reasonable partitions.
Clusters Ave. clusters searched found 6 5 7 5.033 8 5 9 5
P
0.3 0.3 0.75 0.8
Table 3. Number of clusters searched for and average number found for the Artificial data with the minimum P over 30 trials.
28
L.O. Hall and P.M. Kanade
For the artificial data we did experiments with 5, 6, 7, 8 and 9 clusters. Results are summarized for different c in Table 3. The ant based clustering algorithm always found five clusters when it was given five to search for. In fact, it found the exact original partition 15 times. When it was incorrect, it had some small confusion between class two and class five. A typical partition that did not match the original was: (248, 133,217, 192, 210) in which one example had switched between class 2 and class 5. This seems to be a pretty reasonable clustering result given the larger search space of the ants. When it searched for six classes, it always found five for a percentage 30 or greater. The sixth cluster typically had between 0 and two examples assigned to it. When searching for seven classes, it found five classes for a percentage of 30 or greater 29 times. One time it found six classes. In that case there was an empty cluster and then class 4 was split into two clusters. For eight classes, exactly five were found for a percentage of 0.75. Making it larger would occasionally cause 4 to be found when Cluster 5 was split exactly into 2 chunks. For nine classes, five classes were always found for a percentage of 80 up to about 90. There might be two or three empty clusters. The other non-clusters were very lightly populated with less than 15 examples closest to their centroid in the usual case. As the percentage got too high it would cause a class split into two, to occasionally be missed resulting in four clusters. For example, with P = 1, T = 111.11, class 4 is split into two clusters with 107 and 86 examples in each, respectively.
4.3
Iris with two Subsets
The Iris data was randomly broken into two stratified subsets. Ant based clustering was separately applied to each of the subsets. Thirty experiments were conducted. We chose to use four clusters and the result was, in every case, two viable clusters (with P = 0.11). Using the nearest neighbor approach the cluster centers of each of the 30 pairs were combined. The paired clusters were then averaged to provide cluster centroids for a final partition. The final cluster centroids were used to assign the Iris data to the two clusters. Using the cluster labels as a guide, we found that the average error was 3.23 examples with a standard deviation of 0.773854. This means that one cluster was, on average, 53 examples where 50 of the examples were from the Setosa class. The other cluster was the remaining 97 examples from the other two classes. Using the cluster centers obtained from distributed ant based clustering as initial cluster centers for FCM we obtained three errors for each of the 30 experiments with an average of 6.5 iterations before convergence. This indicates that the partition, without any extra optimization, produced by the ants was quite good. We also tried 5000 random initializations of FCM with two classes and always got the same pair of cluster centers resulting in three errors.
Scalable Swarm Based Fuzzy Clustering
5
29
Summary and Discussion
A swarm based approach to clustering was used to optimize a fuzzy partition validity metric. A group of ants was assigned as a team to produce a partition of the data by positioning cluster centroids. Each ant was assigned to a particular feature of a particular cluster in a particular partition. The assignment was fixed. The ants utilized memory to keep track of the best locations they had visited. Thirty partitions were simultaneously explored. An overestimate of the number of clusters that exist in the data resulted in a best partition with “the optimal” number of clusters. The overestimate allowed the ant based algorithm the freedom to make groups of two or more clusters have approximately the same centroid, thereby reducing the total number of clusters in a partition. The ability to choose a smaller set of clusters than initially hypothesized allows for a better optimized value for the partition validity function. After minimal post-processing to remove spurious clusters the “natural” substructure of the data, in terms of clusters, was discovered. The Xie-Beni fuzzy clustering validity metric (based on the fuzzy c-means algorithm) was used to evaluate the goodness of each partition. A minor modification was made to it so that a membership matrix did not need to be computed. A threshold was applied to cluster size to eliminate very small clusters which would not be discovered utilizing the FCM functional which has a strong bias towards approximately equal size clusters. By small clusters we mean clusters of from 1 to 20 elements or less than 40% of the expected size of a class (given that we knew the approximate class size). Two data sets, the Iris data and a five cluster artificial data set, were used to evaluate the approach. For both data sets, the number of clusters in the feature space describing the data set were discovered even when guessing there were more than twice as many clusters as in the original data set. There is an open question on how to set the threshold which would indicate that a cluster is spurious (too small to be real). There is the question of what to do with spurious clusters. They could certainly be merged into the closest non-spurious cluster. Alternatively, if the threshold is too high a cluster that is split into two or more chunks could be left undiscovered as all sub-clusters could be deemed spurious. The search can be parallelized to make it significantly faster. For example, each ant can certainly move independently or clustering can be applied to subsets of data. An experiment with two subsets of the Iris data showed that, in about half the time, a partition could be created by applying ant based clustering to each subset and merging the final cluster centers. The final partitions produced by the swarm based clustering algorithm typically matched or were quite close to what would be obtained from FCM with the same number of cluster centers and matched the actual data quite well. Hence, this approach holds the promise of discovering the number of clusters
30
L.O. Hall and P.M. Kanade
in the data as well as producing a partition of the data when a heuristic overestimate of the number of clusters can be made. Acknowledgements This research partially supported by The National Institutes of Health via a bioengineering research partnership under grant number 1 R01 EB00822-01.
References BEZDEK, J.C., KELLER, J., KRISHNAPURAM, R., and PAL, N. (1999): Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Boston, MA. BLAKE, C.L., and MERZ, C.J. (1998): UCI repository of machine learning databases. http://www.ics.uci.edu/∼mlearn/MLRepository.html. HALL, L.O., and KANADE, P.M. (2005): Swarm based fuzzy clustering with partition validity. Proc. 14th IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE’05). IEEE Press, Piscataway, NJ. To Appear. HANDL, J., KNOWLES, J., and DORIGO, M. (2003a): On the performance of antbased clustering. design and application of hybrid intelligent systems. Frontiers in Artificial intelligence and Applications 104, 204–213. HANDL, J., KNOWLES, J., and DORIGO, M. (2003b): Strategies for the increased robustness of ant-based clustering. Self-Organising Applications: Issues, challenges and trends, LNCS 2977, 90–104. Springer-Verlag, Berlin. HATHAWAY, R.J., and BEZDEK, J.C. (1995): Optimization of clustering criteria by reformulation. IEEE Trans. on Fuzzy Systems, 3(2):241–245. IEEE Press, Piscataway, NJ. HORE, P. (2004): Distributed clustering for scaling classic algorithms. Master’s thesis, University of South Florida, Tampa, FL. HORE, P., and HALL, L.O. (2004): Distributed clustering for scaling classic algorithms. Proc. 13th IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE’04). IEEE Press, Piscataway, NJ. KANADE, P. (2004): Fuzzy ants as a clustering concept. Master’s thesis, University of South Florida, Tampa, FL. KANADE, P.M., and HALL, L.O. (2004): Fuzzy ants clustering with centroids. Proc. 13th IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE’04). IEEE Press, Piscataway, NJ. KIM, D.W., LEE, K.H., and LEE, D. (2004): A novel initialization scheme for the fuzzy c-means algorithm for color clustering. Pattern Recognition Letters, 25(2):227–237. LABROCHE, N., MONMARCHE, N., and VENTURINI, G. (2002): A new clustering algorithm based on the chemical recognition system of ants. Proc. European Conf. on Artificial Intelligence, 345–349. ´ N., SLIMANE, M., and VENTURINI, G. (1999): On improving MONMARCHE, clustering in numerical databases with artificial ants. Proc. 5th European Conf. on Artificial Life (ECAL’99), LNAI 1674, 626–635. Springer-Verlag, Berlin.
Scalable Swarm Based Fuzzy Clustering
31
OUADFEL, S., and BATOUCHE, M. (2002): Unsupervised image segmentation using a colony of cooperating ants. Biologically Motivated Computer Vision, 2nd Int. Workshop, BMCV 2002, LNCS 2525, 109–116. Springer-Verlag, Berlin. PAL, N.R., and BEZDEK, J.C. (1995): On cluster validity for the fuzzy c-means model. IEEE Trans. on Fuzzy Systems, 3(3):370–379. ULTSCH, A. (2004): Strategies for an artificial life system to cluster high dimensional data. Abstracting and Synthesizing the Principles of Living Systems, GWAL-6, 128–137. XIE, X.L., and BENI, G.A. (1991): Validity measure for fuzzy clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 3(8):841–846.
SolEuNet: Selected Data Mining Techniques and Applications Nada Lavraˇc1,2 1 2
Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia Nova Gorica Polytechnic, Vipavska 13, 5000 Nova Gorica, Slovenia
Abstract. Data mining is concerned with the discovery of interesting patterns and models in data. In practice, data mining has become an established technology with applications in a wide range of areas that include marketing, health care, finance, environmental planning, up to applications in e-commerce and e-science. This paper presents selected data mining techniques and applications developed in the course of the SolEuNet 5FP IST project Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise (2000–2003).
1
Introduction
This paper reports on the experience gained from a variety of applications of data mining, drawn from both successes and from failures, from the engineering of representations for practical problems, and from expert evaluations of solutions developed in the European 5FP IST project Data Mining and Decision Support: A European Virtual Enterprise (SolEuNet) (Mladeni´c et al. (2003), Mladeni´c and Lavraˇc (2003)). The aim of the project was to develop a framework, methods, and tools for the integration of data mining and decision support, as well as their application to business problems in a collaborative framework of 12 European project partners. Data mining and decision support are, each on their own, well-developed research areas, but until the start of SolEuNet there has been no systematic attempt to integrate them. The main project innovations resulted in bridging the gap between these two technologies, enabling the fusion of knowledge from experts (provided by decision support) and knowledge extracted from data (provided by data mining), and consequently enabling successful solutions of new types of problems. The objective of this paper is to give an outline of SolEuNet techniques and applications and present some lessons learned from collaborative research and development projects performed in the scope of the project. One particular data mining technique—subgroup discovery—is described in more detail, together with the results of a medical application solved by the developed subgroup discovery technique. This paper is organized as follows. Section 2 outlines selected SolEuNet techniques and applications, some of which were developed in a collaborative
SolEuNet: Data Mining Techniques and Applications
33
setting of remote data mining teams. Section 3 presents the developed subgroup discovering methodology and resuls achieved in coronary heart disease risk group detencion. The paper concludes by outlining some lessons learned.
2
Selected SolEuNet Results
Selected research results, developed by different partners of the SolEuNet project and described in detail in the book by Mladeni´c et al. (2003), are outlined below. • Advances in data mining technology, including the LispMiner data mining tool, Sumatra TT transformation tool for data preprocessing, data and model visualization tools, subgroup discovery and visualization methods, the ROC analysis evaluation and visualization, and methods for combining data mining solutions. • Advances in data mining and decision support integration technology, including their integration with information systems based on the OLAP technology, the methodology for collaborative problem solving and data mining and decision support integration, data and model description standards (PMML extensions), and unified descriptors of solved client problems. • Other research results presented at international conferences and workshops, including the workshops organized by project partners: ECML/ PKDD Workshop on Integration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM-2001, Freiburg and IDDM-2002, Helsinki) and ICML Workshop on Data Mining Lessons Learned (DMLL-2002, Sydney). • Most important project results published in the edited book Data Mining and Decision Support: Integration and Collaboration, published by Kluwer in 2003, containing 22 chapters describing the main scientific results and propotype applications developed in SolEuNet. Project partners have developed numerous prototype problem solutions, described in more detail by Mladeni´c and Lavraˇc (2003). Below is a nonexhaustive list of prototype problems solved. • Analysis of media research data for a marketing research company. • Brand name recognition for a direct marketing campaign. • Customer quality evaluation and stock market prediction for a large financial house. • Predicting the use of resources in a Czech health farm. • Analysis of data of 20 years of UK road traffic accidents. • Automatic ontology construction from education materials on the Web for a large publishing house. • Analysis of Web page access to improve site usability for a statistics institute.
34
N. Lavraˇc
• Analysis of IT projects funded by the European Commission. • Selection of ski resorts for clients of a tourist agency. • Loan allocation for renovation of denationalized objects for a housing fund. • Bank selection for implementing the National Housing Schema for a housing fund. • Assessment of diabetic foot risk. • Selection of research projects for a mnicipality research fund. • Evaluation of IT services for a government agency. • Analysis of international building construction projects.
3
Selected Subgroup Mining Technique Applied to Coronary Heart Disease Risk Group Detection
Rule learning is an important data mining technique, used in classification rule induction, mining of association rules, subgroup discovery and other approaches to predictive and descriptive induction. This section discusses actionable knowledge generation by means of subgroup discovery. The term actionability is described in Silberschatz and Tuzhilin (1995) as follows: “a pattern is interesting to the user if the user can do something with it to his or her advantage.” As such, actionability is a subjective measure of interestingness. In an ideal case, the induced knowledge should enable the decision maker to perform an action to his or her advantage, for instance, by appropriately selecting individuals for population screening concerning high risk for coronary heart disease (CHD). Consider one rule from this application: CHD ← female & body mass index > 25 kg/m2 & age > 63 years This rule is actionable as the general practitioner can select from his patients the overweight patients older than 63 years. This section provides arguments in favor of actionable knowledge generation through recently developed subgroup discovery approaches, where a subgroup discovery task is informally defined as follows (Wrobel 1997, Gamberger and Lavraˇc 2002): Given a population of individuals and a specific property of individuals that we are interested in, find population subgroups that are statistically ‘most interesting’, e.g., are as large as possible and have the most unusual distributional characteristics with respect to the property of interest. The subgroup discovery task is restricted to learning from classlabeled data, thus targeting the process of subgroup discovery to uncovering properties of a selected target population of individuals with the given property of interest. The proposed subgroup discovery methodology was applied to the problem of detecting and describing of Coronary Heart Disease (CHD) patient risk groups (Gamberger and Lavraˇc 2002) from data collected in general
SolEuNet: Data Mining Techniques and Applications
35
patient screening procedures that include anamnestic information gathering and physical examination, laboratory tests, and ECG tests. Expert-guided subgroup discovery was aimed at easier detection of important risk factors and risk groups in the population which should help general practitioners to recognize and/or detect CHD even before the first symptoms actually occur. Early detection of artherosclerotic coronary heart disease (CHD) is an important and difficult medical problem. CHD risk factors include artherosclerotic attributes, living habits, hemostatic factors, blood pressure, and metabolic factors. Their screening is performed in general practice by data collection in three different stages. A Collecting anamnestic information and physical examination results, including risk factors like age, positive family history, weight, height, cigarette smoking, alcohol consumption, blood pressure, and previous heart and vascular diseases. B Collecting results of laboratory tests, including information about risk factors like lipid profile, glucose tolerance, and thrombogenic factors. C Collecting ECG at rest test results, including measurements of heart rate, left ventricular hypertrophy, ST segment depression, cardiac arrhythmias and conduction disturbances. In this application, the goal was to construct at least one relevant and interesting CHD risk group for each of the stages A, B, and C, respectively. Subgroup discovery was performed by SD, an iterative beam search rule learning algorithm (Gamberger and Lavraˇc 2002). The input to SD consists of a set of examples E and a set of features F constructed for the given example set. The output of the SD algorithm is a set of rules with optimal covering properties on the given example set. The SD algorithm is implemented in the on-line Data Mining Server (DMS), publicly available at http://dms.irb.hr. The following constraints formalize the SD constraint-based subgroup mining task. Language constraints: Individual subgroup descriptions have the form of rules Class ← Cond, where Class is the property of interest (the target class CHD), and Cond is a conjunction of features (conditions based on attribute value pairs) defined by the language describing the training examples. Evaluation/optimization constraints: To ensure that induced subgroups are sufficiently large, each induced rule R must have high support, i.e., sup(R) ≥ M inSup, where M inSup is a user-defined threshold, and sup(R) is the relative frequency of correctly covered examples of the target class in examples set E: sup(R) = p(Class · Cond) =
n(Class · Cond) |T P | = |E| |E|
36
N. Lavraˇc
Other evaluation/optimization constraints have to ensure that the induced subgroups are highly significant (ensuring that the distribution of target class examples covered by the subgroup description will be statistically significantly different from the distribution in the training set). This could be achieved in a straight-forward way by imposing a significance constraint on rules, e.g., by requiring that rule significance is above a user-defined threshold. Instead, in the SD subgroup discovery algorithm (Gamberger and Lavraˇc 2002) the following rule quality measure assuring rule significance, implemented as a heuristic in rule construction, is used: qg (R) =
|T P | |F P | + g
(1)
In this equation, T P are true positives (target class examples covered by rule R), F P are false positives (non-target class examples covered by rule R), and g is a user defined generalization parameter. High quality rules will cover many target class examples and a low number of non-target examples. The number of tolerated non-target class cases, relative to the number of covered target class cases, is determined by parameter g. It was shown in (Gamberger and Lavraˇc 2002) that by using this optimization constraint (choose the rule with best qg (R) value in beam search of best rule conditions), rules with a significantly different distribution of covered positives, compared to the prior distribution in the training set, are induced.
The process of expert-guided subgroup discovery was performed as follows. For every data stage A, B and C, the SD algorithm was run for values g in the range 0.5 to 100 (values 0.5, 1, 2, 4, 6, ...), and a fixed number of selected output rules equal to 3. The rules induced in this iterative process were shown to the expert for selection and interpretation. The inspection of 15–20 rules for each data stage triggered further experiments, following the suggestions of the medical expert to limit the number of features in the rule body and avoid the generation of rules whose features would involve expensive and/or unreliable laboratory tests. In the iterative process of rule generation and selection, the expert has selected five most interesting CHD risk groups. Table 1 shows the induced subgroup descriptions. The features appearing in the conditions of rules describing the subgroups are called the principal factors. Subgroup A1 is for male patients, subgroup A2 for female patients, while subgroups B1, B2, and C1 are for both male and female patients. The subgroups are induced from different attribute subsets (A, B and C, respectively) with different g parameter values (14, 8, 10, 12 and 10, respectively). The described iterative process was successful for data at stages B and C, but it turned out that medical history data on its own (stage A data) is not informative enough for inducing subgroups, i.e., it failed to fulfil the expert’s subjective criteria of interestingness. Only after engineering the domain, by
SolEuNet: Data Mining Techniques and Applications
37
Expert Selected Subgroups A1 CHD ← male & positive family history & age over 46 year A2 CHD ← female & body mass index over 25 kg/m2 & age over 63 years B1 CHD ← total cholesterol over 6.1 mmol/L & age over 53 years & body mass index below 30 kg/m2 B2 CHD ← total cholesterol over 5.6 mmol/L & fibrinogen over 3.7 g/L & body mass index below 30 kg/m2 C1 CHD ← left ventricular hypertrophy
Table 1. Induced subgroup descriptions in the form of rules.
separating male and female patients, interesting subgroups A1 and A2 have actually been discovered. Separately for each data stage A, B and C, we have investigated which of the induced rules are the best in terms of the T P/F P tradeoff, i.e., which of them are used to define the convex hull in the ROC space. The expert-selected subgroups B1 and B2 are significant, but are not among those lying on the ROC convex hull. The reason for selecting exactly those two rules at stage B are their simplicity (consisting of three features only), their generality (covering relatively many positive cases) and the fact that the used features are, from the medical point of view, inexpensive laboratory tests. Additionally, rules B1 and B2 are interesting because of the feature body mass index below 30 kg/m2 , which is intuitively in contradiction with the expert knowledge that both increased body weight as well as increased total cholesterol values are CHD risk factors. It is known that increased body weight typically results in increased total cholesterol values while subgroups B1 and B2 actually point out the importance of increased total cholesterol when it is not caused by obesity as a relevant disease risk factor. The next step in the proposed subgroup discovery process starts from the discovered subgroups. In this step, statistical differences in distributions are computed for two populations, the target and the reference population. The target population consists of true positive cases (CHD patients included into the analyzed subgroup), whereas the reference population are all available non-target class examples (all the healthy subjects). Statistical differences in distributions for all the descriptors (attributes) between these two populations are tested using the χ2 test with 95% confidence level (p = 0.05). To enable testing of statistical significance, numerical attributes have been partitioned in up to 30 intervals so that in every interval there are at least 5 instances. Among the attributes with significantly different value distributions there are always those that form the features describing the subgroups
38
N. Lavraˇc
Supporting Factors A1 psychosocial stress, cigarette smoking, hypertension, overweight A2 positive family history, hypertension, slightly increased LDL cholesterol, normal but decreased HDL cholesterol B1 increased triglycerides value B2 positive family history C1 positive family history, hypertension, diabetes mellitus
Table 2. Statistical characterization of induced subgroup descriptions.
(the principal factors), but usually there are also other attributes with statistically significantly different value distributions. These attributes are called supporting attributes, and the features formed of their values that are characteristic for the discovered subgroups are called supporting factors. Supporting factors are very important for subgroup descriptions to become more complete and acceptable for medical practice. Medical experts dislike long conjunctive rules which are difficult to interpret. On the other hand, they also dislike short rules providing insufficient supportive evidence. In this work, we found an appropriate tradeoff between rule simplicity and the amount of supportive evidence by enabling the expert to inspect all the statistically significant supporting factors, whereas the decision whether they indeed increase the user’s confidence in the subgroup description is left to the expert. In the CHD application the expert has decided whether the proposed supporting factors are meaningful, interesting and actionable, how reliable they are and how easily they can be measured in practice. Table 2 lists the expert selected supporting factors.
4
Conclusions
We have identified a number of lessons learned in collaborative data mining and decision support projects. First, researchers should explore methods that generate knowledge in established domain formalisms rather than focusing entirely on those invented by the machine learning community. They should also employ standards (e.g., PMML) for model sharing, use and visualization. We also need to reach increased awareness of methods that produce good models from small data sets, whether through incorporation of domain knowledge or statistical techniques for variance reduction, and of methods that generate explanatory models to complement the existing emphasis on purely predictive ones. Finally, the field should expand its efforts on interactive environments for learning and discovery, rather than continuing its emphasis on automated methods. These recommendations do not contradict earlier lessons drawn from successful applications. Developers should still think carefully about how to formulate their problems, engineer the representations, manipulate their data
SolEuNet: Data Mining Techniques and Applications
39
and algorithms, and interpret their results. But they do suggest that, despite some impressive successes, we still require research that will produce a broader base of computational methods for discovery and learning. These will be crucial for the next generation of applications in data mining and scientific discovery Acknowledgments This paper outlines the results of joint work of partners of the SolEuNet 5FP IST project Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise (2000–2003), that was coordinated by Dunja Mladeni´c and the author of this paper. The results of subgroup discovery were developed in joint work with Dragan Gamberger from Rudjer Boˇskovi´c Institute, Zagreb, Croatia. The work presented in this paper was funded by the SolEuNet project and the Slovenian Ministry of Higher Education, Science and Technology.
References ˇ N. (2002): Expert-Guided Subgroup Discovery: GAMBERGER, D. and LAVRAC, Methodology and Application. Journal of Artificial Intelligence Research, 17, 501–527. ˇ N., MOTODA, H., FAWCETT, T., HOLTE, R.C., LANGLEY, P. and LAVRAC, ADRIAANS, P. (2004): Introduction: Lessons Learned from Data Mining Applications and Collaborative Problem Solving. Maching Learning Journal, 57, 13–34. ´ D., LAVRAC, ˇ N., BOHANEC, M. and MOYLE, S. (eds.) (2003): MLADENIC, Data Mining and Decision Support: Integration and Collaboration, Kluwer Academic Publishers. ´ D. and LAVRAC, ˇ N. (eds.) (2003): Data Mining and Decision Support MLADENIC, for Business Competitiveness: A European Virtual Enterprise - Results of the Sol-Eu-Net Project, DZS, Ljubljana. SILBERSCHATZ, A. and TUZHILIN, A. (1995) On subjective measures of interestingness in knowledge discovery. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, AAAI Press. WROBEL, S. (1997): An Algorithm for Multi-relational Discovery of Subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, Springer, 78–87.
Inferred Causation Theory: Time for a Paradigm Shift in Marketing Science? Josef A. Mazanec Institute for Tourism and Leisure Studies, Wirtschaftsuniversit¨ at Wien, 1090 Vienna, Austria
Abstract. Over the last two decades the analytical toolbox for examining the properties needed to claim causal relationships has been significantly extended. New approaches to the theory of causality rely on the concept of ’intervention’ instead of ’association’. Under an axiomatic framework they elaborate the conditions for safe causal inference from nonexperimental data. Inferred Causation Theory (Spirtes et. al., 2000; Pearl, 2000) teaches us that the same independence relationships (or covariance matrix) may have been generated by numerous other graphs representing the cause-effect hypotheses. ICT combines elements of graph theory, statistics, logic, and computer science. It is not limited to parametric models in need of quantitative (ratio or interval scaled) data, but also operates much more generally on the observed conditional independence relationships among a set of qualitative (categorical) observations. Causal inference does not appear to be restricted to experimental data. This is particularly promising for research domains such as consumer behavior where policy makers and managers are unwilling to engage in experiments on real markets. A case example highlights the potential use of Inferred Causation methodology for analyzing the marketing researchers’ belief systems about their scientific orientation.
1
Introduction
In 1997 McKim and Turner published a reader entitled ’Causality in Crisis?’. This question raised for social science in general is particularly appropriate for marketing and its equivocal relationship with the concept of causality. It becomes apparent in the debate on the causal interpretability of structural modeling results. In their review of structural equation modeling in marketing Baumgartner and Homburg (1996) found that 93 per cent of a total of 147 articles used cross-sectional data. Even the most advanced SEMs which try to capture unobserved consumer heterogeneity rely on cross-sectional data (Jedidi, Jagpal and DeSarbo, 1997). Baumgartner and Homburg concluded what seems to be typical of contemporary marketing research; ’special care’ should be ’exercised in causally interpreting results . . . ’, a recommendation that directly leads to avoiding ’the term causal modeling altogether’ (p. 141). In the same issue of the International Journal of Research in Marketing Hulland, Chow and Lam (1996) offered another review based on 186 articles.
Inferred Causation Theory
41
Without questioning their decision these authors chose to adhere to the term ’causal models’ as introduced by Bagozzi (1980). By the way, Richard Bagozzi did not coin this term lightheartedly. He sets out the discussion with – rarely found in marketing books – a profound account of the epistemological underpinnings of causal research. The large majority of explanatory models in marketing claim to serve managerial purposes. Therefore, the practical value does not only depend on prediction but requires a lot more viz. predicting a system’s response to interventions. Such a prediction cannot be deduced in a meaningful way unless the model gets causally interpreted. Over the last 25 years a new way of thinking about causality and causal model building has emerged. Inferred Causation Theory (ICT) represents a research area of overlap between logic, graph theory, statistics, and computer science with prevailing applications in social science and economics. While largely unnoticed in marketing it offers new exciting instruments for drawing causal conclusions also for data collected in a nonexperimental setting and even of cross-sectional origin. The two leading books on ICT consume 380-520 pages to outline the basic concepts and algorithms. So this article cannot be expected to lead to an in-depth understanding. However, it draws the marketing researcher’s attention to the assertion that the traditional view ’You can never draw causal inferences from cross-sectional data’ may be obsolete. An empirical demonstration study will illustrate the practical application of Inferred Causation tools.
2
Causality Revisited
The history of causal reasoning in the sciences has seen controversies between widely differing positions. A famous example is Hume’s view that causal laws are illusionary, nothing more than a tendency of the human mind to organize observations in its struggle to make sense out of them. By contrast, the Kantian interpretation gives causality the status of a synthetic a priori truth that need not be established empirically. An in-depth treatment of causality from the philosophy of science point of view is bound to discuss intricate issues such as causal necessity and causal explanation, determinism versus indeterminism, and inductive reasoning (Stegm¨ uller, 1969). Despite all the efforts made so far agreement about how to explicate causality in a manner that may accommodate relativity and quantum physics seems as out of reach as ever. The inconclusive findings about causality in theoretical physics and philosophy have always appeared paradoxical when confronted with the fact that a human child acquires causal knowledge fairly easily. There must be learning mechanisms at work — at least on macroscopic level (Heylighen, 1989) — that allow for causal generalizations. ICT holds that, until recently, we have been lacking the proper language to describe the process of causal reasoning in a way amenable to computer analysis. This language requires graphical elements in addition to algebraic sym-
42
J.A. Mazanec
bolism and it needs a special operator – do(.) in the notation of Pearl (2000) – to implement an intervention calculus. Manipulating the variable Xi in a causal model involving variables X1 , . . . , Xn , i.e. do(Xi = xi ), removes the term P (xi |pai ) from the factorization of the joint distribution P (x1 , . . . , xn ). In graphical terms this is equivalent to eliminating the directed link between xi and all the variables influencing it (viz. its parents pai ). Empirical researchers in economics and management science have been trained to circumvent the causality problem. While the aim of acquiring causal knowledge is accepted by some silent consent there are very few cases where the empirical evidence is said to support causal relationships. The traditional principles of research designs appropriate for confirming causal hypotheses are well known. Repeated measurements, treatment and control groups allowing for manipulating independent variables and controlling for extraneous influences, random assignments or at least matching of cases are accepted requirements (Kerlinger, 1986). Marketing research, however, meets with a very limited willingness of marketing practitioners to enter into controlled experimentation. If there are any disciplines legitimately asking ’How far do we get with nonexperimental data?’ marketing ought to be amongst them. If a human child derives causal knowledge — of consequences of one’s own actions — without performing controlled experiments, why shouldn’t a marketing analyst be able to achieve similar results? A ’normative’ definition of Inferred Causation has been proposed by Pearl (2000). According to this a variable C has a causal influence on variable E if and only if there exists a directed path from C to E in every minimal latent structure consistent with a given probability distribution P ; a latent structure D, O consists of a causal structure D over the variables V and a set of observed variables O ⊆ V ; a causal structure of a set of variables V is a directed acyclic graph where the nodes represent the elements of V and the arrows denote functional relationships. Because of the recourse to Minimality, which refers to the principle of parsimony or Occam’s Razor, the definition is normative. Consistency with the distribution P over O points to the existence of a parameterization for D that generates P ; in more sloppy terms this means consistency with the data.
3
Inferred Causation Theory
Inferred Causation Theory (ICT) seeks to establish the conditions that must be fulfilled to deduct causal structure from statistical data. There is no requirement for experimental manipulation and even a temporal sequence of measurements is not mandatory. To destroy unwarranted expectations it is still impossible to claim a causal relationship when only a pair of correlated variables has been observed. However, such claims can be substantiated for more complex systems of partially interrelated variables. The elementary building blocks of ICT are directed acyclic graphs (DAGs) and (conditional)
Inferred Causation Theory
43
independence relationships. A DAG such as (1) (1) X → Y → Z reflects the independence relationship (2): (2) X ⊥ Z | Y (1) also exhibits the Markov property (generalized to ’d-separation’ by Pearl, 1988) as knowing Y makes further knowledge of X irrelevant for learning something about Z. Y blocks (or d-separates) Z from X. With (2) observed a computerized DAG reconstruction procedure (like the PC or PC* algorithms; Spirtes, Glymour and Scheines, 2000) would yield the graph in (1) plus the two in (3) and (4) (Scheines, 1997): (3) X ← Y → Z (4) X ← Y ← Z Note that the graphical representation (5) X → Y ← Z is not consistent with the observed conditional independence relationship in (2), which is obvious here, but may be hard to infer for nontrivial data structures. From the marketing research point of view graphical models are a rich and promising model class (Edwards, 2000) quite independent from being interpreted causally or not. Even without making additional assumptions needed to infer causation (the Causal Markov and Causal Sufficiency condition, Faithfulness) the analyst benefits from applying ICT tools. In particular, it is highly desirable to achieve evidence for the direction of an edge in a model graph that is unambiguously supported by the data. As an example consider observations on the four variables A, B, C, D that do not exhibit any other independence relationships except (6) A ⊥ B and (7) D ⊥ {A, B} | C and those logically deducible from (6) and (7). With only one principle adopted — the above-mentioned Minimality — the dependency C → D can be inferred unambiguously (Pearl, 2000, p. 47). This conclusion is valid with or without assuming further latent variables. Minimality just precludes overfitting models. The following two graphs are consistent with the two independencies (6) and (7) and also fulfill the Minimality requirement: (a)
A→C←B ↓ D
(b)
A→C←B ↓ D
Lacking further information there is no way to differentiate between the structures (a) and (b) thereby deciding on the presence of a latent variable L in (b). However, a mediation effect of a latent variable such that C → D gets replaced by C → L ← D does not follow from the observations as it implies D ⊥ {A, B} unconditioned on C.
44
J.A. Mazanec
Fig. 1. Assumed causal structure (starting model).
4
An Illustrative Example: Marketing Researchers’ Belief Systems About Their Scientific Orientation
The empirical demonstration study uses the results and the data of an analysis presented by Franke (2002). It explores nonexperimental cross-sectional data. The assertions encountered in the literature lead to formulating and testing a ’starting model’. This model describes the interrelationships likely to be found in the marketing scholars’ systems of beliefs about the philosophical and methodological foundations of the discipline. Exploring the empirical data from a harmonized German-US survey new diagnostic tools of Inferred Causation Theory are employed to eliminate unwarranted causal paths and to search for new ones neglected so far. A discussion of the results evaluates the findings and points to directions of further research. The literature survey and the results of Franke (2002) advocate a causal mechanism working along these lines: A marketing researcher’s decision on the width of his domain of study, the origin of theories to choose from, the analytical toolbox to be used and the willingness to make normative statements depends on his basic epistemological orientation and the preference for scientific discovery versus service to management practice. (One may argue that sometimes a researcher in discomfort with sophisticated quantitative methods may tailor his orientation as to not needing them. But for the moment disregard this reverse causality interpretation.) Figure 1 exhibits the expected causal relationships among the six attitudinal variables. Only a minor difference of the researchers’ mind sets in German and US subgroups was detected in a classification study (Franke and Mazanec, forthcoming). Therefore, a homogeneous causal structure was assumed to underlie the master sample of 241 respondents. Six of the attitudinal variables relate to how marketing scientists may perceive the world. The three statements
Inferred Causation Theory
45
expressing a rationalist view and the other three favoring a constructionist interpretation of marketing “reality” were condensed into two indices. Named REALIST and CONSTRUCTIONIST they range over the same scale interval as the rest of the items. This leaves one with two singleton variables and four paired items. The singletons have nothing in common; the desired focus of research — suggesting a narrow or a wide empirical domain — was named FOCUS, the perceived necessity to issue value-judgments was labeled VALUE. The paired variables are not strict alternatives but exhibit their full meaning if considered in conjunction with each other. For instance, a priority for seeking theoretical explanations (EXPLANATION) very often limits the time of a researcher to become preoccupied with serving marketing practice (APPLICATION). The same argument of contrast applies to a scientist emphasizing a microeconomic, strongly formal style of research (FORMAL) or preferring behavioral sciences for providing basic theories (BEHAVIORAL). Finally, not a mutually exclusive but a pragmatic choice is made by colleagues leaning more towards quantitative (QUANTITATIVE) or qualitative (QUALITATIVE) methodology. Computing scale differences for the paired variables on disaggregate level enhances the discriminating strength of the items. It also greatly improves the multivariate normality properties of the set of variables. After these preprocessing steps the researchers’ mind-sets of normative beliefs about science are made up of six attitudinal items. Fitting the parameters of the starting model results into these preliminary findings (std. errors in parentheses; Bengt and Linda Muth´en’s Mplus (Muth´en and Muth´en, 2001) was used): x2 = x3 = x4 = x5 = x6 =
.143x1 + 2 (.053) .093x1 −.252x2 + 3 (.100) (.119) −.044x1 −.065x2 + 4 (.094) (.113) .218x1 +.110x2 + 5 (.060) (.072) .031x1 +.178x2 + 6 (.061) (.073)
Four of the nine path coefficients are significant (p < .05) and exhibit the expected signs (boldface in Figure 1). In particular, a realist orientation tends to entail a stronger awareness of the explanatory purpose of the marketing discipline and strengthens a preference for quantitative methods. Emphasis on explanation versus application favors a formal and microeconomic style of research while it disfavors a narrow focus on the phenomena under study. The model achieves a χ2 of 58.79 (p < .001) and an RSMEA of .105; the R squared of the dependent variables are poor and range between .3 percent for VALUE and 6.8 percent for QUANTITATIVE-QUALITATIVE.
46
J.A. Mazanec
The starting model gets only partial support but may have a potential for improvement. The systematic search for alternative model specifications benefits from the promising developments aimed at elaborating the conditions for inferring causal structure from nonexperimental statistical data. Marketing researchers have not yet become widely aware of these results, which represent a substantial progress compared to the familiar search procedures in popular software such as LISREL or EQS. Many applications of structural equation models (SEM) to marketing problems fail to recognize that there may be many alternative model specifications that might have reproduced the observed covariance/correlation matrix equally well. Nevertheless, the authors explicitly or implicitly claim to have validated a set of causal relationships. On the other hand, it is also unjustified to accept the pessimistic view that causal inference is impossible without exploiting experimental data or at least relying on a temporal sequence in the measurements. In the meantime the conditions of causal inference from nonexperimental data have been brought to a degree of precision that allows for algorithmic treatment. The starting model in Figure 1 summarizes the prior knowledge of the analyst. It is expressed in terms of the dependence relationships, which define an acyclic directed graph. As a DAG it is characterized by directed paths that do not include directed circles. Though the exclusion of feedback loops and directed cyclic graphs (nonrecursive systems in SEM parlance) may be seen as restrictions to overcome (Glymour, 1997), they certainly do not pose a problem in this application. Approximate linearity and multivariate normality already had to be assumed for the parameter estimation presented above. Additional assumptions are needed for making judgments regarding causal inferences. The Markov condition was addressed in Section 3. It assures that each variable in the DAG is independent of all its nondescendent nodes given its parental nodes (Pearl and Verma, 1991). If the DAG is to be interpreted causally the Markov condition has to be reformulated to incorporate Causal Sufficiency i.e. the assumption that every common cause of two or more variables appearing in the DAG is itself included in this set of variables (Glymour, 1997). A final condition is called the Faithfulness (Spirtes, Glymour and Scheines, 2000) or Stability property (Pearl, 2000). Faithfulness (or stability in Pearl’s terminology) implies that the conditional independence relationships suggested in the DAG stay invariant to changes in the parameters of the model. Put more practically this means that the independence relations must not break down for some peculiar parameter settings. The following analysis benefits from Carnegie Mellon’s Tetrad Project. The Tetrad research group develops ICT methodology and software. The Build procedure embedded in the Tetrad system (see the Tetrad project at http://www.phil.cmu.edu/tetrad/) assists in elaborating the causal pattern underlying the graph in Figure 2. The pattern represents a set of models encompassing all equivalent DAGs that may have generated the observed cor-
Inferred Causation Theory
47
Fig. 2. Causal pattern.
relation matrix while being consistent with the analyst’s background knowledge (Spirtes et al., 2002). The statistical tests involved are based on conditional independence judgments, which are equivalent to vanishing partial correlations (ρ) under multivariate normality assumptions: ρ(xi , xj |{xk=i,j }) = 0 ⇔ xi ⊥xj |{xk=i,j } For building the pattern Tetrad applies the PC algorithm (see Appendix B in Spirtes et al., 2002); an alternative is the Inductive Causation algorithm, which comes in two versions for systems without (IC) and systems with latent variables (IC*; Pearl, 2000, pp. 50-4). Three subsequent Tetrad analyses employing assumptions of decreasing rigor are undertaken. Causal Sufficiency is assumed for the first Tetrad run. The assumption of Causal Sufficiency holds if and only if every common cause of a pair of random variables in the set S is itself a member of S (Spirtes, Glymour and Scheines, 2000). In other words, there are no unmeasured (latent) common causes admitted for any pair of variables. Given the partial correlations the following edges cannot be added to the graph in Figure 2 (p = .05): ρ(x3 , x4 ) = −0.120 ρ(x3 , x5 ) = 0.108 ρ(x4 , x6 ) = 0.003 ρ(x4 , x5 ) = −0.053
(p = .061) (p = .093) (p = .963) (p = .411)
The causal pattern actually suggested includes the nine directed edges already posited in Figure 1 and two additional ones (dotted lines in Figure 2), i.e. x3 → x6 and x5 → x6 . Hence, the data appear to indicate that a marketing scholar’s preference for seeking formal rather than behavioral explanation is influenced by his
48
J.A. Mazanec
attitude toward using quantitative or qualitative tools and his willingness to broaden the focus of research. To examine the role of a rigorous assumption such as Causal Sufficiency it gets relaxed in the next step of this exploratory analysis. Without assuming Causal Sufficiency the causal conclusions are generally weaker. They are condensed in a Partially Oriented Inducing Path Graph (Spirtes, Glymour and Scheines, 2000). The POIPG produced in a second Tetrad run also confirms the relationships compatible with the background knowledge presented in Figure 1 and adds two connections, i.e. x3 → ? x6 and x5 → ? x6 , where → ? means that the directed relationship might be replaced or complemented by a common cause between the two variables. Also, the POIPG results rule out that there may be a directed path from x6 to any other variable except x4 . The Tetrad findings gained so far are incorporated into a revised model that introduces new directed edges for x3 → x6 and x5 → x6 . The new ML estimates amount to: x2 = x3 = x4 = x5 = x6 =
.143x1 + 2 (.053) .093x1 −.252x2 + 3 (.100) (.119) .044x1 +.065x2 + 4 (.094) (.113) .218x1 +.110x2 + 5 (.060) (.072) −.006x1 +.192x2 +.115x3 +.133x5 + 6 (.061) (.072) (.038) (.063)
Now six of the eleven path coefficients exceed twice their standard deviations. In addition to the starting model in 1 the revised version of Figure 2 suggests a higher degree of integration of x6 in the belief system. A preference for explanation versus application, a narrower research focus and greater enthusiasm about quantitative methods encourage a formal versus a behavioral style of research. The model now achieves a χ2 of 7.66 (p = .105; as the same data have been used the chi-square is not interpreted as a significance test and the p value just serves as a measure of fit) and an RSMEA of .061; the R squared of the four dependent variables increase marginally but still range between .3 per cent for VALUE and 7.8 per cent for FORMAL-BEHAVIORAL. The overall model fit is far from satisfactory. In relative terms, however, the causal explanation attempt improved by using the ICT tools without doubt. Given this improvement the diagnostics of inferred causation demonstrated their ability to provide useful indication on how to systematically think over and revise one’s theory. In any case, of course, new data are required to subject such a revised model to conclusive inferential testing. Finally, it is tempting
Inferred Causation Theory
49
to let Tetrad construct its own theory without specifying any prior knowledge at all. The Causal Sufficiency assumption remains relaxed too. Therefore, the resulting path fragments arranged into a POIPG hold without claiming the absence of unmeasured variables: x1 → ? x2 x3 → ? x2 x2 ↔ x6 x3 → ? x6 x5 ↔ x6 x1 → ? x5 x4 There are several lessons to learn from this POIPG. (1) x4 is standing aloof, at least under a p < .05 level, so the willingness to make valuejudgments seems to be detached from the rest of the belief system. Seemingly, the fundamental scientific orientation does not matter for the question whether a marketing scholar likes or dislikes value-judgments (and his/her colleagues making them). (2) x2 , the EXPLANATION-APPLICATION attitude, is not exogenous. To some degree it depends on one’s being a realist or a constructionist. (3) By contrast, x1 never appears as a dependent variable, hence the basic scientific orientation (REALIST—CONSTRUCTIONIST) qualifies as an ancestral node in the cause-effect chain. This also implies that one may dismiss the speculation of ’reverse causality’ made in parentheses above: Marketing researchers are not suspect of tailoring their fundamental scientific conviction according to their familiarity with a toolbox of ’quantitative’ or ’qualitative’ methods.
5
Discussion
The exploration of the marketing researchers’ epistemological and methodological attitudes gives limited credit to the propositions brought forward in the literature. Despite a number of significant relationships among the variables in the mind-set the explanatory power remains weak. Obviously, there must be more factors influencing a researcher’s orientation regarding the purpose, scope, and methods of the marketing discipline. It will be necessary to look for other causes outside the small assortment of the six attitudinal variables analyzed here. Few of us get born as realists or constructionists. Somehow we are pushed into either direction or freely choose to move there. The motives and causes are likely to relate to the individual researchers’ career histories, their education, scientific idols and heroes, and to the academic institutions’ incentive systems, their receptiveness of minority research styles and their effort to reach or maintain diversity. Leaving aside the particular findings of the demonstration example, what seems to be a fair and provisional judgment about the achievements of ICT? An adverse opinion, put in the words of Freedman (1997), may state that ’if you want to pull a rabbit
50
J.A. Mazanec
out of the hat, you have to put a rabbit into the hat’. This argument points to assuming the Causal Markov/Causal Sufficiency and Faithfulness conditions. A counter argument refers to the set of DAGs produced by a causal discovery algorithm. One of these DAGs correctly represents the underlying causal process and this result is a derivation rather than an assumption (Spirtes and Scheines, 1997). From the marketing science point of view a fair answer must consider the state of maturity of marketing theory. The reader may honestly judge the merits of ICT by self-answering the following question: If we are given evidence that a directed relationship X → Y is bound to appear in any DAG consistent with our data, doesn’t this bring us pretty close to supporting a causal link? Compare this to many dozens if not hundreds of SEM applications trying to corroborate one particular initial or artfully modified model. A meaningful result will not occur, unless there is a mature theory ruling out most of the alternative but unproven model specifications. Do marketing scientists grow enough mature theories in their backyards?
References BAGOZZI, R.P. (1980): Causal Models in Marketing. New York: Wiley. BAUMGARTNER, H., and HOMBURG, C. (1996): Applications of structural equation modeling in marketing and consumer research: A review. Int. Journal of Research in Marketing, 13, 139–161. EDWARDS, D. (2000): Introduction to Graphical Modelling. 2nd Edition. Springer, New York. FRANKE, N. (2002): Schools of Thought in Marketing. Proc. 31st EMAC Conf. Marketing in a Changing World. Braga, University of Minho, 151. FREEDMAN, D.A. (1997): From Association to Causation via Regression, and, Rejoinder to Spirtes and Scheines. In: McKim and Turner (1997), 113–161 and 177–182. GLYMOUR, C. (1997): A Review of Recent Work on the Foundations of Causal Inference. In: McKim and Turner (1997), 201–248. HEYLIGHEN, F. (1989): Causality as Distinction Conservation: A Theory of Predictability, Reversibility and Time Order. Cybernetics and Systems, 20, 361– 384. HULLAND, J., CHOW, Y.H., and LAM, S. (1996): Use of Causal Models in Marketing research: A Review. Int. Journal of Research in Marketing, 13, 181–197. JEDIDI, K., JAGPAL, H., and DeSARBO, W. (1997): Finite-Mixture Structural Equation Models for Response-Based Segmentation and Unobserved Heterogeneity. Marketing Science, 16, 39–59. KERLINGER, F. (1986): Foundations of Behavioral Research. 3rd ed., Holt, Rinehart, and Winston, Forth Worth. McKIM, V.R., and TURNER, S.P., eds. (1997): Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences. University of Notre Dame Press. ´ L.K., and MUTHEN, ´ B.O. (2001): Mplus User’s Guide: Statistical AnalMUTHEN, ysis with Latent Variables. Muth´en and Muth´en, Los Angeles.
Inferred Causation Theory
51
PEARL, J. (1988): Probabilistic Reasoning in Intelligent Systems. San Mateo: Morgan Kaufmann. PEARL, J. (2001): Causality, Models, Reasoning, and Inference. Cambridge University Press. PEARL, J. and VERMA, T. (1991): A Theory of Inferred Causation. Proc. 2nd Int. Conf. on Principles of Knowledge Representation and Reasoning. Morgan Kaufmann, San Mateo, 441–52. SCHEINES, R. (1997): An Introduction to Causal Inference. In: McKim and Turner (1997), 163–176. SPIRTES, P., and SCHEINES, R. (1997): Reply to Freedman. In: McKim and Turner (1997), 185–199. SPIRTES, P., SCHEINES, R., MEEK, C., RICHARDSON, T., GLYMOUR, C., HOIJTINK, H., and BOOMSMA, A. (2002): Tetrad3: Tools for Causal Modeling, User’s Manual. http://www.phil.cmu.edu/tetrad/tet3/master.htm SPIRTES, P., GLYMOUR, C., and SCHEINES, R. (2000): Causation, Prediction, and Search. 2nd ed., The MIT Press, Cambridge. ¨ STEGMULLER, W. (1969): Probleme und Resultate der Wissenschaftstheorie und Analytischen Philosophie Band I: Wissenschaftliche Erkl¨ arung und Begr¨ undung. Springer, Berlin-Heidelberg-New York.
Text Mining in Action! Dunja Mladeniˇc J.Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia
[email protected] http://kt.ijs.si/Dunja/ Abstract. Text mining methods have being successfully used on different problems, where text data is involved. Some Text mining approaches are capable of handling text just relying on statistics such as, frequency of words or phrases, while others assume availability of additional resources such as, natural language processing tools for the language in which the text is written; availability of lexicons; ontologies of concepts; aligned corpus in several languages; additional data sources such as, links between the text units or other non-textual data. This paper aims at illustrating potential of Text mining by presenting several approaches having some of the listed properties. For this purpose, we present research applications that were developed mainly inside European projects in collaboration with end-users and, research prototypes that do not necessary involve end-users.
1
Introduction
Intensive usage and growth of the World Wide Web and the daily increasing amount of text information in electronic form, have resulted in a growing need in computer-supported ways of dealing with text data. Here we adopt a view on Text mining as a fairly broad area dealing with computer-supported analysis of text. This make the list of problems that can be addressed by text mining rather long and open. For the purpose of this paper we will concentrate on problems addressed by text mining approaches related to automatic data analysis and data mining. We can say that Text Mining is an interdisciplinary area involving the following key research fields. Machine Learning (Mitchell, 1997), (Duda et al., 2000) and Data Mining (Fayyad et al., 1996), (Hand et al., 2001), (Witten and Frank, 1999) which provide techniques for data analysis with varying knowledge representations and large amounts of data. Data Visualization (Fayyad et al., 2001) that can be especially helpful in the first steps of data analysis and for presenting the results of analysis. Statistics and statistical learning (Hastie et al., 2001) which contributes data analysis in general. Information Retrieval (Rijsberg, 1979) providing techniques for text manipulation mechanisms. Natural Language Processing (Manning and Schutze, 2001) providing the techniques for analyzing natural language. Some aspects of text mining involve the development of models for reasoning about new text documents based on words, phrases, linguistic and grammatical properties of the text, as well as extracting information and knowledge from large amounts of text documents.
Text Mining in Action
53
Some of the researcher problems got more attention than the other such as, document categorization and clustering. Document categorization aims at organizing documents by classifying them into pre-defined taxonomies/categories based on their content (as for instance, described by Sebastiani (2002)), while document clustering (eg., Steinbach et al., 2000) aims at identifying groups of similar documents. In data analysis and also in Text mining visualization of data can be very helpful especially in the first steps of data analysis. The most popular visualization in text mining is on large document collections, as for instance, proposed by Kohonen et al., (2000) for large document collections or by Grobelnik and Mladenic, (2002b, 2004) for visualizing and browsing a large collection of news articles. Having documents that are ordered in time, such as news, people have addressed a problem of topic identification and tracking and visualization showing time line of topic development through time (Havre et al., 2000). Other important problems frequently addressed in Text mining include automatic document summarization (eg., Mani and Maybury, 1999), automatic construction and updating of document hierarchies (eg., (Mladenic and Grobelnik, 2003b)), semi-automatic ontology construction (eg., (Bisson et al., 2000), (Maedche and Staab, 2001), (Mladenic and Grobelnik, 2004)), semantic web (eg., (Berendt et al., 2003)), user profiling, information extraction, question answering in natural language and many other. In growing usage of the Web, a very popular problem addressed by Text mining is searching through document collections, while a less known (but very important especially due to a free sharing of resource over the Web) is possibility of automatic document authorship detection and identification of plagiarism. We should point out that this is not a complete list of problems addressed by Text mining, we would rather hope to give the reader an idea of a span of the problems that have been addressed in Text mining. The next sections aim in that direction by briefly describing some of the existing application and research prototypes (rather then providing a comprehensive list of the available systems and tools). Section 2 describes several research applications that were mostly developed in collaboration with a particular end-user. In Section 3 some research prototypes are presented that do not necessary have the end-users involved. Section 4 concludes with a brief discussion.
2
Research Prototypes and Applications
In this Section, we illustrate a kind of problems that can be addressed by different Text mining methods. We briefly describe some research applications that we have developed over the last years in collaboration with several endusers. All applications involve handling text data combined with some other data source. In the application for publishing house of educational materials, an in-house build ontology of educational materials was made available by the end-user as well as the text of educational materials. In the application
54
D. Mladeniˇc
involving analysis of European research space, an internal database was provided in addition to publicly available descriptions of the funded research projects. Web access analysis for statistical office was based on the log files of accesses to the Web site provided by the end-user in addition to the content of the accessed html documents. In the application on Web browsing using user profiling, the internal data of digital library and log files of the users accessing the library was provided by the end-user. 2.1
Support for Publishing House of Educational Materials
Semi-automatic approaches involving text mining can be incorporated into larger systems. One example is supporting search and onology construction in publishing house of educational materials (Mladenic and Grobelnik, 2003a). In discussion with the editors and managers of publishing house, two text mining problems were defined: (1) Support for search on end-user text databases involving natural language specifics and offering some additional functionality required by the end-user that was not offered by the general search engines. (2) Support for ontology construction from in-house XML documents taking into account the existing taxonomy and handling the natural language specifics for Slovenian. The resulting solutions were included in one of the main projects of multimedia division of the publishing house, supporting education in the information society through the Web educational portals for Civic education, Biology, Physics and Pedagogy. Portals were sold to over 70 schools all over the country and so targeting more than 35 thousand individual users. The publishing house expressed their strong belief that the included prototype we have provided improved the quality of their product and potentially also brought financial benefit for their company. 2.2
Analysis of European Research Space
Text mining methods can be used in combination with other related methods, such as Web mining and Link analysis (Chakrabarti, 2002), to address different problems involving text documents in different formats, including html-format used for Web pages and possible connections between the documents (structure of the document set). One application developed in this are as a part of European project on Data mining and Decision support (Mladenic and Lavrac, 2003) is for the European Commission involving analysis of Europan research space (Grobelnik and Mladenic, 2003) based on the publicly available textual descriptions of research and development projects as well as some internal database of European Commission. This prototype does not use any language dependent information and aims mainly at providing different views to the complex data. Different methods for data analysis were used to extract the needed data from the Web, group the projects according to their content and the organizations participating in the projects. The goal was to find various informative insights into the research project
Text Mining in Action
55
database, which would enable better understanding of the past dynamics and provide ground for better planning of the future research programs. For this prototype, four types of data analytic methods were used: text mining, link analysis, web mining, and several visualization techniques. The main emphasis was on the analysis of various aspects of research collaboration between different objects (such as institution, countries, and research areas). This enabled the following specific problems to be addressed: the analysis of collaborations, the identification of similar organizations and project topics (based on the text of project descriptions), community identification (based on the graph of project partnership), the identification of consortia of the organizations for the given topics. 2.3
Web Access Analysis for Statistical Office
Text mining methods are also often used as a part of Web mining analysis, where in addition to Web information such as Web log files, the text of the Web pages is analysed. One application developed in this are as a part of European project on Data mining and Decision support (Mladenic and Lavrac, 2003) is on Web access analysis for statistical office (Jorge et al., 2003). The Portuguese National Statistics Office is the governmental agency that is the gatekeeper of national statistics for Portugal and has the task of monitoring inflation, cost-of-living, demographic trends, and other important indicators. After data cleaning, the work was focused on addressing several problems including the relationship of the user preferences, the clustering of the users according to their preferences, the characterization of the users, the recommendation of potentially interesting/related pages, the visualization of the Web site content, user profiling using collaborative methods, and building classification models to distinguish between various navigation paths. 2.4
Web Browsing Supported by User Profiling
Web browsing is an activity that is very popular among the Internet users. One of the interesting problems connected to Web browsing is user profiling based on the users browsing behavior. Text mining and data mining methods can be used to construct a profile of the user’s interests. For instance, Personal WebWatcher (Mladenic, 2002) uses Text mining methods to constructs a user profile based on the content of previously visited Web documents. It is used to highlight potentially interesting hyperlinks on the requested Web pages. Another way to help the user in browsing the Web is to offer some structuring over the previously visited documents. One application from this area developed as a part of European project on Semantic Web - SEKT is SEKTBar. It was adopted for the needs of British Telecom to enhance access to digital library via building interest-focused browsing history of the user (Grcar et al., 2005). The system is incorporated into the Internet Explorer and maintains a dynamic user profile in a form of automatically constructed
56
D. Mladeniˇc
topic ontology. A subset of previously visited Web pages is associated with each topic in the ontology. By selecting a topic, the user can view the set of associated pages and choose to navigate to the page of his/her interest. The ontology is constructed by clustering the visited Web pages. The most recently visited pages are used to identify the user’s current interest and map it to the ontology. The user can clearly see which topics, and their corresponding pages, are related or not to his/her current interest. In this way the user’s browsing history is organized and visually represented to the user. Figure 1 shows screen of the system resulting from one real-life interaction when the user was visiting Wikipedia for “whale tooth”, “triumph tr4” and “semantic web”, in this same order.
3
Research Prototypes
Under research prototypes we consider systems that offer some interesting functionality but are not necessary solving a problem interesting for an enduser and have not been developed/targeted to a specific end-user. Most of the described prototypes are publicly available for research purpose and can potentially be used as building blocks in some end-user applications. 3.1
Automatic Summarization using Graph Representation
Automatic summarization provides for a shorter version of text and has been addressed by researchers on different levels and by different methods. Here we describe an approach to automatic summarization of text document developed in (Leskovec et al., 2004) to illustrate natural language intensive Text mining. Natural language processing of text is applied to obtain information about semantic structure of the text and generate its graph representation. In the next step, natural language properties as well as structure of the document graph are used by machine learning to construct a model for selecting important parts of the text. The approach is based on exploiting semantic structure of the text represented by a semantic graph (a graph constructed from Subject-Predicate-Object triples extracted from the sentences within the document after applying co-reference resolution. Machine learning provided for ranking of the triples that appear in a semantic graph, the highly ranked tripples were selected for the summary. An example document graph is presented within the Figure 2. Notice that document graphs can be also used for visualizing the content of the document. 3.2
Automatic Lemmatization based on a Trained Model
Lemmatization is the process of finding normalized forms of words, called lemmas. For instance, words computes, computing, computed are via
Text Mining in Action
57
Fig. 1. Screenshot of the system’s GUI, captured after the user visited several pages searching the Web. The left light-blue part is the developed toolbar offering aditional functionality to the browser, while the rest is the usual interface of a Web browser. Screenshot shows the automatically generated topic ontology of the user interests (top) and the list of keywords that corresponds to the selected topic (bottom). The highlighting of the user’s most recent interest is visualized by red color (the brighter the more relevant).
58
D. Mladeniˇc
Fig. 2. An example summary for a new on earthquake in Iran.
lemmatization all mapped to the infinitive of the verb: compute. Lemmatization is an important preprocessing step for many applications dealing with text, including information retrieval, text mining, and applications of linguistics and natural language processing, especially when dealing with highly inflected natural languages. One of the prototypes from this area was developed as a part of European project on Superpeer Semantic Search Engine ALVIS. It is based on using machine learning methods to build a model for lemmatization from pre-annotated data . The system is set as a generally accessible Web service for lemmatization of Slovenian text (Plisson et al., 2005). It can be directly used as pre-processing for standard document classification and clustering, where lemmatization is crucial (when dealing with texts written in highly inflected natural language). It is important to know that there are many languages in the world which do not have stemmers and lemmatizers yet and therefore it is important to be able to create such models automatically from the language data. For instance, the Slovenian language has approx. 20 inflected words (different surface forms) per one normalized word, while this number is much lower for eg., English (approx. 5 to 1).
3.3
Providing Language Independent Document Representation
Even though most Text mining approaches handle document written in a single natural language (in most cases the English language), there are many situations where multilinguality of documents causes difficulties for text processing. Some of the most important examples are multilingual document retrieval, classification, clustering, etc., where one of the basic building blocks is calculating similarity between the documents written in different languages.
Text Mining in Action
59
Standard similarity measures used on documents written in the same language would proclaim two documents with the same content but written in two different languages totally different. What is needed is a way of representing documents so that the documents writen in different language but with the similar contents would lie close to each-other. A solution giving good results to the above problem is Canonical Correlation Analysis, a technique for finding common semantic features between different views of data. To illustrate on an example from (Fortuna, 2004): Let’s have a document collections in English that is translated to German, providing for an aligned corpus. The output of the used method on this dataset is a semantic space where each dimension shares similar English and German meaning. By mapping any English or German document into this space, language independent representations are obtained. In this way, standard machine learning algorithms can be used on multi-lingual datasets. For instance, this is are two pairs of aligned eigenvectors for German and English language automatically generated from the Reuters news: (zentralbank - bank; bp - bp; milliarde - central; dollar - dollar), (verlust - loss; einkommen - income; firma - company; viertel - quarter). 3.4
Visualization of Document Collection
Visualization of a set of text documents is a very useful tool for finding the main topics that the documents from this set talk about. For example, given a set of descriptions of European research and development projects funded under IT, using document visualization one can find main areas that these projects cover, such as, semantic web, e-learning, security, etc. In order to visually represent text documents they need to be represented in a more abstract way. This can be done by first extracting main concepts from documents using Latent Semantic Indexing and than using this information to position documents in two dimensions. One prototype for visualization of document collection was developed in European project on Semantic Web SEKT as a part of Text Garden software tools for Text mining (Grobelnik and Mladenic 2005). Figure 3 gives an example on the data of European research projects.
4
Discussion
In this paper we tried to illustrate potential of Text mining by presenting several approaches applied on different kind of data. For this purpose, we presented several research applications that were developed mainly inside European projects in collaboration with end-users and, several research prototypes that do not necessary involve any end-users. The paper is biased towards our own work, but the hope is that this does not diminish its contribution to illustrating capabilities of Text mining in research as well as in application development.
60
D. Mladeniˇc
Fig. 3. Visualization of European research and development IST projects that started in 2004 (6FP).
Acknowledgements This work was supported by the Slovenian Research Agency and the IST Programme of the European Community under SEKT Semantically Enabled Knowledge Technologies (IST-1-506826-IP), under ALVIS Superpeer Semantic Search Engine (IST-1-002068-STP) and PASCAL Network of Excellence (IST-2002-506778).
References BERENDT, B., HOTHO, A., MLADENIC, D., SOMEREN, M.W. van., SPILIOPOULOU, M., STUMME, G. (2003). A roadmap for web mining : from web to ssemantic web. In Web mining : from web to semantic web : First European Web Mining Forum - EWMF 2003, (Berendt, Hotho, Mladenic, Someren, Spiliopoulou, Stumme (eds.)), (Lecture notes in artificial inteligence, Lecture notes in computer science, vol. 3209). Berlin; Heidelberg; New York: Springer, 2004, pp.1–22. BISSON, G, NEDELLEC, C., CANAMERO, D. (2000). Designing clustering methods for ontology building: The Mo’K workbench. In Proceedings of the First
Text Mining in Action
61
Workshop on Ontology Learning OL-2000. The 14th European Conference on Artificial Intelligence ECAI-2000. CHAKRABARTI. S., (2002). Mining the Web: Analysis of Hypertext and Semi Structured Data, Morgan Kaufmann. DUDA, R.O., HART, P.E., and STORK, D.G. (2000). Pattern Classification 2nd edition, Wiley-Interscience. FORTUNA, B., (2004). Kernel Canonical Correlation Analysis With Applications. Proceedings of the 7th International multi-conference Information Society IS2004, Ljubljana: Jozef Stefan Institute, 2004. FAYYAD, U., GRINSTEIN, G.G., and WIERSE, A. (editors), (2001). Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann. FAYYAD, U., PIATETSKI-SHAPIRO, G., SMITH, P., and UTHURUSAMY, R. (eds.) (1996) Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996. GRCAR, M, MLADENIC, D., GROBELNIK, M. (2005). User Profiling for Interestfocused Browsing History. Proceedings of ESWC-2005 Workshop on End User Aspects of the Semantic Web. GROBELNIK, M., and MLADENIC, D. (2002a). Approaching Analysis of EU IST Projects Database. In Proceedings of the IIS 2002, 13th International Conference on Information and Intelligent Systems. (eds. Aurer, B. and Lovrencic, A.), Varazdin, Croatia, Faculty of Organization and Informatics; Zagreb, University of Zagreb, pp. 57-61. GROBELNIK, M., and MLADENIC, D. (2002b). Efficient visualization of large text corpora. In Proceedings of the 7th TELRI seminar. Dubrovnik, Croatia. GROBELNIK, M., and MLADENIC, D. (2003). Analysis of a database of research projects using text mining and link analysis. In: Data mining and decision support : integration and collaboration (Mladenic, Lavrac, Bohanec and Moyle (eds.)), (The Kluwer international series in engineering and computer science, SECS 745). Boston; Dordrecht; London: Kluwer Academic Publishers, 2003, pp.157-166. GROBELNIK, M., MLADENIC, D. (2004). Visualization of news articles. Informatica journal, 2004, vol. 28, no. 4. GROBELNIK, M., MLADENIC, D. (2005). TextGarden sotware library http://www.textmining.net/, Release January 2005. HAND, D.J., MANNILA, H., SMYTH, P. (2001). Principles of Data Mining (Adaptive Computation and Machine Learning), MIT Press. HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer Verlag. SUSAN HAVRE, S., HETZLER, B., NOWELL, L. (2000). ThemeRiver: Visualizing Theme Changes over Time, Proceedings of the IEEE Symposium on Information Vizualization INFOVIS-2000,(isbn:0-7695-0804-9), 115 pages, IEEE Computer Society, Washington, DC, USA. JORGE, A., ALVES, M.A., GROBELNIK, M., MLADENIC, D., PETRAK, J. (2003). Web site access analysis for a national statistical agency. In: Data mining and decision support : integration and collaboration (Mladenic, Lavrac, Bohanec and Moyle (eds.)), (The Kluwer international series in engineering and computer science, SECS 745). Boston; Dordrecht; London: Kluwer Academic Publishers, 2003, pp.167–176.
62
D. Mladeniˇc
KOHONEN, T., KASKI, S., LAGUS, K., SALOJARVI, J., PAATERO, V., SAARELA, A. (2000). Organization of a Massive Document Collection, IEEE Transactions on Neural Networks, Special Issue on Neural Networks for Data Mining and Knowledge Discovery, 11:3, pp.574–585. LESKOVEC, J., GROBELNIK, M., MILIC-FRAYLING, N. (2004). Learning Substructures of Document Semantic Graphs for Document Summarization. In Workshop on Link Analysis and Group Detection (LinkKDD2004). The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. MAEDCHE, A., STAAB, S. (2001). Discovering conceptual relations from text. In Proceedings of European Conference on Artificial Intelligence ECAI-2000, pp.321–325. MANI, I., MAYBURY, M.T. (editors), (1999). Advances In Automatic Text Summarization, MIT Press. MANNING, C.D., SCHUTZE, H. (2001).Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, MA. MITCHELL, T.M. (1997). Machine Learning. The McGraw-Hill Companies, Inc. MLADENIC, D. (2002). Web browsing using machine learning on text data, In (ed. Szczepaniak, P. S.), Intelligent exploration of the web, 111, Physica-Verlag, 288-303. MLADENIC, D., GROBELNIK, M. (2003). Text and Web Mining. In: Data mining and decision support : integration and collaboration (Mladenic, Lavrac, Bohanec and Moyle (eds.)), (The Kluwer international series in engineering and computer science, SECS 745). Boston; Dordrecht; London: Kluwer Academic Publishers, 2003, pp.13–14. MLADENIC, D., GROBELNIK, M. (2003). Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35(1): 45-87. MLADENIC, D., and LAVRAC, N. (eds.), (2003). Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise : Results of the Sol-Eu-Net Project : January 2000-March 2003, (Sol-Eu-Net, IST-199911495). 1st ed. Ljubljana: DZS, 2003. XII, 132 pages, ilustr. MLADENIC, D., GROBELNIK, M. (2004). Mapping documents onto web page ontology. In: Web mining : from web to semantic web (Berendt, B., Hotho, A., Mladenic, D., Someren, M.W. Van, Spiliopoulou, M., Stumme, G., eds.), Lecture notes in artificial inteligence, Lecture notes in computer science, vol. 3209, Berlin; Heidelberg; New York: Springer, 2004, pp.77–96. PLISSON, J., MLADENIC, D., LAVRAC, N., ERJAVEC, T. (2005). A LemmatizationWeb Service Based on Machine Learning Techniques. Proceedings of the 7th International 2nd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, April 21-23, 2005, Poznan, Poland. RIJSBERG, C.J., van (1979), Information Retrieval, Butterworths. SEBASTIANI, F., (2002), Machine Learning for Automated Text Categorization, ACM Computing Surveys, 34:1, pp.1–47. STEINBACH, M., KARYPIS, G., and KUMAR, V. (2000). A comparison of document clustering techniques. Proc. KDD Workshop on Text Mining. (eds. Grobelnik, M., Mladenic, D. and Milic-Frayling, N.), Boston, MA, USA, 109–110. WITTEN, I.H., FRANK, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann.
Identification of Real-world Objects in Multiple Databases Mattis Neiling Technische Universit¨ at Berlin,
[email protected]
Abstract. Object identification is an important issue for integration of data from different sources. The identification task is complicated, if no global and consistent identifier is shared by the sources. Then, object identification can only be performed through the identifying information, the objects data provides itself. Unfortunately real-world data is dirty, hence identification mechanisms like natural keys fail mostly — we have to take care of the variations and errors of the data. Consequently, object identification can no more be guaranteed to be fault-free. Several methods tackle the object identification problem, e.g. Record Linkage, or the Sorted Neighborhood Method. Based on a novel object identification framework, we assessed data quality and evaluated different methods on real data. One main result is that scalability is determined by the applied preselection technique and the usage of efficient data structures. As another result we can state that Decision Tree Induction achieves better correctness and is more robust than Record Linkage.
1
Introduction
Assumed that information from several databases shall be merged on the entity level, then the information referring to the same real-world objects have to be identified and put together. But often no unique identifiers are available from the sources such as the Social Insurance Number SSN for American residents or the International Standard Book Number ISBN for print media. In this situation one has to use the identifying information available from the sources however reliable or correct they may be. Previous publications the author contributed to stressed the importance of a generic framework for object identification, e.g. Neiling and Jurk (2003). As result of our research, we developed a generic object identification framework, mainly consisting of three successive steps: Conversion, Comparison, and Classification. In addition, the framework covers: (1) concepts for identification, (2) its software architecture, (3) data quality characteristics, (4) a preselection technique that ensures efficiency for large databases (incorporating suitable index structures), and (5) a prescription for evaluation, sampling and quality criteria. Based on the framework, an evaluation of different methods of object identification became attainable. We applied extensive benchmarking of several methods on different real-world databases. The framework is described in Neiling and Lenz (2004) in the context of the next
64
M. Neiling
German Census that will be basically an Administrative Record Census. In this contribution, we will not review all the details of the framework, instead we emphasize on data quality analysis, preselection, and sampling. The paper is structured as follows: After a review of historical developments we scetch the general model in section 3. After discussing data quality in section 4, we introduce preselection techniques in section 5. Within section 6 we present results of our evaluation. We conclude with a short summary and give an outlook towards further investigations.
2
Historical Development
Starting in the fifties of the last century, a methodology of Record Linkage was developed in the sixties, which was continuously improved up to now. It was successfully applied to personal information, mainly for statistical purpose like census data and patient information. The research in this area was mainly focused on the improvement of the underlying Likelihood-Ratio Test, without any consideration of alternative methods such as machine learning algorithms. Independent from that development, however, duplicate detection had gotten more and more attention by database researchers in the nineties. Their investigations were performance-driven — computational efficiency was their main goal. Until the end of the last century both approaches to object identification can be said to be complementary — both communities treated it with different tongues. Both research directions influenced one another with the beginning of the twenty-first century. Eventually, a methodology could be founded which considers both computational and statistical efficiency at the one hand, and the use of learning algorithms on the other one. In our work, we performed an exhaustive comparison of different learning methods. Record Linkage. Inspired by the work of H. Newcombe et al. (1959), the well-known model for Record Linkage was founded by I.P.Fellegi and A.B. Sunter (1969). Until now, the methodology was continuously enhanced, cf. the proceedings of the two workshops: Kilss and Alvey (1985), and Alvey and Jamerson (1997). For instance, the estimation of the multinomial distribution could be improved by means of variants of the EM-Algorithm, cf. Meng and Rubin (1993), Winkler (1993), Liu and Rubin (1994), and Yancey (2002). Further, powerful software packages were developed, cf. Winkler (2001), Bell and Sethi (2001), and Christen et al. (2004). A general overview about the state of Record Linkage can be found in Winkler (1999) and in Gu et al. (2003). Computational feasibility was a less investigated aspect of Record Linkage, only simple Blocking methods were used. Recently, other approaches, e.g. clustering are applied, cf. Baxter et al. (2003). Database management systems with their powerful indexes were not investigated — Record Linkage was mostly performed on plain files.
Identification of Real-world Objects in Multiple Databases
65
Fig. 1. Overview of Historical Development
Duplicate Detection in Databases. The start of research on duplicate detection in databases was the seminal work of Bitton and DeWitt (1983) dealing with the removal of identical rows. Wang and Madnick (1989) discussed the identification problem for multiple databases first. Hernandez and Stolfo (1995) invented the Sorted-Neighborhood-Method, which is widely used for de-duplication. Until the end of the twentieth century, there was no use of machine learning algorithms. Recently, many researchers applied supervised learning methods like decision tree induction successfully to object identification, e.g. Neiling and Lenz (2000), Elveky et al. (2002), and Bilenko and Mooney (2003).
3
The General Model for Object Identification
The identification procedure was introduced by Neiling and Lenz (2000) and refined by Neiling and Jurk (2003) and works as follows: 1. Conversion: The identifying information are extracted from the original data for each element (e.g. records) and standardized. 2. Pair Construction and Comparison: Pairs of elements are built (at least virtually) that fulfill given preselection predicates, cf. section 5. The pairs are compared with sophisticated functions like Minimum-Edit-Distance, N -Gram Distances etc., or simply with comparison patterns for equal/missing/nonequal values. 3. Classification: Each comparion vector in the multi-dimensional comparison space is classified by a decicion rule δ w.r.t. a previously induced decision rule as matched or non-matched (possibly equipped with a score).
66
M. Neiling
The classifier δ can be defined manually (e.g. the decision rules of the Sorted Neighborhood Method, cf. Hernandez (1996)). Alternatively it can be learned from given example data, i.e. a set of matches and non-matches. For example, within the Record Linkage method, the likelihood ratios λ : V → IR≥0 (the so-called odds) are estimated and used as classifier for comparison vectors v ∈ V : P (v | (a, b) is matched . λ(v) = P (v | (a, b) is non-matched
Large values λ(v) indicate matches, while small values indicate non-matches, whereby values around 1 indicate neither of both. Given predefined error levels of misclassifications, decision bounds λl ≤ λu can be derived, while pairs with λ(v) ∈ [λl , λu ] are left unclassified for screening, cf. Fellegi and Sunter (1969). Similarly, if any other classifier provides a score, the error rates could be controlled. This is an important feature, since the costs caused by a misclassification of a match are typically higher than vice versa. There are many suitable classification methods in literature, e.g. Decision Tree Induction, k-Nearest Neighbor Classification, Support Vector Machines, Neural Networks, Bayes Classifier, etc. The interested reader may consult textbooks about Machine Learning (e.g. Michie et al. (1994), or Berthold and Hand (1999)), or existing Classification and Data Mining Software. Obviously, the scales of the comparison space has to be considered for a choice. For instance, Record Linkage has been designed for a finite set of nominal values, thus ordinal scaled values are treated as nominal with loss of information. On the other hand, decision tree learner can even deal with mixed scales and are therefore well-suited. Multiple combinations of classifiers has been studied by Tejada et al. (2001).
4
Data Quality
We assessed data quality and stated semantic constraints on data, cf. Neiling et al. (2003) and Neiling (2004). These constraints determine the quality of attributes, especially regarding their identifying power. For instance, an attribute set that is stated as approximative key with high confidence, would be an appropriate candidate for identification. Constraints can be stated for the attributes of single relations. Let A be a table with the attributes Y1 , . . . , Ym , Y ⊂ {Y1 , . . . , Ym }, and a, b ∈ A. Y (a) denotes the value(s) of the attribute(s) Y for the tuple a, and a ≡ b abbreviates that tuples a and b are matched. dist : dom(Y ) × dom(Y ) → R≥0 denotes a distance measure on the domain of Y , and p ∈ (0, 1]. There are two concepts for keys, which are both modified towards an approximation in order to cope with dirty data. These keys can be determined from samples of pairs. A semantic key is an attribute set, that identifies realworld objects in reality, but in databases it could fail, therefore we weaken it by means of conditional probabilities Pr · | ·.
Identification of Real-world Objects in Multiple Databases
67
• Y is an semantic key, if Y (a) = Y (b) ⇐⇒ a ≡ b • Y is an approximate key with confidence p, if both accuracy := Pr Y (a) = Y (b) | a ≡ b ≥ p, and confidence := Pr a ≡ b | Y (a) = Y (b) ≥ p, • Y is an ∆–approximate key with confidence p, if both ∆–accuracy := Pr dist(Y (a), 4Y (b)) ≤ ∆ | a ≡ b ≥ p, and ∆–confidence := Pr a ≡ b | dist(Y (a), Y (b)) ≤ ∆ ≥ p, Differentiating keys are used to separate sets of objects: Whenever the values differ, they can not be considered to be equal. Consequently, these keys are useful for preselection, cf. section 5. • Y is an differentiating key, if Y (a) = (b) =⇒ a ≡ b • Y is an approximative differentiating key with confidence p, if ∆–anti–confidence := Pr a ≡ b | dist(Y (a), Y (b)) ≤ ∆ ≥ p, Further constraints cope with the occurence of missing values, the selectivity of attributes, or the expected number of duplicates between to subsets of records, cf. Neiling et al. (2003).
5
Pair Construction/Preselection of Pairs
To be efficient for large databases, preprocessing is applied. Obviously it is unnecessary to compare all pairs — most of them can be omitted. But the question arises, which pairs are to be built for comparison? Different methods exist, cf. Baxter et al. (2003). Also well-known is the so-called Sorted Neighborhood Method, where the records are sorted w.r.t. a combined key and pairs are built for records, that are at most k positions away w.r.t. the sorting, cf. Hernandez (1996). The choice of a preselection was described as optimization problem by Neiling and M¨ uller (2001) and later revised by Neiling (2004). Let δ be a classifier for pairs of elements from two databases A1 , A2 . Within the preprocessing we avoid pairs of elements that are not likely to be matched. T.i. we use a combination σ = j ( i σij ) of selectors σij , where every σij filters pairs from the cross product space A1 × A2 . Then we can apply the classifier δ = δ ◦ σ for object identification, reducing the number of pairs to check. The main idea behind a preselection is to employ approximative and differentiating keys efficiently. A preselection can be established on the results of the data analysis. The identified key attribute sets can be used for selectors.
68
M. Neiling
Each selector σ has processing costs, a selection rate, estimating the percentage of the selected pairs from A1 × A2 : selA1 ×A2 (σ) :=
|σ(A1 × A2 )| , |A1 × A2 |
(1)
and an error rate, quantifying the portion of the not selected matches: errA1 ×A2 (σ) := 1 −
|{(a, b) ∈ σ(A1 × A2 ) | a ≡ b}| . |{(a, b) ∈ A1 × A2 | a ≡ b}|
(2)
Generally spoken, a good preselection combines a low error rate with a considerably high selection rate, such that the most non matched pairs fall out by default, whereby only a few matched pairs are left out. Typically, the lower the selection rate the better the performance of the whole identification task, since the main cost of object identification is determined by loading and processing of pairs. But obviously there is a trade-off between the error rate and the selection rate. Thus choosing a combined selector among a set of possible combinations of selectors becomes an optimization problem, whereby the solution can be found with greedy approaches, e.g. by means of branch– and–bound. Starting from the estimated selection and error rates of single selectors, the respective values for their combinations can be approximated with a heuristic. For instance, the selection rate of the intersection of two selectors lays between the maximum and the sum of their selection rates, such that we can choose the average as heuristic, cf. Neiling (2004), Ch.5. Different optimization problems can be defined, e.g. to minimize the error rate under processing time constraints, or maximize the selection rate while bounding the error rate: !
maxσ∈Σ = sel(σ) s.t. err(σ) ≤ κ, whereby Σ contains all combinations that can be constructed from given selectors by union and intersection like σ1 ∪ (σ2 ∩ σ3 ). Example 1 A relational selector σ poses conditions on attribute values, e.g. requiring equality (this is sometimes called blocking) or containment of a value in a list, or limiting the variation of cardinal scaled attributes by some ∆ > 0. Index structures, such as bitmaps or tree-based structures, are available in database management systems and can be used to achieve efficient data access. Example 2 A metrical selector σ poses conditions on attributes in terms of a given (multidimensional) metric dist(·, ·), e.g. the Minimum–Edit–Distance for strings. A metrical selector allows (1) the selection of the k nearest neighbors of an element, or (2) the selection of all elements within a ∆– environment for ∆ > 0.
Identification of Real-world Objects in Multiple Databases
69
Metrical index structures can be employed, e.g. the M-tree or the MVDtree, cf. Ciaccia et al. (1997) and Bozkaya et al. (1999).Canopy clustering could be applied alternatively to an index, whereby a simple-to-compute ’rough’ metric dist is used for clustering (dist holds for all x, y: if dist (x, y) ≤ ξ then also dist(x, y) ≤ ξ.), cf. McCallum (2000). Claim 1 Let σ be a selector with approximately constant selection rate, i.e. for large sets A1 × A2 and A1 × A2 , holds: selA1 ×A2 (σ) ≈ selA1 ×A2 (σ).
(3)
Then its computational complexity increases quadratically with the maximal size of the databases, written O(n2 ). Claim 2 Let σ be a selector where the number of pairs to build per record is bounded by some fixed k ∈ IN, i.e. for any a ∈ A1 and large sets A2 , holds: σ({a} × A2 ) ≤ k.
(4)
Then its computational complexity increases linearly with the maximal size of the databases, written O(n). The proofs of the claims can be found in Neiling (2004), Ch. 6. It follows immediately Proposition 1 A k-Nearest Neighbor selector has linear complexity. Proposition 2 Let the domain of an attribute set Y be bounded.1 Then a relational selector based on Y has quadratic complexity. Nevertheless, selectors with quadratic complexity are required to guarantee small error rates for large databases. It will not be sufficient to limit the number of comparisons per record, if the database size increases. Moreover, if the number of similar records exceeds such a limit, not all possible pairs will be built. For instance, if the preselection contains pairs where the last and first names equal, there might be too many records of persons named John Smith. In practice, the suitable number of pairs to built for a record depends on its values and should not be limited in advance. Special attention is paid to the sampling procedure, since it is strongly related to the preselection. Sampling. The correctness of an induced classifier depends on the chosen sample it was learned from. Differently from standard learning problems, we do not have any set of instances available. Instead we have to create samples of pairs from a given database, and have to assign the labels ’match’/’non match’ 1
Bounded means for a continuous scaled domain, that it is bounded by an interval, while for other domains it means that the number of possible values is limited.
70
M. Neiling
to them afterwards. The label assignment should be based on a reference lookup table of matched pairs, that could be either constructed manually beforehand or provided together with a benchmark data set (e.g. we got the references for the address database). We apply stratified sampling with strata for matched pairs and non matched pairs, respectively. Parameters for sampling are the sample size N , N1 /N , the small portion of random pairs sampled from the whole cross product space, the portion of random pairs N2 /N out of the preselection, and the portion of matched pairs N3 /N that shall be contained in the sample. Obviously, N = N1 + N2 + N3 holds. If N3 = 0, the number of matched pairs is not controlled and could consequently vary (in this case it depends on the likelihood to select randomly matched pairs). We applied stratified sampling as follows: 1. Create one stratum S1 of random pairs of size N1 from the whole cross product space. 2. Create one stratum S2 of random pairs of size N2 out of the preselection. 3. Assign the correct labels to the pairs in S1 ∪ S2 . 4. Determine the number n of matched pairs that are already contained in S1 ∪ S2 , and add n further (but only non matched) pairs out of the preselection.2 Stop if the sample size N is reached. 5. Create a stratum S3 by adding of max(0, N3 − n) random pairs out of the reference set of matched pairs. To apply supervised learning, the samples have to be split into learn- and test-samples, again with the possibility to restrain with the strata above, e.g. to require that the proportions of matches and non-matches are equal for both. Although the sampling seems to be too complicated for our purposes, there exist no alternatives as we argue in the following. • It is absolutely necessary to consider pairs out of the preselection for sampling, since the induced classifier will be applied to exactly such pairs afterwards. Otherwise, if the samples would be generated differently, any learned classifier will be biased. In fact, if the sample would be chosen from a superset of the preselection, decision rules voting for matches in regions outside of the preselection could not be performed. On the other hand, if the sample would be chosen from a subset of the preselection, the induced classifier would have to be applied to regions it had not been learned from, such that no prediction accuracy could be guaranteed. • The supplement with a few randomly generated pairs from the whole cross product space is appropriate, since a preselection with high selectivity excludes many negative examples, while the inclusion of some of 2
We can choose pairs from the preselection only, since there is nearly no chance to get a matched pair from the cross product space at random.
Identification of Real-world Objects in Multiple Databases
71
them might lead to sharper classifiers. If only pairs with similar values are filtered, a learner might be improved with the supplemented pairs. Our experience shows, that a portion of about 5–10% works well. • To control the portion of matched pairs is important, since the likelihood to randomly select a matched pair (even from the preselection) is usually very small. Thus, the portion of matched pairs would be small, which would be problematically for learners, that are not capable to cope adequately with skewed class distributions. Typically, the portion N3 /N is set to 21 , such that the samples will be well-balanced.
The main drawback of this sampling procedure lies in its dependency on the chosen preselection. Therefore the preselection shall cover almost all matches and thereby exclude most of the non matches. This goal can be achieved, if the identifying as well as discriminating attributes are detected by means of data quality analysis and the preselection is chosen as solution of an optimization problem as sketched above.
6
Evaluation
We selected the methods Record Linkage, Decision Tree Induction, and Association Rule–based Classification. The methods were tested on several samples of different sizes sampled from three databases: Address data, apartment advertisements, and bibliographic data. Different parameters were set for classification models, e.g. the attributes (and respective comparison functions) to be taken into consideration (ranging from 4 to 14 attributes), parameters such as the pruning strategy (information gain, information gain ratio, or Gini index) and the measure to be applied by the Decision Tree Learner, the interaction model for Record Linkage, and the conflict resolution strategy for Association Rule–based Classification. We specified between 6 and 12 different classification models per method. We present results for address data: The database consists of 250.000 records, and provides information on name, address, and birth date of German customers. We assessed the correctness of the induced classifiers on test samples by means of the False Negative Rate, which indicates the portion of undetected matches, and the False Positive Rate, estimating the misclassification rate for non matches. The scatter plot in figure 2 displays the results of the three classification models that performed best among each of the tested methods. We can state, that the Decision Tree classifier outperformed the other classifiers. It can also be seen, that for the larger samples the classifiers got more accurate. Exceptionally, Association Rule–based Classification did not improve with increasing sample size. Decision Trees were quite robust w.r.t. their parameterization: Regardless of the chosen measure and the pruning strategy, all classifiers behaved well, with slightly better results if pruning was discarded, and the best measure was information gain ratio.
72
M. Neiling
Fig. 2. Correctness results of three induced classifiers.
Decision Tree Induction is capable to cope with all attributes at once, while the others methods did work well only if less than 6 attributes were considered. The accuracy of the other methods depends on their parameterization: The more accurate the interaction model is specified for Record Linkage, the more correct the estimator of the multinomial distribution will be. Especially the number of attributes that were used for learning had an impact on the accuracy. Record Linkage works well for correctly specified interaction models and not too many attributes. Association Rule–based Classification does not seem to be stable enough in general, but could be used to control one of the error rates efficiently. We conclude, that without human expertise only Decision Tree Induction yields sufficient accuracy. Unfortunately, it does not allow to control the error rates, i.e. to bound the False Negative Rate. This feature is required by many object identification applications. The other methods support it, since they provide a score for each pair. Record Linkage, for instance, allows to reduce the False Negative Rate by lowering the bound λl for the Likelihood Ratio (compare section 3). From the set of derived fine-grained Association Rules classifiers can be constructed, that minimize one of both error rates.
7
Summary and Outlook
We developed an universal framework for object identification. Attributes can be selected for the classification and for the preselection based on data quality analysis. Object identification is perceived as specific classification problem.
Identification of Real-world Objects in Multiple Databases
73
Different learning methods can be applied, exemplarily we compared three methods. We discovered from our evaluation, that the use of Decision Tree Induction is well-suited for object identification. Moreover, it yielded higher accuracy and was more robust than the other methods. But it fails to control the error rates, a feature which is provided by the other investigated methods, Record Linkage and Association Rule–based Classification. The creation of benchmark databases is a main challenge for the research community. For instance, we have made the apartment advertisements database available to other researchers. This framework lays the foundation for future research. Other approaches could be tested based on it. For instance, which conditions have to be fulfilled, such that unsupervised learning (which does not need labelled samples at all) could be applied successfully? Or how could interactive learning (e.g. incorporation of expert suggestions and relevance feedback) or incremental learning (e.g. stepwise improvement over time) be applied?
References ALVEY, W., and JAMERSON, B. (Eds.) (1997): Record Linkage Techniques — 1997. Int. Workshop, Arlington, Virginia. BELL, G.B., and SETHI, A. (2001): Matching records in a national medical patient index. Communications of the ACM 44(9), 83–88. BERTHOLD, M., and HAND, D.J. (Eds.) (1999): Intelligent Data Analysis: An Introduction. New York: Springer. BILENKO, M., and MOONEY, R. (2003): Adaptive duplicate detection using learnable string similarity measures. KDD Conf. 2003, Washington DC. BITTON, D., and DeWITT, D.J. (1983): Duplicate record elimination in large data files. ACM TODS 8(2), 255–265. ¨ BOZKAYA, T., and OZSOYOGLU, Z.M. (1999): Indexing large metric spaces for similarity search queries. ACM TODS 24 (3), 361–404. BREIMAN, L., FRIEDMAN, J., OLSHEN, R., and STONE, C. (1984): Classification and regression trees. Chapman & Hall. CIACCIA, P., PATELLA, M., and ZEZULA, P. (1997): M-tree: An efficient access method for similarity search in metric spaces. VLDB 1997, 426–435. ELFEKY, M.G., VERYKIOS, V.S., and ELMAGARMID, A.K. (2002): Tailor: A record linkage toolbox. ICDE 2002, San Jose. FELLEGI, I.P., and SUNTER, A.B. (1969): A theory of record linkage. Journal of the American Statistical Association, 64, 1183–1210. GALHARDAS, H., FLORESCU, D., SHASHA, D., SIMON, E., and SAITA, C.-A. (2001): Declarative data cleaning: Language, model and algorithms. VLDB 2001. GU, L., BAXTER, R., VICKERS, D., and RAINSFORD, C. (2003): Record Linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia. HERNANDEZ, M.A. (1996): A Generalization of Band Joins and The Merge/Purge Problem. Phd thesis, Columbia University.
74
M. Neiling
HERNANDEZ, M.A., and STOLFO, S.J. (1995): The merge/purge problem for large databases. ACM SIGMOD Conf. 1995, 127-138. JARO, M.A. (1989): Advances in record-linkage methodology as applied to matching the census of Tampa, Florida. JASA 84 (406), 414–420. LIM, E.-P., SRIVASTAVA, J., PRABHAKAR, S., and RICHARDSON, J. (1993): Entity Identification in Database Integration. ICDE 1993, pp. 294–301. LIU, C., and RUBIN, D.B. (1994): The ecme algorithm: A simple extension of em and ecm with faster monotone convergence. Biometrika 81 (4), 633–48. McCALLUM, A., NIGAM, K., and UNGAR, L.H. (2000): Efficient clustering of high-dimensional data sets with application to reference matching. KDD 2000: New York, USA, 169–178. MENG, X.-L., and RUBIN, D.B. (1993): Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika 80(2), 267–78. MICHIE, D., SPIEGELHALTER, D.J., and TAYLOR, C.C. (1994): Machine learning, neural and statistical classification. New York: Horwood. NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Dissertation, Techn. Universit¨ at Cottbus, 2004. NEILING, M., JURK, S., LENZ, H.-J., and NAUMANN, F.: Object identification quality. Workshop on Data Quality in Coop. Information Systems, Siena. NEILING, M., and JURK, S. (2003): The Object Identification Framework. Workshop on Data Cleaning, Record Linkage and Object Consolidation at the KDD 2003, Washington DC. NEILING, M., and LENZ, H.-J. (2000): Data integration by means of object identification in information systems. ECIS 2000, Vienna, Austria. NEILING, M., and LENZ, H.-J. (2004): The German Administrative Record Census — An Object Identification Problem. Allg. Stat. Arch. 88, 259–277. ¨ NEILING, M., and MULLER, R. (2001): The good into the pot, the bad into the crop. Preselection of record pairs for database integration. Workshop DBFusion 2001, Gommern, Germany. NEWCOMBE, H.B., KENNEDY, J.M., AXFORD, S.J., and JAMES, A.P. (1959): Automatic linkage of vital records. Science 130, 954–959. CHRISTEN, P., CHURCHES, T., and HEGLAND, M. (2004): Febrl — a parallel open source data linkage system. PAKDD, LNCS 3056, 638–647. BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A comparison of fast blocking methods for record linkage. Workshop on Data Cleaning, Record Linkage and Object Consolidation at the KDD 2003, Washington DC. TEJADA, S., KNOBLOCK, C.A., and MINTON, S. (2001): Learning object identification rules for information integration. Information Systems 26 (8). VERYKIOS, V., ELMAGARMID, A., and HOUSTIS, E. (2000): Automating the approximate record matching process. J. Information Sciences 126, 83–98. WANG, Y.R., and MADNICK, S.E. (1989): The inter-database instance identification problem in integrating autonomous systems. ICDE 1989, 46–55. WINKLER, W.E. (1993): Improved decision rules in the Fellegi-Sunter model of record linkage. The Research Report Series, U.S. Bureau of the Census. WINKLER, W.E. (1999): The state of record linkage and current research problems. Statistical research report series, U.S. Bureau of the Census, Washington D.C. WINKLER, W.E. (2001): Record linkage software and methods for merging administrative lists. Statistical research report series, U.S. Bureau of the Census, YANCEY, W. (2002): Improving parameter estimates for record linkage parameters. Section on Survey Research Methodology. American Stat. Association.
Kernels for Predictive Graph Mining Stefan Wrobel1,2 , Thomas G¨artner1 , and Tam´ as Horv´ath1 1 2
Fraunhofer AIS, Schloss Birlinghoven, D-53754 Sankt Augustin, Germany Department of Computer Science III, University of Bonn, Germany
Abstract. In many application areas, graphs are a very natural way of representing structural aspects of a domain. While most classical algorithms for data analysis cannot directly deal with graphs, recently there has been increasing interest in approaches that can learn general classification models from graph-structured data. In this paper, we summarize and review the line of work that we have been following in the last years on making a particular class of methods suitable for predictive graph mining, namely the so-called kernel methods. Firstly, we state a result on fundamental computational limits to the possible expressive power of kernel functions for graphs. Secondly, we present two alternative graph kernels, one based on walks in a graph, the other based on cycle and tree patterns. The paper concludes with empirical evaluation on a large chemical data set.
1
Introduction
Over the past years, computers have become an integral part of most activities and processes in business, administration, science and even everyday life. This enables us to keep detailed and persistent records of what has happened. Data mining, or knowledge discovery in databases, is the interdisciplinary field concerned with computer algorithms and systems for analyzing the resulting data sets in order to discover useful knowledge. Consider a data set created by performing laboratory experiments on a large number of chemical substances, recording for each whether it is active against a particular disease or not. As we will see below, from such data it is possible to automatically induce classifiers capable of recognizing active substances with high predictive performance, thus offering the possibility of quickly screening unknown substances to reduce the amount of expensive laboratory experiments. In such applications it is quite natural to look at the domain of discourse as consisting of objects of different types that can be linked to each other in several ways: the atoms of a molecule are linked by different types of bonds, the physical parts of an artifact have spatial or functional relationships, the pages of the World Wide Web are connected by hyperlinks, the intersections in a city are linked by street segments, and so on. The natural representation for such domains are graphs consisting of vertices (the atoms, parts, pages, intersections) and edges (the bonds, functional relationships, hyperlinks, and streets). In addition, the vertices and edges of the graphs may have labels to represent properties of objects and/or links.
76
S. Wrobel et al.
Perhaps surprisingly, it is very difficult for many classical data mining methods to handle graph structured data, since these methods are limited to input in the form of a single table where each object is represented by one row having a fixed number of columns to record properties of objects. This representation does not allow objects of different types nor relationships between objects. Therefore, in recent years, research efforts have been intensified in order to develop new and extended data mining methods that are capable of directly handling graph structured data. If, as in the above example involving the activity of substances, the goal of analysis is to discover a classifier capable of predicting the properties of previously unseen graphs (or of previously unseen vertices of a graph), the resulting data analysis task is referred to as predictive graph mining. In this paper, we are summarizing and reviewing the line of work that we have recently been following on making a particular class of methods suitable for predictive graph mining, namely kernel methods (see, e.g., [18]). In other areas of data mining, kernel methods have become enormously popular due to their nice theoretical and computational properties, and their empirical performance which often beats other methods. Kernel methods, such as the support vector machine (SVM) [20], are centrally based on the concept of kernel functions, which (intuitively and somewhat imprecisely stated) are functions that compute the “similarity” of two domain objects. In our case, the objects to be compared will be entire graphs, such as for example a chemical molecule. In other words, we wish to predict the properties of entire graphs (each object is graph structured) and not properties of vertices within a graph (see, e.g., [12]). The remainder of the paper is based on the work originally reported in [5–7,10,11]. In Section 2, we give a brief introduction to graphs and kernel methods for graphs and discuss some aspects of computational complexity. In Section 3, we describe our first approach to computationally efficient graph kernel functions, originally published in [7], which is based on comparing the possible walks in each graph. This kernel is powerful and can be computed in polynomial time, however, the polynomial is such that in practice, there are applications where the approach is not yet efficient enough. We have therefore developed a kernel based on representing a graph by its cycle and tree patterns which is summarized in Section 4 (originally published in [10,11]). While of exponential worst-case complexity, in practice one can make certain wellbehavedness assumptions that lead to efficient computation and promising empirical results as detailed in Section 5.
2
Kernel Methods and Graph Kernels
In this section, we first recall some necessary notions from graph theory and kernel methods, and then discuss some basic properties of graph kernels. We start with the definition of graphs.
Kernels for Predictive Graph Mining
77
A labeled undirected (resp. directed) graph is a quadruple G = (V, E, Σ, ), where V is a finite set of vertices, E ⊆ {X ∈ 2V : |X| = 2} (resp. E ⊆ V × V ) is a set of edges, Σ is a finite set of labels, and : V ∪ E → Σ is a function assigning a label to each vertex and edge. A graph database G is a set of disjoint graphs (either each undirected or each directed). Let G1 = (V1 , E1 , Σ, 1 ) and G2 = (V2 , E2 , Σ, 2 ) be undirected (resp. directed graphs). G1 and G2 are isomorphic if there is a bijection ϕ : V1 → V2 such that (i) for every u, v ∈ V1 , {u, v} ∈ E1 iff {ϕ(u), ϕ(v)} ∈ E2 (resp. (u, v) ∈ E1 iff (ϕ(u), ϕ(v)) ∈ E2 ), (ii) 1 (u) = 2 (ϕ(u)) for every u ∈ V1 , and (iii) 1 ({u, v}) = 2 ({ϕ(u), ϕ(v)}) (resp. 1 ((u, v)) = 2 ((ϕ(u), ϕ(v)))) for every {u, v} ∈ E1 (resp. (u, v) ∈ E1 ). Kernel methods (see, e.g., [18]) are a theoretically well-founded class of statistical learning algorithms that have received considerable attention recently also in the data mining community. Algorithms in this broad class (e.g., support vector machines, Gaussian processes, etc.) have proved to be powerful tools in various real-world data mining applications. Since kernel methods are not restricted to the attribute-value representation used by most data mining algorithms, many of these applications involve datasets given in some non-vectorial representation formalism such as graphs (see, e.g., [5] for a survey), higher-order logic [8], etc. In general, kernel methods are composed of two components: (i) A domain specific function Φ embedding the underlying instance space X into a high (possibly infinite) dimensional inner product space F and (ii) a domain independent algorithm aimed at discovering patterns (e.g., classification, clustering, etc.) in the embedded data, where patterns are restricted to linear functions defined in terms of inner products between the points of the embedded input data. One of the attractive computational properties of kernel methods is that in many cases, patterns can be computed in time independent of the dimension of F. In such cases, the inner product of the feature vectors can be calculated by a kernel without explicitly performing or even knowing the embedding function Φ, where a kernel is a function of the form κ:X ×X →R
(1)
satisfying κ(x, y) = Φ(x), Φ(y) for every x, y ∈ X. In this work, we deal with graph kernels. More precisely, we consider the case, where X in (1) is a set of possible graphs. In the design of practically useful graph kernels, one would require them to distinguish between nonisomorphic graphs, i.e., the underlying embedding function to be injective modulo isomorphism. Such graph kernels are called complete graph kernels. For complete graph kernels, the following complexity result holds [7]: Proposition 1. Computing any complete graph kernel is at least as hard as deciding whether two graphs are isomorphic.
78
S. Wrobel et al.
Since the graph isomorphism problem is believed not to be in P, i.e. not computable in polynomial time, in the subsequent sections we present two graph kernels of good predictive performance that are not complete (i.e., the underlying embedding functions may map some non-isomorphic graphs to the same point), but perform well and can be computed efficiently in practice.
3
Walk-based Kernels
The central idea used in this section to develop kernels for directed graphs is to decompose each graph into different parts and use a measure of the intersection of the two part sets as a kernel function. In particular, we decompose directed graphs into multisets of label sequences corresponding to walks in the graph.1 Before we go into more technical details, we first review some basic definitions.
3.1
Directed Graphs
Let G = (V, E, Σ, ) be a directed graph. A walk w in G is a sequence of vertices w = v1 , v2 , . . . vn+1 such that (vi , vi+1 ) ∈ E for every i = 1, . . . , n. The length of the walk is equal to the number of edges in this sequence, i.e., n in the above case. In order to define our graph kernel in a compact way, we use the adjacency matrix E of G, defined by 1 if (vi , vj ) ∈ E [E]ij = 0 otherwise for every vi , vj ∈ V . Another central concept for the definition of walk kernels is the notion of products of labeled directed graphs. Let G1 = (V1 , E1 , Σ, 1 ) and G2 = (V2 , E2 , Σ, 2 ) be directed graphs. Then the direct product of G1 and G2 is the directed graph G1 × G2 = (V, E, Σ, ), where V = {(v1 , v2 ) ∈ V1 × V2 : 1 (v1 ) = 2 (v2 )} E = {((u1 , u2 ), (v1 , v2 )) ∈ V × V : (u1 , v1 ) ∈ E1 , (u2 , v2 ) ∈ E2 , and 1 ((u1 , v1 )) = 2 ((u2 , v2 ))} and maps each vertex and edge to the common label of its components. 1
An alternative view of our kernel is to consider each directed graph as a Markov chain and compare two Markov chains by means of the probability that both Markov chains generate the same sequences of observable random variables (i.e., the same sequence of labels).
Kernels for Predictive Graph Mining
3.2
79
Walk Kernels
The kernel described in this section is based on defining one feature for every possible label sequence and then counting how many walks in a graph match this label sequence. The inner product in this feature space can be computed with the following closed form. Let G1 = (V1 , E1 , Σ, 2 ) and G2 = (V2 , E2 , Σ, 2 ) be directed graphs and let E× and V× denote the adjacency matrix and the vertex set of the direct product G1 × G2 , respectively. With a sequence of weights λ = λ0 , λ1 , . . . (λi ∈ R; λi ≥ 0 for all i ∈ N) the direct product kernel is defined as ∞ |V× | n k× (G1 , G2 ) = λn E× i,j=1
n=0
ij
n if the limit exists. Note that E× is the number of walks of length n from ij vi = (vi,1 , vi,2 ) to vj = (vj,1 , vj,2 ) in the product graph G1 × G2 . This is in turn equal to the number of all possible pairs of walks of length n from vi,1 to vj,1 in G1 and from vi,2 to vj,2 in G2 with the same label sequence. To compute the above kernel function one can make use of polynomial time computable closed forms or resort to approximations by short random walks on the graphs (see Section 5.1). A variant of the above kernel function that can be used whenever the label sequences are unlikely to match exactly is the following kernel. Let G1 , G2 be the graphs as defined above, let G× = G1 × G2 be their direct product, and let Go be their direct product when ignoring the labels in G1 and G2 . With a sequence of weights λ = λ0 , λ1 , . . . (λi ∈ R; λi ≥ 0 for all i ∈ N) and a factor 0 ≤ α ≤ 1 penalizing gaps, the non-contiguous sequence kernel is defined as ∞ |V× | n k∗ (G1 , G2 ) = λn ((1 − α)E× + αEo ) i,j=1
n=0
ij
if the limit exists. This kernel is very similar to the direct product kernel. The only difference is that instead of the adjacency matrix of the direct product graph, the matrix (1 − α)E× + αEo is used. The relationship can be seen √ by adding — parallel to each edge — a new edge labeled # with weight α in both factor graphs.
4
The Cyclic Pattern Kernel
In practice, are though of polynomial complexity, the above walk-based kernels turn out too slow for certain very large problems. In this section, we therefore present another graph kernel, called the cyclic pattern kernel (CPK) introduced in [11]. CPK is based on decomposing graphs into a distinguished set of trees and cycles. Using the labels of vertices and edges, these trees
80
S. Wrobel et al.
and cycles are then mapped to strings called tree and cyclic patterns. For two graphs, CPK is defined as the cardinality of the intersection of their sets of tree and cyclic patterns. Although below we define CPK for undirected graphs, we note that the approach can easily be adapted to directed graphs as well. We first recall some further basic notions from graph theory. 4.1
Undirected Graphs
In this section, by graphs we always mean labeled undirected graphs. Let G = (V, E, Σ, ) be a graph. A graph G = (V , E , Σ, ) is a subgraph of G, if V ⊆ V , E ⊆ E, and (x) = (x) for every x ∈ V ∪ E . A walk in G is a sequence of vertices w = v1 , v2 , . . . vn+1 such that {vi , vi+1 } ∈ E for every i = 1, . . . , n. G is connected if there is a walk between any pair of its vertices. A connected component of G is a maximal subgraph of G that is connected. A vertex v ∈ V is an articulation vertex, if its removal increases the number of connected components of G. G is biconnected if it contains no articulation vertex. A biconnected component of G is a maximal subgraph that is biconnected. Let G be a graph. Two vertices of G are adjacent if they are connected by an edge. The degree of a vertex v ∈ V is the number of vertices adjacent to v. A subgraph C of G forms a (simple) cycle if it is connected and each of its vertices has degree 2. We denote by S(G) the set of cycles of G. We note that the number of cycles can grow faster than 2n , where n is the cardinality of the vertex set. It holds that the biconnected components of a graph G are pairwise edge disjoint and form thus a partition on the set of G’s edges. This partition, in turn, corresponds to the following equivalence relation on the set of edges: two edges are equivalent iff they belong to a common cycle. This property of biconnected components implies that an edge of a graph belongs to a cycle iff its biconnected component contains more than one edge. Edges not belonging to cycles are called bridges. The subgraph of a graph G formed by its bridges is denoted by B(G). Clearly, each bridge of a graph is a singleton biconnected component, and B(G) is a forest. 4.2
Definition of CPK
In order to define CPK, we need the following function. Let U be a set and κ∩ : 2U × 2U → N be the function defined by κ∩ : (S1 , S2 ) → |S1 ∩ S2 | for every S1 , S2 ⊆ U . From the definitions it follows that κ∩ is a kernel, called the intersection kernel 2 . 2
We note that intersection kernels are often defined in a more general way (see, e.g., [18]).
Kernels for Predictive Graph Mining
81
Let Σ, Γ be alphabets, and π be a mapping from the set of cycles and trees labeled by Σ to Γ ∗ such that (i) π maps two graphs to the same string iff they are isomorphic and (ii) π can be computed in polynomial time. We note that such Γ and π exist and can easily be constructed (see, e.g., [11,21]). Using π, the set of cyclic and tree patterns of G is defined by PS (G) = {π(C) : C ∈ S(G)} PT (G) = {π(T ) : T is a maximal tree of B(G)} ,
(2) (3)
respectively. The cyclic pattern kernel for a graph database G is then defined by κS (G1 , G2 ) = κ∩ (PS (G1 ), PS (G2 )) + κ∩ (PT (G1 ), PT (G2 )) (4) for every G1 , G2 ∈ G. Since PS (G) and PT (G) are disjoint for every G and κ∩ is a kernel, κS is also a kernel. 4.3
Computing CPK
Unfortunately, unless P = NP, κS is not computable in polynomial time [11]. Because of the high complexity, CPK is computed in [11] by (i) explicitly performing the embedding into the feature space for every graph, and then by (ii) calculating the inner product of the obtained feature vectors. To perform the embedding for a graph G, PS (G) is computed by enumerating all elements of S(G). The reason is that while S(G) can be enumerated with linear delay [16], PS (G) cannot be enumerated in output-polynomial time (unless P = NP) [11]. Thus, the algorithm computing CPK in [11] is polynomial just in |S(G)| rather than in |PS (G)|. Since |S(G)| can be exponential in the number of vertices of G, the method in [11] is restricted to graphs with polynomial number of cycles. To decide whether the graphs in the database satisfy this condition, one has to count their cycles which is #P-complete in general [19]. Restricting CPK to graphs of polynomial number of cycles is rather severe; graphs containing exponentially many cycles may have polynomially or even constant many cyclic patterns. As an example, let G be the graph defined in Figure 1 such that each vertex and edge of G is labeled by the same symbol. G is made up of 2n + 1 vertices and contains 2n + n cycles which in turn form, however, only two different cyclic patterns. This, as well as other examples from real-world datasets motivate us to deal with the problem of listing cyclic patterns of a graph without enumerating the possibly exponentially large set of all its cycles. More precisely, we consider the problem whether cyclic patterns can be enumerated with polynomial delay. The following proposition states that, in contrast to cycles, this problem is most likely intractable [11]: Proposition 2. Unless P = NP, cyclic patterns cannot be enumerated in output-polynomial time.
82
S. Wrobel et al.
1r
@
@ @r
2
r3 @
@ @r
r5 @ @
2n − 1r
···
4
@
r2n + 1
@ @r
2n
Fig. 1. A graph containing exponentially many cycles that form two different cyclic patterns. Each edge and vertex has the same label (not shown in the figure).
The proof of the above proposition is based on a polynomial-time reduction from the NP-complete Hamiltonian cycle problem. This and many other NPhard computational problems become, however, polynomially solvable when restricted to graphs of bounded treewidth. Treewidth [17] is a measure of treelikeness of graphs. It has wide algorithmic applications because many problems that are hard on arbitrary graphs become easy for graphs of bounded treewidth. The class of bounded treewidth graphs includes many practically relevant graph classes (see, e.g., [1] for an overview). Due to space limitation, we omit the formal definition of treewidth (see, e.g., [1]). We note that graphs with small treewidth may have exponentially many cycles. For instance, one can see that the treewidth of the graph in Figure 1 is 2 for every n 1. For graphs of bounded treewidth, the following result holds [10]: Theorem 1. Let Gi = (Vi , Ei , Σ, i ) be bounded treewidth graphs for i = 1, 2. Then κS (G1 , G2 ) can be computed in time polynomial in max{|V1 |, |V2 |, |PS (G1 )|, |PS (G2 )|} .
5
Empirical Evaluation
In order to evaluate the predictive power of the walk and cyclic pattern kernels, we use the NCI-HIV dataset3 of chemical compounds that has been used frequently in the empirical evaluation of graph mining approaches (see, e.g., [2–4,14]). Each compound in this dataset is described by its chemical structure and classified into one of three categories based on its capability to inhibit the HIV virus: confirmed inactive (CI), moderately active (CM), or active (CA). The dataset contains 42689 molecules, 423 of which are active, 1081 are moderately active, and 41185 are inactive. Since more than 99% of the corresponding 42689 chemical graphs contain less than 1000 cycles, the cyclic and tree patterns for the whole dataset can be computed in about 10 minutes. 3
http://cactus.nci.nih.gov/ncidb/download.html
Kernels for Predictive Graph Mining
5.1
83
Practical Considerations for the Walk Kernel
For molecule classification the number of vertex labels is limited by the number of elements occurring in natural compounds. For that, it is reasonable to not just use the element of the atom as its label. Instead, we use the pair consisting of the atom’s element and the multiset of all neighbors’ elements as the label. In the HIV dataset this increases the number of different labels from 62 to 1391. The size of this dataset, in particular the size of the graphs in this dataset, hinders the computation of walk-based graph kernels. The largest graph contains 214 atoms (not counting hydrogen atoms). If all had the same label, the product graph would have 45796 vertices. As different elements occur in this molecule, the product graph has less vertices. However, it turns out that the largest product graph (without the vertex coloring step) still has 34645 vertices. The vertex coloring above changes the number of vertices with the same label, thus the product graph is reduced to 12293 vertices. For each kernel computation, either eigendecomposition or inversion of the adjacency matrix of a product graph has to be performed. With cubic time complexity, such operations on matrices of this size are not feasible. The only chance to compute graph kernels in this application is to approximate them. There are two choices. First we consider counting the number of walks in the product graph up to a certain depth. In our experiments it turned out that counting walks with 13 or less vertices is still feasible. An alternative is to explicitly construct the image of each graph in the feature space. In the original dataset 62 different labels occur and after the vertex coloring 1391 different labels occur. The size of the feature space of label sequences of length 13 is then 6213 > 1023 for the original dataset and 139113 > 1040 with the vertex coloring. We would also have to take into account walks with less than 13 vertices but at the same time not all walks will occur in at least one graph. The size of this feature space hinders explicit computation. We thus resorted to counting walks with 13 or less vertices in the product graph. 5.2
Experimental Methodology and Results
We compare both of our approaches to the results presented in [3] and [4]. The classification problems considered there were: (1) distinguish CA from CM, (2) distinguish CA and CM from CI, and (3) distinguish CA from CI. Additionally, we will consider (4) distinguish CA from CM and CI. For each problem, the area under the ROC curve (AUC), averaged over a 5-fold crossvalidation, is given for different misclassification cost settings. In order to choose the parameters of the walk-based graph kernel (we use the direct product kernel) we proceeded as follows. We split the smallest problem (1) into 10% for parameter tuning and 90% for evaluation. First we tried different parameters for the exponential weight (10−3 , 10−2 , 10−1 , 1, 10) in a single nearest neighbor algorithm (leading to an average AUC of 0.660,
84
S. Wrobel et al.
task (1) (1) (2) (2) (3) (3) (4) (4)
cost 1.0 2.5 1.0 35.0 1.0 100.0 1.0 100.0
walk-based kernels 0.818(±0.024) 0.825(±0.032) 0.815(±0.015) 0.799(±0.011) 0.942(±0.015) 0.944(±0.015) 0.926(±0.015) 0.928(±0.013)
cyclic pattern kernels 0.813(±0.014) 0.827(±0.013) 0.775(±0.017) •• 0.801(±0.017) 0.919(±0.011) • 0.929(±0.01) • 0.908(±0.024) • 0.921(±0.026)
FSG 0.774 ••◦◦ 0.782 • ◦◦ 0.742 ••◦◦ 0.778 ••◦ 0.868 ••◦◦ 0.914 ••◦ — —
FSG∗ 0.810 0.792 • ◦◦ 0.765 •• 0.794 0.839 ••◦◦ 0.908 ••◦◦ — —
Table 1. Area under the ROC curve for different costs and problems (•: significant loss against walk-based kernels at 10% / ••: significant loss against walk-based kernels at 1% / ◦: significant loss against cyclic pattern kernels at 10% / ◦◦: significant loss against cyclic pattern kernels at 1%)
0.660, 0.674, 0.759, 0.338) and decided to use the value 1 throughout. Next we needed to choose the complexity (regularization) parameter of the SVM. Here we tried different parameters (10−3 , 10−2 , 10−1 leading to an average AUC of 0.694, 0.716, 0.708) and found the parameter 10−2 to work best. Evaluating with a SVM and these parameters on the remaining 90% of the data, we achieved an average AUC of 0.820 and standard deviation 0.024. For cyclic pattern kernels, only the complexity constant of the support vector machine has to be chosen. Here, the heuristic as implemented in SVMlight [13] is used. Also, we did not use any vertex coloring with cyclic pattern kernels. To compare our results to those achieved in previous work, we fixed these parameters and reran the experiments on the full data of all three problems. Table 5.2 summarises these results and the results with FSG reported in [3]. In [4] the authors of [3] describe improved results (FSG∗ ). There, the authors report results obtained with an optimised threshold on the frequency of patterns4 . Clearly, the graph kernels proposed here outperform FSG and FSG∗ over all problems and misclassification cost settings. To evaluate the significance of our results we proceeded as follows: As we did not know the variance of the area under the ROC curve for FSG, we assumed the same variance as obtained with graph kernels. Thus, to test the hypothesis that graph kernels significantly outperform FSG, we used a pooled sample variance equal to the variance exhibited by graph kernels. As FSG and graph kernels were applied in a 5-fold crossvalidation, the estimated standard error of the average difference is the pooled sample variance times 4
2 5.
The
In [4] also including a description of the three dimensional shape of each molecule is considered. We do not compare our results to those obtained using the three dimensional information. We are considering to also include three dimensional information in our future work and expect similar improvements.
Kernels for Predictive Graph Mining
85
test statistic is then the average difference divided by its estimated standard error. This statistic follows a t distribution. The null hypothesis — graph kernels perform no better than FSG — can be rejected at the significance level α if the test statistic is greater than t8 (α), the corresponding percentile of the t distribution. Table 5.2 shows the detailed results of this comparison. Walk-based graph kernels perform always better or at least not significantly worse than any other kernel. Cyclic pattern kernels are sometimes outperformed by walkbased graph kernels but can be computed much more efficiently. For example, in the classification problem where we tried to distinguish active compounds from moderately active compounds and inactive compounds, five-fold crossvalidation with walk-based graph kernels finished in about eight hours, while changing to cyclic pattern kernels reduced the runtime to about twenty minutes.
6
Conclusions and Future Work
The obvious approach to define kernels on objects that have a natural representation as a graph is to decompose each graph into a set of subgraphs and measure the intersection of two decompositions. As mentioned in Section 2, such a graph kernel can not be computed efficiently if the decomposition is required to be unique up to isomorphism. In the literature different approaches have been tried to overcome this problem. [9] restricts the decomposition to paths up to a given size, and [3] only considers the set of connected graphs that occur frequently as subgraphs in the graph database. The approach taken there to compute the decomposition of each graph is an iterative one [15]. In this work we presented two practically usable kernels for graphs. Although the underlying decompositions are not unique up to isomorphism, our experiments on a large chemical dataset indicate that the above complexity limitation does not hinder successful classification of molecules. In future work we consider to investigate kernels that are able to use more information than just the graph structure. For example, for chemical molecules, this might be the 3D structure and additional background knowledge about important building blocks such as rings or other structures.
Acknowledgements This work was supported in part by the DFG project (WR 40/2–1) Hybride Methoden und Systemarchitekturen f¨ ur heterogene Informationsr¨ aume.
References 1. BODLAENDER, H.L. (1998): A partial k-arboretum of graphs with bounded treewidth. Theoretical Computer Science, 209(1–2):1–45.
86
S. Wrobel et al.
2. BORGELT, C., and BERTHOLD, M.R. (2002): Mining molecular fragments: Finding relevant substructures of molecules. Proc. IEEE Int. Conf. on Data Mining, pp. 51–58. IEEE Computer Society. 3. DESHPANDE, M., KURAMOCHI, M., and KARYPIS, G. (2002): Automated approaches for classifying structures. Proc. 2nd ACM SIGKDD Workshop on Data Mining in Bioinformatics, pp. 11–18. 4. DESHPANDE, M., KURAMOCHI, M., and KARYPIS, G. (2003): Frequent substructure based approaches for classifying chemical compounds. Proc. 3rd IEEE Int. Conf. on Data Mining, pp. 35–42. IEEE Computer Society. ¨ 5. GARTNER, T. (2003): A survey of kernels for structured data. SIGKDD Explorations, 5(1):49–58. ¨ 6. GARTNER, T. (2005): Predictive graph mining with kernel methods. In: S. Bandyopadhyay, D. Cook, U. Maulik, and L. Holder, editors, Advanced Methods for Knowledge Discovery from Complex Data, to appear. ¨ 7. GARTNER, T., FLACH, P.A., and WROBEL, S. (2003): On graph kernels: Hardness results and efficient alternatives. 16th Annual Conf. on Computational Learning Theory and 7th Kernel Workshop, pp. 129–143. Springer Verlag, Berlin. ¨ 8. GARTNER, T., LLOYD, J., and FLACH, P. (2004): Kernels and distances for structured data. Machine Learning, 57(3):2005–232. 9. GRAEPEL, T. (2002): PAC-Bayesian Pattern Classification with Kernels. PhD thesis, TU Berlin. ´ 10. HORVATH, T. (2005): Cyclic pattern kernels revisited. Proc. Advances in Knowledge Discovery and Data Mining, 9th Pacific-Asia Conf., pp. 791–801. Springer Verlag, Berlin. ´ ¨ 11. HORVATH, T., GARTNER, T., and WROBEL, S. (2004): Cyclic pattern kernels for predictive graph mining. Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 158–167. ACM Press, New York. ´ ´ 12. HORVATH, T., and TURAN, G. (2001): Learning logic programs with structured background knowledge. Artificial Intelligence, 128(1-2):31–97. 13. JOACHIMS, T. (1999): Making large–scale SVM learning practical. In: B. Sch¨ olkopf, C.J.C. Burges, and A.J. Smola, editors. Advances in Kernel Methods — Support Vector Learning, pp. 169–184. MIT Press, Cambridge, MA. 14. KRAMER, S., DE RAEDT, L., and HELMA, C. (2001): Molecular feature mining in HIV data. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 136–143. ACM Press, New York. 15. KURAMOCHI, M., and KARYPIS, G. (2001): Frequent subgraph discovery. Proc. IEEE Int. Conf. on Data Mining, pp. 313–320. IEEE Computer Society. 16. READ, R.C., and TARJAN, R.E. (1975): Bounds on backtrack algorithms for listing cycles, paths, and spanning trees. Networks, 5(3):237–252. 17. ROBERTSON, N., and SEYMOUR, P.D. (1986): Graph minors. II. Algorithmic Aspects of Tree-Width. J. Algorithms, 7(3):309–322. 18. SHAWE-TAYLOR, J., and CRISTIANINI, N. (2004): Kernel Methods for Pattern Analysis. Cambridge University Press. 19. VALIANT, L.G. (1979): The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3):410–421. 20. VAPNIK, V. (1998): Statistical Learning Theory. J. Wiley & Sons, Chichester. 21. ZAKI, M. (2002): Efficiently mining frequent trees in a forest. Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 71–80. ACM Press, New York.
PRISMA: Improving Risk Estimation with Parallel Logistic Regression Trees Bert Arnrich1 , Alexander Albert2 , and J¨ org Walter1 1 2
Neuroinformatics Group, Faculty of Technology, Bielefeld University, Germany Clinic for Cardiothoracic Surgery, Heart Institute Lahr, Germany
Abstract. Logistic regression is a very powerful method to estimate models with binary response variables. With the previously suggested combination of tree-based approaches with local, piecewise valid logistic regression models in the nodes, interactions between the covariates are directly conveyed by the tree and can be interpreted more easily. We show that the restriction of partitioning the feature space only at the single best attribute limits the overall estimation accuracy. Here we suggest Parallel RecursIve Search at Multiple Attributes (PRISMA) and demonstrate how the method can significantly improve risk estimation models in heart surgery and successfully perform a benchmark on three UCI data sets.
1
Introduction
The logistic regression model (LogReg) estimates the probability of a binary outcome Y depending on the linear combination of k input variables Xi in a data set D. The k + 1 coefficients β0 , β1 , . . . , βk are usually estimated by iterative likelihood maximization. In health sciences the input parameters are, e.g., absence/presence of some risk factors, medication or procedure type for a certain patient. Then the coefficients βi are easily interpretable (e.g. for a binary variable Xi the corresponding eβi is equal to the odds ratio) and broadly appreciated. Despite its simplicity, the accuracy compares favorably to many binary classification methods [8]. In principle the Xi can be any non-linear transformations or combinations of variables in order to adapt the model to existing non-linearities and parameter interactions. Unfortunately the model looses then quickly its comprehensiveness – at least to many health professionals – and therefore these extensions are not very commonly applied. Another well appreciated model format is the decision tree (DT). It assigns each new case to a unique terminal node which comprises a group of cases. The combination of DT and LogReg models was suggested earlier and embraces several advantages (see e.g. [5]): • The tree-structure can handle large parts of the overall model complexity. • Interactions between the covariates are directly conveyed by the tree and can be interpreted more easily in qualitative terms. • A simple form of the fitted function in each node enables the statistical properties of the method to be studied analytically.
88
B. Arnrich et al.
In previous work concerning tree-structured regression we can distinguish two basic strategies: Two Phase Methods: First partitioning the data using a tree construction method and afterward fit models in each node. Adaptive Methods: Recursive partitioning of the data, taking into account the quality of the fitted models during the tree construction process. One example of the first strategy is Quinlan’s M5 regression tree [9]. In a preliminary step M5 builds first a classification tree using the standard deviation of the Y values as node impurity function and then a multivariate linear model is fitted at each node. In the second direction a deviance-based approach was proposed: LOTUS constructs contingency tables with the Y outcome [4]. The variable Xi with the smallest significance level (tested with χ2 statistic) is selected to split the node. The binary split point that minimizes the sum of the deviances of the logistic regression models fitted to the two data subsets, is chosen. Although it was shown that the regression tree models have significant advantages over simple regression or standard decision trees, it remains unclear whether the structure of the built hybrid tree is optimal regarding to the overall estimation performance. In contrast to previous work we propose to search a partitioning where the node models produces the highest overall estimation accuracy.
2
Methods
Aiming at best overall estimation accuracy, the key ideas of the proposed search for an optimal tree are: 1. Parallel recursive search at multiple attributes, 2. fit stepwise logistic regression model in each partition, 3. test if the discriminative power of the sub-model in each leaf node is better than a parent model using the area under the receiver operating characteristic (ROC) curve, and 4. select the tree structure and the corresponding node models with the highest overall estimation accuracy. Independent of the algorithmic strategy that is used for finding optimal split points of an attribute Xi in D or a subspace of D, it is important to ensure that obviously ”bad” partitions, i.e. a sequence of records belonging to a single class should not be broken apart [6], are not selected by the evaluation function. Here we used the concept of boundary points introduced in [7]. In the following we first briefly introduce the used evaluation functions for finding optimal split points which we apply only on boundary points. Next we explain the parallel subtree construction, model fitting, pruning and final tree generation.
PRISMA: Improving Risk Estimation
2.1
89
Split Criteria: Gain Ratio
The gain ratio criterion assesses the desirability of a partition as the ratio of its information gain to its split information [10]. The information that is gained if a set D is partitioned into two subsets D1 and D2 induced by a boundary point T of the attribute Xi is given by Gain(Xi , T ; D) = Ent(D) − E(Xi , T ; D) where Ent(D) denotes the class information entropy and E(Xi , T ; D) is the weighted average of the resulting class entropies. The potential information generated by dividing D into m subsets is given by the split information. With this kind of normalization the gain ratio is defined as the ratio of information gain and split information. For a given attribute Xi the boundary point T with maximal gain ratio is selected, if the information gain is positive. 2.2
Split Criteria: Class Information Entropy and MDLPC
In [7] a stopping criteria based on the Minimum Description Length Principle Criterion (MDLPC) for the entropy based partitioning process was developed. A partition induced by a boundary point T of the attribute Xi in a set D with minimal class information entropy E(Xi , T ; D) is accepted if the information gain is greater than a threshold (for details see [7]). 2.3
Split Criteria: χ2 Statistic
For the χ2 method a 2×2 contingency table is computed for each boundary point T in variable i ({Xi < T, Xi ≥ T } versus Y = {0, 1}). Using the χ2 distribution function the significance of an association between the outcome and each boundary point is calculated. For a given attribute i the most significant partitioning is chosen, if its significance level is at least 5 %. 2.4
Parallel Subtree Construction using Proxy Nodes
With the usage of optimal splits for multiple attributes at each node, parallel subtrees according to the number of different attributes are constructed. To ensure that we visit every partitioning only once, we introduced proxy nodes: if the required partitioning already exists in the tree, the new node becomes a proxy node and refers then to the corresponding node (see Fig. 1). By this mechanism many redundant computations can be saved. 2.5
LogReg Model Fitting and ROC-based Pruning
For each non-proxy node a stepwise backward logistic regression model is fitted. 1 As initial predictor variables all non-constant ordinal numeric attributes Xi are used. Proxy nodes refer to the computed node models and their possible subtrees. 1
We employ the glm and the accelerated version fastbw of the statistical software package R [11] for the regression task.
90
B. Arnrich et al.
Fig. 1. Parallel subtree construction using proxy nodes: beside the split on “Critical Preoperative State” (CPS) also a branch for “Non-Coronary Surgery” (NCS) is opened. The same strategy can be seen for the child’s of node 3 where additionally to the CPS-Split also an “Age Group” (AG) devision is carried out. To ensure that every partitioning is visited only once, a new node with a partitioning which already exists in the tree, becomes a proxy node and refers to the corresponding node. For example the partitioning in node 9 (NCS=0 and CPS=0) is the same as in node 5 (CPS=0 and NCS=0). Therefore node 9 is a proxy of node 5 and refers to it.
In the following pruning phase the estimation accuracy of each leaf node model is compared with all of its parent models. The model with the best estimated accuracy is assigned to the leaf node, i.e. if a parent model is superior, the leaf model will be discarded and the node will refer to the parent model (e.g. see node 7 in Fig. 2). We choose the area under the ROC curve (AUC) as best suited for comparing our node models. In a low risk regime the AUC, also sometimes called “c-index”, is an integral measure for the entire performance of a classification or estimation system. While model improvements result only in small value changes (AU C ∈ [0.5, 1]) the figure is very sensible compared to, e.g. the standard error. 2.6
Generating the Final Tree
Our goal is to find a single, complete and unique tree structure, where the node models produce the best overall estimation accuracy. In a first step all complete and unique trees from the series of parallel trees resulting at the previous tree construction process have to be extracted in a recursively manner starting at the root node (see example in Fig. 2):
PRISMA: Improving Risk Estimation
91
Fig. 2. Extracted unique trees with regression models in the nodes. In the final tree generation process all child nodes are grouped according to their attributes and a new tree for each attribute is created if more than one group exists. In the example tree in Fig. 1 the branch at NCS=0 has two different attributes: AG and CPS. Therefore two new tress (one with child’s attribute AG, another with CPS in the NCS=0 branch) were generated. The models in the proxy nodes 9 and 11 refer to their corresponding models in the nodes 5 and 6. An example of the parent regression model assignment in the ROC-based pruning phase can be seen at node 7 (see left tree): the estimated accuracy of the regression model F3 in node 3 was superior in the sub-partition AG<2.5 compared with the model built only in that sub-space. Therefore the leaf model in node 7 was discarded and replaced with the parent model F3.
1. Group all child nodes according to their attributes. 2. If more than one attribute group exists, create a new tree for each attribute and discard the source tree. 3. Iterate until all trees are unique. In a next step the estimation accuracy for each extracted tree is computed and the tree with the highest estimation performance is selected as final tree. This can be optional performed on a separate tree selection data set, not used in the tree construction and model fitting phases. In this contribution the final tree selection is based on the test set in a cross-validation experiment.
3
Results
With the proposed PRISMA method we (i) could improve an established risk score system in heart surgery using our research database in the Heart Insti-
92
B. Arnrich et al.
Data set Number of variables Definition of Y = 1 Percent Name Size Quantitative Ordinal Y =1 Bupa 345 6 0 Presence of liver disorder 58.0 Pima 768 8 0 Diabetes in Pima Indian women 34.9 Yeast 1484 8 0 Location site 76.5 Heart Lahr 11714 1 17 Mortality within 30 days 2.1
Table 1. Characteristics of the data sets
Heart Lahr Method Split evaluation #M C #M P #T #N AU C LogReg 1 0.777 Std. RT MDLPC 5.7 2.7 1 5.7 0.789 Gain-Ratio 6.8 3.0 1 6.8 0.789 Chi-Square 7.0 2.6 1 7.0 0.789 PRISMA MDLPC 19.8 10.7 4.7 5.8 0.800 Gain-Ratio 143.5 82.7 38.0 8.0 0.808 Chi-Square 94.0 51.0 24.4 7.8 0.808
Table 2. Comparison of prediction accuracy measured as area under the ROC curve (AUC) and tree sizes for the Heart Lahr data set: stepwise logistic regression (“LogReg”), standard regression tree (“Std. RT”) grown on a single attribute split using three different split criteria, and PRISMA. “#M C” denotes the average number of models computed in the parallel tree, “#M P ” average number of models pruned, “#T ” average number of extracted unique trees in the final tree generation process and “#N ” average number of nodes in the unique trees.
tute Lahr [1]. (ii) We used three real data sets from the UCI data repository [2] to compare PRISMA with stepwise logistic regression (LogReg), standard regression tree methods (Std. RT) and the recently introduced two variants of the LOTUS algorithm for building accurate and comprehensible logistic regression trees [4]. Tab. 1 shows the characteristics of the tested data sets. 3.1
Surgical Quality Assessment in Heart Surgery
Risk-adjusted performance analysis for medical procedures are increasingly important. One of the established risk score systems based on a logistic regression model for postoperative mortality in Europe is the European System for Cardiac Operative Risk Evaluation (EuroSCORE) [12]. Tab. 2 presents the 10-fold cross-validated results for building stepwise logistic regression model (“LogReg”), the standard regression tree scheme (“Std. RT”) grown on a single attribute split, and the “PRISMA” approach. The results improve in this sequence and with the best split criteria being χ2 and MDLPC.
PRISMA: Improving Risk Estimation
93
Fig. 3. Predicted mean deviance relative to that of simple logistic regression (denoted at 1.0) on three UCI data sets. Beside the LOTUS result (first bar), three paired bars for a standard regression tree (“Std. RT”) and PRISMA with the same splitting criteria are visualized respectively. PRISMA always outperforms the simple regression as well as LOTUS and the standard regression tree variant.
3.2
LOTUS vs. PRISMA
We compared the accuracy of the simple logistic model, LOTUS using multiple regression models with PRISMA on three real data sets from the UCI data repository. Fig. 3 shows the estimated predicted mean deviance of LOTUS, standard regression tree and PRISMA relative to that of LogReg.
4
Discussion and Conclusion
Both methods, logistic regression models and decision trees, are powerful data mining techniques. Their result formats are relatively easy to understand and therefore well established. Combining the two methods can preserve this advantage. In our contribution we showed that previous approaches fall short in the standard recipe for generating trees. Only the single best split recursion it pursued. While this is a straight forward method to limit the number of evaluated nodes, it fails for the goal on most accurate regression trees. The suggested PRISMA algorithm builds several subtrees with regression models for multiple attributes in parallel. By using proxy nodes it keeps track of splits already computed in previous subtrees. One important step in the final process is the pruning phase which is based on the ultimate goal for overall estimation system accuracy, here by using the area under the ROC curve. The ambiguous tree is replicated in a forest of unambiguous regression trees and the final one is selected which gives the overall best estimated performance.
94
B. Arnrich et al.
We showed that the introduced PRISMA algorithm outperforms not only the simple regression model and standard regression tree approaches employing several split criteria, but also the recently published LOTUS algorithm. We demonstrated this using UCI benchmark data sets and could show the improvement in building comprehensive regression trees in the medical domain, using a data set from open heart surgery.
References 1. ARNRICH, B. and WALTER, J. and ALBERT, A. and ENNKER, J. and RITTER, H. (2004): Data Mart based Research in Heart Surgery: Challenges and Benefit. In: Medinfo, 8–12 2. BLAKE, C. and MERZ, C.J. (2000): UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine 3. BREIMAN, L. and FRIEDMAN, J. H. and OLSHEN, R. A. and STONE, C. J. (1984): Classification and Regression Trees. Wadsworth International, Monterey, CA 4. CHAN, K.Y. and LOH, W.Y. (2004): LOTUS: An algorithm for building accurate and comprehensible logistic regression trees. Journal of Computational and Graphical Statistics, 13(4), 826–852 5. CHAUUDHURI, P. and LO, W.D. and LOH, W.Y. and YANG, C.C. (1995): Generalized regression trees. Statistica Sinica, 5(2), 641–666 6. ELOMAA, T. and ROUSU, J. (1997): On the Well-Behavedness of Important Attribute Evaluation Functions. In: Scandinavian Conference on AI, 95–106 7. FAYYAD, U.M. and IRANI, K.B. (1993): Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: Proceedings of the 13th International Joint Conference of Artificial Intelligence (IJCAI), 1022–1027 8. LIM, T.S. and LOH, W.Y. and SHIH, Y.S (2000): A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning, 40(3), 203–228 9. QUINLAN, J.R. (1992): Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, 343–348 10. QUINLAN, J.R. (1993): C4. 5: Programs for Machine Learning. MK, San Mateo, CA 11. R DEVELOPMENT CORE TEAM (2003): R: A language and environment for statistical computing. Vienna, Austria 12. ROQUES, F. and NASHEF, S.A. and MICHEL, P. and GAUDUCHEAU, E. and DE VINCENTIIS, C. and et.al. (1999): Risk factors and outcome in European cardiac surgery: analysis of the EuroSCORE multinational database of 19030 patients. European Journal of Cardio-thoracic Surgery, 15, 816–823
Latent Class Analysis and Model Selection Jos´e G. Dias Department of Quantitative Methods, ISCTE – Instituto Superior de Ciˆencias do Trabalho e da Empresa, Av. das For¸cas Armadas, Lisboa 1649–026, Portugal
Abstract. This paper discusses model selection for latent class (LC) models. A large experimental design is set that allows the comparison of the performance of different information criteria for these models, some compared for the first time. Furthermore, the level of separation of latent classes is controlled using a new procedure. The results show that AIC3 (Akaike information criterion with 3 as penalizing factor) outperforms other model selection criteria for LC models.
1
Introduction
In recent years latent class (LC) analysis has been popularized as a clustering technique (Clogg, 1995). Let y = (y1 , ..., yn ) denote a sample of size n; J represents the number of manifest or observed variables; and datum yij indicates the observed value for variable j in observation i, with i = 1, ..., n, j = 1, ..., J. The finite mixture model with S components for yi = (yi1 , ..., yiJ ) is defined S by f (yi ; ϕ) = s=1 πs fs (yi ; θs ), where the mixing proportions πs are positive and sum to one; θs denotes the parameters of the conditional distribution of component s, defined by fs (yi ; θs ); π = (π1 , ..., πS−1 ), θ = (θ1 , ..., θS ), and ϕ = (π, θ). For nominal data, Yj has Lj categories, yij ∈ {1, ..., Lj }. From the local independence assumption – the J manifest variables are indepenJ Lj I(yij =l) dent given the latent variable –, fs (yi ; θs ) = j=1 l=1 θsjl , where θsjl is the probability that observation i belonging to component s falls in category l of variable j. Category l is associated with the binary variable defined by the indicator function I(yij = l) = 1, and 0 otherwise. Therefore, the LC model has density f (yi ; ϕ) =
S s=1
πs
Lj J
I(y =l)
θsjl ij
,
j=1 l=1
which defines a mixture of conditionally independent multinomial distributions. The number of free parameters in vectors π and θ are dπ = S − 1 J and dθ = S j=1 (Lj − 1), respectively. The total number of free parameters is dϕ = dπ + dθ . The likelihood and log-likelihood functions are L(ϕ; y) = n i=1 f (yi ; ϕ) and (ϕ; y) = log L(ϕ; y), respectively. It is straightforward to obtain the maximum likelihood (ML) estimates of ϕ using the EM algorithm (Everitt, 1984).
96
J.G. Dias
Despite the increasing widespread application of LC analysis, deciding on the number of latent classes to retain remains an important topic of research. From an inferential viewpoint, hypothesis testing for model selection based on the likelihood function utilizes preferably likelihood ratio tests (LRTs), which under regularity conditions, have a simple asymptotic theory (Wilks, 1938). However, these regularity conditions are not satisfied in the LC model. For example, in testing the hypothesis of a single latent class against more than one, the mixing proportion under H0 is on the boundary of the parameter space, and consequently the LRT statistic is not asymptotically chi-squared distributed. In an attempt to overcome this limitation, information criteria have become popular as a useful approach to model selection. Some such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) have been widely used. The performance of information criteria has been studied extensively in the finite mixture literature, mostly focused on finite mixtures of Gaussian distributions (McLachlan and Peel, 2000). Lin and Dayton (1997) and Hoijtink (2001) provided results for the LC model from a frequentist and Bayesian perspective, respectively. However, little is known on the performance of these and other criteria for the LC model. In Section 2, we review the literature on model selection criteria. In Section 3, we describe the design of the Monte Carlo study. In Section 4, we present and discuss the results. The paper concludes with a summary of main findings, implications, and suggestions for further research.
2
Information Criteria
The Akaike’s information criterion (AIC) is based on the principle of maximum likelihood and negative entropy or Kullback-Leibler distance between the true density and the estimated density (Akaike, 1974). AIC chooses S which minimizes ˆ y) + 2dϕ , AIC = −2(ϕ; ˆ is the ML estimate, (ϕ; ˆ y) is the log-likelihood value at the ML where ϕ estimate and dϕ is the number of independent parameters (Akaike, 1974). It can be a drastically negatively biased estimate of the expected KullbackLeibler information of the fitted model (Hurvich and Tsai, 1989). Bozdogan (1993) suggested the modified AIC (AIC3) criterion using 3 instead of 2 as penalizing factor. The consistent AIC criterion (CAIC: Bozdogan, 1987) chooses S which minimizes ˆ y) + dϕ (log n + 1). CAIC = −2(ϕ; From the theory of complexity Bozdogan (1988) introduced the informational complexity (ICOMP) criterion. ICOMP chooses S which minimizes −1 ˆ y) + dϕ log d−1 ˆ ˆ , ICOMP = −2(ϕ; (ϕ) − log I −1 (ϕ) ϕ tr I
Latent Class Analysis and Model Selection
97
ˆ and I −1 (ϕ) ˆ are the trace and determinant of I −1 (ϕ), ˆ where tr I −1 (ϕ) ˆ is the expected information matrix at the ML estimate respectively, and I(ϕ) ˆ The expected information matrix has to be estimated, usually by (ϕ = ϕ). approximating it by the observed information matrix I(ϕ; y), the negative of ˆ the Hessian of the log-likelihood function at ϕ. From a different theoretical background, the Bayesian information criterion (BIC), proposed by Schwarz (1978), utilizes the marginal likelihood p(y) = L(ϕ; y)p(ϕ)dϕ, which is the weighted average of the likelihood val˜ where ues. Using the Laplace approximation about the posterior mode (ϕ, L(ϕ; y)p(ϕ) is maximized), it results (Tierney and Kadane, 1986) ˜ y) + log p(ϕ) ˜ − log p(y) ≈ (ϕ;
1 dϕ ˜ y)| + log |H(ϕ; log(2π), 2 2
˜ y) is the negative of the Hessian matrix of the log-posterior where H(ϕ; ˜ function, log L(ϕ; y)p(ϕ), evaluated at the modal value ϕ = ϕ. BIC assumes a proper prior, which assigns positive probability to lower dimensional subspaces of the parameter vector. For a very diffuse (almost ˜ y) can non-informative, and consequently ignorable) prior distribution, H(ϕ; ˜ y). Replacing the posbe replaced by the observed information matrix I(ϕ; ˆ the approximation becomes terior mode by the MLE ϕ, ˆ y) + log p(ϕ) ˆ − log p(y) ≈ (ϕ;
1 dϕ ˆ y)| + log |I(ϕ; log(2π). 2 2
(1)
From the asymptotic behavior of the approximation above, the Bayesian information criterion (BIC) chooses S which minimizes ˆ y) + dϕ log n. BIC = −2(ϕ; Approximation (1) can be used itself as suggested by McLachlan and Peel (2000). The resulting Laplace-empirical criterion (LEC) chooses S which minimizes ˆ y) − 2 log p(ϕ) ˆ + log |I(ϕ; ˆ y)| − dϕ log(2π). LEC = −2(ϕ; For the prior distribution p(ϕ), it is assumed that parameters are a priori S J independent, p(ϕ) = p(π) s=1 j=1 p(θsj ). LEC-U and LEC-J criteria are defined by the uniform and Jeffreys’ priors for ϕ, respectively: 1. The uniform prior (U) corresponding to Dirichlet distributions with π ∼ D(1, ..., 1) and θsj ∼ D(1, ..., 1) is given by log p(ϕ) = log [(S − 1)!] + S
J
log [(Lj − 1)!] ;
j=1
2. The Jeffreys’ prior (J) corresponding to Dirichlet distributions with π ∼ D(1/2, ..., 1/2) and θsj ∼ D(1/2, ..., 1/2) is
98
J.G. Dias
log p(ϕ) = S
J j=1
log Γ
Lj 2
J S J Lj 1 1 − S log Γ Lj + log θsjl + 2 j=1 2 s=1 j=1
S S 1 1 + log Γ − S log Γ + log πs . 2 2 2 s=1
3
l=1
Experimental Design
In order to evaluate the performance of these criteria, a Monte Carlo (MC) study was set. In our simulations, all estimated LC models have non-singular estimated information matrices. The number of components in the Monte Carlo study is set to two (S = 2), and models with one, two, and three components are estimated. The Monte Carlo experimental design controls the number of variables and categories, the sample size, the balance of component sizes, and the level of separation of latent classes. The number of variables (J) was set at levels 5 and 8; and the number of categories (Lj ) at levels 2 and 3. From preliminary analises with Lj = 2 and J = 5, we concluded that datasets with a nonsingular estimated information matrix for the three-component LC model with sample sizes smaller than 300 are difficult to obtain. Therefore, the factor sample size (n) assumes the levels: 300, 600, 1200, and 2400. The component −1 S v−1 sizes were generated using the expression πs = as−1 , with v=1 a s = 1, ..., S and a ≥ 1. For a = 1 yields equal proportions; for larger values of a, component sizes become more unbalanced. In our MC study, we set three levels for a: 1, 2 and 3. Despite the importance of controlling the level of separation of components of mixtures in Monte Carlo studies, the approach has mostly been based on ad hoc procedures such as randomly generated parameters of the first component, and the other components are obtained by adding successively 0.2 and 0.4 in low and high level of separation of components, respectively. In this paper, we apply a sampling procedure recently proposed by Dias (2004). The vector θ is generated as: 1. Draw θ1j from the Dirichlet distribution with parameters (φ1 , ..., φLj ), j = 1, .., J; 2. Draw θsj from the Dirichlet distribution with parameters (δθ1j1 , ..., δθ1jLj ), j = 1, ..., J, s = 2, ..., S. This procedure assumes that parameters θ of the LC model are sampled from a superpopulation defined by the hyperparameters δ and (φ1 , ..., φLj ), j = 1, ..., J, and defines a hierarchical (Bayesian) structure. We set (φ1 , ..., φLj ) = (1, ...1), which corresponds to the uniform distribution. For s = 2, ..., S, we have E(θsjl ) = θ1jl and Var(θsjl ) = θ1jl (1 − θ1jl ) / (δ + 1). With this procedure, on average, all components are centered at the same
Latent Class Analysis and Model Selection
99
parameter value generated from a uniform distribution (first component). The constant δ > 0 controls the level of separation of the components. As δ increases, the components separation decreases as a consequence of the decreasing of the variance. As δ → ∞, all components tend to share the same parameters. Based on results in Dias (2004), three levels of δ give a good coverage of the level of separation of components for the LC model: 0.1 (well-separated components), 1 (moderately-separated components), and 10 (weakly-separated components). These values of δ were set in this study. This MC study sets a 22 × 32 × 4 factorial design with 144 cells. The main performance measure used is the frequency with which each criterion picks the correct model. For each dataset, each criterion is classified as underfitting, fitting, or overfitting, based on the relation between S and the estimated S by those criteria. Special care needs to be taken before arriving at conclusions based on MC results. In this study, we performed 100 replications within each cell to obtain the frequency distribution of selecting the true model, resulting in a total of 14400 datasets. To avoid local optima, for each number of components (2 and 3) the EM algorithm was repeated 5 times with random starting centers, and the best solution (maximum likelihood value out of the 5 runs) and model selection results were kept. The EM algorithm ran for 1500 iterations, which was enough to ensure the convergence in all cells of the design. The programs were written in MATLAB 6.5.
4
Results
The key feature of the results is the overall remarkable performance of AIC3 (Table 4). While many criteria often perform satisfactory, AIC3 finds the true model 72.9% of the times. Overall, ICOMP and AIC perform well with 67.5% and 64.5%, respectively. As in other studies, our results document the tendency of AIC to overfit. ICOMP and LEC-U present the same behavior. BIC, CAIC, and LEC-J tend to choose slightly more parsimonious models than the others, which concurs with results in previous studies. BIC and CAIC underfit 39.1% and 41.3% of the times, respectively. By comparing LEC-U and LEC-J results, we conclude that LEC is very sensitive to the prior setting. A second objective of the study was to compare these criteria across the factors in the design. Increasing the sample size almost always improves the performance of traditional information criteria and extensions. However, these criteria showed a tendency to underestimate the true number of components when the sample size decreases. Increasing the number of variables (J) and categories (Lj ) mostly reduces the underfitting, and improves the performance of the information criteria. For AIC, and ICOMP, increasing the number of variables (J) or categories (Lj ) is associated with overfitting. In general, the more balanced the component sizes are, the better is the perfor-
100
J.G. Dias Factors
Criteria AIC AIC3 CAIC ICOMP BIC
Sample size (n) Underfit 23.03 35.25 300 Fit 59.25 64.39 Overfit 17.72 0.36 Underfit 17.44 30.44 600 Fit 62.53 68.92 Overfit 20.03 0.64 Underfit 12.81 24.33 1200 Fit 65.50 75.11 Overfit 21.69 0.56 Underfit 7.39 16.33 2400 Fit 70.53 83.20 Overfit 22.08 0.47 Number of variables (J) Underfit 19.75 29.79 5 Fit 66.86 69.82 Overfit 13.39 0.39 Underfit 10.58 23.39 8 Fit 62.04 75.98 Overfit 27.38 0.63 Number of categories (Lj ) Underfit 19.39 29.38 2 Fit 66.87 69.77 Overfit 13.74 0.85 Underfit 10.94 23.81 3 Fit 62.03 76.02 Overfit 27.03 0.17 Proportions (a) Underfit 13.56 23.65 1 Fit 63.59 75.87 Overfit 22.85 0.48 Underfit 14.85 27.06 2 Fit 64.77 72.42 Overfit 20.38 0.52 Underfit 17.08 29.06 3 Fit 65.00 70.42 Overfit 17.92 0.52 Level of separation (δ) Underfit 0.44 0.77 0.1 Fit 74.46 98.58 Overfit 25.10 0.65 Underfit 1.85 4.17 1 Fit 75.42 95.02 Overfit 22.73 0.81 Underfit 43.21 74.84 10 Fit 43.48 25.10 Overfit 13.31 0.06 Overall Underfit 15.17 26.59 Fit 64.45 72.90 Overfit 20.38 0.51
LEC-U LEC-J
52.97 47.03 0.00 42.42 57.58 0.00 37.53 62.47 0.00 32.42 67.58 0.00
26.03 63.28 10.69 25.36 62.61 12.03 23.94 68.39 7.67 20.08 75.56 4.36
48.67 51.33 0.00 40.39 59.61 0.00 35.83 64.17 0.00 31.53 68.47 0.00
19.94 40.95 39.11 23.00 51.58 25.42 24.44 62.78 12.78 18.75 74.17 7.08
58.81 38.75 2.44 44.83 54.09 1.08 37.86 61.64 0.50 32.47 67.42 0.11
44.93 55.07 0.00 37.74 62.26 0.00
31.00 65.07 3.93 16.71 69.85 13.44
42.17 57.83 0.00 36.04 63.96 0.00
23.21 54.30 22.49 19.86 60.43 19.71
47.46 51.91 0.63 39.53 59.03 1.44
41.11 58.89 0.00 41.56 58.44 0.00
30.40 65.84 3.76 17.31 69.08 13.61
39.58 60.42 0.00 38.63 61.37 0.00
24.93 56.85 18.22 18.14 57.89 23.97
42.83 56.39 0.78 44.15 54.56 1.29
38.40 61.60 0.00 41.04 58.96 0.00 44.56 55.44 0.00
21.04 70.92 8.04 23.67 67.33 9.00 26.85 64.13 9.02
36.10 63.90 0.00 38.96 61.04 0.00 42.25 57.75 0.00
19.46 58.71 21.83 21.60 57.34 21.06 23.54 56.06 20.40
39.92 58.93 1.15 43.00 55.77 1.23 47.56 51.71 0.73
5.19 94.81 0.00 20.94 79.06 0.00 97.87 2.13 0.00
0.88 92.85 6.27 4.23 86.54 9.23 66.46 22.98 10.56
3.71 96.29 0.00 16.56 83.44 0.00 97.04 2.96 0.00
0.56 75.86 23.58 2.50 74.08 23.42 61.54 22.17 16.29
8.27 90.40 1.33 25.27 73.88 0.85 96.93 2.15 0.92
41.33 58.67 0.00
23.85 39.10 67.46 60.90 8.69 0.00
21.53 57.37 21.10
43.49 55.48 1.03
Table 1. Results of the Monte Carlo study
Latent Class Analysis and Model Selection
101
mance of these criteria. Moreover, increasing the balance of component sizes tends to overfit and reduces underfitting. The level of separation of components has a dramatic effect on the performance of these criteria. For example, BIC finds the correct model in 96.3% of the cases for the well-separated components, but just in 3.0% for ill-separated components. This shows that BIC and CAIC can be extremely conservative for ill-separated components. AIC3 has the best success rate in every experimental condition, presenting balanced results across different levels of separation of components. For ill-separated components, AIC outperforms AIC3, however this criterion tends to overfit. Even for well-separated components AIC presents a very high percentage of overfitting.
5
Conclusion
The paper compared the performance of finite mixture models for discrete data (LC models). Because most of the information criteria are derived from asymptotics, this extensive Monte Carlo study allowed their assessment for realistic sample sizes. We have included traditional and recently proposed information criteria, some of them are compared for the first time. A large experimental design was set, controlling sample size, number of variables, number of categories, relative component sizes, and separation of components. The level of separation of components was controlled using a recently proposed procedure. The main finding of this study is the overall good performance of the AIC3 criterion for the LC model. AIC3 has the best overall performance among all the information criteria, with an overall success rate of 72.9% and only minor overfitting (0.51%), outperforming other traditional criteria such as AIC, BIC, and CAIC. Our results are restricted to S = 2 and have be extended to a larger number of components. However, for a larger number of components parameter estimates may be on the boundary of the parameter space, and it is likely that ICOMP cannot be computed (despite model identifiability) for small sample sizes such as n = 300. Therefore, we presented important results for this sample size. Future research could extend our findings to other finite mixture models for discrete data and more general latent structures. These results suggest that the type of approximation for the marginal likelihood needed for the derivation of the LEC and BIC has to be further studied. Indeed, despite the difficulty of the ill-separated scenario, approximations other than the Laplace may improve the performance of the information criteria in particular for finite mixtures of discrete distributions.
102
J.G. Dias
References AKAIKE, H. (1974): A New Look at Statistical Model Identification, IEEE Transactions on Automatic Control, AC-19, 716–723. BOZDOGAN, H. (1987): Model Selection and Akaike’s Information Criterion (AIC): The General Theory and Its Analytical Extensions, Psychometrika, 52, 345–370. BOZDOGAN, H. (1988): ICOMP: A New Model-selection Criterion. In: H.H. Bock (Ed.): Classification and Related Methods of Data Analysis. Elsevier Science (North Holland), Amsterdam, 599–608. BOZDOGAN, H. (1993): Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the InverseFisher Information Matrix. In: O. Opitz, B. Lausen, and R. Klar (Eds.): Information and Classification, Concepts, Methods and Applications. Springer, Berlin, 40–54. CLOGG, C.C. (1995): Latent Class Models. In: G. Arminger, C.C. Clogg, and M.E. Sobel (Eds.): Handbook of Statistical Modeling for the Social and Behavioral Sciences. Plenum, New York, 311–353. DIAS, J.G. (2004): Controlling the Level of Separation of Components in Monte Carlo Studies of Latent Class Models. In: D. Banks, L. House, F.R. McMorris, P. Arabie, and W. Gaul (Eds.): Classification, Clustering, and Data Mining Applications. Springer, Berlin, 77–84. EVERITT, B.S. (1984): A Note on Parameter Estimation for Lazarsfeld’s Latent Class Model Using the EM Algorithm, Multivariate Behavioral Research, 19, 79–89. HOIJTINK, H. (2001): Confirmatory Latent Class Analysis: Model Selection Using Bayes Factors and (Pseudo) Likelihood Ratio Statistics, Multivariate Behavioral Research, 36, 563–588. HURVICH, C.M. and TSAI, C.-L. (1989): Regression and Time Series Model Selection in Small Samples, Biometrika, 76, 297–307. LIN, T.H. and DAYTON, C.M. (1997): Model selection information criteria for nonnested latent class models, Journal of Educational and Behavioral Statistics, 22, 249–264. MCLACHLAN, G.J. and PEEL, D. (2000): Finite Mixture Models. John Wiley & Sons, New York. SCHWARZ, G. (1978): Estimating the Dimension of a Model, Annals of Statistics, 6, 461–464. TIERNEY, L. and KADANE, J. (1986): Accurate Approximations for Posterior Moments and Marginal Densities, Journal of the American Statistical Association, 81, 82–86. WILKS, S.S. (1938), The Large Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses, Annals of Mathematical Statistics, 9, 60–62.
An Indicator for the Number of Clusters: Using a Linear Map to Simplex Structure Marcus Weber1 , Wasinee Rungsarityotin2 , and Alexander Schliep2 1
2
Zuse Institute Berlin ZIB Takustraße 7, D-14195 Berlin, Germany Computational Molecular Biology, Max Planck Institute for Molecular Genetics Ihnestraße 63–73, D-14195 Berlin, Germany
Abstract. The problem of clustering data can be formulated as a graph partitioning problem. In this setting, spectral methods for obtaining optimal solutions have received a lot of attention recently. We describe Perron Cluster Cluster Analysis (PCCA) and establish a connection to spectral graph partitioning. We show that in our approach a clustering can be efficiently computed by mapping the eigenvector data onto a simplex. To deal with the prevalent problem of noisy and possibly overlapping data we introduce the Min-chi indicator which helps in confirming the existence of a partition of the data and in selecting the number of clusters with quite favorable performance. Furthermore, if no hard partition exists in the data, the Min-chi can guide in selecting the number of modes in a mixture model. We close with showing results on simulated data generated by a mixture of Gaussians.
1
Introduction
In data analysis, it is a common first step to detect groups of data, or clusters, sharing important characteristics. The relevant body of literature with regard to methods as well as applications is vast (see Hastie et al. (2001) or Jain and Dubes (1988) for an introduction). There are a number of ways to obtain a mathematical model for the data and the concept of similarity between data points, so that one can define a measure of clustering quality and design algorithms for finding a clustering maximizing this measure. The simplest, classical approach is to model data points as vectors from Rn . Euclidean distance between points measures their similarity and the average Euclidean distance between data points to the centroid of the groups they are assigned to is one natural measure for the quality of a clustering. The well-known kmeans algorithm, Jain and Dubes (1988), will find a locally optimal solution in that setting. One of the reasons why the development of clustering algorithms did not cease after k-means are the many intrinsic differences of data sets to be analyzed. Often the measure of similarity between data points might not fulfill all the properties of a mathematical distance function, or the measure of clustering quality has to be adapted, as for example the ball-shape assumption inherent in standard k-means does not often match the shape of clusters in real data.
104
M. Weber et al.
An issue which is usually, and unfortunately, of little concern, is whether there is a partition of the data into a number of groups in the first place and how many possible groups the data support. Whenever we apply a clustering algorithm which computes a k-partition this is an assumption we imply to hold for the data set we analyze. The problem is more complicated when k is unknown. In the statistical literature, McLachlan et al. (1988) suggested mixture models as alternatives for problem instances where clusters overlap. We address the problem of finding clusters in data sets for which we do not require the existence of a k-partition. The model we will use is a similarity graph. More specifically, we have G = (V, E), where V = {1, . . . , n} is the set of vertices corresponding to the data points. We have an edge {i, j} between two vertices iff we can quantify their similarity, which is denoted w(i, j). The set of all edges is E and the similarities can be considered as a function w : E → R+ 0 . The problem of finding a k-partition of the data can now be formulated as the problem of partitioning V into k subsets, V = ∪ki=1 Vi . Let us consider the problem of finding a 2-partition, say V = A ∪ B. This can be achieved by removing edges {i, j} from E for which i ∈ A and j ∈ B. Such a set of edges which leaves the graph disconnected is called a cut and the weight function allows us to quantify cuts by defining their weight or cut-value, cut(A, B) := w(i, j). {i,j}∈E,i∈A,j∈B
A natural objective is to find a cut of minimal value. A problem with this objective function is that sizes of partitions do not matter. As a consequence, using min-cut will often compute very unbalanced partitions, effectively splitting V into one single vertex, or a small number of vertices, and one very large set of vertices. We can alleviate this problem by evaluating cuts differently. Instead of just considering partition sizes one can also consider the similarity within partitions, for which we introduce the so-called association value of a vertex set A denoted by a(A) = a(A, V ) := wij . Defining the i∈Aj∈V
normalized cut by Normcut(A, B) =
cut(A, B) cut(A, B) + , a(A, V ) a(B, V )
we observe that the cut value is now measured in terms of the similarity of each partition to the whole graph. Vertices which are more similar to many data points are harder to separate. As we will see, the normalized cut is well suited as an objective function for minimizing because it keeps the relative size and connectivity of clusters balanced. The min-cut problem can be solved in polynomial time for k = 2. Finding k-way cuts in arbitrary graphs for k > 2 is proven NP-hard by Dahlhaus et al. (1994). For the two other cut criteria, already the problem of finding a 2-way cut is NPC, for a proof, see appendix in Shi and Malik (2000).
Simplex Structure: An Indicator for the Number of Clusters
105
However, we can find good approximate solutions to the 2-way normalized cut by considering a relaxation of the problem, see Kannan et al. (1999) and Shi and Malik (2000). Instead of discrete assignments to partitions consider a continuous indicator for membership. Let D = diag(d(1), . . . , d(n)) and d(i) = w(i, j). The relaxation of the 2-way normalized cut problem j∈V,i=j
can be formulated as (D − W )x = λDx.
(1)
For solving the 2-partition problem, we are interested in the eigenvector x2 for the second-smallest eigenvalue, compare Kannan et al. (1999) and Shi and Malik (2000). In particular, we will inspect its sign structure and use the sign of an entry x2 (i) to assign vertex i to one or the other vertex set. Similarly, for direct computation of k-partitions one can use all k eigenvectors to obtain k-dimensional indicator vectors. Previous approaches in Shi and Malik (2000) and Ng et al. (2002) relied on k-means clustering of the indicator vectors to obtain a k-partition in this space. In the next section, we will propose an indicator for the amount of overlapping in W which helps in deciding whether the recursive spectral method is applicable. Subsequently we will introduce an alternative approach to finding k-partitions even in absence of a perfect block structure. We first rephrase the problem equivalently in terms of transition matrices of Markov-chains and use perturbation analysis to arrive at the main result, a geometric interpretation of the eigenvector data as a simplex. This allows to devise an assignment of data into overlapping groups and a measure for the deviation from the simplex structure, the so-called Min-chi value. The advantages of our method are manifold: there are fewer requirements on the similarity measure, it is effective even for high-dimensional data and foremost, with our robust diagnostic we can assess whether a unique k-partition exists. The immediate application value is two-fold. On one hand, the Min-chi value indicates whether trying to partition the data into k groups is possible. On the other hand, if clusters arise from a mixture model, the indicator can be used as a guide for deciding on the number of modes in a mixture model. We close with showing results on simulated data generated by a mixture of Gaussians.
2 2.1
Clustering Method Simplex Structure and Perturbation Analysis
One can transform equation (1) into an eigenvalue problem for a stochastic matrix: (D − W )x = λDx ⇔ (I − D−1 W )x = λx
106
M. Weber et al.
⇔ D−1 W x = (1 − λ) x. ¯ =λ
In this equation T = D−1 W is a stochastic matrix and the eigenvalues 1 ≥ ¯ ≥ −1 are real valued, because W is symmetric. λ If W has a perfect block diagonal structure with k blocks, then clustering should lead to k perfectly separated index sets C1 , . . . , Ck . With W the matrix T also has perfect block diagonal structure and due to the row sum of stochastic matrices the characteristic vectors1 χ1 , . . . , χk of the sets C1 , . . . , Ck are ¯1 = . . . = λ ¯ k = 1. The eigenvectors of T for the k-fold maximal eigenvalue λ numerical eigenvector computation in this case provides an arbitrary basis ¯ = 1, X = [x1 , . . . , xk ] of the eigenspace corresponding to the eigenvalue λ k×k i.e. with χ = [χ1 , . . . , χk ] there is a transformation matrix A ∈ R with χ = XA.
(2)
In other words: If one wants to find the clustering of a perfect block diagonal matrix T , one has to compute the transformation matrix A which transforms the eigenvector data into characteristic vectors. If T! has almost block structure it can be seen as an -perturbed stochastic matrix of the ideal case T . For ! ¯ = 1 degenerates into one Perron eigenvalue λ ¯1 = 1 T! the k-fold eigenvalue λ ! ! ¯2, . . . , λ ¯ k near with a constant eigenvector and a cluster of k − 1 eigenvalues λ 1, the so-called Perron cluster. It has been shown, that there is a transformation matrix A! such that χ−χ ! = O( 2 ) ! A, ! see Deuflhard and Weber (2005). If the result χ for χ !=X ! shall be interpretable, then the vectors χ !1 , . . . , χ !k have to be “close to” characteristic: I.e., they have to be nonnegative and provide a partition of unity. In other words: The rows of χ ! as points in Rk have to lie inside a simplex spanned by the k unit vectors. If clustering is possible, then additionally, for the reason of maximal separation of the clusters, for every almost characteristic vector χ !i there should be an entry l with χ !i (l) = 1. It has been shown, that there is always a possibility to meet three of the four conditions (i) nonnegativity, (ii) partition ! and (iv) 1-entry in every vector. If all four conditions of unity, (iii) χ ! = XA, hold, the solution χ ! is unique, see Deuflhard and Weber (2005). In this case the eigenvector data itself spans a simplex. This simplex can be found via the inner simplex algorithm, see Weber and Galliat (2002) and Deuflhard and Weber (2005). The result χ ! of this algorithm always meets the conditions (ii)(iv), but the solution may have negative components. The absolute value of the minimal entry of χ ! is called the Min-chi indicator. As the uniqueness of the clustering increases, Min-chi goes to zero. Due to perturbation analysis it has been shown, that Min-chi= O( 2 ), see Weber (2004). 1
A characteristic vector χi of an index subset Ci meets χi (l) = 1 iff l ∈ Ci , and χi (l) = 0 elsewhere.
Simplex Structure: An Indicator for the Number of Clusters
2.2
107
Implementation: Min-chi in Practice
Given an n × m data matrix, we compute pairwise-distances for all pairs and construct the n × n distance matrix A with a symmetric distance function w : Rm ×Rm → R+ 0 . We then convert the distance to a similarity matrix with W = exp(−βA) where β is a scaling parameter and the stochastic matrix is defined by T = D−1 W . We can use the error measure Min-chi to determine a locally optimal solution for the number of clusters. Given the matrix T , we can use our method to determine a number of clusters denoted by k as follows: The Mode Selection Algorithm 1. Choose kmin , . . . , kmax such that the optimal k could be in the interval, 2. Iterate from kmin , . . . , kmax and for each k-th trial, calculate χ for cluster assignment via the Inner Simplex algorithm and Min-chi as an indicator for the number of clusters, 3. Choose the maximum k for which Min-chi < Threshold as the number of clusters. Selections of the threshold depends on the value β or variance which controls the perturbation from the perfect block structure of T . As a rule, when β is large, the threshold can be small because T is almost block-diagonal.
3
Result and Discussion
We compare the Min-chi indicator with the Bouldin index defined in Jain and Dubes (1988) applied to the result from the Inner Simplex algorithm described in details by Weber and Galliat (2002) and Deuflhard and Weber (2005). Given a partition into k clusters by a clustering algorithm, one first defines the measure of within-to-between cluster spread for the ith cluster ej +ei with the notation Ri = max m , where ei is the average distance within ji j=i
the ith cluster and mij is the Euclidean distance between the means. The Bouldin index for k is 1 DB(k) = Ri . k i>1 According to the Bouldin indicator, the number of clusters is k ∗ such that k ∗ = argmin DB(k). kmin ...kmax
In the examples of Fig. 3 we compute a sampling of 900 points from three spherical Gaussians with different variances and means. 180 points with mean (−1, 0) and 360 points with mean (2, 0) and (2, 3) respectively. For three different variances 0.15, 0.3, 0.6 and 1.2 we compute the Bouldin index and
108
M. Weber et al. Samples
5
2.5 Min−Chi Bouldin
4 2
Indicator
3 2 1 0
1.5
1
0.5
−1 0 2
−2 −2
−1
0
1
2
3
3
4
5
6
7
8
9
10
Number of Clusters
4
(a) Variance 0.15
(b) Min-Chi and Bouldin
Samples
5
2.5 Min−Chi Bouldin
4 2
Indicator
3 2 1 0
1.5
1
0.5
−1 −2 −3
0 2
−2
−1
0
1
2
3
3
4
5
6
7
8
9
10
Number of Clusters
4
(c) Variance 0.3
(d) Min-Chi and Bouldin
Samples
5
1.4 Min−Chi Bouldin
4
1.2
3
1 Indicator
2 1
0.8 0.6
0 0.4
−1 0.2
−2 −3 −3
0 2
−2
−1
0
1
2
3
4
3
4
5
6
7
8
9
10
Number of Clusters
5
(e) Variance 0.6
(f) Min-Chi and Bouldin
Samples
8
1.4 Min−Chi Bouldin 1.2
6
1 Indicator
4
2
0
0.6 0.4 0.2
−2
−4 −6
0.8
0 2
−4
−2
0
2
(g) Variance 1.2
4
6
3
4
5
6
7
8
9
10
Number of Clusters
(h) Min-Chi and Bouldin
Fig. 1. Simulated data: Mixture of three spherical gaussians with different variances. Comparison of Min-chi with the Bouldin index.
Simplex Structure: An Indicator for the Number of Clusters
109
the Min-chi indicator for kmin = 2 and kmax = 10. For a low variance in Fig. 1(a) both indicators give the same result k = 3, but for increasing variance in Fig. 1(c) and Fig. 1(e) the Bouldin indicator fails, whereas the Min-chi indicator still finds three clusters. For very high variance in Fig. 1(g), the Bouldin index finds 9 clusters. In this experiment, the Min-chi indicator is not unique. Depending on the threshold, two or three clusteres are indicated. This behaviour becomes worse for increasing variance.
4
Conclusion
In this paper we have shown the relation between Perron Cluster Cluster Analysis and spectral clustering methods. Some changes of PCCA with regard to geometrical clustering have been proposed, e.g. the Min-chi indicator for the number k of clusters. We have shown that this indicator is valuable also for noisy data. It evaluates the deviation of some eigenvector data from simplex structure and, therefore, it indicates the possibility of a “fuzzy” clustering, i.e. a clustering with a certain number of almost characteristic functions. A simple linear mapping of the eigenvector data has to be performed in order to compute these almost characteristic functions. Therefore, the cluster algorithm is easy to implement and fast in practice. We have also shown, that PCCA does not need strong assumptions like other spectral graph partitioning methods, because it uses the full eigenvector information and not only signs or less than k eigenvectors.
References DAHLHAUS, E., JOHNSON, D. S., PAPADIMITRIOU, C. H., SEYMOUR, P. D. and M. YANNAKAKIS (1994): The complexity of multiterminal cuts. SIAM J. Comput., 23(4):864–894. DEUFLHARD, P. and WEBER, M. (2005): Robust Perron Cluster Analysis in Conformation Dynamics. Lin. Alg. App., Special Issue on Matrices and Mathematical Biology, 398c:161–184. HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001): The Elements of Statistical Learning. Springer, Berlin. JAIN, A.K. and DUBES, R.C. (1988): Algorithms for clustering data. Prentice Hall, Engelwood Cliffs. KANNAN, R., VEMPALA, S. and VETTA, A. (1999): On Clusterings: Good, Bad and Spectral. Proceedings of IEEE Foundations of Computer Science. MCLACHLAN, G.J. and BASFORD, K.E. (1988): Mixture Models: Inference and Applications to Clustering. Marcel Dekker, Inc., New York, Basel. NG, A. Y., JORDAN, M. and WEISS, J (2002): On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14. SHI, J. and MALIK, J. (2000): Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905. WEBER, M. (2004): Clustering by using a simplex structure. Technical report, ZR-04-03, Zuse Institute Berlin.
110
M. Weber et al.
WEBER, M. and GALLIAT, T (2002): Characterization of transition states in conformational dynamics using Fuzzy sets. Technical Report 02–12, Zuse Institute Berlin (ZIB).
On the Use of Some Classification Quality Measure to Construct Mean Value Estimates Under Nonresponse Wojciech Gamrot Department of Statistics, University of Economics, Bogucicka 14, 40-226 Katowice, Poland Abstract. Several procedures have been developed for estimating the mean value of population characteristic under nonresponse. Usually estimators use available auxiliary information as a basis for the nonresponse correction. Some of them rely on classification procedures which allow to divide the population under study into subsets of units which are similar to sample respondents or sample nonrespondents. This allows to approximate the proportion of respondent and nonrespondent stratum in the population. Nonrespondents are then subsampled and estimates of population parameters are constructed. Such estimators are more accurate than the standard estimator for two-phase sample when distributions of auxiliary variables in respondent and nonrespondent stratum differ significantly. However, in the case when these distributions are similar the improvement disappears and classificationbased estimator may be less accurate than the standard one. In this paper another mean value estimator is proposed in order to eliminate this disadvantage. It is constructed as a combination of a standard (unbiased) two-phase estimator and a classification-based estimator. The weights of this combination are functions of some classification quality measure. The proposed mean value estimator should behave like a classification-based estimator when auxiliary characteristics seem to be useful for classification and behave like a standard estimator otherwise. The results of Monte Carlo simulation experiments aimed at assessing the properties of the proposed combined estimator are presented.
1
Introduction: Two-phase Sampling
Assume that the mean value Y of some characteristic Y in a finite and fixed population U of size N is to be estimated. Assume that nonresponse occurs in the survey and that nonresponse mechanism is deterministic. This means, that the population can be divided into two disjoint strata U1 and U2 , of unknown sizes N1 and N2 respectively, such that population units belonging to U1 always provide required data if contacted whereas units from U2 always refuse to co-operate. Let us denote W1 = N1 /N and W2 = N2 /N . The survey is carried out in two phases. In the first phase a simple random sample s of size n is drawn without replacement from the population, according to the sampling design: −1 N P1 (s) = . (1) n
112
W. Gamrot
The sample s is divided into two disjoint random sets s1 ⊂ U1 and s2 ⊂ U2 with sizes 0 ≤ n1 ≤ n and 0 ≤ n2 ≤ n satisfying: n1 + n2 = n. Sizes of both subsets are random, but observable variables having hypergeometric distribution. During a contact attempt units from the set s1 respond and units from s2 fail to provide answers, so the values of Y in the stratum U2 remain unknown. Then the second phase of the survey is executed to gather data about them and a subsample s of size n = cn2 , (where 0 < c < 1), is drawn without replacement from s2 , according to a conditional sampling design: −1 n2 P2 (s |n2 ) = . (2) n Subsample units are re-contacted and it is assumed that data collection procedures applied in the second phase guarantee full response. Let us define
y s1 =
1 yi . n1 i∈s 1
y s =
1 yi . n
(3)
i∈s
and consider the following statistic (see Wywial(2001)):
y(α) = αy s1 + (1 − α)y s .
(4)
When α = n1 /n the statistic above takes the well-known form: yS =
n1 n2 y + y . n s1 n s
(5)
and according to Hansen and Hurwitz (1949) it is an unbiased estimator of the population mean with the variance: V (y S ) =
N −n 2 W2 1 − c 2 S (Y ) + S2 (Y ) Nn n c
(6)
where S 2 (Y ) is the variance of the characteristic under study in the population U and S22 (Y ) represents its variance in the stratum U2 . In the following discussion, the estimator (5) will be called the standard estimator.
2
Bayesian Discrimination Function and the Classification Estimator
In general, the constant α in the expression (4) may be computed from the sample, and therefore it may be random. When the vector xi = [xi1 , ..., xik ] containing observations of k auxiliary variables X1 , ..., Xk is observed for each i-th population unit, Wywial(2001) suggests to apply some classification methods to establish the value of the weight α as close as possible to the population respondent fraction W1 . According to this proposition the
On the Use of Some Classification Quality Measure
113
population is divided into two subclasses (subsets) U1 and U2 , using classification algorithms. The division is aimed at obtaining the classes that are as close (similar) as possible to the strata U1 and U2 respectively. These classes constitute estimates of actual strata, and their sizes N1 and N2 are treated as estimates of unknown stratum sizes N1 and N2 . Finally the weight in the expression (4) may be set to α = N1 /N . The classes U1 and U2 may (and usually will) differ from the original strata, but resulting errors in estimating strata proportions may be lower than errors occuring when estimating these proportions on the basis of the initial sample respondent (nonrespondent) fraction, according to the standard two-phase estimation procedure. An application of several classification methods to estimate these proportions and construct mean value estimates under nonresponse is discussed by Gamrot (2003a). A comparison of properties of such classification-based mean value estimators is given by Gamrot (2003b) In order to assign population units to classes U1 and U2 , we assume multivariate Gaussian distributions of auxiliary characteristics in both strata and apply the well-known bayesian discrimination function (see e.g. Duda et al (2001)): # 1 " f (x) = (x − xs1 ) S2s1 (x − xs1 ) 2 $ |S2s1 | w1 2 − (x − xs2 ) Ss2 (x − xs2 ) + ln 2 − ln (7) |Ss2 | w2
where
1 xi n1 i∈s
(8)
1 xi n2 i∈s
(9)
1 (xi − xs1 ) (xi − xs1 ) n1 − 1 i∈s
(10)
xs1 =
1
xs2 =
2
S2s1 =
1
S2s2 =
1 (xi − xs2 ) (xi − xs2 ) n2 − 1 i∈s
(11)
2
and w1 = n1 /n, w2 = n2 /n are initial sample respondent and nonrespondent fractions respectively. Each population unit is classified as belonging to U1 when f"(x) > 0 and classified as belonging to U2 when f"(x) < 0. It is assumed that the probability of f"(x) being exactly equal to zero is negligible. Computing the size N1 of the set U1 and setting α = N1 /N in the expression (4) we obtain another population mean value estimator which will be referred to as the classification estimator and denoted by the symbol y C or by the letter C.
114
3
W. Gamrot
The Combined Estimator
The classification estimator presented above should be more accurate than the standard estimator if distributions of auxiliary variables in both strata differ significantly in such a way that allows to separate both strata using discrimination function. However, as it has been shown by Gamrot (2003b), when these distributions are similar to each other the classification estimator looses its advantage and it is less accurate than the standard one. If these distributions are not known exactly and classification estimator is used then one risks obtaining highly inaccurate estimates. In this paper an attempt is made to eliminate this disadvantage. To improve the classification estimator let us consider the following combination of both statistics:
y W = βy C + (1 − β)y S .
(12)
The weight β is computed from the sample according to expression β = 1 − 2R
(13)
where R is the initial sample misclassification rate evaluated by applying the discrimination function (7) to the sample data. For bayesian discrimination function the rate R should fall into < 0, 0.5 > interval and consequently β will take values from the < 0, 1 > interval. The lower the misclassification rate, the better the quality of classification, and the higher values of β. Consequently, the proposed estimator should adapt to the distributions of auxiliary variables. When β is high the classification function will probably divide population units properly and the greater weight is attached to the classification estimator which should be more accurate in this case. When β is low, the discrimination function will probably fail in identifying respondents and nonrespondents, and consequently the greater weight is attached to the standard estimator. In the following simulation study this estimator will be referred to as the combined (or hybrid) estimator and denoted by the letter W.
4
Simulation Results
Stochastic properties of the proposed estimator y W are difficult to assess analytically. A simulation study was performed to sched some light on its accuracy and to compare it with standard and classification estimators. Simulations were performed by repeatedly generating the values of the variable under study and three auxiliary variables for every population unit using pseudo-random number generator of multivariate Gaussian distribution. Parameters of the generator differed between respondent and nonrespondent stratum. Consequently, the probability distribution of population characteristics was equivalent to the mixture of adequate within-stratum probability
On the Use of Some Classification Quality Measure
115
distributions. This way several complete populations of pseudo-random numbers were generated. Then several sample-subsample pairs were drawn from each of these populations and mean value estimates were computed for each pair. The approximate mean square error was computed for each estimator on the basis of its empirical distribution. By averaging these approximate MSE’s over all populations the ultimate estimates of the MSE were evaluated for each estimator. The study involved two simulation experiments. Each experiment consisted of several simulations. In each simulation a total of 100 populations were generated and 100 sample-subsample pairs were drawn from each. A subsample size was always equal to the 30% of the nonrespondent subset size. All variables were uncorrelated within strata and their within-stratum standard deviations were set to one. Mean values of the characteristic under study in both strata were equal to 0 and 2 respectively. Auxiliary variable mean value vectors were also different in both strata and equal to: m1 = [0, 0, 0] and m2 = [d, d, d] respectively, with d being a constant fixed in advance. The aim of the first experiment was to investigate how the mean square error of estimators depends on the distance d between stratum auxiliary variable distribution centers. The initial sample size was set to n = 100. A sequence of independent simulations was performed for d = 0.0, 0.4, ..., 2.4, and N1 = 600, N2 = 400. An identical sequence of simulations was then repeated for stratum sizes N1 = 500, N2 = 500. Observed relative efficiency of estimators (proportion of the MSE of the estimator to the MSE of standard estimator) as a function of d is presented on Figure 1 and Figure 2. As it can be seen on both charts, for high values of d the relative efficiency of both classification estimator and combined estimator takes values below one. This means that both estimators are more accurate in terms of MSE than the standard estimator if the distance d between mean value vectors m1 and m2 is large enough. In fact for large d the MSE of both classification estimator and combined estimator is approximately the same. When the distance d decreases the relative efficiency of both estimators grows to exceed one, which means that when the distance d is low enough both estimators are less accurate than the standard estimator. However, if distributions of auxiliary variables in respondent and nonrespondent stratum are similar, then the relative efficiency of the combined estimator is significantly lower that the one of classification estimator . This means that for small d the combined estimator is much more accurate than the classification estimator. The objective of the second experiment was to investigate how the relative efficiency of the combined estimator depends on the initial sample size n. The distance between stratum mean value vectors was set to d = 0.8 and stratum sizes were set to N1 = 600, N2 = 400. Simulations were executed independently for n = 40, 60, ..., 200. Observed relative efficiency of estimators as a function of n is presented on Figure 3.
116
W. Gamrot
Fig. 1. The relative efficiency of estimators as a function of the difference d between stratum mean values, for N1 = 600, N2 = 400.
Fig. 2. The relative efficiency of estimators as a function of the difference d between stratum mean values, for N1 = 500, N2 = 500.
On the Use of Some Classification Quality Measure
117
Fig. 3. The dependence between initial sample size n and the relative efficiency of estimators, for N1 = 600, N2 = 400, d = 0.8
The chart on Figure 3 shows that relative efficiency of combined estimator and classification estimator is lower than one (both estimators are more accurate than the standard estimator), and it falls with increasing initial sample size n, which means that the advantage of both estimators over the standard one grows with n. The combined estimator had lower MSE than the classification estimator for any value of n tested in this experiment.
5
Conclusions
In this paper a mean value estimator under two-phase sampling for nonresponse was proposed. The new estimator is constructed as a combination of well-known standard estimator proposed by Hansen and Hurwitz (1949), and a classification estimator proposed by Wywial(2001). The weights of this combination depend on the sample misclassification rate. The properties of the proposed estimator were investigated by Monte Carlo simulation. Simulation results suggest that the combined estimator is at least as accurate as the classification estimator. Furthermore, when distributions of auxiliary characteristics in respondent and nonrespondent stratum are similar or the same, the combined estimator is much more accurate than the classification estimator. Consequently, the proposed estimator is an attractive alternative to the classification estimator.
118
W. Gamrot
References DUDA, R.O. HART, P.E. and STORK, D.G. (2001): Pattern Classification. Wiley, New York GAMROT, W. (2003a) On Application of Some Discrimination Methods to Mean Value Estimation in the Presence of Nonresponse. In: J. Wywial (Ed.): Metoda reprezentacyjna w Badaniach Ekonomiczno-Spolecznych, Katowice, 37-50. GAMROT, W. (2003b) A Monte Carlo Comparison of Some Two-phase Sampling Strategies Utilizing Discrimination Methods in the Presence of Nonresponse. Zeszyty Naukowe, No 29, University of Economics, Katowice, 41-54. HANSEN, M.H. HURWITZ, W.N. (1949) The Problem of Nonresponse in Sample Surveys. Journal of the American Statistical Society, No 41, 517-529. WYWIAL, J. (2001): On Estimation of Population Mean in the Case When Nonrespondents Are Present. Prace Naukowe AE Wroclaw, 8, 906, 13–21
A Wrapper Feature Selection Method for Combined Tree-based Classifiers Eugeniusz Gatnar Institute of Statistics, Katowice University of Economics, ul. Bogucicka 14, 40-226 Katowice, Poland
Abstract. The aim of feature selection is to find the subset of features that maximizes the classifier performance. Recently, we have proposed a correlation-based feature selection method for the classifier ensembles based on Hellwig heuristic (CFSH). In this paper we show that further improvement of the ensemble accuracy can be achieved by combining the CFSH method with the wrapper approach.
1
Introduction
Feature selection is a crucial step in statistical modelling. Its principal aim is to remove irrelevant, redundant or noisy features because they increase the computation cost. For K features there are 2K possible feature subsets Fl (l = 1, . . . , 2K ). Therefore, searching this space for a subset F ∗ ⊂ F that contains only influential features is extremly time-consuming. There are three approaches to the feature selection: filter approach, wrapper approach and ranking approach. Filter methods are the most commonly used in statistics. They eliminate undesirable features prior to model building, on the basis of their statistical properties, e.g. correlation with the dependent variable (representing class). The wrapper methods (Kohavi and John, 1997) use the classification algorithm itself to evaluate resulting models. Unfortunately, they are computationally expensive and very slow. In order to reduce the computation cost, different search strategies have been used, e.g. best-first search, tabu search or hill-climbing. In this paper we propose to evaluate only a limited number of the top feature subsets selected with the CFSH method (Gatnar, 2005a). This paper is organised as follows: in Section 2 we give a short description of the methods for combining classifiers; Sections 3 and 4 contain a disscusion on feature selection methods and wrapper methods in particular. In Section 5 we introduce the Hellwig heuristic and in Section 6 we propose a combined filter-wrapper algorithm. Section 7 contains a brief description of the results of our experiments. The last section contains a short summary.
120
2
E. Gatnar
Combining Classifiers
Given a set of training examples: T = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, we form a set of subsets: T1 , T2 , . . . , TM and a classifier C is fitted to each of them, resulting in a set of base classifiers: C1 , C2 , . . . , CM . Then they are combined in some way to produce the ensemble C ∗ . Ensemble approach developed in the past decade consists of two steps: 1. Select a set of independent and accurate classifiers. 2. Aggregate them to form an ensemble. Existing methods, like Bagging, Boosting, RandomForest etc., differ in the way the base classifiers are built and their outputs are combined. Generally, there are three approaches to obtain a set of component classifiers: • Manipulating training examples, e.g. Bagging (Breiman, 1996); Boosting (Freund and Shapire, 1997) and Arcing (Breiman, 1998). • Manipulating input features, e.g. Random subspaces (Ho, 1998) and Random forests (Breiman, 2001). • Manipulating output values, e.g. Error-correcting output coding (Dietterich and Bakiri, 1995). Having a set of classifiers, they can be combined using one of the following methods: • Averaging methods, e.g. average vote and weighted vote. • Non-linear methods, e.g. majority vote (the component classifiers vote for the most frequent class as the predicted class), maximum vote, Borda Count method, etc. • Stacked generalisation developed by Wolpert (1992).
3
Feature Selection
It can be observed that improvement of the ensemble accuracy depends on the feature selection method, quality of the features and the classifier. Usually, the aim of feature selection is to delete noisy or redundant features and reduce the dimensionality of the feature space. Recently, Tumer and Ghosh (1996) proved that the ensemble error depends also on the correlation between members of the ensemble. Then Breiman (2001) developed an upper bound for the classification error of the ensemble. Therefore, the feature selection should also promote diversity among the ensemble members. In general, there are three groups of feature selection methods in statistics: • filter methods that filter undesirable features out of the data before classification,
A Wrapper Feature Selection Method
121
• wrapper methods that use the classification algorithm itself to evaluate the usefulness of feature subsets, • ranking methods that score individual features. Filter methods are the most common methods used for feature selection in statistics. They eliminate irrelevant features before classification on the basis of their statistical properties, e.g. variance, correlation with the class, etc. The wrapper methods generate sets of features. Then they run the classification algorithm using features in each set, and evaluate resulting models on a test set or using cross-validation. The RELIEF algorithm (Kira and Rendell, 1992) uses ranking methods for feature selection. It draws instances at random, finds their nearest neighbors, and gives higher weights to features that discriminate the instance from neighbors of different classes. Then those features with weights that exceed a user-specified threshold are selected.
4
Applying the Wrapper Method
Filter approach does not take into account the biases of the classification algorithms, i.e. it selects features that are independent of the model. Some features that are good for classification trees are not necessarily useful for other models, e.g. nearest neighbor. Perhaps Provost and Buchanan (1995) first introduced the wrapper approach as ”search of the bias space“. Singh and Provan (1995) applied the wrapper approach to feature selection for Bayesian networks. Kohavi and John (1997) proposed a stepwise wrapper algorithm that starts with an empty set of features and adds single features that improve the accuracy of the resulted classifier. They have used the best-first search strategy to find the best feature subset in the search space of 2K possible subsets. Unfortunately, this method is only useful for data sets with relatively small number of features and very fast classification algorithms (e.g. trees). In general, the wrapper methods are computationally expensive and very slow. The search space consists of states representing feature subsets. Stepwise selection is commonly used, i.e. adding to the subset a single feature from a state. The goal of the search is to find the state with the highest evaluation. The size of the search space is O(2K ) for K features, so it is impractical to search the whole space exhaustively. The main problem of the wrapper approach is that of the state space search, so different search techniques have been applied, e.g. best-first search, tabu search, hill-climbing, etc. Best-first search is a robust search method (Gisnberg, 1993) but it is possible that it increases the variance and reduces accuracy (Kohavi and Wolpert, 1996).
122
5
E. Gatnar
Hellwig Heuristic
The heuristic proposed by Hellwig (1969) takes into account both classfeature correlation and correlation between pairs of variables. The best subset of features is selected from among all possible subsets F1 , F2 , . . . , FL (L = 2K ) that maximises the so-called “integral capacity of information”: H(Fl ) =
j∈Fl
2 rcj
i∈Fl
|rij |
,
(1)
where rcj is a class-feature correlation, and rij is a feature-feature correlation . The measure (1) often takes high values. In order to eliminate this bias, we have applied the normalisation proposed by Walesiak (1987): % H (Fl ) = H(Fl ) det(Rl ), (2) where Rl is the feature intercorrelation matrix in the subset Fl . The wrapper method performs sequential search through a ranked set of feature subsets to identify the best feature subset to use with a particular algorithm. We propose to rank the feature subsets using the Hellwig heuristic.
6
Proposed Method
We propose to combine the filter approach (correlation-based feature selection) with the wrapper approach. The algorithm for ensemble building consists of two main steps: 1. Iterate m=1 to M: (a) Choose at random half of the data set features (K/2) to the training subset Tm . (b) Select the features with class-feature correlation |rj | > 0.5. (c) Determine the best V subsets Fv (v = 1, . . . , V ) of features in Tm according to the Hellwig heuristic. (d) Apply the wrapper to the subsets F1 , . . . , FV and find the subset F ∗ that gives the most accurate classifier. (e) Grow a tree using the subset F ∗ resulting in the classifier Cm . 2. Finally, combine the component models C1 , . . . , CM using majority voting: M & ∗ C (x) = argmaxy∈Y I(Cm (x) = y) . (3) m=1
Figure 1 shows the proposed hybrid filter-wrapper algorithm for ensemble building.
A Wrapper Feature Selection Method Training subset
Feature selection (CFSH)
123
Final classifier
Feature evaluation (Tree-based model)
Fig. 1. The combined filter-wrapper method.
In order to find the best set of features for the component model Cm the steps (a)–(d) in the above algorithm have been applied. The aim of the step (a) is to ensure the diversity among the component classifiers C1 , . . . , CM . The step (b) is to decrease the search space by selecting the features highly correlated with the class. The next step is also to limit the search space by the use of the Hellwig heuristic for preliminary evaluation of the subsets of features. The ranking of feature subsets F1 , . . . , FL is the result of this evaluation. In the step (d) the top V subsets (we usually set the value of the parameter V to 10) of the F1 , . . . , FL are evaluated with the wrapper method. Finally the best feature subset F ∗ selected by the wrapper is used to build the tree-based classifier Cm .
7
Experiments
In order to compare prediction accuracy of ensembles for different feature selection methods, we used benchmark datasets from the Machine Learning Repository at the UCI (Blake et al., 1998). Results of the comparisons are presented in Table 1. For each dataset, an aggregated model has been built containing M=100 component trees1 . Classification errors have been estimated for the appropriate test sets. The diversity has been evaluated using the Hamann’s coefficient (Gatnar, 2005b).
8
Summary
In this paper we have proposed a combined filter-wrapper feature selection method for classifier ensembles that is based on the Hellwig heuristic. The correlation-based feature selection method has guided the search done by the classification algorithm itself. Experiment results showed that the hybrid method gives more accurate aggregated models than those built with other feature selection methods. 1
In order to grow trees we have used the Rpart procedure written by Therneau and Atkinson (1997) for the S-PLUS and R environment.
124
E. Gatnar Data set
Anneal Australian credit DNA German credit Letter Satellite Segmentation Sick Soybean
Single tree CFSH New Averaged (Rpart) method diversity
1.40% 14.90% 6.40% 29.60% 14.00% 13.80% 3.70% 1.30% 8.00%
1.22% 14.53% 5.20% 27.33% 10.83% 14.87% 3.37% 2.51% 9.34%
1.20% 14.10% 4.51% 26.92% 5.84% 10.32% 2.27% 2.14% 6.98%
0.15 0.21 0.12 0.28 0.14 0.18 0.13 0.20 0.07
Table 1. Classification errors and diversity among ensemble members.
References AMIT, Y. and GEMAN, G. (2001): Multiple Randomized Classifiers: MRCL. Technical Report, Department of Statistics, University of Chicago, Chicago. BAUER, E. and KOHAVI R. (1999): An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36, 105–142. BLAKE, C., KEOGH, E. and MERZ, C. J. (1998): UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine. BREIMAN, L. (1996): Bagging predictors. Machine Learning, 24, 123–140. BREIMAN, L. (1998): Arcing classifiers. Annals of Statistics, 26, 801–849. BREIMAN, L. (1999): Using adaptive bagging to debias regressions. Technical Report 547, Department of Statistics, University of California, Berkeley. BREIMAN, L. (2001): Random Forests. Machine Learning 45, 5–32. DIETTERICH, T. and BAKIRI, G. (1995): Solving multiclass learning problem via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286. FREUND, Y. and SCHAPIRE, R.E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55, 119–139. GATNAR, E. (2005a): Dimensionality of Random Subspaces. In: C. Weihs and W. Gaul (Eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg, 129–136. GATNAR, E. (2005b): A Diversity Measure for Tree-Based Classifier Ensembles. In: D. Baier, R. Decker, and L. Schmidt-Thieme (Eds.): Data Analysis and Decision Support. Springer, Heidelberg, 30–38. GINSBERG, M.L. (1993): Essentials of Artificial Intelligence. Morgan Kaufmann, San Francisco. HELLWIG, Z. (1969): On the problem of optimal selection of predictors. Statistical Revue, 3–4 (in Polish).
A Wrapper Feature Selection Method
125
HO, T.K. (1998): The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 832– 844. KOHAVI, R. and WOLPERT, D.H. (1996):Bias plus variance decomposition for zero-one loss functions. In: L. Saita (Ed.) Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, 275–283. KIRA, A. and RENDELL, L. (1992): A practical approach to feature selection. In: D. Sleeman and P. Edwards (Eds.): Proceedings of the 9th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, 249–256. KOHAVI, R. and JOHN, G.H. (1997): Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324. PROVOST, F. and BUCHANAN, B. (1995): Inductive Policy: The pragmatics of bias selection. Machine Learning, 20, 35–61. SINGH, M. and PROVAN, G. (1995): A comparison of induction algorithms for selective and non-selective Bayesian classifiers. Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, 497–505. THERNEAU, T.M. and ATKINSON, E.J. (1997): An introduction to recursive partitioning using the RPART routines, Mayo Foundation, Rochester. TUMER, K. and GHOSH, J. (1996): Analysis of decision boundaries in linearly combined neural classifiers, Pattern Recognition, 29, 341–348. WALESIAK, M. (1987): Modified criterion of explanatory variable selection to the linear econometric model. Statistical Revue, 1, 37–43 (in Polish). WOLPERT, D. (1992): Stacked generalization. Neural Networks 5, 241–259.
Input Variable Selection in Kernel Fisher Discriminant Analysis Nelmarie Louw and Sarel J. Steel Department of Statistics and Actuarial Science, University of Stellenbosch Private Bag X1, 7602 Matieland, South Africa Abstract. Variable selection serves a dual purpose in statistical classification problems: it enables one to identify the input variables which separate the groups well, and a classification rule based on these variables frequently has a lower error rate than the rule based on all the input variables. Kernel Fisher discriminant analysis (KFDA) is a recently proposed powerful classification procedure, frequently applied in cases characterized by large numbers of input variables. The important problem of eliminating redundant input variables before implementing KFDA is addressed in this paper. A backward elimination approach is employed, and a criterion which can be used for recursive elimination of input variables is proposed. The merit of the proposal is evaluated in a simulation study and in terms of its performance when applied to two benchmark data sets.
1
Introduction
Kernel based methods are fast becoming standard tools for solving regression and classification problems in statistics. These methods originated mainly in areas such as artificial intelligence, machine learning, and computer science, where they have been widely and successfully applied. Examples of kernel methods are support vector machines, kernel Fisher discriminant analysis (KFDA), kernel principal component analysis, and kernel logistic regression (see Sch¨olkopf and Smola, 2002, for a comprehensive discussion). Our focus in this paper is on KFDA. Although less well known than support vector machines (SVMs), the performance in terms of error rate of KFDA is comparable to that of SVMs (cf. Mika et al., 1999). Kernel methods are frequently applied in problems characterized by many input variables, for example the analysis of DNA microarray data. In such cases identifying and eliminating irrelevant variables is an essential first step in analysis of the data. It is well known that variable selection in classical statistical procedures such as multiple linear regression and discriminant analysis not only leads to simpler models, but also frequently improves prediction or classification accuracy (cf. Miller, 2002, and McLachlan, 1992). Regarding kernel based classification methods, several procedures have been proposed for input variable selection and dimension reduction in SVMs (cf. Guyon et al., 2002, Rakotomamonjy, 2003, and Weston et al., 2003). Once again the simpler models identified through variable selection generally leads to an improvement in classification accuracy.
Input Variable Selection in Kernel Fisher Discriminant Analysis
127
In this paper we consider situations where the purpose is to use available sample data to develop a KFDA classification function which can be employed to assign new entities to one of two populations. It will generally be assumed that the sample data consist of measurements on a large number of input variables, and that only a fraction of these is relevant in the sense that they separate the populations under consideration. Within this context we investigate several aspects of variable selection: we highlight the detrimental effect which the presence of irrelevant variables may have on the error rate behaviour of the KFDA classification function, thereby clearly demonstrating the need for variable selection. A criterion which may be used for stepwise elimination of irrelevant variables is therefore introduced. The extent to which this criterion succeeds in identifying the relevant variables, and the corresponding improvement in error rate of the KFDA classification function based on a reduced number of variables, are studied through simulation and by applying the proposal to two benchmark data sets. The paper is organized as follows. Section 2 introduces required notation, and provides technical details on KFDA. Recursive feature elimination (RFE) in KFDA is discussed in Section 3. We introduce a criterion which can be used for RFE in KFDA, and describe an algorithm for implementing KFDARFE. A Monte Carlo simulation study that was conducted to evaluate the proposed RFE procedure is described in Section 4. In this section we also discuss application of KFDA-RFE to two practical data sets. Concluding remarks and open problems appear in Section 5.
2
Notation and Technical Preliminaries
Consider the following generic two-group classification problem. We observe a binary response variable Y ∈ {−1, +1}, together with a (large) number of classification or input variables X1 , X2 , · · · , Xp . These variables are observed for n = n1 + n2 sample cases, with n1 cases from population 1 and n2 cases from population 2. The resulting training data set is therefore {(xi , yi ) , i = 1, 2, · · · , n}. Here, xi is a p-component vector representing the values of X1 , X2 , · · · , Xp for case i in the sample. Our purpose is to use the training data to determine a rule that can be used to assign a new case with observed values of the predictor variables in a vector n x to one of the two classes. The KFDA classifier is given by sign {b + i=1 αi K(xi , x)}. Here, b and α1 , α2 , · · · , αn are quantities determined by applying the KFDA algorithm to the training data, while K(xi , x) is a kernel function evaluated at (xi , x). Two examples of popular kernel functions are the polynomial kernel, K(x1 , x2 ) = x1 , x2 d , where d is an integer, usually 2 or 3, and the Gaussian kernel, K(x1 , x2 , ) = exp(−γx1 − x2 2 ), where γ is a so-called kernel hyperparameter that has to be specified by the user or estimated from the data. We restrict attention to the Gaussian kernel in the remainder of the paper. Empirical evidence suggests that γ = 1/p generally works well and we will use
128
N. Louw and S.J. Steel
this throughout the paper. Evaluating K(xi , xj ) for i, j = 1, 2, · · · , n, we are able to construct the so-called Gram matrix, K, with ij th entry K(xi , xj ). The constants αi are determined as follows. Let α be an n-vector with elements α1 , α2 , · · · , αn . The α-vector used in KFDA maximises the Rayleigh coefficient α Mα R(α) = . (1) α Nα In (1),M = (m1 − m2 )(m1 − m2 ) , with the n elements of m1 given n1 by n11 j=1 K(xi , xj ), i = 1, 2, · · · , n , and similarly for m2 . Also, N = KK − n1 m1 m1 − n2 m2 m2 . The analogy with classical linear discriminant analysis is clear: we may interpret M as the between group scatter matrix, and N as the within group scatter matrix, in both cases taking into account that we are effectively working in the feature space induced by the kernel function. For a more detailed discussion of KFDA, see for example Mika et al. (1999) and Louw and Steel (2005). It is well known that N−1 (m1 − m2 ) will maximize (1). There is however one problem: the matrix N is singular and consequently we cannot find α by simply calculating N−1 (m1 − m2 ). Mika et al. (1999) propose and motivate the use of regularization to overcome this difficulty. In the present context regularization entails replacing N by a matrix Nλ = N + λI for some (small) positive scalar λ. This yields a solution N−1 λ (m1 − m2 ) , depending on λ, which can be used in the KFDA classifier. Obviously the hyperparameter λ has to be specified, and this is typically done by performing a crossvalidation search along a suitable grid of potential λ-values. The intercept b can be specified in different ways. A popular choice, which −1 we will also use, is b = 0.5(m2 N−1 λ m2 − m1 Nλ m1 ) + log(n1 /n2 ), which is similar to the intercept used in linear discriminant analysis.
3
Recursive Feature Elimination in KFDA
Recursive feature elimination (RFE) was proposed by Guyon et al. (2002) for variable selection in an SVM context. It was also investigated by Rakotomamonjy (2003) and found to perform well on several simulated and benchmark data sets. RFE is essentially a backward stepwise elimination procedure, where the variable to be eliminated at a specific step is identified by optimizing a suitable criterion. Guyon et al. (2002) and Rakotomamonjy (2003) studied several criteria suitable for this purpose in an SVM context. In this paper we propose RFE for variable selection in KFDA. An important aspect is to define the criterion which is optimized at each step to identify the variable to be deleted. In this paper we propose the Rayleigh coefficient given in (1) as criterion. We start with all p available input variables in the model, and perform KFDA to obtain a solution vector . We then omit variables in turn, and calculate the value of the Rayleigh coefficient after each omission. Upon omission of variable i, the coefficient is:
Input Variable Selection in Kernel Fisher Discriminant Analysis
129
α(i) M(i) α(i) . (2) α(i) N(i) α(i) This implies that the α-vector has to be recalculated following omission of variable i. Since this would be very computationally expensive, we make the assumption that the components of the α-vector do not change significantly upon omission of a single variable, and we only recalculate M(i) and N(i) (which implies recalculation of the Gram matrix after omission of variable i). This is similar to the assumption regarding the α-vector made by Guyon et al. (2002) and Rakotomamonjy (2003) when applying RFE in SVMs. Empirical evidence suggests that making this simplifying assumption does not substantially affect the results of KFDA-RFE. To determine which variable should be eliminated at each step, we therefore calculate the criterion R(i) (α(i) ) =
R(i) (α) =
α M(i) α . α N(i) α
(3)
The variable whose omission results in the maximum value of the criterion, is omitted. This procedure is repeated in a recursive manner, until a subset of the desired size, m, is obtained. In many applications it is unclear how many variables should be retained, i.e. what the value of m should be. With this in mind, the above procedure can also be extended to the scenario where m is assumed unknown. We simply repeat the elimination process until a single variable remains, thereby obtaining nested subsets of sizes p − 1, p − 2, · · · , 1. The optimal number of variables to retain can then be estimated by for example minimizing crossvalidation estimates of the error rates of the p nested subsets of variables. If p is very large, the proposed procedure can easily be adapted to delete more than one variable at each step (the r variables yielding the largest value of the criterion are omitted at each step). The number of variables omitted at each stage can for example be large in the initial stages of the process, and smaller in later stages. This is similar to a suggestion in an SVM context in Guyon et al. (2002) and Rakotomamonjy (2003), where gene selection is considered.
4
Evaluating the Performance of KFDA-RFE
To evaluate the performance of the proposed KFDA-RFE procedure we conducted an extensive Monte Carlo simulation study and applied the method to several data sets. We report a representative selection of the results. For the simulation study, we considered data from normal as well as lognormal populations. We investigated different sample sizes, correlation structures and numbers of relevant and noise variables. Two types of differences between populations were studied, viz. differences between population means
130
N. Louw and S.J. Steel
(with identical covariance structures in both populations) and differences between the covariance matrices (with identical means in both populations). We report on four of these cases in Table 1. In case 1, the 2 relevant variables were generated from a normal distribution with all components of the mean vectors equal to 0 in both groups. A variance of 1 was used for all relevant variables in group 1, and in group 2 the variance of the relevant variables was equal to 10. This represents a case where the two populations differ w.r.t. spread. The correlation between the relevant variables was 0.5 in both groups. The 48 noise variables for both groups were generated from a normal distribution with mean 0 and variance 20, and were uncorrelated. In case 2, the 5 relevant variables were generated from a normal distribution with all components of the mean vector equal to 0 in group 1 and equal to 1 in group 2. Variances of 1 were used for all relevant variables in both groups, and the variables were uncorrelated. This represents a case where the two populations differ w.r.t. location. The 95 noise variables for both groups were generated from a normal distribution with mean 0 and variance 20, and were uncorrelated. In case 3, the 5 relevant variables were generated from a lognormal distribution with mean 0 in both groups. The variances of the relevant variables were 1 in group 1 and 20 in group 2. This again represents a case where the two populations differ w.r.t. spread. The correlation between the relevant variables was equal to 0.5. The 95 noise variables were uncorrelated and were generated from a lognormal distribution with mean 0 and variance 1. In case 4, the 2 relevant variables were generated from a lognormal distribution with mean 0 in group 1 and 1 in group 2. The variances of the relevant variables were 1 in both groups. This again represents a case where the two populations differ w.r.t. location. The relevant variables were uncorrelated in both groups. The 48 noise variables were uncorrelated and were generated from a lognormal distribution with mean 0 and variance 20. In each case training samples of different sizes were generated from the appropriate underlying distribution, and RFE using the Rayleigh coefficient as selection criterion was performed to identify the best 2 (in Case 1 and 4) or 5 (in Case 2 and 3) variables. The KFDA classifier based on the selected variables was constructed and used to classify a large (n1 = n2 = 1000) test data set generated independently from the same underlying distribution. The KFDA classifier containing all variables, as well as the classifier containing only the relevant variables (refered to as the oracle), were also used to classify the test set. This was repeated 100 times, and the mean error rates were calculated for each of the three classifiers. These are reported in Table 1. We use the following coding for the different classifiers: N - no selection is done; R - the Rayleigh coefficient is used in RFE; O - oracle (only the relevant variables are used). What conclusions can be drawn from these results? Firstly, it is important to take note of the detrimental effect of irrelevant variables on the accuracy of the KFDA classifier. We see this by comparing the error rates achieved by
Input Variable Selection in Kernel Fisher Discriminant Analysis
Case 1: N R O Case 2: N R O Case 3: N R O Case 4: N R O
10 0.480 0.380 0.214 0.382 0.390 0.158 0.442 0.445 0.163 0.471 0.430 0.115
131
Training sample size 20 30 40 50 100 0.470 0.463 0.456 0.451 0.437 0.251 0.253 0.230 0.228 0.209 0.215 0.213 0.213 0.209 0.206 0.322 0.292 0.259 0.244 0.198 0.296 0.208 0.180 0.163 0.138 0.144 0.143 0.140 0.138 0.136 0.383 0.343 0.316 0.299 0.244 0.341 0.231 0.179 0.158 0.142 0.151 0.146 0.138 0.136 0.136 0.415 0.366 0.331 0.307 0.243 0.144 0.103 0.102 0.096 0.092 0.099 0.096 0.097 0.096 0.092
Table 1. Means of test error rates
the N classifier, based on all variables, to that of the O classifier, where only the relevant variables are used. It is clear that in all cases the error rates are markedly increased by the inclusion of irrelevant variables in the classifier. This clearly indicates that attempting to eliminate irrelevant variables before constructing the classification rule is a worthwhile pursuit. If we compare the post selection (R) error rates to the N error rates, it is clear that a lower error rate is almost always achieved by the post selection classifier. As the sample size increases, the difference between the R and N error rates increases, and the R error rates get closer to (and sometimes equal to at sample sizes 100) the O error rates. This indicates that RFE succeeds in identifying the relevant input variables. In addition to the Monte Carlo simulation study, we also applied RFE to the heart disease data (p = 13 variables and n = 240 data cases), and the breast cancer data (p = 9 variables and n = 277 data cases), both available in the form of 100 splits into training and test sets at http://ida.first.gmd.de/˜ raetsch/data/benchmarks.htm. For both data sets KFDA-RFE was applied to each of the training sets to find input variable subsets of sizes p − 1, p − 2, · · · , 1. The KFDA classifier based on each subset was obtained, and the test error estimated by classifying the test cases. These error rate estimates were then averaged over the 100 splits. The resulting error rates for the heart disease data appear in Figure 1, and for the breast cancer data in Figure 2. For comparison purposes the error rates reported in Rakotomamonjy(2003) for the SVM-RFE using the margin as criterion are also plotted in both graphs. For the heart disease data, the average estimated test error rate using all 13 variables is 0.16 for both the KFDA and the SVM classifier. The lowest estimated test error for KFDA of 0.155 is achieved using a subset of size 10. For the SVM the lowest error rate is also 0.155, also for a subset of size 10.
Error rate
N. Louw and S.J. Steel
0.16 0.18 0.20 0.22 0.24 0.26
132
KFDA−RFE SVM−RFE
2
4
6
8
10
12
Number of variables
0.270
0.280
KFDA−RFE SVM−RFE
0.260
Error rate
0.290
Fig. 1. Error rates for heart disease data
2
4
6
8
Number of variables
Fig. 2. Error rates for breast cancer data
For KFDA, using a subset of as few as 7 selected variables leads to a very slight increase in the error rate to 0.156. For the breast cancer data, the KFDA error rate using all 9 variables is 0.266, while it is 0.26 for the corresponding SVM. The lowest KFDA error rate of 0.259 is now obtained using only 3 variables, while for the SVM the lowest error rate is that of the full model. (Rakotomamonjy, 2003, did however achieve lower post selection error rates using other criteria). For both data sets it seems that RFE variable selection slightly improves the performance of the KFDA classifier. The main advantage in these examples is the saving in the number of variables used in the classifier.
Input Variable Selection in Kernel Fisher Discriminant Analysis
5
133
Conclusions and Open Problems
The results of the simulation study and practical applications indicate that KFDA-RFE succeeds in eliminating irrelevant variables and leads to reduced error rates. As such it can be recommended to a practitioner confronted with a classification problem containing many input variables. Several open problems remain. In our analysis we used γ = 1/p in the Gaussian kernel and performed a limited crossvalidation search using all input variables to determine the value of λ. Procedures which take into account possible interaction between the number of variables and the hyperparameter values should be investigated. Also, although crossvalidation seems to be a viable option, finding a value of m (the number of variables to be retained) from the data remains a difficult and important problem.
References GUYON, I., WESTON J., BARNHILL, S. and VAPNIK, V. (2002): Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389-422. LOUW, N. and STEEL, S.J. (2005): A review of kernel Fisher discriminant analysis for statistical classification. The South African Statistical Journal, 39, 1-21. MCLACHLAN, G.J. (1992): Discriminant analysis and statistical pattern recognition. Wiley, New York. ¨ ¨ ¨ MIKA, S., RATSCH, G., WESTON, J., SCHOLKOPF, B. and MULLER, K.-R. (1999): Fisher discriminant analysis with kernels. In: Y.-H. Hu, J. Larsen, E. Wilson and S. Douglas (Eds.): Neural Networks for Signal Processing, IX. IEEE Press, New York, 41-48. MILLER, A.J. (2002): Subset selection in regression. Chapman and Hall, London. RAKOTOMAMONJY, A. (2003): Variable selection using SVM based criteria. Journal of Machine Learning Research, 3, 1357-1370. ¨ ¨ RATSCH, G.,ONODA, T. AND MULLER, K.-R. (2001): Soft margins for AdaBoost. Machine Learning, 42, 287-320. ¨ SCHOLKOPF, B. and SMOLA, A.J. (2002): Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, London. ¨ WESTON, J., ELISSEEFF, A., SCHOLKOPF, B. and TIPPING, M. (2003): Use of the Zero Norm with Linear Models and Kernel Methods. Journal of Machine Learning Research, 3, 1439-1461.
The Wavelet Packet Based Cepstral Features for Open Set Speaker Classification in Marathi Hemant A. Patil1 , P. K. Dutta2 , and T. K. Basu2 1
2
Dept. of Electronics and Instrumentation Engineering, Dr.B.C. Roy Engineering College, Durgapur, West Bengal, India hemant
[email protected] Dept. of Electrical Engineering, IIT Kharagpur, West Bengal, India {pkd|tkb}@ee.iitkgp.ernet.in
Abstract. In this paper, a new method of feature extraction based on perceptually meaningful subband decomposition of speech signal has been described. Dialectal zone based speaker classification in Marathi language has been attempted in the open set mode using a polynomial classifier. The method consists of dividing the speech signal into nonuniform subbands in approximate Mel-scale using an admissible wavelet packet filterbank and modeling each dialectal zone with the 2nd and 3rd order polynomial expansions of feature vector.
1
Introduction
The problem of speaker classification (SC) can be defined in different ways [6]. We define SC as grouping of the speakers residing in a particular dialectal zone based on their similar acoustical characteristics of speech. Such problem may be useful in forensic science applications such as in identifying a criminal’s place of origin or in anthropological study of social ethnic group. The feasibility of solution to the problem lies on the fundamental fact that the parts which principally determine voiceprint (we refer voiceprint as the model describing similar acoustical characteristics of speech from a dialectal zone) are the vocal cavities and articulators. A still greater factor in determining the voice uniqueness is the manner in which the articulators are manipulated during speech. The articulators include the lips, teeth, tongue, soft palate, and jaw muscles, and the controlled dynamic interplay of these results in intelligible speech which is not spontaneously acquired by infants. It is a studied process of the imitation of those who are successfully communicating. The desire to communicate causes the infant to accomplish intelligible speech by successive steps of trial and error [7]. So our claim is that in this process of imitation, speakers residing in a particular dialectal zone will have similar dynamic use-patterns for their articulators which will be reflected in their spectrograms. Thus, if we bring an infant from zone Z1 and bring him up in zone Z2 , then at an adult stage he will have articulators use pattern similar to that of zone Z2 but not the zone Z1 . Fig. 1 shows speech corresponding to the word, “Ganpati ”, (chosen because it has nasal-to-vowel coarticulation
Wavelet Packet Based Cepstral Features (a) Konkan zone
(b) Marathwada zone
(c) Vidharbh zone
0
Amplitude
0.2 Amplitude
Amplitude
0.5
0.1 0
−0.1 −0.5 0
5000 10000 Sample Index n
−0.2 0
0
0
2000 4000 Time
0
5000 10000 Sample Index n
1
0.5
0
0.1
−0.2 0
5000 10000 Sample Index n
Frequency
Frequency
Frequency
0.5
0.2
−0.1
1
1
135
0
2000 4000 Time
0.5
0
0
2000 4000 Time
Fig. 1. Speech signal and its spectrogram corresponding to the Marathi word, “Ganpati”, spoken by rural males of (a) Konkan, (b) Marathwada, and (c) Vidharbh zone having age 51, 35, and 34, respectively.
and hence it is highly speaker and possibly zone specific) spoken by three rural males from each of Konkan, Marathwada and Vidharbh zones. It is a very commonly used word. Subjects were asked to read the word, “Ganpati ”, with ten repetitions and third repetition was selected as the test sample. It is clear that there are distinct dialectal differences in speech spectrograms of males from different zones. SC task can be performed in closed set or open set mode depending upon whether training and testing classes are same or different. In this paper, the problem of open set speaker classification is addressed in text-independent mode on the database prepared in realistic noisy environments from four distinct dialectal zones of Maharashtra viz. Konkan, Vidharbh, Marathwada and Khandesh in an Indian language viz. Marathi.
2
Data Collection and Corpus Design
Database of 168 speakers (42 speakers from each zone with 21 speakers for training and remaining 21 for testing; recorded with different microphones) is created from the four distinct dialectal zones of Maharashtra with the help of a voice activated tape recorder (Sanyo Model M-1110C & Aiwa JS299)
136
H.A. Patil et al.
with microphone input, a close talking microphone (viz. Frontech and Intex). The data is recorded on the Sony high fidelity voice and music recording cassettes (C-90HFB). A list consisting of five questions, isolated words, digits, combination-lock phrases, read sentences and a contextual speech of considerable duration was prepared. The contextual speech consisted of description of nature or memorable events etc. of community or family life of the speaker. The data was recorded with 10 repetitions except for the contextual speech. During recording of the contextual speech, the interviewer asked some questions to speaker in order to motivate him/her to speak on his/her chosen topic. This also helps the speaker to overcome the initial nervousness and come to his/her natural mode so that the acoustic characteristics of his/her speech are tracked precisely. Once the magnetic tape was played into the computer, the speaker’s voice was played again to check for wrong editing. Silence removal and amplitude normalization was done through software. Finally, corpus is designed into training segments of 30s, 60s, 90s and 120s durations and testing segments of 1s, 3s, 5s, 7s, 10s, 12s and 15s durations.
3
SBCC (Subband Based Cepstral Coefficients)
Even though state-of-the-art feature set viz. Mel Frequency Cepstral Coefficients (MFCC) is extensively used for speaker recognition, it has got some drawbacks and hence this motivates one to investigate other feature sets [4–5], [9–10]: 1. In MFCC, the filterbank is implemented with triangular filters whose frequency response is not smooth and hence may not be suitable for noisy speech data. 2. The implementation of triangular filterbank requires critical band windowing (in frequency domain) or critical band filter banks (in time domain) which are computationally expensive as it does not involve any multirate signal processing. 3. For computing the spectrum, Discrete Fourier Transform (DFT) whose resolution is constant in time and frequency is used in MFCC. The local changes in time frequency plane will therefore not be highlighted very much in MFCC; this in turn will give less inter-zonal variability. Thus, speaker classification may not be satisfactory. 3.1
Wavelet Packet Transform
Wavelet packets (WP) were introduced by Coifmann, Meyer and Wickerhauser [2] by generalizing the link between multiresolution approximations and wavelet bases. A signal space Vj of a multiresolution approximation is decomposed in a lower resolution space Vj+1 plus a detail space Wj+1 . This is achieved by dividing the orthogonal basis {φj (t−2j n)}n∈Z of Vj into two new
Wavelet Packet Based Cepstral Features
137
orthogonal bases {φj+1 (t − 2j+1 n)}n∈Z of Vj+1 and {ψj+1 (t − 2j+1 n)}n∈Z of Wj+1 where φ(t) and ψ(t) are scaling and wavelet function, respectively. The decomposition for WP can be implemented by using a pair of Quadrature Mirror Filter (QMF) filter bank which divides the frequency band into equal halves. Due to the decomposition of the approximation space (low frequency band) as well as the detail space (high frequency band), the frequency division of speech on both lower and higher side takes place. This recursive splitting of vector spaces is represented by an admissible WP binary tree. Let each subspace in the tree be represented by its depth j and number of subspaces p below it. The two wavelet packet orthogonal bases at a parent node (j, p) are defined by [8], 2p ψj+1 (t)
=
n=+∞
h(n)ψjp (t
− 2 n) and j
n=−∞
2p+1 ψj+1 (t)
'
=
n=+∞
g(n)ψjp (t − 2j n)
n=−∞
( 2p As {ψjp (t − 2j n)}n∈Z is orthonormal, h(n) = ψj+1 (v), ψjp (v − 2j n) and ' ( 2p+1 g(n) = ψj+1 (v), ψjp (v − 2j n) . The implementation of SBCC is similar to that of MFCC [3], i.e., we pass the speech signal through the process of frame blocking, Hamming windowing, pre-emphasis and decomposing the speech into admissible wavelet packet structure. The tree which has been selected in this paper is given in [9-10]. Then finding the normalized filterbank energy (to have equal emphasis in each subband) and finally decorrelate the logfilterbank energy using DCT. # $ L k(l − 0.5) SBCC(k) = log[S(l)] cos π , k = 1, 2, . . . , Nc , L l=1
where L = number of subbands in WP tree, Nc = number of SBCC, SBCC(k) = k-th SBCC, S(l) = normalized filter bank energy, i.e.S(l) =
∞ m=l
W x(l, m)2 Nl
Nl = number of wavelet coefficients in i-th subband. For implementing Wavelet Packet Cepstral Coefficients (WPCC) implementation, wavelet transform of log-filterbank energy is taken (rather than DCT as in case of SBCC) to decorrelate the subband energies (as shown in Figure 2).
4
Polynomial Classifier
In this paper, polynomial classifier of 2nd and 3rd order approximation is used as the basis for all the experiments. Due to Weierstrass-Stone approximation
138
H.A. Patil et al.
Fig. 2. Functional Block diagram for SBCC and WPCC implementation
Fig. 3. The Modified Classifier Structure
theorem, polynomial classifiers are universal approximators to the optimal Bayes classifier [1]. The basic structure of the classifier is shown in Fig. 3. They are processed by the polynomial discriminant function. Every speaker i has wi as his/her model, and the output of a discriminant function is averaged over time resulting in a score for every [1]. The score is then given by Si =
N 1 w p(xi ) N i=1
where xi = i-th input test feature vector, w = speaker model, and p(x) = vector of polynomial basis terms of the input test feature vector. Training polynomial classifier is accomplished by obtaining the optimum speaker model for each speaker using discriminatively trained classifier with mean-squared error (MSE) criterion, i.e., for speaker’s feature vector, an output of one is desired, whereas for impostor data an output of zero is desired. For the two-class problem, let wspk be the optimum speaker model, ω class label, and y(ω) the ideal output, i.e., y(spk) = 1 and y(imp) = 0. The resulting problem using MSE is ) 2 * wspk = arg min E w p(x) − y(ω) (1) w
where E{.} means expectation over x and ω. This can be approximated using training feature set as Nimp w p(xi ) − 12 + w p(y i )2
Nspk
wspk = arg min w
i=1
i=1
(2)
Wavelet Packet Based Cepstral Features TR \ FS MFCC SBCC WPCC
30s 64.28 62.07 65.98
60s 63.09 62.75 65.47
90s 63.94 59.35 66.66
TR \ FS MFCC SBCC WPCC
120s 62.92 58.67 66.66
Table 1. Average Success rates (%) for 2nd order approximation (Open Set SC-Marathi)
30s 61.22 61.22 66.32
60s 57.99 60.54 65.13
90s 61.90 61.39 67.51
139
120s 63.09 61.39 67.85
Table 2. Average Success rates (%) for 3rd order approximation (Open Set SC-Marathi)
where xi , . . . , xNspk are speaker’s training data and y i , . . . , y Nimp is the impostor data. This training algorithm can be expressed in matrix form. Let Mspk = [p(x1 ), p(x2 ), . . . , p(xNspk )] and similar matrix for Mimp . Also let M = [Mspk Mimp ] and thus the training problem in eq. (2) is reduced to the well-known linear approximation problem in normed space as wspk = arg min ||Mw − o||2 w
where o consisting of Nspk ones followed by Nimp zeros. This problem can be solved using method of normal equations M Mwspk = M o which after rearranging gives (M spk Mspk + Mimp Mimp )w spk = Mspk 1
(3)
(3) where 1 is the vector of all ones. Now we define Rspk = M spk Mspk and define Rimp similarly, then eq. (3) can be written as (Rspk + Rimp )wspk = M spk 1.
(4)
Zspk Also define R = Rspk + Rimp and Ai = M i=1 Mspki 1, spki 1, A = −1 wZspk = R A, where wZspk is the optimum voiceprint for a dialectal zone and Zspk is the number of speakers in each dialectal zone (21 in present problem). The details of training algorithm for multi-class problem, polynomial basis determination and mapping algorithm based semi-group isomorphism property of monomials for computing unique terms in Rspk (and hence R) are given in [1].
5
Experimental Results
Feature analysis was performed using a 23.2ms duration frame with an overlap of 50%. Hamming window was applied to each frame and subsequently, each frame was pre-emphasized with the filter (1 − 0.97z −1 ). Pre-emphasis helps us to concentrate on articulator dynamics in speech frame and it is
140
H.A. Patil et al.
Id.\Act. KN MW V K
KN 0 0 0 0
MW 0 61.905 9.5238 0
V K 9.5238 90.476 38.095 0 90.476 0 4.7619 95.238
Table 3. Confusion Matrix for MFCC with 2nd Order Approximation for 4 zones
Id.\Act. KN MW V K
KN 0 0 0 0
MW V 0 0 33.333 66.667 0 100 0 0
K 100 0 0 100
Table 4. Confusion Matrix for SBCC (db6) IV with 2nd Order Approximation for 4 zones
Id.\Act. KN MW V K KN 4.7619 0 0 95.238 MW 0 90.476 9.5238 0 V 0 19.048 76.19 4.7619 K 0 0 0 100
Table 5. Confusion Matrix for WPCC (db6) with 2nd Order Approximation for 4 zones
hence useful for tracking the manner in which the speaker pronounces a word. During training phase, 12 MFCC, 12 SBCC and 12 WPCC feature vectors were extracted per frame from the training speech as per the details discussed in Section 3. SBCC were extracted with daubechies wavelets of 6 vanishing moments (db6). The results are shown as average success rates over testing speech durations viz. 1s, 3s, 5s, 7s, 10s, 12s, and 15s in Tables 1 and 2 for different training (TR) durations. Tables 3 and 5 show confusion matrices (diagonal elements indicate % correct identification in a particular dialectal zone and off-diagonal elements show the misclassifications) for Konkan (KN), Marathwada (MW), Vidharbh (V) and Khandesh (K). In Tables 3–5, ACT and IDENT represents actual dialectal zone and identified zone, respectively. Some of the observations from the results are as follows: 1. Average success rates improve slightly for 3rd order approximation as compared to 2nd order approximation (Tables 1–2). 2. For 2nd order approximation, WPCC performs better than SBCC and WPCC in majority of the cases of training speech durations whereas MFCC performs better than SBCC. 3. For 3rd order approximation, SBCC and WPCC both perform better than MFCC in majority of the cases of training speech durations whereas MFCC performs better than SBCC. 4. WPCC showed better class discrimination power as compared to MFCC and SBCC in majority of the cases of speaker classification. 5. The Konkan dialect has been misclassified as Khandesh by a large degree.
Wavelet Packet Based Cepstral Features
6
141
Summary and Conclusions
In this paper, a novel approach is made for speaker classification in Marathi in open set by exploiting wavelet based features and polynomial classier. In the present study, classifier has been used for speaker classification task based on dialectal zones in open set mode. To the authors’ knowledge this is the first study of its kind in Indian languages. Low level of success rates are probably due to 1. The use of different microphones for training and testing in realistic situations. 2. Loss of individual’s identity in an averaged characteristics of feature set. Acknowledgments The authors would like to thank the authorities of EU-India CultureTech Project for extending their support to carry out this research work. They would also like to thank the GfKl 2005 authorities.
References 1. CAMPBELL, W.M., ASSALEH. K.T., and BROUN, C.C. (2002): Speaker Recognition with Polynomial Classifiers. IEEE Trans. on Speech and Audio Processing 10(4):205–212. 2. COIFMAN, R.R., MEYER, Y., and WICKERHAUSER, M.V. (1992): Wavelet Analysis and Signal Processing. In: B. Ruskai et al. (eds.) Wavelets and Applications. Boston, Jones and Bartlett, pp. 153–178 3. DAVIS, S.B., and MERMELSTEIN, P. (1980): Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. Acoust., Speech and Signal Processing, ASSP-28(4). 4. ERZIN, E., CETIN, A.E., and YARDIMCI, Y. (1995): Subband Analysis for Robust Speech Recognition in the Presence of Car Noise. Proc. Int. Conf. on Acoust., Speech and Signal Processing (ICASSP95), 1:417–420. 5. FAROOQ, O., and DATTA, S. (2001): Mel Filter-like Admissible Wavelet Packet Structure for Speech Recognition. IEEE Signal Processing Letters 8(7). 6. JIN, H., KUBALA, F., and SCHWARTZ, R. (1997): Automatic Speaker Clustering. Proc. Speech Recognition Workshop, 108–111. 7. KERSTA, L.G. (1962): Voiceprint Identification. Nature 196(4861):1253–1257. 8. MALLAT, S. (1999): A Wavelet Tour of Signal Processing 2nd Edition. Academic Press. 9. PATIL, H.A., and Basu, T.K. (2004): Comparison of Subband Cepstrum and Mel Cepstrum for Open Set Speaker Classification. IEEE INDICON, 35–40. IIT Kharagpur, India 10. SARIKAYA, R., PELLON, B.L., and HANSEN, J.H.L. (1998): Wavelet Packet Transforms Features with Application to Speaker Identification. IEEE Nordic Signal Processing Symposium, 81–84.
A New Effective Algorithm for Stepwise Principle Components Selection in Discriminant Analysis Ekaterina Serikova and Eugene Zhuk Belarus State University, 4 Fr. Skariny av., 220050 Minsk, Belarus Abstract. The problem of reducing the dimensionality of multivariate Gaussian observations is considered. The efficiency of discriminant analysis procedure based on well-known method of principle components selection is analytically investigated. The average decrease of interclass distances square is presented as a new criterion of feature selection directly connected with the classification error probability. New stepwise discriminant analysis procedure in the space of principal components based on this criterion is proposed and its efficiency is experimentally and analytically investigated.
1
Introduction: Mathematical Model
Let a sample of n jointly independent random observations x1 , ..., xn from L ≥ 2 classes {Ω1 , ..., ΩL } be registered in the feature space RN . Let dot ∈ S = {1, ..., L} be an unknown random class index to which xt belongs: P{dot = i} = πi > 0,
i∈S
(π1 + ... + πL = 1),
(1)
where {πi }i∈S are prior class probabilities (FUKUNAGA (1990)). Under fixed dot = i (i ∈ S) the observation xt ∈ RN is described by the conditional probability density function: pi (x) ≥ 0, x ∈ RN : p (x)dx = 1, i ∈ S. i N R The classes {Ωi }i∈S are completely determined by the introduced characteristics {πi , pi (·)}i∈S . Often in practice these characteristics are unknown, however the vector of true classification indices Do = (do1 , ..., don )T ∈ S n for the sample X = {x1 , ..., xn } (so-called training sample) is observed. (”T ” is the transposition symbol). The discriminant analysis problem consists in construction of decision rule (DR) d = d(x; X, Do ) ∈ S for classifying a random observation x ∈ RN with true class index do ∈ S. However, often in practice the initial feature space is redundant. It means that its dimensionality N is too large (AIVAZYAN at al.(1989), ANDERSON (1963), FUKUNAGA (1990)) and new sample Y = {y1 , ..., yn } must be ∗ constructed from sample X: yt = f (xt ) ∈ RN , N ∗ < N (t = 1, n), so as ∗ classification d = d(y; Y, Do ) ∈ S (y = f (x) ∈ RN ) remains acceptable. In this paper the well-known Fisher model (AIVAZYAN at al.(1989)) of multivariate normal (Gaussian) distribution mixture is investigated:
pi (x) = nN (x|µi , Σ),
x ∈ RN ,
i ∈ S,
(2)
Stepwise Principle Components Selection
143
where nN (x|µi , Σ) is N -variate Gaussian probability density function with mean vector µi = E{x| do = i} ∈ RN and non-singular covariance (N × N )matrix Σ = E{(x − µi )(x − µi )T | do = i} (det(Σ) = 0), common for all classes.
2
Bayesian Decision Rule and Probability of Error Classification
For the Fisher model (1), (2) to classify a random observation x ∈ RN the well-known Bayesian DR (BDR) (AIVAZYAN at al.(1989), ANDERSON (1963), FUKUNAGA (1990)), which minimizes the risk (the classification error probability) ro = P{do (x) = do } is used: do (x) = arg max{2 ln πi − (x − µi )T Σ −1 (x − µi )}, i∈S
ro =
i∈S
+ ∆2jk πk T −1 πi I (x − µj ) Σ (µj − µk ) + − ln × 2 πj j∈S k∈S j=i
RN
(3)
k=j
×nN (x|µi , Σ)dx, where I(z) = {1, if z ≥ 0; 0, if z < 0} is the unit function and ∆jk = (µj − µk )T Σ −1 (µj − µk ), j = k ∈ S.
(4)
is the Mahalanobis interclass distance between classes Ωj , Ωk . The risk (3) is the primary efficiency criterion in statistical classification theory, but its analytic investigation is difficult. Let state the helpful theorem. Theorem 1. Under the conditions of Fisher model (1), (2) the following inequality for classification error probability (3) is true: ∆ij ln(πi /πj ) ro ≤ πi Φ − − , (5) 2 ∆ij j∈S i∈S
j=i
z where Φ(z) = −∞ ϕ(w)dw is the standard Gaussian distribution function with the probability density function ϕ(w) = √12π exp(−w2 /2). In case of two classes (L = 2) the exact equality takes place.
Proof. Let calculate the upper estimate of the risk value (3):
+ ∆2ij πi T −1 ro ≤ πi I (x − µj ) Σ (µj − µi ) + − ln nN (x|µi , Σ)dx. 2 πj j∈S i∈S
j=i R
N
144
E. Serikova and E. Zhuk
Make following substitutions in turn y = (x−µj )T Σ −1 (µj −µi )+
2 N ∆ij
2
−ln
πi , πj
z=−
y ln(πi /πj ) N ∆ij − − , 2 N ∆ij N ∆ij
and fined (5). The theorem is proved. It is seen from (5) that increasing of interclass distances (4) amounts to the classification error probability decreasing, if interclass distances are sufficiently great or if prior class probabilities are equal. So in order to evaluate the classification efficiency, when dimensionality reducing of initial feature space takes place, it is necessary to investigate the behavior of interclass distances.
3
Interclass Distance Properties in the Space of Principle Components
Let investigate the efficiency of well-known principle components selection method (AIVAZYAN at al.(1989), ANDERSON (1963), FUKUNAGA (1990)). According to this method N -vector x = (˜ x1 , ..., x ˜N )T ∈ RN with the covariance (N × N )-matrix Σ is linearly transformed: y˜k = y˜k (x) = ΨkT x,
k = 1, N ,
(6)
where {Ψk }N k=1 are orthonormalized eigenvectors of matrix Σ: ΣΨk = λk Ψk ,
ΨkT Ψl = {1, if k = l; 0, if k = l},
k, l = 1, N ,
(7)
and {λk }N k=1 are descent ordered eigenvalues of matrix Σ: λ1 ≥ λ2 ≥ ... ≥ λN . Obtained values y˜1 , ..., y˜N are called principle components for initial observation x = (˜ x1 , ..., x ˜N )T . Principle components are uncorrelated, dispersion value of component number k is equal to appropriate eigenvalue: D{˜ yk } = N N N λk > 0, and tr Σ = D{˜ xk } = D{˜ yk } = λk . k=1
k=1
k=1
To detect informative principle components the criterion of large variability of these components is applied. Components with small dispersion are rejected and N ∗ (N ∗ ≤ N ) first principle components y˜1 , ..., y˜N ∗ are used. The number of informative components N ∗ is defined by the following rule: ∗
N (ε) = min{k : 1 − ν(k) ≤ ε, k = 1, N },
k l=1 ν(k) = N l=1
λl
λl
(8)
where ε ∈ [0, 1) is a predetermined value, 0 < ν(k) ≤ 1 is the relative summarized dispersion fraction of first k principle components (ν(N ) = 1). Let new sample Y = {y1 , ..., yn } be constructed from X = {x1 , ..., xn } using principle components method (6),(7): yt = f (xt ) = Ψ N xt , t = 1, n,
Stepwise Principle Components Selection
145
. . where Ψ N = (Ψ1 .......ΨN )T is (N × N )-matrix composed of eigenvectors of matrix Σ. Observations {yt }nt=1 from Y are described by the Fisher model with the following parameters {mi }i∈S and Σy : Σy = diag{λ1 , ..., λN }.
mi = (mi,1 , mi,2 , ..., mi,N )T = Ψ N µi ,
(9)
Let N ∗ ∆yij be the Mahalanobis interclass distance between classes Ωi , Ωj in the space of first N ∗ principle components (AIVAZYAN at al.(1989)): ∗ ∆ = (mi (N ∗ ) − mj (N ∗ ))T (Σy (N ∗ ))(mi (N ∗ ) − mj (N ∗ )), i = j ∈ S, N yij
∗ where mi (N ∗ ) ∈ RN i∈S and Σy (N ∗ ) are obtained from {mi }i∈S , Σy by removal of last N − N ∗ rows and columns. Notice that in the case of N = N ∗ : N ∆yij is the Mahalanobis interclass distance between classes Ωi , Ωj in the space of all N principle components: ∆ = (mi − mj )T Σy−1 (mi − mj ), i = j ∈ S. N yij To investigate the efficiency of principle components selection based on their variability at first let us investigate the behavior of interclass distances. Theorem 2. The Mahalanobis interclass distance N ∗ ∆yij in the space of first N ∗ principle components (N ∗ < N ) is related to appropriate interclass distance N ∆yij in the space of all N principle components by expression: 2 N ∗ ∆yij
N
= N ∆2yij −
l=N ∗ +1
(mi,l − mj,l )2 , λl
i = j ∈ S,
(10)
and in the case of N = N ∗ it coincides with appropriate distance (4) in the initial space: N ∆yij = ∆ij , i = j ∈ S. Proof. Using obvious properties: Σy−1 = diag{1/λ1 , . . . , 1/λN }, Ψ N (Ψ N )T = N (Ψ N )T Ψ N = l=1 Ψl ΨlT = IN , and expressions (7): Σ −1 Ψj = λ−1 j Ψj , j = 1, N , transform interclass distance in the space of N ∗ principle components: 2 N ∗ ∆yij
∗
∗
= (µi − µj )T (Ψ N )T (Σy (N ∗ ))−1 Ψ N (µi (N ∗ ) − µj (N ∗ )) = ∗
∗
N N 1 = (µi − µj ) Ψl ΨlT (µi − µj ) = (µi − µj )T Σ −1 Ψl ΨlT (µi − µj ). λl T
l=1
∗
If N = N :
2 N ∗ ∆yij
2 N ∗ ∆yij
l=1
=
∆2ij .
∗
For N < N continue the transformation:
= (µi − µj )T Σ −1 IN −
N l=N ∗ +1
Ψl ΨlT (µi − µj ) =
146
E. Serikova and E. Zhuk N
= N ∆2yij −
λl ((µi −µj )T Σ −1 Ψl )2 = N ∆2yij −
l=N ∗ +1
N
l=N ∗ +1
2 1 (µi − µj )T Ψl . λl
According to (9) (mi,l − mj,l )T = (µi − µj )T Ψl . The theorem is proved. Corollary 1. Under the conditions of Theorem 2 the inequality is true: 2 2 2 N ∆yij − N ∗ ∆yij ≤ |µi − µj |
N l=N ∗ +1
where |z| =
√
1 , λl
i = j ∈ S,
(11)
z T z is the Euclidean norm of vector z ∈ RN .
It is seen from (10), (11) that the interclass distances decrease when features are rejected from the space of N principle components: N ∗ ∆yij ≤ N ∆yij , i = j ∈ S, and the value of this decreasing is inversely proportional to dispersions D{˜ yl } = λl > 0, l = N ∗ + 1, N . Therefore the rejection of principle components with small dispersions as in (8) can cause the acute increase of the classification error probability. The results of Theorem 1 and Theorem 2 allow to introduce new (directly connected with the classification error probability) criterion of principle component rejection. The rejected component number k (k ∈ {1, ..., N }) should minimize the average decrease of interclass distances square: δ∆2y (k) =
L2
1 2 2 N ∆yij − N \k ∆yij , −L j∈S i∈S
j=i
where N \k ∆yij is the Mahalanobis interclass distance in the space of N − 1 principle components after rejecting the component number k. Notice that this criterion is universal and can be applied not in the space of principle components only (see SERIKOVA (2004)). According to (10) the expression for the average decrease of interclass distances square can be written in the form: δ∆2y (k) =
(mi,k − mj,k )2 2 . L(L − 1) λk j∈S i∈S
4
(12)
j>i
Stepwise Discriminant Procedure in the Space of Principle Components
Now let us describe the classical procedure of discriminant analysis in the space of principle components and using obtained analytic results propose new stepwise discriminant procedure based on interclass distance behaviour in the space of principle components.
Stepwise Principle Components Selection
147
ˆ of class characteristics Stage 1. Statistical estimates {ˆ πi , µ ˆi }i∈S , Σ N {πi , µi}i∈S , Σ are calculated in the initial space R (AIVAZYAN at al.(1989)): ni ; n
π ˆi =
n 1 δdo ,i xt ; ni t=1 t
µ ˆi =
ni =
n
δdot ,i ,
i ∈ S;
(13)
t=1
1 (xt − µ ˆdot )(xt − µ ˆdot )T . n − L t=1 n
ˆ= Σ
ˆy of class characteristics {mi }i∈S , Stage 2. Statistical estimates {m ˆ i }i∈S , Σ Σy are calculated in the space of principle components by applying (9): m ˆ i = Ψˆ N µ ˆi ,
i ∈ S;
ˆ 1 , ..., λ ˆ N }, ˆy = diag{λ Σ
ˆ 1 ≥ ... ≥ λ ˆ N are decrease ordered eigenvalues of matrix Σ ˆ from (13), where λ . . ˆ Ψˆ N = (Ψˆ .......Ψˆ )T is (N × N )-matrix composed of eigenvectors of matrix Σ. 1
N
Stage 3. For classical procedure based on components variability the number of informative principle components N ∗ = N ∗ (ε) is defined according to the rule (ε ∈ [0, 1) is any predetemined value): k ˆ λl ˆ ∗ (ε) = min{k : 1 − κ N ˆ (k) ≤ ε, k = 1, N }, κ ˆ (k) = l=1 . (14) N ˆ l=1 λl
For new stepwise procedure based on interclass distance behavior the following steps are performed. Let M (o) = {1, ..., N } be the initial set of principle component numbers. At step g (g = 1, N − 1) component with number k (g) is rejected from the set of principle components M (g−1) : M (g) = M (g−1) \{k (g) }. Rejected component minimizes the average decrease of interclass distances square:
k (g) = arg
min l∈M (g−1)
ˆ 2 (l) = δ∆ y
ˆ 2 (l), δ∆ y
(m 2 ˆ i,k − m ˆ j,k )2 . ˆk L(L − 1) λ j∈S i∈S
j>i
The relative decrement of the average interclass distancse square is calculated: g ˆ 2 (s) ˆ 2 (k (g) ) ) δ∆ y s=1 δ∆y (k (g) (g−1) δK = + = δK . N ˆ 2 N ˆ 2 δ∆ (l) δ∆ (l) l=1
y
l=1
y
Stage 4. The behavior of interclass distances is analysed by means of increasing consequence of relative decrements 0 ≤ δK (g) < 1, g = 0, N − 1 ˆ∗ = N ˆ ∗ (ε) (δK (o) := 0). The number of informative principle components N is defined according to the rule (ε ∈ [0, 1) is any predetemined value):
ˆ∗ = N ˆ ∗ (ε) = N − g ∗ (ε), N
g ∗ (ε) = max{g : δK (g) ≤ ε, g = 0, N − 1}.
148
E. Serikova and E. Zhuk ˆ∗
Stage 5. In the space of informative principle components RN plug-in BDR is constructed:
ˆ y (N ˆ ∗) − m ˆ ∗ ))T × ˆ ∗ )) = arg max{2 ln π ˆi − (ˆ y (N ˆ i (N d(ˆ i∈S
ˆy (N ˆ ))−1 (ˆ ˆ ∗) − m ˆ ∗ ))}, ×(Σ y (N ˆ i (N
(15)
ˆ∗
ˆ ∗ ) ∈ RN are the features from observation yˆ = Ψˆ N x ∈ RN where yˆ(N ˆ ∗ )}i∈S and using for classification (x ∈ RN is an initial observation); {m ˆ i (N ∗ ˆ ˆ Σy (N ) are estimates of mean vectors and covariance matrix in the space ˆ ∗ principle components obtained from {m ˆy by removal rows of N ˆ i }i∈S , Σ and columns with numbers of rejected components. For classical procedure ˆ ∗ }. For ones are components with numbers not presented in the set {1, · · · , N ∗ new procedure these are components with numbers not from the set M (g ) , ∗ ∗ ˆ . g =N −N
5
Experimental Efficiency Investigation
To investigate the efficiency of proposed discriminant procedure the computing experiment was carried out. The real data of oncological disease was considered. The dimension of initial feature space was equal to twelve (N = 12). It is necessary to distinguish between the absence of a desease and three cancer stages (number of classes L = 4) for new incoming observations by carry out discriminant analysis of training sample X = {x1 , ..., xn } with size n = 140. The problem was solved applying new stepwise discriminant procedure based on interclass distance behavior. Obtained results were compared with classical procedure based on well-known method of principle component selection from (14). The classification was performed on every stage g (g = 0, N − 1) of discriminant procedures in the space of principle components. The results are presented in the Table 1. ∗ The values of experimental error fraction (AIVAZYAN at al.(1989) ) γnN ∗ N and γn,m for training sample X = {xn , . . . , xn } and for test sample X(m) = {xn+1 , . . . , xn+m } of new registered observations with size m = 40 were calculated as indicators of accepted decisions efficiency (ˆ yn+j = Ψˆ N xn+j ∈ RN , j = 1, m): ∗ γnN
1 = 1 − δd(ˆ ˆ yt (N ∗ )),do , t n t=1 n
N∗ γn,m
1 = 1 − δd(ˆ , ˆ yn+j (N ∗ )),do n+j m j=1 m
ˆ yn+j (N ∗ )) ∈ S are decisions accepted by DR (15) for ˆ yt (N ∗ )), d(ˆ where d(ˆ ∗ ∗ observations yˆ(N ), yˆn+j (N ∗ ) ∈ RN (N ∗ = N − g) correspondingly. From the Table 1 it is seen that the well-known procedure based on dispersion fraction is less effective and doesn’t allow to detect the number of
Stepwise Principle Components Selection Step number,
g
0 1 2 3 4 5 6 7 8 9 10 11
Classical procedure based on dispersion fraction
New procedure based on interclass distance behavior
k(g) 1 − κ ˆ (N − g) γnN −g
N −g γn,m k(g) δK (g)
γnN −g
N −g γn,m
0 12 11 10 9 8 7 6 5 4 3 2
0.000 0.050 0.050 0.125 0.125 0.100 0.125 0.100 0.175 0.250 0.450 0.600
0.007 0.029 0.043 0.043 0.036 0.043 0.036 0.050 0.043 0.043 0.071 0.293
0.000 0.000 0.000 0.000 0.025 0.000 0.175 0.150 0.075 0.075 0.125 0.475
0.000000 0.000000 0.000000 0.000001 0.000004 0.000008 0.000022 0.000065 0.000199 0.000589 0.001505 0.007420
0.007 0.007 0.029 0.043 0.036 0.028 0.093 0.236 0.371 0.486 0.521 0.671
149
0 1 4 3 2 7 10 8 5 6 9 11
0.000 0.003 0.011 0.020 0.036 0.056 0.083 0.127 0.175 0.259 0.371 0.561
Table 1. Experimental results.
informative principle components adequately. Whereas new procedure based on interclass distance behavior is more effective. The acute decrease of interclass distances takes place after rejecting of ten components and for acceptable classification it is necessary to use at least three principle components. Note that new procedure leaves components with numbers nine, eleven and twelve as the most informative ones. Whereas according to classical procedure these components have the smallest dispersion fractions and were falsely rejected on first steps.
References AIVAZYAN S., BUCHSTABER V., YENYUKOV I., MESHALKIN L. (1989): Applied statistics: Classification and Dimensionality Reduction. Finansy i Statistika, Moskow. ANDERSON Y. (1963): An Introduction to Multivariate Statistical Analysis. Viley, New York. FUKUNAGA K. (1990): Introduction to statistical pattern recognition. Academic Press, New York. SERIKOVA E. (2004): Admissible sample size for stepwise discriminant procedure based on interclass distance behavior. Computer Data Analysis and Modeling: robustness and computer intensive methods. September, Minsk, 189–192.
A Comparison of Validation Methods for Learning Vector Quantization and for Support Vector Machines on Two Biomedical Data Sets David Sommer and Martin Golz Department of Computer Science, University of Applied Sciences Schmalkalden
Abstract. We compare two comprehensive classification algorithms, support vector machines (SVM) and several variants of learning vector quantization (LVQ), with respect to different validation methods. The generalization ability is estimated by ”multiple-hold-out” (MHO) and by ”leave-one-out” (LOO) cross validation method. The ξα-method, a further estimation method, which is only applicable for SVM and is computationally more efficient, is also used. Calculations on two different biomedical data sets generated of experimental data measured in our own laboratory are presented. The first data set contains 748 feature vectors extracted of posturographic signals which were obtained in investigations of balance control in upright standing of 48 young adults. Two different classes are labelled as ”without alcoholic impairment” and ”with alcoholic impairment”. This classification task aims the detection of small unknown changes in a relative complex signal with high inter-individual variability. The second data set contains 6432 feature vectors extracted of electroencephalographic and electroocculographic signals recorded during overnight driving simulations of 22 young adults. Short intrusions of sleep during driving, so-called microsleep events, were observed. They form examples of the first class. The second class contains examples of fatigue states, whereas driving is still possible. If microsleep events happen in typical states of brain activity, the recorded signals should contain typical alterations, and therefore discrimination from signals of the second class, which do not refer to such states, should be possible. Optimal kernel parameters of SVM are found by searching minimal test errors with all three validation methods. Results obtained on both different biomedical data sets show different optimal kernel parameters depending on the validation method. It is shown, that the ξα-method seems to be biased and therefore LOO or MHO method should be preferred. A comparison of eight different variants of LVQ and six other classification methods using MHO validation yields that SVM performs best for the second and more complex data set and SVM, GRLVQ and OLVQ1 show nearly the same performance for the first data set.
1
Introduction
Support Vector Machines and Learning Vector Quantization are two efficient methods of machine learning which are approved e.g. in handwritten word recognition, robotic navigation, textual categorization, face recognition and
A Comparison of Validation Methods for LVQ and SVM
151
time series prediction [M¨ uller et al. (2001), Osuna et al. (1997), Cao and Tay (2003)]. The aim of this paper is to compare both methods on two real world biomedical data sets using several variants of LVQ and of SVM and of some other classification algorithms. Among them are several methods of automatic relevance detection, e.g. recently introduced GRLVQ [Hammer and Villmann (2002)]. Calculations were done on two fully different biomedical data sets coming from two different disciplines: biomechanics and electrophysiology applied to psychophysiology. The first data set comes out of an investigation of balance control in upright standing of 48 young volunteers. They were investigated without impairment and 40 minutes after consumption of 32 grams of alcohol. Therefore we have two different classes which are labelled as ”without alcoholic impairment” and ”with alcoholic impairment”. Subjects had to stand on a solid plate with elevated arms and turned hands, the so-called supination position [Golz et al. (2004)]. Signals of four force sensors located between plate and ground are combined to calculate the two-dimensional signal of the centre-of-foot-pressure, which is a sensitive measure of postural sway. From both signals the power spectral densities were estimated by Burg’s autoregressive modelling method. This two-class problem is nearly weight out and consists of 376 feature vectors of 40 components. This classification task aims the detection of small unknown changes in a relative complex signal with high inter-individual variability. The second and clearly more extensive and higher-dimensional data set contains power spectral densities of electroencephalograms (EEG) and electrooculograms (EOG) recorded during strong fatigue states and during microsleep events of 16 young car drivers [Sommer and Golz (2003)]. Microsleep events are defined as short intrusions of sleep into ongoing wakefulness during attentional tasks and are coupled to dangerous attention losses. The decision which behavioural event belongs to ”microsleep events” and which to ”strong fatigue” was made by two independent experts. This was mainly done by visual scoring of video recordings. Subjects had to drive overnight starting at 1:00 a.m. (7 x 40 min) in our driving simulation lab under monotonic conditions. Small segments (duration 6 sec) of EEG and EOG were taken during both events. A comparison of several spectral estimation methods yields that Burg’s autoregressive method is outperformed by the simple periodogram method [Sommer and Golz (2003)]. In this paper we therefore report only on results for the second data set using the latter method. The extracted data set contains 5728 feature vectors of 207 components. This classification task also aims the detection of small unknown changes in a relative complex signal with high inter-individual variability. If microsleep events happen in typical states of brain activity, the recorded signals should contain typical alterations, and therefore discrimination from signals of the second class, which do not refer to such states, should be possible. There exists no expert knowledge to solve both classification tasks. Knowl-
152
D. Sommer and M. Golz
edge extraction in both fields is strongly impaired due to high inter-individual differences in the observed biosignals and due to high noise. Therefore, adaptive and robust methods of machine learning are essential. Learning Vector Quantization (LVQ) [Kohonen (2001)] is a supervised learning and prototype-vector based classification method which adapts a piecewice linear discriminant function using a relative simple learning rule due to the principle of competitive learning. Activation of neurons is based on distance measures and therefore depends on metrics used. A known disadvantage of LVQ is its high dependence on initialization of the weight matrix [Song and Lee (1996)] which can be decreased by an initial unsupervised phase of training [Golz et al. (1998)]. [Sato (1999)] developed a modification, the so-called Generalized LVQ to decrease variance due to initializations. Other developments are LVQ methods which iteratively adapt a feature weighting during training to improve results and to give back a feature relevance measure. Here we used three representatives, the Distinctive Selection LVQ (DSLVQ), the Relevance LVQ (RLVQ) and the Generalized Relevance LVQ (GRLVQ) (for references we refer to [Hammer and Villmann (2002)]. The Support Vector Machine (SVM) [Vapnik (1995)] is also a supervised learning method and is more computationally expensive than LVQ. In its basic version, SVM can only adapt to linearly separable two-class problems. Advantageously, training is restricted to search for only those input vectors which are crucial for classification. They are called support vectors and are found by solving a quadratic optimization problem. For real world applications the soft-margin SVM [Cortes and Vapnik (1995)] is commonly used which allows a restricted number of training set errors. Another advantage of SVM in comparison to many other classification methods is the uniqueness of the solution found and the resulting independence on initialization and on training sequencing. Important parameters are the slack variable and the type and parameters of the kernel function. Disadvantages of SVM like the relative large memory allocation during training and the relative slow convergence can be removed by optimization of the training algorithm [Joachims (2002)]. This is essential to apply SVM to larger sized problems.
2
Performance Measurement
The performance of a classification algorithm is generally problem dependent. The ability of generalization is a measure of expected correct classifications of unknown patterns of the same underlying distribution function as of the training set. It can be estimated empirically by calculation of test set error rate. Here, we utilize two cross validation method, the ”multiple hold-out” (MHO) and the ”leave-one-out” (LOO) method [Devroye et al. (1996)]. Both methods require a learning set (training + test set) of statistically independent feature vectors. This is e.g. violated in time series processing when using overlapping segmentation; otherwise too optimistic estimates are resulting.
A Comparison of Validation Methods for LVQ and SVM
153
Fig. 1. Semilogarithmic plot of mean training (left) and mean test (right) errors of SVM vs. parameter gamma of Gaussian kernel function applied to posturography data. Estimates of LOO method are shown by left upper graph and by right lower graph, estimates of MHO method are shown by graph with errorbars, and ξα-estimate by right, upper graph. The regularization parameter of C = 10 was separately found to be sufficient.
The acquisition of statistically independent patterns is expensive. In biomedical problems this process often requires an independent scoring process mostly done by experts and requires experimental and organisational effort. As a consequence, relative small data sets on small groups of test subjects are mostly available. Processing of those data sets should be as efficient as possible under the restriction of computational resources [Joachims (2002)]. The MHO validation consumes less computational time than the LOO method. The first method has the ratio of sizes of test and training set as a free selectable parameter for which upper and lower bounds are estimable [Kearns (1996)]. After repeating N times the random partition in test and training set following up by single hold-out estimation one can conclude estimates of adaptivity and ability of generalization by descriptive statistics. We calculate mean and standard deviations of training and test errors. Disadvantageously MHO is biased, because of the limited hypothesis space [Joachims (2002)]. This limitation is minimal in case of LOO because the size of the training set is reduced by only one feature vector. Therefore, this method supplies an almost unbiased estimate of the true classification error. In the special case of the SVM classificator the ξα-estimate was proposed [Joachims (2002)]. This estimator avoids high computational effort. There is no common criterium for the choice of kernel function [M¨ uller et al. (2001)]. Each function type has few parameters which can be defined empirically. Mostly this is done by variation of parameters and calculation of classification errors or the VC-dimension [Van Gestel et al. (2002), Joachims (2002)]. The slack variable is determined in the same manner. For our data sets we have tested the linear, the polynomial and the Gaussian kernel function. In the following we refer only to results of Gaussian kernel SVM because
154
D. Sommer and M. Golz
Fig. 2. Semilogarithmic plot of mean training (left) and mean test (right) errors of SVM vs. parameter gamma of Gaussian kernel function applied to microsleep data. Estimates of LOO method are shown by left upper graph and by right lower graph, estimates of MHO method are shown by graph with errorbars, and ξα-estimate by right, upper graph. The regularization parameter of C = 1.5 was separately found to be sufficient.
they performed best in all cases. Variation of the parameter gamma, which predefines the influence region of single support vectors, shows even in a semilogarithmic plot a gradually decreasing test error which is abruptly increasing after the optimum (Fig. 1 right). Test errors are in case of SVM efficiently computable by LOO method and are mostly slightly lower than mean errors of MHO method. The same plot, but calculated for training errors (Fig. 1 left), shows an inverse result. Training errors of LOO method are mostly slightly higher than MHO results. The ξα-estimate shows a different dependence on gamma and is in the vicinity of both other estimate only in a small range of gamma. Therefore, the ξα-estimate should not be suitable for selection of parameters. Results of the second data set (Fig. 2) are similar to the first, though the processes of data generation are fundamentally distinct. A difference is seen in optimal value of gamma and another in optimal value of mean test errors (Fig. 2 right). The optimal mean test error of the microsleep data set is about 9.8% and the standard deviation is clearly lower, which is argued by the clearly higher size of the data set. On this data set the ξα-method is resulting in the same optimal parameter gamma than both other estimations, but is estimating clearly higher errors.
3
Comparison of Different Classification Methods
In the following we want to compare several variants of LVQ, SVM and other classification methods applied to both data sets. In addition to the originally proposed variants, LVQ1, LVQ2.1, LVQ3, OLVQ1 [Kohonen (2001)], we used
A Comparison of Validation Methods for LVQ and SVM
155
four further variants for relevance detection and feature weighting as mentioned above. Furthermore, some unsupervised learning methods which are calibrated by class labels after training. We compare well-known k-Means and Self-Organizing Map to a representative of incremental neural networks, the Growing Cell Structures [Fritzke (1994)]. All three methods find out a trade-off between a quantized adaptation of the probability density function and a minimization of the mean squared error of vector quantization. In all three unsupervised methods we tested also the modification ”supervised” (sv) which is using the class label as a further component in input vectors of the training set [Kohonen (2001)]. The term is somewhat misleading because training remains unsupervised. Though this modification has only a small effect on distance calculations during training, the algorithm should be able to adapt better. Therefore, training errors are always lower than without modification ”sv”. The posturography data set (Tab. 1A) is very well adaptable reflecting in very low training errors, especially for supervised learning methods which perform nearly equally by mean errors of about 1% and lower. The ability of generalization is also nearly equal suggesting by mean test errors of about 4% which is unusually low for real world biosignals. The quickly converging method OLVQ1 arrives at same level than modern methods GRLVQ and SVM. As expected, in (Tab. 1A) a large difference in test errors to unsupervised learning methods is evident. The modification ”sv” allows the algorithm to find a more generalizable discriminant function. The second and more complex data set (microsleep data) supplies different results (Tab. 1B). Training errors are much higher despite the exception of no errors of SVM. Unsupervised learning methods with modification ”sv” perform better than all LVQ variants with respect to training errors. Among all LVQ variants OLVQ1 performs best. Two modified LVQ algorithms for relevance detection perform slightly worse, but better than standard LVQ. The higher complexity is also reflected in test errors. They are between 14% and 16% for all LVQ variants and are best for OLVQ1. Here, SVM shows lowest errors and the best ability to handle higher complexity. The relative improvement (∆E / E) compared to LVQ variants is about 30%. As expected, unsupervised methods are not able to perform comparably. Interestingly, in case of microsleep data there is no difference in test errors between unsupervised methods with and without modification ”sv”. This modification shows better adaptivity in all cases shown by lower training errors (Tab. 1), but doesn’t improve test errors in more complex data.
4
Conclusions
Both real world two-class problems have been solved with low error rates using prototype-vector based classification methods. The posturography data set has shown very good discriminability indicating high sensitivity of this measurement technique to small and unknown changes. This result is achiev-
156
D. Sommer and M. Golz
Table 1. mean and standard deviations of test and training errors of different classification methods applied to posturography (A) and to microsleep data (B)
able only by processing spectral domain features. As not reported here, we failed in achieving similar results using alternatively 23 time domain features which were reported of several authors in the posturography literature of the last two decades. As well as processing of all 23 features and as also processing some combinations of them did not lead to similar results as by spectral domain features. This indicates that no simple effects, like changes in amplitude histogram, but dynamical aspects of postural time series are influenced the effect of alcohol intake on posture. OLVQ1, SVM and the recently introduced GRLVQ perform best. The first method is the most simplest and fastest in convergence. Their iterative adaptation rule of step size during training seems to be the key point to outperform other adjacently associated methods, like LVQ1. In a more complex data set (microsleep) which has much more feature vectors and higher dimensionality than the posturography data set SVM outperforms all other methods. In contrast to all other methods SVM is not dependent on initializations and always finds out the global minimum of the error function [M¨ uller et al. (2001)]. Utilizing LOO method to estimate the ability of generalization is computationally expensive but in case of SVM an efficient calculation using support vectors only can be used. The ξα-estimator is also an efficient method, but as our empirical results on both biomedical data sets indicate, this estimator seems to be biased. Therefore, SVM combined with
A Comparison of Validation Methods for LVQ and SVM
157
LOO validation exposes to be the most recommendable combination. Nevertheless, in some parameter settings the SVM combined with all three mentioned validation methods needs up to 100 times more computational effort than OLVQ1 combined with MHO validation. For extensive scanning of parameters in the whole processing cue, we therefore recommend to apply OLVQ1 / MHO and for subsequent fine tuning we recommend to apply SVM / LOO.
References CAO, L.J. and TAY, F.E.H. (2003): Support Vector Machine With Adaptive Parameters in Financial Time Series Forecasting. IEEE Transactions on Neural Networks, 14, 1506–1518. CORTES, C. and VAPNIK, V. (1995): Support Vector Networks. Machine Learning, 20, 273–297. DEVROYE, L.; GYORFI, L.; LUGOSI, G. (1996): A probabilistic theory of pattern recognition. Springer; New York. FRITZKE, B. (1994): Growing Cell Structures - A Self-Organizing Network for Unsupervised and Supervised Learning. Neural Networks, 7, 1441–1460. GOLZ, M.; SOMMER, D.; LEMBCKE, T.; KURELLA, B. (1998): Classification of the pre-stimulus-EEG of k-complexes using competitive learning networks. ´ Aachen, 1767–1771. EUFIT 98. GOLZ, M.; SOMMER, D.; WALTHER, L.; EURICH, C. (2004): Discriminance Analysis of Postural Sway Trajectories with Neural Networks SCI2004, VII. Orlando, USA, 151–155. HAMMER, B. and VILLMANN, T. (2002): Generalized relevance learning vector quantization. Neural Networks, 15, 1059–1068. JOACHIMS, T. (2002): Learning to Classify Text Using Support Vector Machines. Kluwer; Boston. KEARNS, M. (1996): A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split. Advances in Neural Information Processing Systems, 8, 183–189. KOHONEN, T. (2001): Self-Organizing Maps (third edition). Springer, NewYork. ¨ ¨ ¨ MULLER, K.-R.; S. MIKA; RATSCH, G.; TSUDA, K.; SCHOLKOPF, B. (2001): An Introduction to Kernel-Based Learning Algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. OSUNA, E.; FREUND, R.; GIROSI, F.; (1997): Training Support Vector Machines: ´ Puerto Rico. an Application to Face Detection. Proceedings of CVPR 97. SATO, A. (1999): An Analysis of Initial State Dependence in Generalized LVQ. In: ´ D. Willshaw et al. (Eds.): (ICANN 99). IEEE Press; , 928–933. SOMMER, D. and GOLZ, M. (2003): Short-Time Prognosis of Microsleep Events by Artificial Neural Networks. Proc. Medizin und Mobilit¨ at. Berlin, 149–151. SONG, H. and LEE, S. (1996): LVQ Combined with Simulated Annealing for Optimal Design of Large-set Reference Models. Neural Networks, 9, 329–336. VAN GESTEL, T.; SUYKENS, J.; BAESENS, B.; VIAENE, S.; VANTHIENEN, J.; DEDENE, G.; DE MOOR, B.; VANDEWALLE, J. (2002): Benchmarking least squares support vector machine classifiers. Machine Learning. VAPNIK, V. (1995): The Nature of Statistical Learning Theory. Springer, New York.
Discriminant Analysis of Polythetically Described Older Palaeolithic Stone Flakes: Possibilities and Questions Thomas Weber Landesamt f¨ ur Denkmalpflege und Arch¨ aologie Sachsen-Anhalt
Abstract. Archaeological inventories of flaked stone artefacts are the most important sources for the reconstruction of mankind’s earliest history. It is necessary to evaluate also the blanks of tool production (“waste”) as the most numerous artefact category using statistical methods including features like absolute measurements and form quotients of the pieces and their striking platforms, the flaking angles, and the dorsal degradation data. In Central Europe, these three major chrono-technological groups of finds can be determined: from the Middle Pleistocene interglacial(s) 250,000 or 300,000, the Early Saalian glacial perhaps 200,000, and from the Early Weichselian glacial 100,000–60,000 years ago—represented by the inventories from Wallendorf, Markkleeberg, and K¨ onigsaue B. In this study, the attempt has been undertaken to separate these flake inventories using linear discriminant analysis and to use the results for the comparison with other artefact complexes with rather unclear chrono-technological positions.
Archaeological inventories of flaked stone artefacts are the most important sources for the reconstruction of mankind’s earliest history. In most archaeological sites they are much more numerous than the human (bone) remains itself, and when the survival conditions for calcareous material are difficult, we only find the worked stones as the only traces representing the Early Man. To analyze these oldest traces of our ancestors’ culture, it is, of course, not sufficient to study only the modified implements (retouched tools): they are clearly influenced by the functional requirements of the site represented by the inventory (as a hunting area, or a habitation structure, a short or a long term settlement etc.). To produce a lithic tool by flaking technique, a large number of half-products arose. These blanks of tool production, the “waste”, however, may tell us much more how our ancestors produced the artefacts. As flakes—each a product of a single beating process (Fig. 1)—in most of the cases are the most numerous finds from the archaeological inventories statistical methods should be used to compare the flake assemblages. These flakes can be the result of a modification process making a pebble or nodule to a “core tool”—or the origin pieces for a modification with retouch to “flake tools”. Sometimes it is possible to distinguish these two groups in an “inventory” of all flaked stones found in an archaeological entity (from the viewpoint of size, compared with the existing modified implements, etc.) but it is impossible to answer the question of the purpose for each flake exactly.
Polythetically Described Older Palaeolithic Stone Flakes
159
Fig. 1. Scheme of artefact production by stone flaking with a hammerstone (after Toth 1987). The attributes of the flakes used in the discriminant analysis include the number of flake scars / negatives on the striking platform (NSFR), the flaking angle between striking platform and ventral face, the portion of dorsal worked surface, and the three form quotients calculated by length (l), breadth (b), thickness (d) of the piece: LBI, RDI (for further explanations, see the text), and width (w) and depth (t) of the striking platform: WTI.
The pieces are described analytically including features like (i) measurements of the pieces (length measured in flaking direction, breadth, and thickness rectangular to the length): dependant on raw material size but helping us to calculate form quotients reflecting technological changes, (ii) measurements and condition (preparation) of the striking platform resp. of its rest on the flake, (iii) flaking angle between striking platform and the “ventral face” arisen as spliiting face between the flake and the remaining core (influenced by the flaking technique: hard hammer, indirect percussion, etc.), (iv) dorsal degradation data (number of negatives/flake scars, worked surface portion, number of flaking directions), (v) form of the flake, etc. These data are given in distinct forms of quantification—as continuously scaled, as discrete counts, as estimations along a given scale (e.g., in 10% steps for the dorsal worked surface), or as nominal-scaled variables. To compare the inventories, we could undertake such a number of investigations as we have variables included in our attribute analysis. And some univariate studies— of relative thickness index, of flaking angle, of dorsal worked surface—have given us valuable indications for the evaluation of the flaking techniques used in the different sites. It seems to be necessary, however, also to condense the informations and to search for pictures based on a synchronized view of all the included variables: multivariate methods of description are necessary. An example for the use of multivariate mathematical methods can be given with the technique of discriminant analysis. Using this method, we can condense the information from the univariate statistics, and we can find out even these measurements enabling us to separate our different find-spots in the most efficient way. Different “operational taxonomic units” may form the
160
T. Weber
subject of such a discriminant analysis: One possibility is to find (by random numbers) certain samples coming from inventories with known cultural and/or geological background and than to ask for the positions of other inventories or of the single pieces found in these other inventories in relation to the discriminant function separating the artefacts with known origin. In Central Germany we have a long history of human settlement during the Older Stone Age, but we have an episodic history—as the North European glaciers several times covered the land up to the feets of the mountains. Of course, there was no human life during these phases with glaciers of several hundreds of meters thickness. But these large continental glaciations were relatively short events of less than 10,000 years, and between these glacial periods we had several thermal phases (“interglacials”) and firstly long epochs of cooling climate (early glacials or “anaglacials”) interrupted by more or less rather “warm” time spans (“interstadials”). In the interglacials and at least in the earlier warmer interstadials of the large glaciations our ancestors settled in Central Germany. As we find their traces between kryogene sediments, we can establish an order of human cultural history in relation to earth history. Central Europe has been covered by the Northern glaciers at least three times—in the so-called Elsterian, Saalian and Weichselian Glaciations (after the rivers where characteristic traces of these glaciers have been found). In the Elsterian, the glaciers reached its southermost points covering the German lowland including the Thuringian Basin up to Eisenach, Weimar, and Zwickau. After the Holsteinian interglacial and perhaps after some smaller cold (“Fuhne”) and warm (“D¨ omnitz”) climatic stages the largest ice extension of the Saalian was comparable with the Elsterian border but these glaciers did not reach the Thuringian Basin. The town Zeitz is used to characterize the largest Saalian glaciers. Later we have several ice margins reflecting only oscillations (Leipzig, Petersberg and perhaps also Fl¨ aming phases) of the smelting glaciers. After the (last) Eemian interglacial the glaciation reached Central Europe the last time in the Weichselian period—up to a line of the Lower Havel and—more eastwards—south of Berlin. This landscape is characterized now by the remains of the glaciers—the lakes in Brandenburg and Mecklenburg-Vorpommern. As we quite exactly know that the largest extension of the Last Glacial was approximately 20,000 years ago, we have some problems to give absolute data for the earlier climatic periods. For the Eemian interglacial, we calculate with 115,000–130,000 years so that the late Saalian may be counted as 150,000 years ago. As the duration of the Saalian ice age is a subject of a controverse discussion it is impossible to give exact data for the Holsteinian Interglacial (between 250,000 and <400,000 years) and for the Elsterian (between 400,000 and 500,000 years). We have to take into account that the Elsterian glaciers as the most extend glacial traces from the whole Quaternary Ice Age brought in our landscape the most “popular” raw material for the Stone Age people— the Baltic flint (“firestone”)—and we have not yet found older stone artefacts (made from other raw materials) in Central Germany.
Polythetically Described Older Palaeolithic Stone Flakes
161
For our univariate studies and especially for the disriminant analysis of different flake features we selected three inventories: Wallendorf, Kr. Merseburg-Querfurt Gravel pit in the (Post-Elsterian) “Older Middle Pleistocene river terrace”. Here up to 12 metres sands and gravels were accumulated on the ground of a river during a period in which the climate was colder and colder beginning with a temperate wood phase up to arctic permafrost conditions. This climatic development was reflected by the content of limestone (Muschelkalk) pebbles in the sediments, by the presence of different molluscs in several fine-grained sediment lenses, and by remains of ice edges (kryoturbatic structures) as permafrost indicators in the upper layers (Thum & Weber 1991). The artefacts were found mostly in the lower parts of the profile. It is possible that the remains of human cultures belongs to the transition from warmer period immediately after the Elsterian (Holsteinian) to the first cooler event in the Saalian sensu lato (Fuhne). In calendar years, this may be between 250,000 and 300,000 years ago. From the viewpoint of the traditional tool typology, the Wallendorf artefacts have been ordered in the Clactonian (after the finds from Clacton-on-Sea in Southern England) with raw cores, thick flakes, tools with rough retouch and at least a few bifacial worked tools. Markkleeberg near Leipzig Gravels of a river immediately above the Tertiary (Oligocene) sea sands covered by kryogene sediments. As in the Early Saalian the ice was the last time south from Leipzig the Early Saalian (Zeitz phase ice margin) may give us a terminus antequem for these gravels of the so-called “main terrace” (“Hauptterrasse”). In Markkleeberg most of the artefacts, have also been found in layers more or less immediately above the gravel basis. Perhaps there were a comparable process of sediment accumulation as it could be observed in Wallendorf. The absolute dating of the Markkleeberg inventory depends on the Saalian chronology. As the southermost Saalian extension to Zeitz with the later ice margins near the Petersberg (Halle) and the Fl¨ aming was a more or less integral climatic event the terminus antequem for this archaeological material can be given with not more as perhaps 160,000 years (but also in this case it may be possible that the development of the main terrace has taken a long time span of perhaps several 10,000s of years)—otherwise the date could be higher (but not as high as for the Wallendorf inventory). Typologically spoken, the Markkleeberg finds form a part of the Acheulian (after St. Acheul in Northern France) with characteristic handaxes and a flaking preparation technology (Levallois). K¨ onigsaue B, Kr. Aschersleben-Staßfurt The upper layers of a lignite mine come from the former Achersleben lake originated by salt elutriation (halokinesis) and drained in the 18th century. During the coal mining it was possible to observe a large sequence of lake sediments from the last (Eemian) interglacial during the Weichselian glacial up to the geological present (Holocene). In the deposits of an early Weichselian interstadial (K¨ onigsaue Ib) three archaeological horizons were observed from which two—
162
T. Weber
K¨ onigsaue A and K¨ onigsaue C (the lower- and the uppermost)—are typologically characterized by the presence of bifacial knives (“Keilmesser”) as “Central European Micoquien”) whereas the third—middle—K¨ onigsaue B has brought a “Mousterian without bifacial working”. As this flake production may include firstly pieces for further modification into retouched implements (as the Wallendorf and Markkleeberg inventories do) and not a remarkable number of Keilmesser resharpening waste (like probably K¨ onigsaue A and C) K¨onigsaue B was taken for the comparison. The K¨ onigsaue interstadial shows indicators of a temperate cool climate so that it can be parallized with one of the first two warmer interstadials of the Weichselian, in calendar years between more than 60,000 and 100,000 years ago. Comparing these three inventories, the flaking features show clear changes during this time of perhaps 200,000 years. The following features taken at different parts of the pieces were selected for the attempt to distinguish between these three sites only on the basis of the measured flake attributes: (i) number of negatives (flake scars) on the striking platform remnant (ii) flaking angle between striking platform remnant and ventral face (FLANG); measured in degrees (iii) portion of dorsal worked surface (ANTD); estimated in 10% steps between 0 and 10 (iv) length breadth index of the flake (LBI): l/b (length measured in flaking direction; breadth rectangular) (v) relative thickness index (RDI): 200d/(l + b)—thickness divided by the mean of length and breadth of the piece (vi) width depth index of the striking platform (remnant): w/t—width measured in the breadth of the piece direction, depth rectangular As the three flake inventories are of quite different size, 100 pieces from each of them were selected by random numbers to establish linear discriminant functions primary between the chronologically “neigboured” samples from Wallendorf and Markkleeberg and from Markkleeberg and K¨ onigsaue. The two samples of each 100 flakes from Wallendorf and Markkleeberg can be classified correctly at 74% (77 from Wallendorf, 71 from Markkleeberg) using a classical linear discriminant function on the basis of the Mahalanobis distance including relative thickness and flaking angle measurements: LDF = 0.1031 · RDI + 0.02907 · FLANG. The value of 6.8375 calculated using this formula can be used as the best possible separation of the two samples: pieces with larger values are classified as “rather Wallendorf”, artefacts with smaller values as “rather Markkleeberg”. This result can also be drawn, and it is possible to show the positions of the ”ideal flakes” (represented by the arithmetic means for RDI and flaking angle) from other inventories for which we try to give an explanation about the technological affinity (Fig. 2).
Polythetically Described Older Palaeolithic Stone Flakes
163
Fig. 2. Discriminant analysis of each 100 randomly sampled flakes of these both inventories with 34 MS the arithmetic means of the samMA ¨ FB CL BY BIE GU ples (large symbols) and the stan32 ¨ WO dard deviations (dashed lines) GE MR WN ML 30 for the discriminant variables 0.103 +0.0 1·RD EY NIE WO = 6.82907·FLAI and the discrimination line with HEY DH 3 N 75 G 28 BO M the discriminant function. Fur¨ BMN AR LU ther inventories, repesented by 26 BE DI VA the arithmetic means for the HU KB 24 two variables from the Clactonian HX (filled circles), the Palaeolithic FLANG 22 115 120 125 130 from the Older Middle Pleistocene Terrasse (open circles), the Saalian Palaeolithic (mostly Acheulian with Levallois technique) (filled triangles), the rather Saalian Acheulian (open triangles), and the Early Weichselian Palaeolithic (squares) come from: AR – Arneburg; BE – Bertingen; BIE – Biere; BMN – Barleben/Magdeburg-Neustadt; BO – Bottrop; BY – Barby; CL – Clacton; DH – Delitzsch-S¨ udwest, main terrace; DI – DelitzschS¨ udwest, Interstadial (Zwochau); EY – Eythra; FB – Froser Berg; GE – Gerwisch; ¨ – G¨ GR – Gr¨ obzig; GU ubs; HEY – Heyrothsberge; HU – Hundisburg; HX – Hoxne; ¨ – L¨ KB – K¨ onigsaue B; LU ubbow; M – Markkleeberg; MA – Markr¨ ohlitz; ML – Memleben; MR – Magdeburg-Rothensee; MS – Magdeburg-Salbke; NIE – Niegripp; ¨ – W¨ VA – Vahrholz; WD – Wallendorf; WN – Wangen; WO – Woltersdorf; WO orbzig. RDI
36
GR
WD
Most of the find spots classified as “Acheuloid” (by the presence of characteristic handaxes or the “Levallois” core preparation technology — triangles in the diagram) show a flake similarity rather to the Markkleeberg than to the Wallendorf pieces. This group of “Acheulian” finds includes material from the main terrace of northern Saxony, the Middle Elbe river, from the Rhineland (Bottrop), and from Southeast England (Hoxne). The ideal flake from K¨ onigsaue A, evaluated with this LDF, shows a “hyper Markkleeberg behaviour” (graphically in the lower left corner of the diagram. Studying the differences between Markkleeberg and K¨ onigsaue B we get a combination of the flakes’ dorsal worked surface, the width depth index of the striking platform, and also the relative thickness: LDF = 0.3459 · ANTD + 0.1497 · WTI + 0.03518 · RDI. The separation value of 4.187 (lower values for Markkleeberg, higher for K¨ onigsaue B) enables us to classify correctly 86% of the K¨onigsaue flakes (but only 56% of the Markkleeberg pieces, misclassified therefore 44%). The Middle Elbe river finds (discovered in and near Magdeburg) show here (and also in comparison with the Wallendorf-Markkleeberg LDF) results comparable to those from Markkleeberg (Fig. 3)—in agreement with our geological ideas about the Early Saalian dating of their embedding Pleistocene sediments.
164
T. Weber
Fig. 3. Diskriminant analysis of each 100 per random numbers selected complete flint flakes from Wallendorf (wd), Markkleeberg (m) and K¨ onigsaue B (kb) with the relative frequencies of the arrangements to the Palaeolithic from Wallendorf (in opposition wd-m; upper row), K¨ onigsaue B (in opposition m-kb; middle row) as well to Wallendorf as K¨ onigsaue B (flakes which in both discriminant functions are classified as not belongs to Markkleeberg). Black columns show statistical relevant inventories (with more than 50 observations), white irrelevant (less than 50 observations).
Another site—Salzgitter-Lebenstedt—has been questioned with regard of its chronological position in the Saalian or Weichselian Glaciation during several decades. Asking for the highest similarity we find the Salzgitter flakes in an intermediate position between Markkleeberg and K¨ onigsaue (Fig. 4). Studying these results of linear discriminant analyses, from the archaeological point of view arise some interesting questions: Even when such an attempt to distinguish Stone Age artefacts by their technological standards using multivariate statistical methods leads to interpretable results (the LDFs show clear differences between the included sites): What may we do with the obtained functions: (i) Can we test them statistically (as we do it with the immediately observed— directly measured—attributes of the flakes): In the case of SalzgitterLebenstedt a Kolmogorov-Smirnov test of the distances between cumulative distributions between SZ and M would bring a value of 37.2% between SZ and KB of 22.3%, compared with a highest random value Dλ,α = 21.28% for α = 0.95. The highest observed value for M–KB is 43% (maximum random value 19.23%). The observed different absolute numbers of pieces with affinities to Wallendorf (in the Wallendorf-Markkleeberg comparison) and to K¨ onigsaue (in the Markkleeberg-K¨ onigsaue comparison—cf. the diagrams in Fig. 3) could also be evaluated using statistical test techniques (e.g. χ2 ). Compar-
Polythetically Described Older Palaeolithic Stone Flakes
165
Fig. 4. Discriminant analysis: the relative frequency cumulative distributions of the linear discriminant function values for Marklleeberg and the K¨ onigsaue B random samples compared with Salzgitter-Lebenstedt. The maximal distance from SalzgitterLebenstedt to K¨ onigsaue B is smaller than to Markkleeberg.
ing the M–KB values for SZ with the LDF random sample for Markkleeberg, we find a value χ2 = 19.7, with the sample for K¨ onigsaue B χ2 = 1.55. (The LDF samples from M and KB differ more than randomly with a value of 15.28; the—even much larger—complete inventories with 276.81, whereas the M sample / M inventory show a value of 0.16, the KB sample / KB inventory of 1.13—all results should be compared with a crirical value of 4.76.) We see: the differences between the random samples (for Markkleeberg and K¨ onigsaue B) and the whole inventories from which these samples come from are results of random oscillations but comparisons between the Markkleeberg and K¨ onigsaue B assemblages (LDF samples and whole inventories as well) show significant deviations. And even the Salzgitter material can be ordered in this trend—with its significant difference to Markkleeberg and higher affinity (only random difference) to K¨ onigsaue B. But we must not forget that we elaborated—at least for Markkleeberg and K¨ onigsaue B—the LDF with the purpose to discriminate these two samples as well (as clear, as discrete) as possible. (ii) Can we evaluate so the distances to the and between the other inventories expressed as differences in the LDF distributions perhaps in terms of technological development? (iii) How can we compare these different LDFs?
References THUM, J., and WEBER, T. (1991): Prospektion in Tagebaugebieten und Rekonstruktion der Siedlungsgeschichte im Pal¨ aolithikum — Oberfl¨ achenfunde versus tiefere Einschnitte. Ver¨ offentlichungen des Museums f¨ ur Ur- und Fr¨ uhgeschichte Potsdam 25, 21–25. Berlin TOTH, N. (1987): Die ersten Steinwerkzeuge. Spektrum der Wissenschaft 6:124– 134
Model-based Density Estimation by Independent Factor Analysis Daniela G. Cal` o, Angela Montanari, and Cinzia Viroli Department of Statistics, University of Bologna, Italy {calo,montanari,viroli}@stat.unibo.it
Abstract. In this paper we propose a model based density estimation method which is rooted in Independent Factor Analysis (IFA). IFA is in fact a generative latent variable model, whose structure closely resembles the one of an ordinary factor model but which assumes that the latent variables are mutually independent and distributed according to Gaussian mixtures. From these assumptions, the possibility of modelling the observed data density as a mixture of Gaussian distributions too derives. The number of free parameters is controlled through the dimension of the latent factor space. The model is proved to be a special case of mixture of factor analyzers which is less parameterized than the original proposal by McLachlan and Peel (2000). We illustrate the use of IFA density estimation for supervised classification both on real and simulated data.
1
Introduction
Finite mixtures of distributions represent a widely used and flexible approach to model based density estimation (see, for instance, McLachlan and Peel, 2000a). For multivariate continuous data, the preferred solution is based on the use of multivariate normal components, because of their computational convenience. This approach is usually named Gaussian mixture modelling. In this context, the p-dimensional density of a random variable x is modelled as a mixture of m multivariate normal densities in some unknown proportions w1 , . . . , wm : m f (x) = wl φ(x; µl , Σ l ), (1) l=1
where φ(x, µl , Σ l ) denotes the p-variate normal density function with mean µl and covariance matrix Σ l . Here, the set of unknown parameters consists of the mixing proportions wl , the elements of the component means µl and the distinct elements of the component-covariance matrix Σ l for l = 1, ..., m. It is worth noting that an m-component mixture can be thought of as the density of an heterogeneous population consisting of m groups. For each observed unit an allocation variable, z, may be defined which denotes the identity of the group from which the object is drawn. More precisely, z may be thought of as a multinomial random variable consisting of 1 draw on m
Model-based Density Estimation by IFA
167
categories with probabilities w1 , . . . , wm . If we assume that the vector x is conditionally normally distributed given z, then the unconditional density of x (that is, its marginal density) yields a Gaussian mixture model with m components: f (x) =
z
f (z)f (x|z) =
m
wl φ(x; µl , Σ l ).
(2)
l=1
The Gaussian mixture model (1) can be fitted iteratively to an observed sample by maximum likelihood via the expectation-maximization (EM) algorithm (Dempster et al., 1977). The number of components m can be taken sufficiently large to provide an arbitrarily accurate estimate of the underlying density function. Model (1) with unrestricted component-covariance matrices is a highly parameterized model with a total of m − 1 (the mixing proportions) + m × p (the mean vectors components) + m × p(p+1) (the component-covariance 2 matrices distinct elements) parameters. As the number of components m in the mixture model increases, the total number of parameters can quickly become very large relative to the sample size n, thus leading to overfitting. With the aim of reducing the number of parameters which must be estimated in order to fit a Gaussian mixture model, Banfield and Raftery (1993) introduced a parameterization of the generic component-covariance matrix Σ l based on a variant of the standard spectral decomposition of Σ l , which reaches its simplest structure when the component-covariance matrices are assumed to be spherical. A different approach has been proposed by McLachlan and Peel (2000b) who suggest to adopt a mixture of factor analyzers model. In this paper we briefly review McLachlan and Peel’s approach and present a new one based on Independent Factor Analysis (Attias, 1999), which still gives a mixture of factor analyzers but involves fewer parameters than McLachlan and Peel’s solution. The proposed method is applied to real and simulated data in a supervised classification context.
2
Mixtures of Factor Analyzers
McLachlan and Peel (2000b) assume that, given a sample of n observations, the distribution of each observation can be modelled, with probability wl (l = 1, . . . , m), according to an ordinary factor analysis model as x = µl + Bl yl + el ,
(3)
where yl is a q-dimensional vector of common latent variables or factors, Bl is a p × q matrix of factor loadings. The yl are assumed to be distributed as a N (0, Iq ), independently of the errors el , which are distributed as N (0, Dl ), where Dl is a diagonal matrix.
168
D.G. Cal` o et al.
Thus, the mixture of factor analyzers model is given by (1) where the lth component-covariance matrix has the form Σ l = Bl BTl + Dl
(4)
The set of unknown parameters consists now of the elements of the µl , (m × p), the Bl , (m × (p × q)) and the Dl , (m × p), along with the mixing proportions wl , (m − 1). In this way the mixture of factor analyzers provides a way of controlling the number of parameters through the reduced model for the componentcovariance matrices, yielding a solution which is intermediate between the independence and unrestricted models. The mixture of factor analyzers model can be fitted by using the alternating expectation-condition maximization (AECM) algorithm (see McLachan et al., 2003). A formal test for the number of factors can be performed using the likelihood ratio statistic; however, in situations when n is not large relative to the number of unknown parameters, the BIC criterion might be preferable.
3
Independent Factor Analysis
Independent Factor Analysis (IFA) has been originally developed as a latent variable model for solving the problem of blind source separation (Attias, 1999) but it may also be interpreted as an approach to model based density estimation. In effect, Independent Factor Analysis defines a latent variable probabilistic model for the observed multivariate data: x = Λy + e.
(5)
The mean centered p observed variables x are assumed to arise from a smaller set of q latent factors y, that are mixed together by the matrix Λ. A p-dimensional Gaussian noise term e with zero mean and diagonal covariance matrix Ψ is added in order to account for the intrinsic variability of the observed random vector. The factors are assumed to be mutually statistically independent (and also independent from the error terms) and to have arbitrary distributions. In order to make the model flexible enough to account for arbitrary factor densities, while being analytically tractable, each factor marginal density is modelled by a mixture of mi univariate Gaussian components: mi f (yi ) = wil φ(yi ; µil , νil ), (6) l=1
for i = 1, . . . , q, where µil and νil are the mean and the variance of the unidimensional Gaussian components.
Model-based Density Estimation by IFA
169
As a consequence of the independence condition and of the mixture modelling assumption, in the latent space the factor joint density takes the form f (y) =
q
f (yi ) =
i=1
q mi
wil φ(yi ; µil , νil ) =
q
f (zi )f (yi |zi ),
(7)
i=1 zi
i=1 l=1
where the last part has been rephrased in terms of the q allocation variables zi . Let z = [z1 , . . . , zq ]T be a q-variate allocation variable. Then, from (7) q q f (yi |zi ) = f (z)f (y|z), (8) f (y) = f (zi ) z
i=1
i=1
z
where f (y|z) = φ(µz , Vz ) and µz and Vz are respectively defined as: m m mq mq 1 z 1 z zq,l zq,l 1,l 1,l µz = Vz = diag µ1,l , ..., µq,l ν1,l , ..., νq,l . l=1
l=1
l=1
l=1
Thus f (y) is but a q-dimensional mixture model whose generic component density is the product of q normal densities, which is normal too. Fitting an IFA model therefore amounts to fitting a Gaussian mixture model in a low-dimensional space; therefore, just like model (1) its fit to an observed sample can be performed by maximum likelihood via the EM algorithm. But the aspect which is more interesting from our perspective is that in so doing it also allows to model the density of the observed variables as a Gaussian mixture model. The density of the observed random vector x may be derived by integrating the complete data distribution with respect to the factors and by summing with respect to all the possible states of the allocation vector z. After some calculus, the following expression is obtained: + + f (x) = f (x,y,z)dy = f (z)f (x, y|z)dy = f (z)f (x|z) (9) z
where f (x|z) = sities:
z
z
f (x|y, z)f (y|z)dy is the convolution of the following denf (y|z) = φ(µz , Vz )
and f (x|y, z) = φ(Λy, Ψ ) since e ∼ N (0, Ψ ). Since the convolution of two Gaussian densities is still Gaussian: f (x|z) = φ x|z; Λµz , ΛVz ΛT + Ψ , and therefore, from (9), f (x) is a Gaussian mixture model with components.
q i=1
(10) mi = m
170
D.G. Cal` o et al.
Equation (10) clearly shows that the IFA model yields a mixture of factor analyzers too, where the generic component-covariance matrix may be expressed as Bl BTl + Ψ , with Bl = ΛV1/2 z . It is evident from this formulation that such a model gives componentcovariance matrices which vary from one component to another but which involve fewer parameters than McLachlan and Peel’s mixture of factor analyzers. In fact, assuming a number q of IFA factors equal to the number of factors involved in (3) and the same number m of mixture components for both the models, the parameters needed to estimate all the Σl for l = 1, ..., m, in the IFA model are only the p × q factor loadings in Λ, the p diagonal elements q of Ψ and the q i=1 mi diagonal elements of the matrices Vz . It is a total of pq + p + i=1 mi parameters which can be easily proved to be less than (pq + p)m which is the number of parameters involved in McLachlan and Peel’s modelling of the whole set of component-covariance matrices. Equation (10) also shows that a further reduction in the number of free parameters regards the component mean vectors, which are constrained to lie on the q-dimensional subspace spanned by the column of Λ. Just like in the approach based on mixture of factor analyzers, the correct IFA model specification in terms of the optimal q can be derived by making use of information criteria.
4
Some Results in Supervised Classification
The estimation of an unknown probability density function plays a central role in many applications of multivariate techniques. For instance, in the general classification context the goal is to define a rule for the assignment of one new unit, on which a p-variate vector of variables x has been observed, to the class, out of G unordered ones, from which it comes. Denoted by fg , with g = 1, . . . , G, the class conditional densities and by πg the a priori probability of observing an individual from population g, the so-called Bayes decision rule suggests to allocate x to the population gˆ such that gˆ = arg maxg=1,...,G {fg (x)πg }
(11)
In most applications neither fg (x) nor πg (g = 1, . . . , G) are known. In this context the use of mixture models for density estimation represents a relevant solution, which is recently receiving increasing attention. In the following our proposed solution, based on IFA, is employed for the analysis of both simulated and real data sets and its performances are compared with those of other methods based on mixtures, including also the one by McLachlan and Peel.
Model-based Density Estimation by IFA Technique
Error Training LDA 0.121(.006) QDA 0.039(.004) CART 0.072(.003) FDA/MARS (degree=1) 0.100(.006) FDA/MARS (degree=2) 0.068(.004) MDA (3 subclasses) 0.087(.005) MDA (3 subclasses, penalized 4df) 0.137(.006) PDA (penalized 4df) 0.150(.005) Factor analyzers (4 subclasses) 0.129(.010) IFA (2 factors) 0.054(.010)
171
rates Test 0.191(.006) 0.205(.006) 0.289(.004) 0.191(.006) 0.215(.002) 0.169(.006) 0.157(.005) 0.171(.005) 0.187(.005) 0.133(.004)
Table 1. Results for waveform data. The values are averages over 10 simulations, with the standard error of the average in parentheses. The first eight entries are taken from Hastie and Tibshirani (1996). The last line indicates the error rates in the IFA with 2 components for each factor.
4.1
Simulated Data
The first case study is the popular waveform data. This example has been taken from (Breiman et al., 1984) and subsequently used in many works on classification, since it is considered a difficult pattern recognition problem. It is a three class problem with 21 variables, which are defined by xi = uh1 (i) + (1 − u)h2 (i) + εi xi = uh1 (i) + (1 − u)h3 (i) + εi
Class 1
xi = uh2 (i) + (1 − u)h3 (i) + εi
Class 3
Class 2
where i = 1, . . . , 21, u is uniform on [0,1], εi are standard normal random variables and h1 , h1 and h3 are the following shifted triangular forms: h1 (i) = max(6 − |i − 11|, 0), h2 (i) = h1 (i − 4) and h3 (i) = h1 (i + 4). The optimal error rate for this data set is 0.14. The method discussed here is compared with the following classification procedures: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), mixture discriminant analysis (MDA), flexible discriminant analysis (FDA), penalized discriminant analysis (PDA) and the CART procedure. The training sample consists of 300 observations and the test sample has size 500. Both of them have been generated with equal priors. Table 1 indicates the classification results taken from Hastie and Tibshirani (1996) and includes the performances of IFA over 10 simulations. IFA based discriminant analysis shows the lowest classification error rate in the test samples (which is lower than the optimal one only because of sampling error).
172
D.G. Cal` o et al.
Technique LDA MDA MDA/FDA FDA Neural network (10 hidden units) Factor analyzers (4 subclasses) IFA (2 factors)
Error rates Training Test 0.091 0.083 0.028 0.042 0.049 0.014 0.049 0.042 0.000 0.027 0.028 0.069 0.056 0.027
Table 2. Results for Thyroid data. The first five lines are taken from an extended version (technical report) of the paper by Hastie and Tibshirani (1996). The last entry indicates the error rates in IFA with 2 components for each factor.
4.2
Real Data
We also applied the proposed method on the thyroid data (Coomans et al., 1983). The example consists of 5 measurements (T3-resin uptake test, Total Serum thyroxin, Total serum triiodothyronine, Basal thyroid-stimulating hormone and maximal absolute difference of TSH value after injection of 200 micro grams of thyrotropin-releasing hormone as compared to the basal value) on 215 patients, that are distinguished in three groups on the basis of their thyroid status (normal, hyper and hypo). The data have been randomly divided into a training sample of size 143 and a test sample that consists of the remaining patients. Table 2 shows a summary of the performance of several classification procedures. In order to compare our results with those published in a technical report which represents an extended version of Hastie and Tibshirani (1996), only one split into training and test set has been considered. IFA based discriminant analysis performs very well and it is competitive with respect to non linear methods such as neural networks and the MDA/FDA procedure.
5
Conclusions
Density estimation based on independent factors seems to give very good results which are comparable, and sometimes better, to those obtained by using other approaches still based on mixture models, but requiring the estimate of more parameters. The most relevant limit of the procedure is that it does not allow to explore the whole range of possible mixture component number since, the number of estimated components in the x space is dependent on both the number q of independent factors and the number of components used in order to model each factor density.
Model-based Density Estimation by IFA
173
References ATTIAS, H., (1999): Independent Factor Analysis. Neural Computation, 11, 803– 851. BANFIELD, J.D. and RAFTERY, A.E., (1993): Model-based Gaussian and nonGaussian clustering. Biometrics, 49, 803–821. BREIMAN, L., FRIEDMAN, J., OLSHEN, R., and Stone, C., (1984): Classification and Regression Trees. Wadsworth, Belmont, California. COOMANS, D., BROECKAERT, M. AND BROEACKAERT, D.L., (1983): Comparison of multivariate discriminant techniques for clinical data - application to the tyroid functional state. Meth. Inform. Med., 22, 93–101. DEMPSTER, N.M., LAIRD, A.P., and RUBIN, D.B., (1977): Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. HASTIE, T. and TIBSHIRANI, R., (1996): Discriminant Analysis by Gaussian Mixtures. Journal of the Royal Statistical Society B, 58, 155–176. MCLACHLAN, G.J. and PEEL, D. (2000a): Finite Mixture Models. John Wiley & Sons INC, New York. MCLACHLAN, G.J. and PEEL, D. (2000b): Mixtures of Factor Analyzers. In: Langley, P. (Ed.): Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann , San Francisco, 599–606. MCLACHLAN, G.J., PEEL, D., and Bean, R.W. (2003): Modelling highdimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379–388.
Identifying Multiple Cluster Structures Through Latent Class Models Giuliano Galimberti and Gabriele Soffritti Dipartimento di Scienze Statistiche, Universit` a di Bologna, 40126 Bologna, Italy Abstract. Many studies addressing the problem of selecting or weighting variables for cluster analysis assume that all the variables define a unique classification of units. However it is also possible that different classifications of units can be obtained from different subsets of variables. In this paper this problem is considered from a model-based perspective. Limitations and drawbacks of standard latent class cluster analysis are highlighted and a new procedure able to overcome these difficulties is proposed. The results obtained from the application of this procedure on simulated and real data sets are presented and discussed.
1
Introduction
Cluster analysis seeks to classify units such that units in the same group are similar with respect to a given set of variables. Therefore, one of the critical issues in cluster analysis is the choice of the relevant variables that are used to describe units. Many studies have addressed this problem (see for example Fowlkes et al. (1988), Milligan and Cooper (1988), Green et al. (1990), Gnanadesikan et al. (1995), Mirkin (1999), Modha and Spangler (2003), Dy and Brodley (2004)); generally they assume that all the variables define only one classification of units (cluster structure). However, as stressed in Gordon (1999), it is important to note that there can be more than one relevant classification of units based on different (but possibly overlapping) subsets of the observed variables. Nowadays technology advances have emphasized this problem, since data collection is becoming easier and faster, resulting in larger, more complex data sets with many units and variables. Some solutions have recently been proposed in the statistical literature (e.g.: Hastie et al. (2000), Vichi (2001), Soffritti (2003), Friedman and Meulman (2004)); they are based on very different approaches and in some cases have been conceived for the analysis of specific types of data. In this paper the problem of identifying multiple cluster structures in a multivariate data set is considered from a model-based perspective. In Section 2 standard latent class models for clustering (Vermunt and Magidson (2002)) are briefly introduced. Their limitations and drawbacks when used to solve this problem are highlighted in Section 3. Section 4 describes a new procedure able to overcome these difficulties. The results obtained from the application of this procedure on simulated and real data sets are presented and discussed in Section 5.
Identifying Multiple Cluster Structures Through Latent Class Models
2
175
Latent Class Models for Clustering
In latent class cluster analysis it is assumed that group membership is a nominal latent (unobservable) variable Y with K classes, and that the probability distribution of a given set of variables X = (X1 , . . . , Xp ) depends on Y ; more precisely, a different probability distribution of the p observed variables is associated to each latent class. The unconditional distribution of the p variables can be expressed as a mixture of the K class-specific distributions: f (X|θ) = f (X1 , . . . , Xp |θ) =
K
πk fk (X1 , . . . , Xp |θ k )
(1)
k=1
where πk is the prior probability of a unit to belong to the k-th latent class. The classification of the i-th sample unit can be based on the K posterior class membership probabilities: πk fk (xi1 , . . . , xip |θ k ) πk|xi = K k = 1, . . . , K k=1 πk fk (xi1 , . . . , xip |θ k )
(2)
where xi = (xi1 , . . . , xip ) are the scores of the p variables on the i-th sample unit. Unknown parameters can be estimated through the maximum likelihood method, and each unit is then assigned to the class with the highest estimated posterior probability. The number of clusters and the form of the conditional distributions can be chosen through model selection techniques; one of the most popular tools is the Bayesian Information Criterion: BICM = 2 log LM − nparM log n
(3)
n where log LM = i=1 fM (xi |θ) is the loglikelihood for model M given a sample of n units, and nparM is the number of independent parameters to be estimated in that model (Fraley and Raftery (2002a)). This criterion allows to trade-off fit and parsimony of a model: the larger the value of the BIC, the better the model. For further details on this approach to cluster analysis see for example McLachlan and Peel (2000) and Vermunt and Magidson (2002).
3
Multiple Cluster Structures and Latent Class Models
The effect of multiple cluster structures on latent class cluster analysis can be illustrated through the following example. Suppose that a four dimensional sample of 60 units has been observed. In Fig.1(a) the scatter plot of the variables X1 and X2 suggests the presence of three clusters. The second scatter plot (Fig.1(b)) refers to X3 and X4 : also in this second subspace there are three clusters. Labelling the units in both scatter plots according to cluster membership in the first subspace, it emerges that the two cluster
12
G. Galimberti and G. Soffritti
X4 −2
−2
0
0
2
2
4
4
X2
6
6
8
8
10
10
12
176
−2
0
2
4
6
8
10
12
−2
0
2
4
6
8
10
12
X3 (b)
X1 (a)
Fig. 1. Example of two cluster structures: scatter plots of (X1 , X2 ) and (X3 , X4 ).
structures are different. Table 1 shows the cross classification with respect to them; as the value of χ2 , equal to 3, is not significantly greater than 0 (its p-value is 0.5578) the two partitions result to be independent. In the framework of latent class models, this situation can be modelled through two nominal independent latent variables: Y1 induces a partition of units with respect to (X1 , X2 ), Y2 defines a second partition of units with respect to (X3 , X4 ). Two independent mixtures of densities can be defined and the joint distribution of the four variables will be equal to: 3 3 πl fl (X1 , X2 |θ l ) πh fh (X3 , X4 |θ h ) , (4) f (X|θ) = l=1
h=1
which can also be expressed as a single mixture of nine densities: f (X|θ) =
3 3 l=1 h=1
πl πh fl (X1 , X2 |θ l )fh (X3 , X4 |θ h ) =
9
πk fk (X|θ k ),
(5)
k=1
where, for l = 1, 2, 3, h = 1, 2, 3 and k = 1, . . . , 9: ⎧ ⎨ πk = πl πh , fk (X1 , . . . , X4 |θ k ) = fl (X1 , X2 |θ l )fh (X3 , X4 |θ h ), ⎩ θk = θl ∪ θh .
(6)
Analyzing the whole data set by means of standard latent class methods without imposing restrictions (6), nine clusters of units, one for each component of (5), can be discovered, but from this result it can be very difficult to recover the two independent cluster structures hidden in the data, particularly when there are many variables and many clusters of units. Furthermore this unrestricted model will be overparametrized. In order to identify both
Identifying Multiple Cluster Structures Through Latent Class Models
Cluster in (X1 , X2 ) 1 2 3
Cluster in (X3 , X4 ) 1 2 3 4 8 8 7 6 7 9 6 5 20 20 20
177
20 20 20 60
Table 1. Example of two cluster structures: cross classification.
cluster structures it is necessary to perform two different analyses, one for each group of variables, or equivalently a joint analysis through a restricted model obtained by imposing (6).
4
Identification of Multiple Cluster Structures: A Proposal
As shown in the previous Section, whenever it is known that the p observed variables can be partitioned in g groups such that the corresponding g cluster structures are independent, g independent latent class models can be defined, one for each group of variables, leading to a joint restricted latent class model. However, as in many real situations these information are not known a priori, a specific analysis to detect the possible existence of such groups of variables should be carried out. This goal can be pursued by comparing different latent class models with different sets of restrictions through model selection techniques. The general form of these models is: ⎡ ⎤ Hj g ⎣ f (X|θ) = πhj fhj (X(j) |θ hj )⎦ , (7) j=1
h=1
where g is the number of independent latent variables (1 ≤ g ≤ p), X(j) is the subset gof variables which defines the j-th cluster structure (X(j) ∩ X(j ) = ∅ and j=1 X(j) = X), Hj is the number of classes of the j-th latent variable, πhj is the prior probability of the h-th latent class for the j-th latent variable, and fhj is the corresponding class-specific distribution of X(j) . This Section focusses on the use of BIC as a tool to detect multiple cluster structures by choosing the best model within the class defined by (7). Given a sample of n units, for a model Mg of this class it is easy to show that: BICMg =
g
BICMX(j) ,
(8)
j=1
where BICMX(j) is the BIC of the model associated to the j-th independent cluster structure. If the number g of latent variables and the partition
178
G. Galimberti and G. Soffritti
{X(1) , . . . , X(g) } of the observed variables were known, in order to identify the best model it would be necessary to determine the number of latent classes Hj , the prior probabilities πhj and the class-specific distributions fhj , for each independent latent variable. This could be carried out simply by maximizing each addendum of (8) separately. When the number of latent variables and the partition of the observed variables are not known, it would be necessary to determine the best model for each partition of the p observed variables in g groups (g = 1, . . . , p), and then to select the one with the highest value of BIC. It is evident that such an exhaustive search becomes computationally unfeasible when p is large. To overcome these difficulties the following stepwise procedure is proposed, which considers only a subset of the possible models. It hierarchically aggregates the observed variables, and it is composed by three main steps. 1. The model Mg with g = p independent latent variables is considered; this model assumes that each observed variable defines an independent cluster structure (X(j) = {Xj }, j = 1, . . . , p). Using standard techniques (see for example Section 2) p independent latent class models are selected, and the BICMX(j) of these models are computed. 2. Every pair of groups of variables X(j) and X(j ) (j, j = 1, . . . , g and j < j ) is considered: a new partition of the p variables in g − 1 groups is ∗ obtained by aggregating X(j) and X(j ) and the best model Mg−1 with g − 1 independent latent variables associated to this new partition is selected. This model is compared with the model Mg identified at the previous step through the following quantity: ∗ δj,j = BICMg − BICMg−1
= BICMX(j) + BICMX
(j )
− BICMX(j) ∪X
(j )
(9)
where BICMX(j) ∪X is the BIC of the best model associated to the (j ) independent cluster structure defined by X(j) ∪ X(j ) . 3. All the quantities δj,j (j, j = 1, . . . , g and j < j ) are examined to identify the minimum value. If it is negative, a new partition of the variables is obtained by aggregating the corresponding pair of groups of variables (g is set to g − 1), the best model Mg associated to this new partition is selected, and the procedure returns to step 2; otherwise it stops. The stopping rule allows not only to automatically determine the number of independent cluster structures but also to avoid the analysis of irrelevant aggregations, and hence to control the computational complexity which increases with the number of observed variables. Furthermore, this is a very general procedure, as it is based neither on a specific software nor on a specific algorithm for parameter estimation. In order to apply the procedure to data sets with missing values it is necessary to perform a preliminary missing value imputation, or to exclude units with missing values.
Identifying Multiple Cluster Structures Through Latent Class Models
5
179
Experimental Results
This section contains the results obtained on some simulated and real data sets with continuous variables. The proposed procedure has been implemented in R code, resorting to the package MCLUST to select the best latent class model for each group of variables (Fraley and Raftery (2002b)). A Monte Carlo study has been carried out to evaluate the performance of the proposed procedure. The simulated data sets have been generated from mixtures of Gaussian distributions in a four-dimensional space (p = 4) with two independent cluster structures (g = 2): one is defined in the subspace of variables X1 and X2 , the other in the subspace spanned by X3 and X4 . Table 2 contains the prior probabilities and the mean vectors of each cluster for both cluster structures; the parameter used to define the mean vectors controls for the separation between clusters along each coordinate variable. As it can be seen, in the second structure two clusters perfectly overlap when considering separately each coordinate variable. The variables within each cluster are uncorrelated and homoschedastic, with variance σ 2 . In order to assess the ability of the proposed procedure in detecting the presence of independent cluster structures, the effects of three different factors have been evaluated: the degree of separation between clusters ( , with levels 2.0, 2.5, 3.0); the degree of heterogeneity within clusters (σ, with levels 1.0, 1.5, 2.0, 2.5, 3.0); the sample size (n, with levels 150, 300, 450). For each combination of these factors, 100 data matrices have been generated and analyzed. Table 3 shows the number of data matrices for which the proposed procedure succeeded in identifying the two groups of variables for each combination of factors. As it can be seen the procedure generally has a good performance. Broadly speaking, it improves as the separation between clusters increases given the levels of the other factors. Similar conclusions hold when the heterogeneity within clusters decreases, or the sample size increases. In the worst case 34 successes are obtained; this happens when not very separated and highly heterogeneous clusters are considered with the smallest sample size. The procedure has been applied also to a real data matrix which contains seven indicators for the 103 Italian provinces, published in 2002 by an Italian financial newspaper (www.ilsole24ore.com). Two indicators concern public health (X1 : number of hospital beds used for ”day hospital” per 100 beds, X2 : number of patients hospitalized outside region per 100 patients), three concern cultural activities (X3 : number of books bought per 100 inhabitants, X4 : number of cultural and artistic associations per 100,000 inhabitants, X5 : number of cinema tickets bought per inhabitant), and two concern sport activities (X6 : number of gymnasiums per 100,000 inhabitants, X7 : number of national sport club members (CONI) per 1,000 inhabitants). According to the results, these three groups of indicators define three different cluster structures. Table 4 summarizes some information about these structures. The values of some measures of association between each pair of the identified cluster structures (Table 5) confirm that they are independent.
180
G. Galimberti and G. Soffritti
h πh µh1 µh2
Cluster structure 1 1 2 3 1/3 1/3 1/3 0 2 0 2
l πl µl3 µl4
Cluster structure 2 1 2 3 1/3 1/3 1/3 0 0 0
Table 2. Monte Carlo study: parameters used to generate the data.
σ 1.0 1.5 2.0 2.5 3.0
n=150 2.0 2.5 3.0
n=300 2.0 2.5 3.0
n=450 2.0 2.5 3.0
99 92 90 66 34
100 98 98 94 72
100 99 99 98 91
100 94 92 90 74
100 99 94 91 90
100 100 98 98 95
100 100 98 98 98
100 100 99 99 97
100 100 100 99 99
Table 3. Monte Carlo study: successes in identifying the correct groups of variables.
j 1 2 3
X(j) {X1 , X2 } {X3 , X4 , X5 } {X6 , X7 }
Hj 2 2 2
Cluster sizes 30, 73 32, 71 39, 64
Table 4. Real data: information about the identified cluster structures.
Pairs of structures X(1) and X(2) X(1) and X(3) X(2) and X(3)
χ2 1.182 0.369 0.150
(p − value) (0.277) (0.544) (0.699)
Tschuprow index 0.107 0.060 0.038
Table 5. Real data: association between the identified cluster structures.
6
Conclusions and Open Issues
As the results obtained so far on simulated and real data sets show, the proposed procedure seems to represent a useful tool to identify multiple cluster structures in a multivariate data set. However some aspects deserve further investigation. First of all, in order to obtain a wider assessment of its performances, more data sets (both artificial and real) should be analyzed. As the procedure relies on the use of BIC, it could be interesting to consider other model selection criteria and to evaluate to which extent this choice can influence the results. Furthermore, the performances of the procedure in
Identifying Multiple Cluster Structures Through Latent Class Models
181
presence of nearly independent cluster structures should be studied since in real situations the hypothesis of independence between cluster structures can be violated. Finally a comparison with other methods for the identification of multiple cluster structures should be carried out.
References DY, G., BRODLEY, C.E. (2004): Feature Selection for Unsupervised Learning. Journal of Machine Learning Research, 5, 845–889. FOWLKES, E.B., GNANADESIKAN, R., KETTENRING, J.R. (1988): Variable Selection in Clustering. Journal of Classification, 5, 205–228. FRALEY, C. and RAFTERY, A.E. (2002a): Model-Based Clustering, Discriminant Analysis and Density Estimation. Journal of the American Statistical Association, 97, 611–631. FRALEY, C. and RAFTERY, A.E. (2002b): MCLUST: Software for Model-Based Clustering, Density Estimation and Discriminant Analysis. Technical Report No. 415, Department of Statistics, University of Washington. FRIEDMAN, J.H. and MEULMAN, J.J. (2004): Clustering Objects on Subsets of Attributes. Journal of the Royal Statistical Society B, 66, 815–849. GNANADESIKAN, R., KETTENRING, J.R., TSAO, S.L. (1995): Weighting and Selection of Variables for Cluster Analysis. Journal of Classification, 12, 113– 136. GORDON, A.D. (1999): Classification, 2nd Edition. Chapman & Hall, Boca Raton. GREEN, P.E., CARMONE, F.J., KIM, J. (1990): A Preliminary Study of Optimal Variable Weighting in k-means Clustering. Journal of Classification, 7, 271– 285. HASTIE, T., TIBSHIRANI, R., EISEN, M.B., ALIZADEH, A. et al. (2000): Gene Shaving as a Method for Identifying Distinct Sets of Genes with Similar Expression Patterns. Genome Biology, 1, 1–21. MCLACHLAN, G., PEEL, D. (2000): Finite Mixture Models. John Wiley & Sons, Chichester. MILLIGAN, G.W., COOPER, M.C. (1988): A Study of Standardization of Variables in Cluster Analysis. Journal of Classification, 5, 181–204. MIRKIN, B. (1999): Concept Learning and Feature Selection Based on SquareError Clustering. Machine Learning, 35, 25–39. MODHA, D.S., SPANGLER, W.S. (2003): Feature Weighting in k-means Clustering. Machine Learning, 52, 217–237. SOFFRITTI, G. (2003): Identifying Multiple Cluster Structures in a Data Matrix. Communications in Statistics: Simulation and Computation, 32, 1151–1177. VERMUNT, J.K. and MAGIDSON, J. (2002): Latent Class Cluster Analysis. In: J.A. Hagenaars and A.L. McCutcheon (Eds.): Applied Latent Class Analysis. Cambridge University Press, Cambridge, 89–106. VICHI, M. (2001): Double k-means Clustering for Simultaneous Classification of Objects and Variables. In: S. Borra, R. Rocci, M. Vichi and M. Schader (Eds.): Advances in Classification and Data Analysis. Springer-Verlag, Berlin, 43–52.
Gene Selection in Classification Problems via Projections onto a Latent Space Marilena Pillati and Cinzia Viroli Statistics Department, University of Bologna, Italy
[email protected],
[email protected]
Abstract. The analysis of gene expression data involves the observation of a very large number of variables (genes) on a few units (tissues). In such a context the recourse to conventional classification methods may be hard both for analytical and interpretative reasons. In this work a gene selection procedure for classification problems is addressed. The dimensionality reduction is based on the projections of genes along suitable directions obtained by Independent Factor Analysis (IFA). The performances of the proposed procedure are evaluated in the context of both supervised and unsupervised classification problems for different real data sets.
1
Introduction
The numerous statistical questions posed by the analysis of gene expression measurements, as currently determined by the microarray technology, include both the problem of distinguishing between cancer classes and the problem of identifying and discovering various subclasses of cancer. These are two distinct classification problems, supervised and unsupervised respectively, that can be addressed by discriminant analysis and clustering techniques. The peculiarity of gene expression data is the very large number of variables (genes) with respect to the number of units (tissues or cells). A reduction in dimensionality is needed not only to allow the employment of standard statistical methods but also for a biological interpretation. As many genes result to be not relevant to the tumor classification, a natural choice may consist in performing a variable selection to avoid the inclusion of not relevant or noisy genes. Their presence may assert a negative influence on the overall performances of an estimated classification rule or hide some meaningful patterns or structures in the data. There is a vast literature on gene selection for cell classification; a comparative study of several discrimination methods based on filtered sets of genes can be found in Dudoit et al. (2002). Following Cal´ o et al. (2005), we propose an unsupervised multivariate strategy that allows to take into account gene interactions, and that may be employed in order to solve both discriminant and clustering problems. We move from a well known result within the biological community: only few genes have distinct levels of activity between conditions of interest (such as cancer and non cancer or different types of disorders). Most of the genes
Gene Selection in Classification Problems
183
demonstrate a “regular” expression profile and so they are not relevant to class prediction or to recovering subclasses. Therefore, we think that a reasonable criterion for dimension reduction in this context could consist in detecting and selecting the genes showing a behavior across the cells that most differs from that of all the other genes. We start from regarding genes as points in a n-dimensional space. Then, the genes are projected onto a lower dimensional space. In order to highlight the genes showing the greatest expression variability with respect to the tissues, the directions of the projections should exhibit non-gaussian gene expression profiles. We propose to use Independent Factor Analysis (IFA) to identify the latent space. Finally, a rank of the genes in the independent factor space is derived to detect subsets of relevant genes. Section 2 provides a brief introduction to independent factor analysis. Section 3 describes the proposed gene selection procedure. Finally, some applications to real data sets are illustrated.
2
Independent Factor Analysis (IFA)
The historical background of IFA can be found in the signal processing context, as a solution to the so called “blind source separation” problem (Attias, 1999). It identifies a situation in which a number of signals emitted by some physical sources are observed: the objective is to recover the unobserved sources from their signal mixtures. Despite its origin, IFA can be reinterpreted as a particular latent variable model with independent and non gaussian factors. In fact, the p observed variables xj are modelled in terms of a smaller set of k unobserved independent latent variables yi and an additive specific term uj : xj =
k
λji yi + uj ,
(1)
i=1
where j = 1, ..., p, i = 1, ..., k. In compact form the IFA model is x = Λy + u, where the random vector u represents the noise, assumed to be normally distributed, u ∼ N (0, Ψ ) and the factor loading matrix Λ = {λji } is also termed as mixing matrix. The density of the ith factor is modelled by a mixture of ni gaussians with mean µi,qi , variance νi,qi and mixing proportions wi,qi (qi = 1, ..., ni ). The parameter estimation problem may be quite promisingly solved by the EM algorithm. The identification of the latent structure dimensionality can be achieved according to the so called information criteria, which are based on penalized forms of the likelihood.
3
Gene Selection in the Independent Factor Space
In order to perform gene selection by IFA, we start regarding genes as points in the n-dimensional space of the tissues (in doing this the role of units and
184
M. Pillati and C. Viroli
Fig. 1. Latent spaces detected by IFA (left graph: first two estimated factors out of four for the SRBCT data; right graph: latent space of the LK data).
that of variables are exchanged). When considering genes as units, the distribution of gene expression levels in the cell space must be taken into account. Empirical evidence shows that these distributions are typically leptokurtic, with heavy tails and a pronounced central peak. This implies that the observed variables (in this new perspective, the expression profiles across cells) have non-Gaussian distributions and hence the variance-covariance matrix does not suffice to describe the relations between them, but it is necessary to consider also higher-order moments. This is the reason why we look for a subspace in which the projections of the genes along the latent directions exhibit non-gaussian expression profiles. In this perspective Cal´ o et al. (2005) have suggested to use independent component analysis (Comon, 1994), but have left the problem of selecting the correct number of independent components, i.e. the latent space dimension, still unsolved. Following our proposal, the p n-dimensional gene expression data are projected by IFA on a k dimensional space, k << p, where k is chosen on the basis of some information criteria (AIC, BIC). The most “irregular” genes are then detected by looking at the tails of the distributions of gene projections along the IFA directions. In fact, highly induced or repressed genes should lie on the tails of the distributions of the yi (i = 1, ..., k). Genes are ranked in the reduced space according to their maximum absolute scores across the factors, after rescaling the factor scores so that they have the same range. The genes located in the last m positions of this ranking (with m << p) should be used to class prediction or to discover subclasses. Ranking the genes according to the above criterion is equivalent to order the genes on the basis of their distance from the mean vector in the latent space in terms of the Minkowski metric with parameter l → ∞. Therefore, as alternatives, we consider other distance measures, such as those obtained from the Minkowski metric for l = 1 (Manhattan distance), l = 2 (Euclidean distance) and l = 3 (cubic distance), in order to evaluate the sensitivity of the procedure to the choice of the metric.
Gene Selection in Classification Problems
185
Fig. 2. Cross-validated misclassification rates for different subsets of genes (first picture: SRBCT data set; second picture: LK data set). LK data IFA ICA PCA SC SRBCT data IFA ICA PCA SC
2226 0.042 0.042 0.042 0.042 2308 0.032 0.032 0.032 0.032
1347 0.028 0.083 0.097 0.042 761 0.032 0.032 0.111 0.048
262 0.042 0.083 0.139 0.056 423 0.016 0.032 0.127 0.032
68 0.056 0.083 0.111 0.056 310 0.016 0.016 0.190 0.000
37 0.069 0.125 0.111 0.056 86 0.048 0.048 0.381 0.000
20 0.069 0.153 0.125 0.083 33 0.111 0.063 0.603 0.000
17 0.069 0.167 0.125 0.097 20 0.079 0.143 0.603 0.095
10 0.111 0.264 0.181 0.194 16 0.111 0.190 0.635 0.111
6 0.139 0.375 0.306 0.194 9 0.175 0.333 0.714 0.333
3 0.139 0.569 0.375 0.333 2 0.175 0.444 0.667 0.556
Table 1. Supervised classification. Cross-validated misclassification rates for different subsets of genes.
4
Some Applications and Concluding Remarks
The proposed procedure has been applied on some publicly available data sets: the Leukemia (LK) data set of Golub et al., (1999) and the Small Round Blue Cell Tumor (SRBCT) data set of Khan et al., (2001). The first data set contains gene expression levels for p =2226 genes in 72 cells and consists
186
M. Pillati and C. Viroli
k -means IFA ICA PCA Ward method IFA ICA PCA
2226 0.431 0.431 0.431 2226 0.167 0.167 0.167
1347 0.139 0.139 0.139 1347 0.180 0.125 0.194
262 0.111 0.097 0.139 262 0.111 0.139 0.056
68 0.111 0.139 0.444 68 0.083 0.069 0.389
37 0.139 0.083 0.444 37 0.153 0.083 0.403
20 0.097 0.097 0.458 20 0.069 0.056 0.444
17 0.097 0.097 0.639 17 0.069 0.139 0.639
10 0.125 0.125 0.639 10 0.153 0.153 0.639
6 0.167 0.153 0.639 6 0.167 0.180 0.639
3 0.208 0.167 0.611 3 0.250 0.153 0.611
Table 2. Unsupervised classification. LK data set: k -means and hierarchical Ward method misclassification rates for different values of selected genes (with 2 factors).
of 38 cases of B-cell acute lymphoblastic leukemia, 9 cases of T-cell acute lymphoblastic leukemia and 25 cases of acute myeloid leukemia. The small round blue cell data set contains gene expression levels for p =2038 genes in 63 cells and consists of 8 cases of Burkitt lymphoma, 23 cases of Ewing sarcoma, 12 cases of neuroblastoma and 20 cases of rhabdomyosarcoma. Independent factor analysis has been performed on the two data sets and, according to the information criteria, 2-dimensional and 4-dimensional latent spaces for the LK and the SBRCT data sets respectively have been considered. The right graph of figure 1 displays the non gaussian projections onto the LK latent space. The left graph shows the plot of the genes projected onto the first two estimated factors for the SRBCT data set. For each data set and for different distance measures, a gene ranking is produced. As the following empirical analysis shows, the proposed strategy seems to represent a useful and promising tool to detect subsets of relevant genes for cell classification based on microarray data. In fact the detected subsets of genes succeed in capturing the class structure in the data, both in a supervised and unsupervised perspective. Supervised classification A sequence of classification rules is performed for 30 different values of the number m of selected genes, ranging from p to 1. The performances of the classification rules based on these subsets of genes are compared with those obtained by principal component analysis and independent component analysis, as an alternative in the projection phase. The nearest shrunken centroid (SC) method of Tibshirani et al., (2002) is also evaluated on the same data sets. In order to compare the results of our gene selection procedure with those obtained through the shrunken centroids, the nearest centroid method is used in class prediction, but any other postprocessing classifier could be applied. Given the small number of cells in the two data sets, the classification error rates have been estimated by balanced cross-validation. As the performances
Gene Selection in Classification Problems
k -means IFA ICA PCA Ward method IFA ICA PCA
2308 0.571 0.571 0.571 2308 0.429 0.429 0.429
761 0.540 0..571 0.571 761 0.524 0.571 0.508
423 0.540 0.444 0.571 423 0.540 0.571 0.540
310 0.540 0.540 0.619 310 0.524 0.587 0.587
86 0.397 0.556 0.581 86 0.540 0.524 0.571
33 0.397 0.397 0.651 33 0.317 0.286 0.635
20 0.317 0.492 0.667 20 0.286 0.397 0.571
16 0.397 0.349 0.698 16 0.349 0.397 0.667
9 0.571 0.413 0.698 9 0.413 0.492 0.698
187
2 0.254 0.508 0.635 2 0.254 0.508 0.683
Table 3. Unsupervised classification. SRBCT data set: k -means and hierarchical Ward method misclassification rates for different values of selected genes (with 4 factors).
Fig. 3. K -means misclassification rates for different number of selected genes in SRBCT and LK data sets respectively.
do not seem to be influenced by different choices of the metric when defining the gene ranking, we report only the results based on the Euclidean distance. The first graph of figure 2 clearly displays the superiority of the performances of classification rules based on subset of genes selected in non gaussian latent spaces. The worst performances of the PCA based solution confirm the need to take into account also non-linear structures particularly for the SRBCT data set. The IFA-based procedure allows to obtain the bet-
188
M. Pillati and C. Viroli
Fig. 4. Unsupervised classification. Hierarchical Ward method misclassification rates for different number of selected genes in SRBCT and LK data sets respectively.
m=2308
BL
Predicted BL EWS NB RMS
4 0 4 0
Actual EWS NB
3 0 6 14
0 0 9 3
SRBCT data set m=20 RMS Predicted 0 BL 0 EWS 6 NB 14 RMS
BL
0 8 0 0
Actual EWS NB
0 22 0 1
0 0 9 3
RMS
0 0 8 12
Table 4. Unsupervised classification. SRBTC data set: confusion matrices for k means classification with 2308 and 20 selected genes in the IFA latent space.
ter performances, and this is particularly evident for small values of m (see also table 1). The IFA-based procedure outperforms the others also in the LK data classification, where it allows to achieve small cross-validated errors with less than 20 genes. Unsupervised classification In order to check if the selected subsets of genes are able to recover the clustering structure of the data, we applied two different cluster analysis techniques to the same data sets. As shown in tables 2 and 3, the use of the whole
Gene Selection in Classification Problems
m=2226
ALL B
Predicted ALL B ALL T AML
16 0 22
Actual ALL T
5 0 4
LK data set m=20 AML Predicted 0 ALL B 0 ALL T 25 AML
ALL B
32 6 0
Actual ALL T
0 9 0
189
AML
0 4 21
Table 5. Unsupervised classification. LK data set: confusion matrices for k -means classification with 2226 and 20 selected genes in the IFA latent space.
gene set does not allow to accurately detect the tumor classes by clustering methods. Only after a selection of relevant genes, the performances improve and this fact confirms the usefulness of variable selection in unsupervised classification. The IFA-based gene selection allows to halve the classification error by reducing the number of genes from some thousands to just less than 20. The confusion matrices for all the genes and for a subset of 20 ones confirm the effectiveness of the proposed gene selection procedure.
References ATTIAS, H. (1999): Independent Factor Analysis. Neural Computation, 11, 803– 851. ` D.G., GALIMBERTI, G., PILLATI, M. and VIROLI, C. (2005): Variable CALO, selection in classification problems: a strategy based on independent component analysis. In: M. Vichi, P. Monari, S. Mignani and A. Montanari (Eds.): New Developments in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, Springer-Verlag, Berlin, 21–30. COMON, P. (1994): Independent component analysis, a new concept? Signal Processing, 36, 287–314. DUDOIT, S., FRIDLYAND, J. and SPEED, T.P. (2002): Comparison of Discrimination Methods for the Classification of Tumors using Gene Expression Data. Journal of the American Statistical Association, 457, 77-87. GOLUB, T.R., SLONIM, D.K., TAMAYO, P. et al. (1999): Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531–537. KHAN, J., WEI, J., RINGNER, M. et al. (2001): Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine, 7, 673–679. TIBSHIRANI, R., HASTIE, T., NARASIMHAN, B. and CHU, G. (2002): Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression, Proceedings of the National Accademy of Sciences, 99, 6567-6572.
The Recovery Performance of Two–mode Clustering Methods: Monte Carlo Experiment Sabine Krolak-Schwerdt1 and Michael Wiedenbeck2 1
2
Department of Psychology Saarland University, Germany Centre for Survey Research and Methodology Mannheim, Germany
Abstract. In this paper, a Monte Carlo study on the performance of two–mode cluster methods is presented. The synthetical data sets were generated to correspond to two types of data consisting of overlapping as well as disjoint clusters. Furthermore, the data sets differed in cluster number, degrees of within-group homogeneity and between-group heterogeneity as well as degree of cluster overlap. We found that the methods performed very differently depending on type of data, number of clusters, homogeneity and cluster overlap.
1
Introduction
This paper is concerned with the cluster analysis of two–mode data which consist of two sets of entities, that is, objects i, i = 1, . . . , n and attributes j, j = 1, . . . , m. The aim of two–mode techniques is to classify the objects and simultaneously to classify the attributes, and most important allows her/him to characterize the dependence structure between the objects and their corresponding attributes. A great number of different methods has been introduced in the realm of two–mode clustering (cf. Van Mechelen, Bock and De Boeck (2004) for an overview). Some authors subdivide these methods into three categories (Eckes and Orlik (1993), Schwaiger (1997), Krolak-Schwerdt (2003)). The first category consists of methods which are generalizations of the ADCLUS model proposed by Shepard and Arabie (1979; see DeSarbo (1982)). The basic assumption is the following: The similarity s(i, j) of an object and an attribute is an additive function of weights vkk which are associated to those clusters k and k the two entities jointly belong to. DeSarbo (1982) developed the first method within this category and Baier, Gaul and Schader (1997) introduced a probabilistic formulation of the model. In general, the input data are required to represent nonsymmetric similarities and to be interval scaled. The clustering solution provides overlapping clusters. The second category contains methods which are fitting additive or ultrametric tree structures to two–mode data. From the two–way input data, a sort of ’grand matrix’ (with objects and attributes as rows as well as columns) is constructed from which an ultrametric tree structure is estimated. These are
The Recovery Performance of Two–mode Clustering Methods
191
hierarchical methods generating non-overlapping solutions. ESOCLUS developed by Schwaiger and Rix (in press) and the centroid effect method (Eckes and Orlik (1993)) belong to this category. Methods of the third category may be termed ’reordering approaches’. They permute the rows and columns of the data matrix in order to make clusters visible as blocks within the data matrix with objects showing identical values across the attributes. In contrast to the other groups of methods, two–mode profile data are assumed as input which are interpreted as categorical and the algorithms operate directly on the input data without the construction of a grand matrix. Example methods within this category are Hartigan’s (1975) two–way joining, the modal block method (Hartigan (1975)) or GRIDPAT (cf. Krolak-Schwerdt (2003)). Though reordering methods have a very different rationale to construct clusters than the ADCLUS generalizations, most of them can be represented by the ADCLUS model (cf. Van Mechelen et al. (2004)). The Monte Carlo study presented in the following was conducted to compare the ability of the different methods to recover a given clustering within the data.
2
Monte Carlo Experiments: Generation of the Data
Our simulation study comprises two types of data: in Experiment 1 we dealt with non-overlapping clusters and in Experiment 2 we generated data with overlapping clusters. The construction of the data rests upon a basic pattern of blockwise constant data matrices M which will be superimposed by random numbers. Such data matrices M can be represented as: M = RV C + U
(1)
where V is of order K × K. R and C are binary matrices of order n × K resp. m × K. U is a n × m matrix with constant entries; it represents a joint shift of all the entries of RV C . This model is similar to the generalized ADCLUS model (cf. DeSarbo (1982)) where nonsymmetrical similarity matrices S are modelled as S = P V Q +T . Here P and Q are binary, too, T is a matrix with constant entries, and V is an unrestricted square matrix. For Experiment 1, data with non-overlapping blocks were generated. That is, the row sums of R and C in (1) were restricted to 1. Then, for the entries of the diagonal blocks basic constant numbers were chosen in a way which is explained below. Afterwards, normally distributed random numbers with expectation zero were added up. The variances of the random numbers were varied systematically. The main characteristic of the clusters are the diagonal blocks and the sizes of their entries. Therefore, an entry at an off-blockdiagonal position is
192
S. Krolak-Schwerdt and M. Wiedenbeck
given a value which is uniformely less than the diagonal block entries of the same row as well as of the same column. The separability of clusters defined in this way depends jointly on the size of the entire data matrix, the number of clusters, the distribution of the sizes of the diagonal blocks, the means and variances of the entries in the diagonal blocks. Throughout all simulated data, the size of the data matrix was bounded from above by 50 rows × 20 columns according to the upper limit of the program with the lowest capacity. Data sets with three, five and eight clusters were generated. Means of the diagonal blocks were always equidistant (for example, in case of three clusters means were 2, 4 and 6). Variances σ 2 varied between clusters (in case of three clusters, values were chosen as 2, 3 and 4). For each number of clusters, the ”structure” of clusters was varied by generating a balanced or unbalanced number of entities. Structure 1 is balanced (or nearly so) in the number of objects and attributes per cluster within the main diagonal block. For three clusters, these are 17, 17 and 16 objects and 7, 7 and 6 attributes. Structure 2 is balanced in the number of objects per cluster and unbalanced in the number of attributes. For three clusters, the object numbers are again 17, 17 and 16 objects and the attribute numbers are 5, 7 and 8. Structure 3 consists of unbalanced objects numbers and balanced (or nearly so) attributes numbers. Structure 4 is unbalanced in both the number of objects and attributes per cluster. High numbers of objects were combined with low numbers of attributes, and vice versa. For three clusters, that means that there the frequencies are 28, 15 and 7 objects and 3, 7 and 10 attributes. The basic data pattern of Experiment 2 was extended in the following way: again, R and C were chosen to be binary, but now maximally two clusters were allowed to overlap with respect to objects or attributes or both. The entries in the diagonal blocks were chosen again equidistant, but the basic pattern of V was constructed using Toeplitz matrices of the following type: 1 λ λ2 ...
λ 1 λ
λ2 λ 1
λ3 · · · λK−1 λ2 · · · λK−2 λ · · · λK−3
λK−1 λK−2 λK−3 · · · · · ·
1
For 0 ≤ λ < 1, each entry of the main diagonal is a maximum of all the entries of the row and column which it belongs to. This is kept invariantly, if the entries of row 1 and column 1 are multiplied by D1 ≥ 1, the entries of row/column 2 which do not belong to row/column 1, by D2 ≥ D1 and so forth until the multiplication of the last entry at (K, K) by DK ≥ DK−1 . The numbers Dk are cluster specific constants of the diagonal blocks and were chosen as 20, 30, 40 etc. The resulting matrices were superimposed by normally distributed random numbers with expectation zero.
The Recovery Performance of Two–mode Clustering Methods
1a) Number of clusters: Method:
ESOCLUS CEM Baier et al. GRIDPAT two–way joining
193
1b) Structure of clusters:
3
5
8
1
2
3
4
.99(.00) .96(.02) .54(.04) .90(.02) .63(.06)
.90(.03) .78(.03) .53(.04) .76(.01) .48(.04)
.52(.04) .57(.04) .50(.03) .30(.02) .94(.03)
.88(.04) .87(.04) .61(.04) .67(.05) .70(.05)
.84(.04) .84(.04) .50(.05) .69(.07) .70(.07)
.80(.06) .78(.04) .50(.04) .60(.06) .71(.07)
.71(.07) .59(.06) .47(.05) .65(.07) .63(.07)
Table 1. Mean values and standard errors (in parentheses) of ARI from Experiment 1 as a function of method and cluster number and as a function of method and cluster structure
As to the sizes of row and column blocks, we limited the data to balanced clusters. But, as an additional feature, large and small overlap was introduced. The degree of overlap depends on the number of clusters. As an example, for three clusters the large overlap of cluster 1 and cluster 2 was 5 objects and 3 attributes, the overlap of cluster 2 and 3 was 5 objects and 1 attribute. Small overlap was always 3 objects and 1 attribute. The design of our data was then the following: One group of data with large and one group with small overlapping. Within both groups again data with three, five and eight clusters. For each number of clusters one Toeplitz matrix with a larger (λ = .6) and one with a smaller (λ = .2) parameter. This leads to two different levels by which the entries of the off-diagonal blocks are dominated by the entries of the very cluster blocks. And within both versions, we varied the variance of the superimposed random numbers with expectation zero. The variances were chosen proportional to the values of V and two baseline variances σ 2 = 1 and σ 2 = .5. In sum, the data were generated according to a design with four factors: 1. overlap (large - small), 2. number of clusters (three, five, eight), 3. parameter of Toeplitz matrix (large - small), 4. size of variance (large - small). For each of the 24 cells we generated three replications.
3
Data Analysis and Results
Our aim was to include methods from each category of procedures (cf. Section 1) in the Monte Carlo analysis to investigate the relative performance of methods across the categories. In the final analysis the following methods were used: the Baier et al. approach, the centroid effect method (CEM), ESOCLUS, two-way joining and GRIDPAT.
194
S. Krolak-Schwerdt and M. Wiedenbeck
For the GRIDPAT analysis, the data had to be converted into a binary format. In a first step, the data were double centered by removing the row and column means. Then, the resulting data were dichotomized at the median. For Experiment 1, the case of non-overlapping clusters, the Adjusted Rand Index (ARI) of Hubert and Arabie (1985, p. 198) was chosen as an indicator of the solution recovery quality. We performed analysis of variance (ANOVA) on the ARI indices with the number of clusters (3, 5 or 8), type of structure (structure 1, 2, 3 or 4) and type of method (Baier et al. approach, CEM, ESOCLUS, two-way joining or GRIDPAT) as experimental factors. Thus, the entire investigation consisted of 360 separate analyses. For Experiment 2 which involves overlapping clusters, the Omega index of Collins and Dent (1988) was selected. Omega is a generalization of ARI to the case of overlapping clustering solutions. Omega can be applied to situations where both, one, or neither of the solutions being compared is non-disjoint. In the special case of non-overlapping clusters, this index reduces to ARI. Analysis of variance (ANOVA1 ) was performed on the Omega indices with the number of clusters (3, 5, or 8), degree of cluster overlap (large or small), variance of the normal distribution (.5 or 1.0), magnitude of parameter in the Toeplitz matrix (.2 or .6) and type of method (Baier et al. approach, CEM, ESOCLUS, two-way joining or GRIDPAT) as experimental factors. The entire experiment consisted also of 360 separate analyses. 3.1
Results for Disjoint Clusters from Experiment 1
We found a significant main negative effect of cluster number, F (2, 180) = 112.20, p < .001, and a main positive effect of increasing complexity of cluster structure, F (3, 180) = 19.86, p < .001: The more complex the input data structure (in terms of cluster numbers and variations in cluster sizes), the lower the ARI indices that were found across all methods. As to the central aim of the Monte Carlo study, the main effect of methods, F (4, 180) = 56.66, p < .001, is of importance. Across all data sets, ESOCLUS (¯ x = .81) performed best followed by CEM (¯ x = .81), Hartigan’s two-way joining (¯ x = .69), GRIDPAT (¯ x = .66) and the Baier et al. approach performing poorest (¯ x = .52). However, the main effect of methods was accentuated by two interactions. The first is a strong interaction of cluster number and method, F (8, 180) = 68.44, p < .001. Table 1a shows the mean ARI indices and standard errors as a function of the corresponding factor level combinations. In this as well as in Experiment 2 (see section 3.2), standard 1
The values of ARI and OM EGA may not be normally distributed violating an assumption of analysis of variance. However, in designs in which observations in the cells are independent and which are orthogonal, the requirement of a normal distribution is rather unimportant (Hays (1991)). Thus, the designs in both experiments allowed for ANOVAs.
The Recovery Performance of Two–mode Clustering Methods
195
error were low. ESOCLUS, CEM and GRIDPAT which follow the (main) effect of cluster size on the ARI index outlined above. However, Hartigan’s two-way joining shows the reverse behavior, in that recovery quality is rather low in three and five cluster data sets, but dramatically increases in the eight cluster configurations. Furthermore, it is the only method with nearly perfect recovery in the presence of eight clusters. Finally, the Baier et al. procedure is unaffected by the cluster number on a comparatively low level of cluster recovery. The second interaction involves cluster structure and method, F (12, 180) = 3.12, p < .001, and corresponding mean values of ARI are shown in Table 1b. From the balanced structure 1 to the most unbalanced structure 4, ESOCLUS, CEM and the Baier et al. approach display a monotonic decrease in performance, while GRIDPAT and Hartigan’s two-way joining remain rather constant. 3.2
Results for Overlapping Clusters from Experiment 2
In the data from Experiment 2, a significant main effect of cluster overlap, F (1, 240) = 29.72, p < .001, was found indicating that the Omega values for clustering solutions with large overlap (¯ x = .84) were lower than for solutions with small overlap (¯ x = .87). Another main effect concerned the standard deviation of the normal distributions the synthetical data were drawn from, F (1, 240) = 28.66, p < .001, where recovery indices were better for small (¯ x = .87) than for large standard deviations (¯ x = .84). Furthermore, we found a significant main effect of methods, F (4, 240) = ¯ 27.22, p < .001. In contrast to Experiment 1, the Baier et al. approach (X = .85) and GRIDPAT (¯ x = .86) performed as good as ESOCLUS (¯ x = .86) and CEM (¯ x = .86), while two–way joining (¯ x = .81) performed somewhat poorer. The significant interaction of cluster number and method, F (8, 240) = 25.25, p < .001, was replicated in Experiment 2. However, the interaction was accentuated by the triple interaction of cluster number, method and degree of cluster overlap, F (8, 240) = 6.61, p < .001. Table 2 shows the mean Omega indices as a function of the corresponding factor level combinations. There is one group of methods consisting of ESOCLUS and CEM which exhibits rather constant recovery values across all levels of cluster number. These recovey values are high in the presence of small clustering overlap, but decrease considerably in the data sets involving large overlap. The Baier et. al. approach, Hartigan’s two–way joining and GRIDPAT perform depending on cluster number and cluster overlap. For a smaller number of clusters, these methods exhibit high recovery values in the presence of large cluster overlap, but perform poorer in case of small overlap. Furthermore, recovery values decrease with increasing cluster number if cluster overlap is large, while the Omega indices remain rather constant across cluster numbers in data sets with small overlap.
196
S. Krolak-Schwerdt and M. Wiedenbeck
Large overlap: Number of clusters:
Method: ESOCLUS CEM Baier et al. GRIDPAT two–way joining
Small overlap:
3
5
8
3
5
8
.80(.01) .81(.01) .92(.01) .92(.01) .80(.01)
.81(.01) .81(.01) .85(.00) .87(.01) .89(.01)
.81(.01) .82(.01) .78(.02) .82(.02) .76(.01)
.88(.03) .87(.03) .85(.01) .82(.02) .68(.01)
.90(.01) .91(.01) .84(.03) .87(.02) .78(.01)
.91(.01) .92(.01) .86(.00) .87(.10) .79(.01)
Table 2. Mean values and standard errors (in parentheses) of Omega from Experiment 2 as a function of method, cluster number and cluster overlap
Finally, as compared to ESOCLUS and CEM, the recovery of methods of the second group is poorer for clusters with small overlap. Altogether, then, the Baier et. al. approach, two–way joining and GRIDPAT outperform ESOCLUS and CEM in the analysis of data structures with a smaller number of highly overlapping clusters. In the presence of only a small degree of cluster overlap, the results are similar to those of Experiment 1 in that the non– overlapping methods (e.g., ESOCLUS and CEM) are superior to the second group of methods in recovering the true clustering structures.
4
Conclusions
For our simulation study a data model was introduced which may offer the potential for an integration of methods. The proposed model has close relationships to the two–mode formulation of the ADCLUS model. The ADCLUS framework establishes the mathematical link between ADCLUS generalizations and reordering methods (cf. Van Mechelen et al. (2004)) and so does the data model introduced in this paper. Additionally, it allows for a restriction of parameters such that the type of data underlying the non–overlapping hierarchical methods is derived. Thus, the proposed model of two–mode data serves as a common frame of reference where parameter restrictions generate data with the properties of one or the other group of methods. Future research will show if the proposed model may serve as a frame of reference to integrate methods from the different categories. The results on the recovery performance of the two–mode methods revisited in this paper suggest that the ability of the methods to recover a given clustering structure strongly depends on the type and complexity of the data structure. Most notably, methods of each category performed best if the input data corresponded to the data structure presumed by the method. That is, for ADCLUS generalizations and reordering methods, we found superior performance in the presence of highly overlapping clusters, while methods estimat-
The Recovery Performance of Two–mode Clustering Methods
197
ing ultrametric tree structures performed best in analyzing non–overlapping clusters or clusters with only a small degree of overlap. In Experiment 1 involving non-overlapping clusters, the most extreme cases in recovery performance were the following. In the presence of a low number of clusters with rather homogenous within cluster structures, ESOCLUS and the centroid effect methods performed superior to all other methods. In contrast, if a comparatively high number of clusters with heterogenous within cluster structures was to be expected, Hartigan’s two-way joining performed best. In Experiment 2 using overlapping clusters, the Baier et al. approach, two–way joining and GRIDPAT turned out as superior methods if the data structure consists of a low number of highly overlapping clusters. However, results from Experiment 1 were replicated in the presence of a small cluster overlap. Consequently, if methods are to be used in a heuristic way due to a lack of some a–priori knowledge, the best what could be done is to select one of the hierarchical methods as they performed best across all conditions. However, as our Monte Carlo results have shown, this might not guarantee an optimal solution in every case.
References BAIER, D., GAUL, W. and SCHADER, M. (1997): Two–Mode Overlapping Clustering with Applications to Simultaneous Benefit Segmentation and Market Structuring. In: R. Klar and O. Opitz (Eds.): Classification and Knowledge Organization. Springer, Berlin, 557–566. COLLINS, L.M. and DENT, C.W. (1988): Omega: A general formulation of the Rand index of cluster recovery suitable for non–disjoint solutions. Multivariate Behavioral Research, 23, 231–342. DESARBO, W.S. (1982): Gennclus: New Models for General Nonhierarchical Clustering Analysis. Psychometrika, 47, 449–475. ECKES, T. and ORLIK, P. (1993): An error variance approach to two-mode hierarchical clustering. Journal of Classification, 10, 51–74. HARTIGAN, J. (1975): Clustering algorithms. New York: Wiley. HUBERT, L. and ARABIE, P. (1985): Comparing partitions. Journal of Classification, 2, 193–218. KROLAK-SCHWERDT, S. (2003): Two-mode clustering methods: Compare and contrast. In: M. Schader, W. Gaul and M. Vicchi (Eds.): Between data science and everyday web practice. Berlin: Springer, 270-279. SCHWAIGER, M. (1997): Two–mode classification in advertising research. In: R. Klar and O. Opitz (Eds.): Classification and knowledge Organization. Springer, Berlin, 596–603. VAN MECHELEN, I., BOCK, H. and DE BOECK, P. (2004): Two-mode clustering methods: A structure overview. Statistical Methods in Medical Research, 13, 363–394.
On the Comparability of Relialibility Measures: Bifurcation Analysis of Two Measures in the Case of Dichotomous Ratings Thomas Ostermann1 and Reinhard Schuster2,3 1
2
3
Department of Mediacal Theory and Complementary Medicine, University of Witten/Herdecke, Gerhard-Kienle-Weg 4, 58313 Herdecke, Germany Institute of Mathematics, University of Luebeck, Wallstr.40, 23560 Luebeck, Germany North German Biometrical Centre, Medical Advisitory Board of the Statutory Health Insurance, Katharinenstr 11a, 23554 Luebeck, Germany
Abstract. The problem of analysing interrater — agreement and — reliability is known both in human decision making and in machine interaction. Several measures have been developped in the last 100 years for this purpose, with Cohen’s Kappacoefficient to be the most popular one. Due to methodological considerations, the validity of kappa-type measures for interrater agreement has been discussed in a variety of papers. However, a global comparison of properties of these measures is currently still deficient. In our approach, we constructed an integral measure to evaluate the differences between two reliability measures for dichotomous ratings. Additionally, we studied bifurcation properties of the difference of these measures to quantify areas of minimal differences. From the methodological point of view, our integral-measure can also be used to construct other measures for interrater agreement.
1
Introduction
The problem of analysing interrater–agreement and –reliability is known both in research in human decision making and in machine interaction. In the first field it occurs for example in quantifying the amount of agreement in therapeutic goal attainment between patient and physician (Ostermann et. al (2001)), in the agreement between two or more experts rating on the same subject, e g. in the visual assignment of pedographic examination results to anatomical reference areas of the forefoot (Greiner et al. (1999)) or in the evaluation of interrater–reliability in assessment situations e.g. the selection of medical students (Ostermann et al. (2005)). In engeneering science, the agreements between humans and machines is subject of current research e.g. in the comparison of accuracy in the partitioning of image sets (Squire and Pun (1998)). But also in the comparison of different algorithms or computational methods, the evaluation of agreement in results is a important marker. One example is the comparison of results of supervised learning algorithms for word sense disambiguation, which has been carried out by Escudero et al. (2000).
On the Comparability of Relialibility Measures
199
For these purposes, several measures have been developped in the last 100 years starting with Yules’s Gamma (1911), Scott’s Pi (1954), Cohen’s Kappa (1960), Aickin’s Alpha (1990) or Klauers’s Lambda (1996) to mention some of them. The most popular one of the above is Cohen’s Kappa-coefficient Kc ,which has also been used in all the introductory examples for measuring the amount of agreement. Kappa measures the percentage fo of observed agreement in the main diagonal of a n x n table and then adjusts these values for the amount of agreement fe that could be expected due to chance alone: fo − fe Kc = (1) 1 − fe k k with fo := i=1 fii and fe := i=1 fi. f.i (see Tab. 1 for details). Due to methodological considerations, the validity of kappa and kappatype measures for interrater agrreement has been discussed in a variety of papers. Especially, Feinstein and Chichetti (1990) mentioned the phenomenon of a low Kappa value despite high concordance rates which mainly appears in situations with strong asymmetrical marginal distributions. With reference to Feinstein and Cicchetti’s described paradoxes, Mayer et al. (2004) and our working group (Ostermann and Schuster (2005)) showed that the value of Kappa raises when the distribution in the main diagonals is shifted symmetrically whilst simultaneously maintaining the sum of the diagonal to be constant. However, a global comparison of the properties of these measures is currently still deficient. Therefore, we constructed an integral measures to evaluate the differences between measures of interrater agreement and exemplified our approach for the case of dichotomous (binary) ratings.
2
Material and Methods
We assume a situation of a dichotomous attribution (e.g. ”yes” or ”no”) of two raters on one subject, which leads to the cross tabulation given in Tab. 1. To demonstrate our aproach, we chose two alternative measures of inter-rater agreement in addition to the classical Kappa-coefficient. We decided to use the two Lamda-coefficients L1 and L2 developed by Klauer: fo − fe L1 = (2) 2f1 (1 − f1 .) fo − fe L2 = (3) (f1. + f.1 )(f2. + f.2 )/2
To evaluate the global difference the agreement-parameter a introduced in Tab. 1 is shifted steadily from 0 to 1. As already shown in (Ostermann and Schuster (2005)), the symmetrisation a−u−v x= 2 1−a−u+v y= 2
200
T. Ostermann and R. Schuster
Rater I YES
NO
Total
YES
f11 = x
f12 = (1 − a) − y
f1. = 1 − a + (x − y)
NO
f21 = y
f22 = a − x
f2. = a − (x − y)
f.1 = x + y
f.2 = 1 − (x + y)
sum=1
Rater II
Total
Table 1. Crosstabulation of the ratings
with (u, v) ∈ G (Fig. 1) transform the equations of Cohen’s Kappa and Klauer’s Lambdas given in (1)-(3) into the following terms: 1 − 2a + 4uv −1 + 4uv 1 − 2a + 4uv L1 = −1 + 4v 2 1 − 2a + 4uv L2 = −1 + (u + v)2
Kc =
(4) (5)
(6)
With this transformation, the points Pi in the x-y-plane pass into the following coordinates Qi of the u-v-plane. P1 (a) := (0, 0) −→ (1/2, −1/2 + a) =: Q1 (a) P2 (a) := (0, a) −→ (1/2 − a, −1/2) =: Q2 (a) P3 (a) := (a, a) −→ (−1/2, 1/2 − a) =: Q3 (a) P4 (a) := (a, 0) −→ (−1/2 + a, 1/2) =: Q4 (a) Due to the linearity of the transformations in x and y, the connecting lines from Pi to Pj in the x-y-plane are directly converted into the connecting lines from Qi to Qj (i, j = 1, 2, 3, 4, i = j) in the u-v-plane. The resulting rectangle Q1 − Q2 − Q3 − Q4 is not in accordance to the direction of the u − v-coordinate system for all 0 < a < 1 (Fig. 1). The envelope for all 0 ≤ a ≤ 1 is the square Q0 with angles (±1/2, ±1/2) in the u-v-plane. The major advantage is the linearity of the transformed measures in Eq. (4) to (6) with respect to the parameter a, whereas the original equations are non-linear functions in a (Quotient of a polynom of third degree in a and a quadratic polynom in a). Hence, this greatly simplifies the calculations in the comparison of the two measures of Klauer with Cohen’s Kappa. In a first step, we will analyse singularities and bifurcations of critical points of each measure. Singularities in general are defined as points at which
On the Comparability of Relialibility Measures
201
Fig. 1. Region G of (u, v)
a given mathematical object is not defined. Additionally, we are looking on bifurcations. Bifurcations are points in which a systems behaviour undergoes a qualitative change. In our case this happens when the agreement measures of Klauer show identical values with Cohen’s original Kappa. Finally, with the help of the integral measure 1 I= (7) |Kc (u, v, a) − L1,2 (u, v, a)| du dv a(1 − a) G
we will try to give an answer on the question, if these measures are tending to differ in their value of interrater agreement with respect to the percentage a of agreement. Together with the qualitative analysis of the system’s behaviour, we can analyse if and especially under which conditions differences between measures of interrater agreement are significant or rather small. The following results were all calculated with Mathematica 4.1 for Windows. Algorithms for numerical integration were adapted from Schuster (1995).
3 3.1
Results Singularities
In our case, singularities occur under the following conditions: 1. For Kc we have the condition 4uv = 1. This is true if u = v = ±1/2. Only for the practically irrelevant case a = 1, this solution is within the domain (in this case it turns out to be degenerated into a line). 2. For L1 sigularities are given as the solutions of −1 + 4v 2 = 0 which are given by v = ±1/2. For every 0 ≤ a ≤ 1 we therefore have exactly two singularities in the domain.
202
T. Ostermann and R. Schuster
Fig. 2. Bifurcations for the differences of every two measures
3. For L2 singularities are given by the expression u + v = ±1 which are solutions of the equation −1 + (u + v)2 = 0. This directly leads to the first case of Kc with singularities in u = v = ±1/2 and a = 1. 3.2
Bifurcations
In our case, bifurcations occur when every two measures show identical values. With the exception of the above mentioned singularities, all measures and hence also their difference are zero, if 1 − 2a + 4uv = 0. This apparently is true for the vertices of the domains. With the exception of a = 1/2 every two pairs of vertices in the same quadrant can be connected with a curve, which due to the convexity reasons lies in the inner region of the domain. In the border case a = 1/2 the lines −1/2 ≤ u ≤ 1/2, v = 0 and u = 0, −1/2 ≤ v ≤ 1/2 are complete solutions in the domain region. Other possibilities of agreement in the measures do exist, if the a–independent denominators have identical values. Despite their independence of a,
On the Comparability of Relialibility Measures
203
Fig. 3. Integral measure of global difference (numbered from the upper left beginning) between a) Kc and L1 b) Kc and L2 and c)L1 and L2
there is a certain restriction given by the parameter a in the domain regions, which affects if the curves do have a point of intersection. 1. Kc und L1 equal, if −1 + 4uv = −1 + 4v 2 . This is true in case of v = 0 and u = v. 2. Kc und L2 equal, if −1 + 4uv = −1 + (u + v)2 . This is equivalent to u = v. 3. L1 und L2 equal, if −1 + 4v 2 = −1 + (u + v)2 . Here we have two solutions: v = −3v und u = v. All measures do at least equal for u = v. A closer investigation leads to the conclusion, that there is no local extreme value inside the boundaries of the domains. A plot for selected values of a is given in Fig. 2. 3.3
A Global Measure of Difference
Given the parameter a of observed agreement between the raters it is now possible to determine the difference between the measures whilst shifting the marginal sum distribution through u and v. The global difference between the measures is defined by the integral given in (7). Significant differences between Cohen’s Kappa and Klauers Lambda’s are located in the boundary regions. Especially for pairs (u, v) with v ≥ 0.8 and v ≤ 0.2 and u = 1 − v, we find extreme values, whilst the diffenrence is almost zero for u ≈ v. The global difference with respect to a is plotted in Fig. 3.
4
Discussion
We aimed to compare the performance of two different measures for interrateragreement. Our results showed that high differences maily occur, when the amount of agreement was either very high (a > 0.8) or quite low (a < 0.2). In between this range the overall difference is neglectable. As the measuring of agreement is only interesting for a ≥ 0.5 we can conclude, that the main problems for these measures do occur in the case of high agreement rates.
204
T. Ostermann and R. Schuster
This of course is in accordance with the introductory mentioned findings of other research groups. Nevertheless, our approach enables to quantify the amount of difference not only for a given situation, but for all possible permutations in a cross tabulation with a given percentage of agreement a. This allows us to quantify measures for interrater agreement on a more abstract level. Such a global comparison of different measures might be helpful in the interpretation of unequal values of agreement derived with different measures and may help to decide which measure should be chosen in situations with special requirements. First results have already been calculated by our group. A next step would be a meta-analysis (Rosenthal, (1991)) of all these measures in a way which has been suggested by Miller (1999, 2000) and Djoki et al. (2001) in their analysis of combining empirical results of different experiments in software engeneering. This would allow the aggregation of empirical findings in a constructive way and might help to interpret situations in which large differences in between those measure do occur. Such situations should be simulated with a sample set of data consisting of typical situations of agreement or disagreement between judges. Such a meta-analytical approach is currently worked out in our group.
References AICKIN, M. (1990): Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46(2),293–302. COHEN, J. (1960): A coefficient of agreement for nominal scales. Education and Psychological Measurement 20, 37–46. DJOKI, S., SUCCI, G., PEDRYCZ W. and MINTCHEV, M. (2001): Meta Analysis - A Method of Combining Empirical Results and its Application in ObjectOriented Software Systems. In: Y. Wang, S. Patel and R. Johnston (Eds.): Proceedings of the OOIS’01. Springer, Berlin, 103–112. ESCUDERO, G., MARQUEZ, L. and RIGAU, G. (2000): A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation. Proceedings of CoNLL-2000 and LLL-2000. Lisbon, Portugal, 31–36. FEINSTEIN, A.R. and CHICHETTI, D.V. (1990): High agreement but low Kappa: I. The problem of two paradoxes. Journal of Clinical Epidemiology 43, 543–549. GREINER, B., DOHLE, J., SCHULZE, W., OSTERMANN, T. and HAMEL, J. (1999): The visual assignment of pedographic examination results to anatomical reference areas of the forefoot. Foot and Ankle Surgery 5, 219–226. KLAUER, K.C.(1996): Urteiler¨ ubereinstimmung bei dichotomen Kategoriensystemen. Diagnostica 42, 101–118. MAYER, H., NONN, C., OSTERBRINK, J. and EVERS, G.C. (2004): Qualitatskriterien von Assessment-instrumenten–Cohen’s Kappa als Mass der Interrater-Reliabilitat (Teil 1). Pflege 17, 36–46. MILLER, J.(1999): Can Results from Software Engineering Experiments be Safely Combined? IEEE Metrics 1999, 152–158.
On the Comparability of Relialibility Measures
205
MILLER, J. (2000): Applying meta-analytical procedures to software engineering experiments. Journal of Systems and Software 54, 29–39. OSTERMANN, T., BEER, A-M., and MATTHIESSEN, P.F.(2001): Evaluation station¨ arer naturheilkundlicher Behandlung - Konzeption und erste Ergebnisse des Blankensteiner Modells. Qualit¨ atsmanagement in Klinik und Praxix 9(4), 104–111. OSTERMANN, T. and SCHUSTER, R.(2005): On the comparability and construction of reliability measures for dichotomous ratings — a unified algebraic approach. Methodology, Submitted OSTERMANN, T., VERMAASEN, W. and MATTHIESSEN, P.F. (2005): Evaluation des Auswahlverfahrens von Medizinstudenten an der Universit¨ at Witten/Herdecke - Teil I: Inter-Rater-Reliabilit¨ at des Interviewverfahrens. GMS Z Med Ausbild 22(1):Doc13. ROSENTHAL, R. (1991):Meta-analytic procedures for social research. Sage, Beverly Hills. SCHUSTER, R.(1995): Grundkurs Biomathematik, Teubner-Verlag, Stuttgart. SCOTT W.A. (1954): Reliability of Content Analysis: The Case of Nominal Scale Coding. Public Opinion Quarterly 19, 321–25. SQUIRE, D.M. and PUN, T. (1998): Assessing Agreement Between Human and Machine Clusterings of Image Databases. Pattern Recognition, 31 (12): 19051919. YULE, G.U. (1911): On the methods of measuring the association between two attributes. Journal of the Royal Statistical Society 75: 579-652.
On Active Learning in Multi-label Classification Klaus Brinker Data and Knowledge Engineering Faculty of Computer Science Otto-von-Guericke-University Magdeburg D-39106 Magdeburg, Germany Abstract. In conventional multiclass classification learning, we seek to induce a prediction function from the domain of input patterns to a mutually exclusive set of class labels. As a straightforward generalization of this category of learning problems, so-called multi-label classification allows for input patterns to be associated with multiple class labels simultaneously. Text categorization is a domain of particular relevance which can be viewed as an instance of this setting. While the process of labeling input patterns for generating training sets already constitutes a major issue in conventional classification learning, it becomes an even more substantial matter of relevance in the more complex multi-label classification setting. We propose a novel active learning strategy for reducing the labeling effort and conduct an experimental study on the well-known Reuters-21578 text categorization benchmark dataset to demonstrate the efficiency of our approach.
1
Introduction
The conventional multiclass classification setting assumes each input pattern to be associated with one specific class label, i.e., the target space consists of a mutually exclusive finite set of class labels. However, many applications require a more flexible setting which allows for input patterns to be associated with multiple class labels simultaneously. We refer to this setting as multi-label classification. Semantic image classification [2] and text categorization [6] form learning problems of particular relevance in practice where each input patterns is potentially associated with multiple class labels, such as building, beach, animal and sports, economy, politics, respectively. Thus, these learning problems can be cast as instances of multi-label classification in a straightforward manner. We are particularly interested in the latter domain of text categorization and consider the well-studied Reuters-21578 text categorization benchmark dataset. Many machine learning algorithms are inherently restricted to binary classification learning. In multiclass classification, one principal means for the generalization of binary algorithms is to construct and combine several binary classifiers using a one-versus-all decomposition scheme [4]. The oneversus-all technique trains a separate classifier for each possible class against the rest of classes with the training examples being binary relabeled in a
On Active Learning in Multi-label Classification
207
suitable manner and predicts class labels according to the maximum output1 among all binary classifiers (MAX-wins). This technique can be generalized to the multi-label setting by a modification of the decomposition and prediction step, where input patterns are submitted as positive examples to all binary problems corresponding to the associated (relevant) class labels and submitted to the remaining problems as negative examples. Moreover, the MAX-wins prediction technique has to be replaced by an ALL-positive procedure. The effort necessary to construct training sets of labeled examples in a supervised machine learning scenario is often disregarded, though in many applications, it is a time-consuming and expensive procedure. While this process already constitutes a major issue in classification learning, it becomes an even more substantial matter of relevance in multi-label learning, which considers the more complex target space of a set of not necessarily mutually exclusive class labels. The superordinate concept of active learning refers to a collection of approaches which aim at reducing the labeling effort in supervised machine learning. We consider the pool-based active learning model, which was originally introduced by [7] in the context of text classification learning. If not noted otherwise, we refer to the pool-based active learning model as active learning herein after to simplify our presentation. In contrast to conventional supervised learning, pool-based active learning considers an extended learning model in which the learning algorithm is granted access to a set of initially unlabeled examples and the algorithm is provided with the ability to determine the order of assigning target objects, i.e., associated subsets of class labels. The essential idea behind active learning is to select promising unlabeled examples with the objective of attaining a high level of accuracy without requesting the complete set of corresponding target objects. In particular, text categorization is a characteristic learning problem which is amenable to the active learning approach [12, 8, 10]: While a relatively cheap source of unlabeled examples is available, acquiring the associated sets of target labels is an expensive procedure. More precisely, large corpora of text documents are readily available in many domains. However, assigning given text documents to target categories to generate labeled training sets is a time-consuming task as it requires human decisions. This general pattern is not only characteristic for text categorization problems, but also arises in many other domains. We propose a novel generalization of pool-based active learning to reduce the labeling effort based on the one-versus-all technique for representing multi-label classifiers. The remainder of this paper is organized as follows: The subsequent section establishes the notational basis and reviews the aforementioned binary decomposition approach to multi-label classification. 1
In the following, we consider real-valued classifiers which are thresholded at zero to make binary predictions {−1, +1}.
208
K. Brinker
In Section 3, we discuss active learning in the context of multi-label classification and propose our novel generalization. Experimental results on the Reuters-21578 text categorization benchmark dataset which demonstrate the efficiency of our approach are discussed in Section 4.
2
Multi-label Classification
Assume we are given a nonempty input space X and a finite set {1, . . . , d} of class labels. Then, the target space Y in multi-label classification is defined def as Y = P({1, . . . , d}) where P(A) denotes the power set of a given set A. The fundamental learning task consists in inducing a prediction function f : X → Y based on a given training set of labeled examples L = {(x1 , Y1 ), . . . , (xm , Ym )} ⊂ X × Y.
(1)
We consider support vector machines as the binary base learning algorithm as they have demonstrated excellent generalization ability in the domain of text categorization [6]. A common binary decomposition method for solving multi-label problems is to train a separate binary classifier hi : X → {−1, +1} for each of the d target classes against the remaining set of classes [2]. More precisely, all examples (x, Y ) ∈ L with i ∈ Y are relabeled as positive examples in the process of training the binary classifier hi , whereas the remaining examples are relabeled as negative examples. Target objects Y for unseen patterns x are predicted according to positive classification of the underlying set of binary classifiers (ALL-positive): h:X →Y
x → argpos hi (x) = i ∈ {1, . . . , d} | hi (x) = +1 .
(2) (3)
i=1,...,d
3
Active Multi-label Learning
As mentioned in the preceding section, we employ support vector machines [13] as binary base learning algorithm. Support vector machines and, more generally, the class of kernel machines form linear learning algorithms which as a distinctive feature perform an implicit embedding of input patterns in a kernel-induced feature space F. In the following, we will denote the given kernel by k : X × X → R and the corresponding kernel feature map by φ : X → F. Given a weight vector w ∈ F, the corresponding (binary) kernel classifier is defined as hw : X → {−1, +1} x → sign(w, φ(x)F )
(4) (5)
On Active Learning in Multi-label Classification
209
where sign(t) = +1 for t > 0 and sign(t) = −1 otherwise. Let us assume that we are given a linearly separable (in the feature space) binary classification training set {(x1 , y1 ), . . . , (xm , ym )}. This assumption can be relaxed using a suitable modification of the kernel matrix when using the quadratic-loss [11]. The nonempty set def
V = {w ∈ F | hw (xi ) = yi for i = 1, . . . , m, w F ≤ 1}
(6)
which consists of weight vectors in the unit hyperball corresponding to linear classifiers in the feature space which are consistent with the training set is called version space [9]. Learning can be viewed as a search process within the version space: Each labeled example (xi , yi ) imposes a constraint on the version space because to correspond to a consistent classifier a weight vector has to satisfy sign(w, φ(xi )F ) = yi ⇔ yi w, φ(xi )F > 0. In other words, consistent classifiers are restricted to a halfspace whose boundary is the hyperplane with normal vector yi φ(xi ). For a fixed feature vector φ(xi ), the class label yi determines the orientation of the halfspace. Furthermore, V is the intersection of m halfspaces (a convex polyhedral cone) with the unit hyperball in the feature space F. So far, we considered the conventional batch learning scenario where the completely labeled set of examples is required as a prerequisite for training. Moreover, the version space model provides the basis for active learning strategies which sequentially select the most promising unlabeled examples and then request the corresponding class label. From a theoretical perspective, there exists an appealing connection to the theory of convex set which provides further insides on appropriate active selection strategies: [5] showed that any halfspace containing the center of mass of a convex set contains a fraction of at least 1/e of the overall volume. Assume we are able to repeatedly select unlabeled examples which correspond to restricting hyperplanes passing exactly through the current center of mass of the version space wcenter . Then, independently of the actual class label, the volume of the version space is reduced exponentially in terms of the number of labeled examples. For computational efficiency, the exact center of mass can be approximated by the center of maximum radius hyperball inscribable in the version space. In the case of normalized feature vectors2 , this center is given by the weight vector w(svm) of the support vector machine trained on the labeled set of examples. As a consequence of the finite number of unlabeled examples which in general does not allow to satisfy this criterion, a common approach in poolbased active learning with kernel machines is to select the unlabeled example whose restricting hyperplane is closest to the center of the maximum radius hyperball, i.e., unlabeled examples minimizing |w(svm) , φ(x)F | [12]. 2
Normalization can be achieved by a straightforward kernel modification: def k(NORM) (x, x ) = √ k(x,x ) . k(x,x)k(x ,x )
210
K. Brinker
For generalizing this selection strategy from binary to multi-label classification, we have to take into account that instead of a single version space the aforementioned one-versus-all decomposition technique yields a set of d version spaces. In the case of label ranking learning where similar decomposition techniques are required, a best worst-case approach with respect to individual volume reduction was demonstrated to achieve a substantial reduction of the labeling effort [3]. We propose an analogous generalization for multi-label classification in the following. 1+yw(svm) ,φ(x)
F For a labeled binary example, the (rescaled) margin can 2 be viewed as a (coarse) measure of the reduction of the version space volume. Indeed, a straightforward derivation reveals that the above-defined selection strategy can be interpreted as measuring the volume reduction for the worstcase class label. For multi-label classification, the notion of worst-case can be generalized to the case of a set of binary classification problems by evaluating (svm) the minimum absolute distance mini=1,...,d |wi , φ(x)F | among all binary (svm) problems, where wi denotes the weight vector of the support vector machine trained on the one-versus-i subproblem. From a different perspective, we aim at selecting an unlabeled multi-label example which maximizes the (binary) volume reduction with respect to the worst-case set of associated target class label. Denoting the set of labeled and unlabeled examples by L and U , respectively, the active selection strategy is formally given by
(svm) (U, L) → argmin min |wi , φ(x)F | . (7)
x∈U
i=1,...,d
Note, that the right-hand side (implicitly) depends on L through the weight (svm) (svm) , . . . , wd . vectors w1
4
Experiments
The Reuters-21578 newswire benchmark dataset is the currently most widely used test collection for text categorization research.3 Our experiments are based on the standard ModApte split which divides the dataset into 7.769 training and 3.019 test documents. Each document is associated with a subset of the 90 categories present in the dataset. In compliance with related research, the documents were represented using stemmed word frequency vectors with a TFIDF weighting scheme and elimination of common words resulting in roughly 10.000 features. For computational reasons, we restricted our experimental setup to the 10 most frequent categories in the Reuters dataset. Moreover, we used linear kernels with the default choice of C = 10 (and quadratic-loss) as they were demonstrated to provide an excellent basis for accurate classifiers on this dataset [6]. For normalizing the data to unit modules, we employed the aforementioned kernel modification. 3
The Reuters-21578 newswire benchmark dataset is publicly available at http://www.daviddlewis.com/resources/testcollections/reuters21578/.
On Active Learning in Multi-label Classification
211
Fig. 1. Experimental learning curves for the random and active selection strategies on the Reuters-21578 text categorization benchmark dataset. This figure shows average α-evaluation scores (α = 1) and corresponding standard errors of the mean for different numbers of labeled examples.
An initial subsample of 10 multi-label examples was randomly drawn from the training set and submitted to the active learning algorithm. Then, the target objects of the remaining examples were masked out prior to selection and the active learning strategy sequentially selected 190 examples. The accuracy of the multi-label classifiers trained on the currently labeled sets of examples was evaluated every 10 iterations. As the evaluation measure, we used the α-evaluation score proposed by [2]: Denote by Y, Y ∈ Y sets of labels. Then the score(α) is defined as α |Y ∩ Y | (α) def score (Y, Y ) = . (8) |Y ∪ Y | This similarity measure has varying properties depending on the parameter α: For α = ∞, score(α) evaluates to 1 only in the case of identical sets Y and Y , whereas for α = 0, it evaluates to 1 except for the case of completely disjoint sets. We considered the intermediate choice of α = 1, which provides a finer scale. Based on this underlying measure, the accuracy of a multi-label classifier h : X → Y was evaluated on the test set T : def
accuracyT (h) =
1 |T |
score(α) (h(x), Y ).
(9)
(x,Y )∈T
To compensate for effects based on the random choice of the initially labeled set, we repeated the above-described procedure 30 times and averaged the
212
K. Brinker
results over all runs. In addition to the proposed active selection strategy, we employed random selection of new training examples as a baseline strategy. As depicted in Figure 1, active learning significantly outperforms random selection starting at about 40 selection steps (at least at the 0.05 significance level). This pattern is not only typical for active learning in multi-label classification but also for other categories of learning problems where active learning becomes more effective once the labeled data is sufficient to train an adequate intermediate model.
5
Related Work
In the field of active learning, there are two principle categories of approaches: So-called query learning [1] refers to a learning model where the learning algorithm is given the ability to request true class labels corresponding to examples generated from the entire input domain. In contrast to this, in selective sampling the learner is restricted to request labels associated with examples from a finite set of examples (pool-based model ) or the learning algorithm has to decide whether to request the corresponding true labels for sequentially presented single examples (stream-based model ). Research in the field of pool-based active learning with kernel machines has mainly focused on binary classification. Beyond this category, multiclass classification [12] and label-ranking [3] are among those categories of learning problems which were demonstrated to benefit substantially from the active learning framework in terms of the number of labeled examples necessary to attain a certain level of accuracy.
6
Conclusion
We introduced a novel generalization of pool-based active learning to the category of multi-label classification problems which is based on the common one-versus-all binary decomposition scheme. From a theoretical perspective, a generalized view of the version space model provides an appealing motivation of our approach. An experimental study on the well-known Reuters21578 text categorization benchmark dataset demonstrates the efficiency of our approach in terms of the number of labeled examples necessary to attain a certain level of accuracy. Moreover, as it is reasonable to assume that acquiring target objects in multi-label classification learning is more expensive than for less complex domain like binary classification, the benefits of active learning in this context become even more obvious and suggest that it is a promising approach in reducing the cost of learning.
References 1. ANGLUIN, D. (1988). Queries and concept learning. Journal of Machine Learning, 2:319–342.
On Active Learning in Multi-label Classification
213
2. BOUTELL, M.R., LUO, J., SHEN, X., and BROWN, C.M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771. 3. BRINKER, K. (2004). Active learning of label ranking functions. In Greiner, R. and Schuurmans, D., editors, Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), pages 129–136. 4. CORTES, C., and VAPNIK, V. (1995). Support vector networks. Journal of Machine Learning, 20:273 – 297. ¨ 5. GRUNBAUM, B. (1960). Partitions of mass-distributions and convex bodies by hyperplanes. Pacific J. Math., 10:1257–1261. 6. JOACHIMS, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In N´edellec, C. and Rouveirol, C., editors, Proceedings of the European Conference on Machine Learning (ECML 1998), pages 137–142, Berlin. Springer. 7. LEWIS, D.D., and GALE, W.A. (1994). A sequential algorithm for training text classifiers. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 3–12, Dublin, IE. Springer Verlag, Heidelberg, DE. 8. McCALLUM, A.K., and NIGAM, K. (1998). Employing EM in pool-based active learning for text classification. In: Shavlik, J.W., editor, Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), pages 350–358, Madison, US. Morgan Kaufmann Publishers, San Francisco, US. 9. MITCHELL, T.M. (1982). Generalization as search. Journal of Artificial Intelligence, 18:203–226. 10. ROY, N., and McCALLUM, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pages 441–448. Morgan Kaufmann, San Francisco, CA. 11. SHAWE-TAYLOR, J., and CRISTIANINI, N. (1999). Further results on the margin distribution. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT 1999), pages 278–285. ACM Press. 12. TONG, S., and KOLLER, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66. 13. VAPNIK, V. (1998). Statistical Learning Theory. John Wiley, N.Y.
From Ranking to Classification: A Statistical View St´ephan Cl´emen¸con1,3 , G´abor Lugosi2 , and Nicolas Vayatis3 1
2
3
MODAL’X Universit´e Paris X, 92001 Nanterre, France Departament d’Economia i Empresa, Universitat Pompeu Fabra, 08005 Barcelona, Spain Laboratoire de Probabilit´es et Mod`eles Al´eatoires Universit´es Paris VI et Paris VII, 75013 Paris, France
Abstract. In applications related to information retrieval, the goal is not only to build a classifier for deciding whether a document x among a list X is relevant or not, but to learn a scoring function s : X → R for ranking all possible documents with respect to their relevancy. Here we show how the bipartite ranking problem boils down to binary classification with dependent data when accuracy is measured by the AUC criterion. The natural estimate of the risk being of the form of a U statistic, consistency of methods based on empirical risk minimization is studied using the theory of U -processes. Taking advantage of this specific form, we prove that fast rates of convergence may be achieved under general noise assumptions.
1
Introduction
Numerous practical problems related, for instance, to document retrieval have recently advocated the use of learning algorithms to rank labelled objects (see Freund et al. (2002), Bach et al. (2004)), instead of simply classifying them. In order to take a statistical approach, we model these labelled objects as an i.i.d. sample Dn = {(X1 , Y1 ), ..., (Xn , Yn )}. Each pair (Xi , Yi ) is a copy of a pair (X, Y ) where X is a random input observation taking its values in a space X and Y is a binary random label in {−1, +1}. In the binary classification problem, the goal is to construct a classifier C : X → {−1, +1}, given the sample Dn , which optimizes the performance measured by the probability of misclassification P(Y = C(X)). In the bipartite ranking problem, we are concerned with building a scoring function s : X → R from the training data Dn , so as to rank the observations x by increasing order of their score s(x): the higher the score s(X) is, the more likely one should observe Y = 1. The accuracy of the ranking induced by s is classically measured by the ROC curve (ROC standing for Receiving Operator Characteristic, see Green and Swets (1966)), which is defined as the plot of the true positive rate P(s(X) ≥ u | Y = 1) against the false positive rate P(s(X) ≥ u | Y = −1), u ∈ R. This accuracy measure induces a partial order on the set S of all scoring functions: for any s1 , s2 in S, we shall say that s1 is more accurate than s2 if and only if its ROC curve is above the one of s2 everywhere (namely if the test defined by s1 for
From Ranking to Classification: A Statistical View
215
testing the hypothesis that Y = −1 is uniformly more powerful than the one defined by s2 ). With respect to this criterion, one may straightforwardly show that the optimal ranking is the one induced by increasing transformations of the regression function η(x) = P(Y = 1 | X = x) using Neyman-Pearson’s lemma. Given the difficulty about optimizing the ROC curve itself over a class of scoring functions, a simple idea consists in maximizing instead the Area Under the ROC Curve (AUC ), which leads to a much more practical criterion that may be classically interpreted in a probabilistic fashion (see Hanley and McNeil (1982)): for any s ∈ S, we have AUC(s) = P (s(X) ≥ s(X ) | Y = 1, Y = −1) ,
(1)
where the pair (X , Y ) is an independent copy of (X, Y ). Indeed, maximizing the AUC criterion amounts to choosing a scoring function s such that, given two independent input observations X and X with labels Y = 1 and Y = −1 respectively, the probability that s ranks the instance X higher than X is minimum.
2
Reduction to some Classification Problem
In this section we explain how the problem of bipartite ranking can be understood as a classification problem (see also Herbrich et al. (2000) for a similar observation in the context of ordinal regression). From expression (1), one may write for all s ∈ S AUC(s) = 1 −
1 P((s(X) − s(X )) · (Y − Y ) < 0) , 2p(1 − p)
(2)
where p = P(Y = 1). Thus, maximizing AUC(s) amounts to minimizing L(s) = P((s(X) − s(X )) · (Y − Y ) < 0) .
(3)
Now this last quantity can be interpreted as a classification error in the following framework: given two independent observations X and X , predict the ranking label Y−Y Z= ∈ {−1, 0, +1}, (4) 2 with a ranking rule of the form r(X, X ) = 2I{s(X) ≥ s(X )} − 1 (where I{A} denotes the indicator function of the event A, say I{A} = 1 if A is true and I{A} = 0 otherwise), such that the ranking risk L(r) = P(Z · r(X, X ) < 0)
(5)
is minimum (where we have written, with a slight abuse of notations, L(s) = L(r)). In this setting, the optimal rules can easily be derived as in the case of binary classification (see Devroye et al. (1996)).
216
S. Cl´emen¸con et al.
Proposition 1. Set r∗ (X, X ) = 2I{η(X) ≥ η(X )} − 1 and L∗ = L(r∗ ), we have, for any ranking rule r based on a scoring function s ∈ S, L∗ ≤ L(r).
(6)
Moreover, we have the bound: 0 ≤ L∗ ≤ 41 .
Sketch of proof. In order to prove the optimality of r∗ , we observe that, for any s ∈ S, L(s) − L∗ = E {|η(X) − η(X )|I{(s(X) − s(X ))(η(X) − η(X )) < 0}} . (7) The bound on L∗ is obtained thanks to the following expression: Y +1 1 ∗ L = Var − E|η(X) − η(X )|. 2 2
3
(8)
Empirical Criterion and U -statistics
In the previous section, we explained how the bipartite ranking problem can be viewed as a three-class classification problem. Now we turn to the construction of scoring functions of low ranking risk based on training data. Suppose that n independent copies of (X, Y ) are available and denote by Dn = {(X1 , Y1 ), ..., (Xn , Yn )} the corresponding data set. The natural empirical risk functional to minimize in the bipartite ranking problem is: Ln (s) =
2 I{(s(Xi ) − s(Xj )) · (Yi − Yj ) < 0} . n(n − 1) i<j
(9)
For a given s ∈ S, this quantity is no longer an empirical average of independent random elements from Dn as in the standard classification setting, but rather an average of pairs of random elements from Dn . This quantity is known in the statistics literature as a U -statistic (we refer to the book by Serfling (1980) for an overview on this topic). Here, for s ∈ S, the U -statistic Ln (s) is a natural (unbiased) estimate of L(s), with symmetric kernel qs ((x, y), (x , y )) = I{(s(x) − s(x )) · (y − y ) < 0}, for all (x, y) and (x , y ) in X ×{−1, 1}. In the sequel, we will use two different representations of the U -statistic. For simplicity, we will omit the reference to the labels in the expressions below. The first one is the representation as an average of independent blocks: n/2
1 1 1 qs (Xi , Xj ) = qs Xπ(i) , Xπ(n/2 +i) n(n − 1) n! π n/2 i=1
(10)
i=j
where the first sum in the right hand side is taken over all permutations of {1, . . . , n}. The second representation is a crucial tool for studying the asymptotic properties of U -statistics and it is known as Hoeffding’s decomposition.
From Ranking to Classification: A Statistical View
217
In order to formulate it in our framework, we first need some notations. For a given scoring function s, set
and
hs (x) = E(qs (x, X)) − L(s) ,
(11)
hs (x, x ) = qs (x, x ) − L(s) − hs (x) − hs (x ) .
(12)
Now, Hoeffding’s decomposition of Ln (s) is the orthogonal expansion: Ln (s) = L(s) + 2Tn (s) + Wn (s),
(13)
n 1 where Tn (s) = n1 i=1 hs (Xi ), and Wn (s) = n(n−1) i=j hs (Xi , Xj ). Since Wn (s) is a degenerate U -statistic, with variance of order 1/n2 , the sample mean statistic Tn (s) is thus √ the dominating term in this decomposition and the limit distribution of n(Ln (s) − L(s)) is the normal distribution N (0, 4Var [E(qs (X, X ) | X )]. By a standard projection argument, the U statistic Ln (s) is shown to be the estimator of L(s) with minimal variance among all unbiased estimates. In particular, we have Var [E(qs (X, X ) | X )] ≤ Var [qs (X, X )]. This simple fact suggests that the construction of scoring functions based on the minimization of Ln (s) is preferable to procedures based on minimization of other empirical estimates of the ranking risk, such as the sample mean:
n (s) = n/2−1 L
n/2
qs (Xi , Xi+n/2 ) ,
(14)
i=1
where the initial data set Dn has been split into two halves. Such procedures have been studied in Agarwal et al. (2005). Although this remarkable point plays no role in the study of consistency at the first-order developed in the next section, we shall show in Section 5, that the U -statistic structure of Ln (s) provides an essential advantage for establishing fast rates of convergence.
4
Consistency Result
We will now consider the strategy based on the minimization of Ln (s) over a class S ⊂ S of scoring functions. This strategy will make sense if we can prove a consistency result in terms of the ranking risk. We denote by sn = arg min Ln (s), s∈S
(15)
the empirical minimizer of the ranking risk over S . We investigate its performance by evaluating the excess risk : Λ(sn ) = L(sn ) − inf L(s) . s∈S
(16)
218
S. Cl´emen¸con et al.
This quantity is classically studied in a first-order approach using the standard bound (see Devroye et al. (1996)): L(sn ) − inf L(s) ≤ 2 sup |Ln (s) − L(s)| .
(17)
s∈S
s∈S
This bound shows that the asymptotic of the excess risk may be derived from properties of U -processes (we refer to de la Pe˜ na and Gin´e (1999) for a detailed account on the theory of U -processes). Using the first representation of U -statistic, it is easy to derive the following consistency result. Theorem 1. Let S be a class of scoring functions of vc dimension V . Then there exists some constant c > 0 such that, for any δ > 0, with probability larger than 1 − δ, we have Λ(sn ) ≤ c
V +2 n
log(1/δ) . n−1
Sketch of proof. We obtain an exponential inequality on X = sups∈S |Ln (s)− L(s)| by Chernoff’s bounding method: P{X > t} ≤ inf λ>0 E exp(λX − λt). Here X is a U -process, but using (10) and the fact that the exponential is convex and non-decreasing, it is easy to show that the Laplace transform ’E exp(λX)’ factor can be bounded by a similar expression where ’X’ is replaced by an empirical process (meaning a supremum of a sum of n/2 independent variables). Then, the bounded differences inequality (see McDiarmid (1989)) leads to: log E exp λ sup |Ln (s) − L(s)| ≤ s∈S
λ2 + ... 4(n − 1)
n2 λ E sup n < 0 − L(s) . I Yi − Y n +i s(Xi ) − s X n +i s∈S
2
2
2
i=1
The expected value on the right-hand side may now be bounded by the chaining method (see, e.g., Lugosi (2002)) so that we finally get: log E exp λ sup |Ln (s) − L(s)| ≤ s∈S
λ2 + λc 4(n − 1)
V n
(18)
for a universal constant c. Eventually, optimizing in λ gives the result. Remark. We actually get a rate of convergence, but the dependence structure of the U -statistic has not been exploited at this point. We would have n defined in (14). obtained a similar result by using the empirical criterion L In the next section, we provide a significant improvement which reveals the advantage of the strategy based on Ln .
From Ranking to Classification: A Statistical View
5
219
Fast Rates of Convergence
In binary classification, sharp bounds for the excess risk have been proved under specific assumptions on the underlying distribution which are known as margin or noise conditions (see Massart and N´ed´elec (2003), Tsybakov (2004)). These assumptions typically concern the behavior of the regression function η(x) = P(Y = 1 | X = x) near the boundary {x : η(x) = 1/2}. In this section, we adapt this idea in the framework of ranking and derive a simple sufficient condition for the excess risk to achieve fast rates of convergence. We also exploit the reduced variance property of the empirical ranking risk Ln (s), due to its U -statistic structure. First, we make some additional assumptions in order to keep things simple: (i) the class S of scoring functions is finite with cardinality N , (ii) the optimal scoring function s∗ is in the class S . Noise condition. There exist constants c > 0 and α ∈ [0, 1] such that ∀x ∈ X ,
EX (|η(x) − η(X )|
−α
) ≤ c.
(19)
We point out that the condition above is not restrictive when α = 0, while for α = 1, it does not allow η to be differentiable, if for instance X is uniformly distributed on [0, 1]. Furthermore, it is noteworthy that, as may be easily checked, the noise condition is satisfied for any α < 1 when the distribution of η(X) is absolutely continuous with a bounded density. The next result claims that under this noise condition a fast rate of convergence (smaller than n−1/2 ) for the excess ranking risk can be guaranteed. Proposition 2. Under the noise condition, for every δ ∈ (0, 1) there is a constant C such that the excess ranking risk of the empirical minimizer sn satisfies 1/(2−α) log(N/δ) ∗ Λ(sn ) = L(sn ) − L ≤ C . (20) n
We will need the following lemma. Lemma 1. Under the noise condition, we have, for all s ∈ S Var(Hs (X, Y )) ≤ c Λ(s)α . where we have set Hs (x, y) = hs (x, y) − hs∗ (x, y) and hs is as in (11). Proof of the lemma. We first write that 2 Var(Hs (X, Y )) ≤ EX (EX (I{(s(X) − s(X ))(η(X) − η(X )) < 0})) . −α/2
Multiplying and dividing by a factor |η(X) − η(X )| under the expectation, and by a judicious application of the Cauchy-Schwarz inequality, we
220
S. Cl´emen¸con et al.
finally get the result thanks to Jensen’s inequality and the use of the noise assumption. Proof of the proposition. For any s ∈ S the empirical counterpart of the excess ranking risk Λ(s) = L(s) − L∗ is 1 Λn (s) = Qs (Xi , Xj ), n(n − 1) i=j
which is a U -statistic of degree 2 with symmetric kernel Qs = qs − qs∗ . Observing that the minimizer sn of the empirical ranking risk Ln (s) over S also minimizes the empirical excess risk Λn (s) = Ln (s) − Ln (s∗ ), write the Hoeffding decomposition of Λn (s): n (s) , Λn (s) = Λ(s) + 2Tn (s) + W n (s) = Wn (s) − Wn (s∗ ). Therefore, by where Tn (s) = Tn (s) − Tn (s∗ ) and W applying a version of Bernstein’s inequality for degenerate U -statistics (see n (s), we have, with Theorem 4.1.12 in de la Pe˜ na and Gin´e (1999)) to W probability larger than 1 − δ, that C log(N/δ) n where C is a constant. Using the standard Bernstein’s inequality, we get with probability larger than 1 − δ that 2Var(Hs (X, Y )) log(N/δ) 2 log(N/δ) ∀s ∈ S , |Tn (s)| ≤ + . n 3n There is thus a constant C such that, with probability larger than 1 − δ, 2Var(Hs (X, Y )) log(2N/δ) C log(2N/δ) ∀s ∈ S , Λ(s) ≤ Λn (s) + 2 + . n n (21) Now considering the scoring function sn minimizing Ln (s) over S , we have Λn (sn ) ≤ 0 since s∗ ∈ S . Thus, with probability larger than 1 − δ, 2Var(Hsn (X, Y )) log(2N/δ) C log(2N/δ) Λ(sn ) ≤ 2 + . (22) n n The next lemma shows that the variance factor can be upper bounded by the excess risk, the bound on Λ(sn ) is established then by solving a simple inequality. Remark. We emphasize that the reduced variance of the U-statistic Ln (s) is used here in a crucial fashion to derive fast rates from the rather weak n as noise condition. Applying a similar reasoning for a risk estimate like L defined in (14) would have led to a very restrictive condition. Indeed, in that case, we would have had to consider the variance of qs ((X, Y ), (X , Y )) which leads to a noise condition of the form: ∀x = x , |η(x) − η(x )| ≥ c, for some constant c > 0. The last statement is satisfied only when the distribution η(X) is discrete. ∀s ∈ S ,
n (s)| ≤ |W
From Ranking to Classification: A Statistical View
6
221
Concluding Remarks
We have provided a theoretical framework for the bipartite ranking problem in the spirit of statistical learning theory as it has been developed for the binary classification problem. We have highlighted the fact that the empirical criterion is a U -statistic and that consistency results can be achieved, as well as fast rates of convergence under weak assumptions on the distribution. In Cl´emen¸con et al. (2005), we explain how to obtain more general results covering convex risk minimization with massive classes of scoring functions as in boosting or Support Vector Machines. We also show that this framework can be extended to the case of regression data.
References AGARWAL, S., HAR-PELED, S., and ROTH, D. (2005): A uniform convergence bound for the area under the ROC curve. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Barbados. BACH, F.R., HECKERMAN, D., and HORVITZ, E. (2004): On the path to an ideal ROC Curve: considering cost asymmetry in learning classifiers. Technical report MSR-TR-2004-24, University of California, Berkeley. CLEMENC ¸ ON, S., LUGOSI, G., and VAYATIS, N. (2005): Ranking and scoring using empirical risk minimization. Preprint. ˜ DE LA PENA, V. and GINE, E. (1999): Decoupling: from dependence to independence. Springer. ¨ DEVROYE, L., GYORFI, L., and LUGOSI, G. (1996): A Probabilistic Theory of Pattern Recognition. Springer. FREUND, Y., IYER, R., SCHAPIRE, R.E., and SINGER, Y. (2003): An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research, 4, 933–969. GREEN, D.M. and SWETS, J.A. (1966): Signal detection theory and psychophysics. Wiley, New York. HANLEY, J.A. and McNEIL, J. (1982): The meaning and use of the area under a ROC curve. Radiology, 143, 29–36. HERBRICH, R., GRAEPEL, T., and OBERMAYER, K. (2000): Large margin rank boundaries for ordinal regression. In: A. Smola, P.L. Bartlett, B.Sch¨olkopf, and D.Schuurmans (Eds.): Advances in Large Margin Classifiers. The MIT Press, 115–132. LUGOSI, G. (2002): Pattern classification and learning theory. In: Gy¨ orfi, L. (Ed.), Principles of Nonparametric Learning, Springer, Wien, New York, 1–56. MASSART, P. and NEDELEC, E. (2003): Risk bounds for statistical learning. Preprint, Universit´e Paris XI. McDIARMID, C. (1989): On the method of bounded differences. In: Surveys in Combinatorics 1989, Cambridge University Press, 148–188. TSYBAKOV, A. (2004): Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32, 135–166.
Assessing Unidimensionality within PLS Path Modeling Framework Karin Sahmer, Mohamed Hanafi, and El Mostafa Qannari Unit´e de sensom´etrie et de chimiom´etrie, ENITIAA / INRA, rue de la G´eraudi`ere, BP 82 225, F-44322 Nantes Cedex 03, France Abstract. In very many applications and, in particular, in PLS path modeling, it is of paramount importance to assess whether a set of variables is unidimensional. For this purpose, different methods are discussed. In addition to methods generally used in PLS path modeling, methods for the determination of the number of components in principal components analysis are considered. Two original methods based on permutation procedures are also proposed. The methods are compared to each others by means of a simulation study.
1
Introduction
In many applications, the practitioners are interested in assessing whether a set of variables is unidimensional or not. For instance, in PLS path modeling, there are two measurement models which relate the manifest variables to their associated latent variables, namely the reflective model and the formative model (Tenenhaus et al. (2005)). It is advocated that the choice of one or the other of these strategies depends on whether a block of variables can be considered as unidimensional or multidimensional. Tenenhaus et al. (2005) propose three tools to assess the unidimensionality of a block of variables: (i) the relative importance of the first two eigenvalues of the correlation matrix, (ii) Cronbach’s alpha coefficient, (iii) Dillon-Goldstein’s coefficient ρ. The present paper investigates this issue in a wider perspective and compares these strategies of analysis to other approaches linked to the problem of determining the appropriate number of components in principal components analysis (PCA). These methods are discussed in sections 2 and 3. In section 4, we undertake a simulation study in order to assess the efficiency of the various techniques of analysis.
2
Review of Some Existing Methods
In the following, we consider a set of p random variables Xi (i = 1, . . . , p) that are measured on n objects. The observed data matrix is denoted by X, and the ith column of X is denoted by xi . The eigenvalues of the population covariance matrix, sample covariance matrix, population correlation matrix and sample correlation matrix are respectively denoted by µk , mk , λk and lk . The first principal component of the data matrix X is denoted by t1 .
Assessing Unidimensionality
223
In order to assess the unidimensionality of X, Tenenhaus et al. (2005) recommend to perform PCA on X and to compute coefficients α of Cronbach (1951) and ρ of Dillon and Goldstein (1984) which are respectively given by p p j=i cor(xi , xj ) i=1 α= × p p + i=1 j=i cor(xi , xj ) p − 1
and
p 2 ( i=1 cor(xi , t1 )) ρ = p . p 2 ( i=1 cor(xi , t1 )) + i=1 (1 − cor2 (xi , t1 )) A set of variables is considered unidimensional if these coefficients are larger than 0.7. The outcomes of PCA on X advocated by Tenenhaus et al. (2005) are used according to the Kaiser-Guttman Rule (Kaiser (1992)) which consists in assessing X as unidimensional if the first eigenvalue of the correlation matrix is larger than 1 and the second eigenvalue is smaller than 1. Considering that this rule does not take into account the sampling variation, Karlis et al. (2003) proposed to consider an eigenvalue li as significantly greater than 1 if p−1 li > 1 + 2 . n−1 The rationale of these techniques is to consider that if X is unidimensional, then the information contained in X is reflected by the first principal component. In other words, the second and higher principal components should reflect noise only. The rule of the broken stick and Bartlett’s test are based on the same rationale. According to the former rule (broken stick), X should be considered p as unidimensional if l1 exceeds b1 and l2 is smaller than b2 where bk = i=k 1i . This rule is derived from the fact that if a stick of length p is broken at random into p pieces then the expected length of the k th -longest piece is bk (Jackson (1991)). The procedure based on Bartlett’s test (Bartlett (1950)) consists in two stages. The purpose in the first stage is to ensure that the distribution of (X1 , X2 , . . . , Xp ) is not spherical. Thereafter, we test whether the set of variables is unidimensional or not. More formally, this consists in performing two successive hypothesis tests corresponding to k = 0 and k = 1 respectively:
H0 : µk+1 = µk+2 = · · · = µp H1 : not all (p − k) roots are equal. Obviously, the case k = 0 corresponds to the sphericity test (first stage) whereas k = 1 corresponds to the assessment of the unidimensionality (second stage). The test statistic is − 2(n−1) ln(LR) where n ⎡ ⎤n/2 p ⎢ ⎥ i=k+1 mi LR = ⎣ . (1) p−k ⎦ p 1 m i i=k+1 p−k
224
K. Sahmer et al.
It has an asymptotic χ2 distribution with 21 (p − k − 1)(p − k + 2) degrees of freedom (Jackson(1991)). Another possible approach is to assess the prediction accuracy that can be achieved by a model with one component. A prediction strategy on the basis of a PCA model and using a cross-validation approach is described in Krzanowski and Kline (1995). This approach can be used for the assessment of the unidimensionality as follows. If the prediction with one component is better than the prediction by the mean value of each variable, it is assumed that there is a structure in the data (at least one component). If, in addition, the accuracy of the prediction cannot be improved by introducing the second principal component, then the structure can be considered as unidimensional. An advantage of the methods derived from principal components analysis over the coefficients α and ρ is that they are more informative. Indeed, they make a distinction between noise data (0 component), unidimensional data (1 component) and at least two components, whereas the strategy based on coefficients α and ρ does not give such details if the structure of the data turns out to be non unidimensional. This advantage can be useful in some situations. For instance, in cluster analysis of variables (see for example Vigneau and Qannari (2003) and the procedure VARCLUS, SAS/STAT (1999)) it is recommended to check beforehand whether there is a structure with more than one component.
3
New Methods
Some of the procedures described above depend on the distribution of the data. For example, Bartlett’s test and the approximation of the eigenvalues distribution proposed by Karlis et al. (2003) are based on the assumption of a multinormal setting and concern asymptotic results. In order to overcome these difficulties, we propose to undergo permutation tests. They are based on a permutation procedure which is also proposed by Peres-Neto et al. (2005) for other criteria. It consists in permuting randomly the rows for each column of the observed data matrix X. The resulting permuted matrix X ∗ reflects n realizations of p uncorrelated variables with the same variance as the variables in X. The permutation procedure is repeated B times resulting in B permuted matrices X ∗ . The permutation test procedure for the Kaiser-Guttman rule considers in a first stage the hypotheses H01 : λ1 = 1 versus H11 : λ1 > 1. The p-value for this test is given by the proportion of the l1∗ (the largest eigenvalues of the correlation matrices associated with X ∗ ) which are equal to or larger than the observed eigenvalue l1 . If H01 is accepted, this indicates that the block of variables is without structure, i.e. the variables within the
Assessing Unidimensionality
225
block are uncorrelated. If H01 is rejected, a second test is performed with H02 : λ2 ≥ 1 versus H12 : λ2 < 1. The p-value is given by the proportion of the l2∗ that are equal to or smaller than the observed l2 . If H02 is rejected, the block of variables is considered as unidimensional, otherwise we decide upon the existence of more than one component. We can also adapt this permutation procedure to the context of Bartlett’s test. At the heart of the test statistic in Bartlett’s test, there is the ratio of the geometric mean to the arithmetic mean of the last p − k (k = 0, 1) eigenvalues of the covariance matrix (see equation 1). We can compute this ratio for the actual data set and for each simulated data set X ∗ . Thereafter, the decision between H0 and H1 can be made on the basis of how many times the simulated ratios are smaller than the actual ratio.
4
Simulation Study
In order to compare the methods, we performed a simulation study based on the structures which we refer to as A, B, C, D, E and F defined as follows: A, noise only: the variables in these data sets do not have any common structure. B, unidimensional data: the variables in these data sets have one common factor. The covariance matrix has the structure Σ = λλ + Ψ where λ is the vector of loadings which are comprised between 0.8 and 0.9. Ψ is a diagonal matrix containing the error variances that are comprised between 0.2 and 0.7. More details concerning the error variances are given below. For data sets with more than one factor, we considered four different structures. For all of them, the covariance matrix is given by Σ = ΛΛ + Ψ where Λ is the matrix with two or three columns of loadings; each column of loadings being associated with one factor. C, one variable not correlated with the other ones: for all but one of the variables, we have the same structure as in B. One variable has a zero loading on the common factor and a loading of 0.9 on a second factor. D, one common factor and two group factors: there are two distinct groups of variables. All the variables load on a common factor (as in B, but with slightly smaller loadings) and on a group factor (with loadings equal to 0.7). E, two group factors: there are two distinct groups of variables. The loadings
226
K. Sahmer et al.
are all equal to 0.9. There is no common factor. F, three group factors: There are three distinct groups of variables. The loadings are all equal to 0.9. There is no common factor. We combined each of the six structures with three different patterns of noise: Unequal error variances between 0.2 and 0.7, equal error variances of 0.3 and equal error variances of 0.6. For the distribution of the factors and the errors, we used the normal distribution. For each combination of structure and noise pattern, we simulated data sets with p = 6, p = 10 and p = 20 variables and n = 20 (for 6 and 10 variables) or n = 22 (for 20 variables), n = 100 and n = 500 individuals. For the data sets with 10 variables and 100 observations, Table 1 gives the percentage of correct decisions for each method. A decision is considered as correct if the method correctly states the structure as noise data, unidimensional or with a higher dimension without specifying the actual number of components. Good results are highlighted using bold characters. This case (p = 10, n = 100) is characteristic of the overall pattern of the other situations considered herein. Important exceptions are outlined when appropriate. Cronbach’s α and Dillon-Goldstein’s ρ clearly distinguish between the hypothesis of no structure in the data and the hypothesis of unidimensionality. However, they fail in situations where there are more than one factor. For the simulation study with p = 6 and p = 20 variables (not shown herein), the two criteria have a tendency to reject unidimensionality either in situations where there was no structure in the data or in presence of small groups of variables. This latter case is also reflected in Table 1 for structure F (10 variables partitioned in three groups). As a conclusion regarding these two criteria, we can state that they never fail in situations where there is no structure in the data but they have a tendency to indicate unidimensionality even in situations where there are more than one factor. This result is in accordance with Hattie’s remark (Hattie (1985)) that ”despite the common use of alpha as an index of unidimensionality, it does not seem to be justified”. The Kaiser-Guttman Rule performs satisfactorily except with noise data where it can not be applied. Indeed, according to this rule, the number of significant components is equal to the number of eigenvalues of the correlation matrix which are larger than one. Obviously, the first eigenvalue of the sample correlation matrix of noise data is larger than one. The improved versions discussed in this paper overcome this problem. The statistically improved Kaiser-Guttman Rule detects in 80% of the cases that there is only noise whereas with the permutation test, this percentage reaches 95%. Nevertheless, they fail to detect the absence of unidimensionality in structure C in which the first eigenvalue of the population correlation matrix is larger than one, the second eigenvalue is equal to one and the other eigenvalues are smaller than one. A possible direction of research is to adjust the test statistic regarding the second eigenvalue taking account of the first eigenvalue. Furthermore, these results indicate that a combination of the permutation test
Assessing Unidimensionality Structure
Cronbach’s α Dillon-Goldstein’s ρ Kaiser-Guttman Rule statistically improved KaiserGuttman Rule permutation test Kaiser-Guttman Broken Stick Model Cross validation
A B noise unidim.
C
227
D E F not unidimensional
0.0 0.0
0.0 0.0
0.9 8.5
69.0 70.3
100.0 100.0
100.0 100.0
–
100.0
79.5
100.0
0.0
83.3 100.0 100.0
94.7
100.0
0.0
99.5 100.0 100.0
100.0 100.0
100.0 100.0
0.0 12.3
91.1 100.0 100.0 100.0
35.9 100.0 99.9 88.9
72.8 18.1
Unequal error variances Bartlett’s test
Bartlett’s test as a permutation test
0.0
0.0 100.0 100.0 100.0 100.0
94.6
0.0 100.0 100.0 100.0 100.0
Equal error variances Bartlett’s test
93.1
92.0
95.4 100.0 100.0 100.0
Bartlett’s test as a permutation test
96.0
79.9
98.3 100.0 100.0 100.0
Table 1. Results of the simulation study for the data sets with 10 variables and 100 observations: percentage of correct decisions.
with the Kaiser-Guttman Rule of thumb would give better results. The permutation test can be used to reject the hypothesis that the data reflect noise only. Thereafter, the Kaiser-Guttman rule can be used to decide between unidimensionality and more than one factor. In Table 1, it can be seen that the performance of the Broken Stick Model and the cross validation procedure depends on the structure of the data. In real situations where the true structure of the data is unknown, we need a method that performs globally well to decide on the unidimensionality. For this reason, we cannot recommend these methods to assess unidimensionality. As stated above, Bartlett’s test was originally designed for settings with equal error variances. In Table 1, we set apart the results of this test for equal and unequal variances. As it can be expected, Bartlett’s test fails when there are unequal error variances. The good performance of the test in situations C to F (more than one component) for the data sets with unequal error variances can be explained by the fact that the test has a tendency to decide upon a structure with more than one component without indicating, however,
228
K. Sahmer et al.
the actual number of components. Bartlett’s test based on a permutation procedure seems to correctly identify noise data from data structures with one or more components. For data sets with a structure (not only noise), it performs better in situations with equal error variances than unequal error variances. Being based on asymptotic properties, Bartlett’s test naturally has a bad performance in situations with small samples and many variables. Unfortunately, the permutation procedure does not overcome this pitfall. In conclusion, it is clear that the best methods for all structures considered herein are the Kaiser-Guttman rule and the permutation test version of this rule. The permutation test should be used to assess if the first eigenvalue of the correlation matrix is significantly larger than one. If this is not the case, we can state that the data set does not have any structure. Otherwise, the Kaiser-Guttman rule can be used to decide whether the structure is unidimensional or has more than one factor.
5
Conclusion and Perspectives
We compared several methods to assess unidimensionality by a simulation study. A combination of the Kaiser-Guttman rule with a permutation test procedure emerges as having the best performance. The comparison was based on normally distributed data and should be extended to other distributions. Since the Kaiser-Guttman rule is not based on assumptions regarding the distribution we believe that the conclusions will still hold.
References BARTLETT, M.S. (1950): Tests of Significance in Factor Analysis. British Journal of Psychology (Statistical Section), 3, 77–85. CRONBACH, L.J. (1951): Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16, 297–334. DILLON, W.R., and GOLDSTEIN, M. (1984): Multivariate Analysis. Methods and Applications. John Wiley and Sons, New York. HATTIE, J. (1985): Methodology Review: Assessing Unidimensionality of Tests and Items. Applied Psychological Measurement, 9, 139–164. JACKSON, J.E. (1991): A User’s Guide to Principal Components. John Wiley and Sons, New York. KAISER, H.F. (1992): On Cliff’s Formula, the Kaiser-Guttman Rule, and the Number of Factors. Perceptual and Motor Skills, 74, 595–598. KARLIS, D., SAPORTA, G., and SPINAKIS, A. (2003): A simple Rule for the Selection of Principal Components. Communications in Statistics. Theory and Methods, 32, 643–666. KRZANOWSKI, W.J., and KLINE, P. (1995): Cross-Validation for Choosing the Number of Important Components in Principal Component Analysis. Multivariate Behavioral Research, 30, 149–165.
Assessing Unidimensionality
229
PERES-NETO, P.R., JACKSON, D.A., and SOMERS, K.M. (2005): How many principal components? stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis, 49, 974–997. SAS/STAT (1999): User’s guide, Version 8, SAS Institute Inc.: Cary, North Carolina. TENENHAUS, M., VINZI, V.E., CHATELIN, Y.-M. and LAURO, C. (2005): PLS path modeling. Computational Statistics and Data Analysis, 48, 159–205. VIGNEAU, E., and QANNARI, E.M. (2003): Clustering of Variables around Latent Components. Communications in Statistics – Simulation and Computation, 32, 1131–1150.
The Partial Robust M-approach Sven Serneels1 , Christophe Croux2 , Peter Filzmoser3 , and Pierre J. Van Espen1 1
2
3
Department of Chemistry University of Antwerp, 2610 Antwerp, Belgium Department of Applied Economics, KULeuven, 3000 Leuven, Belgium Department of Statistics and Probability Theory, Technical University of Vienna, 1040 Wien, Austria
Abstract. The PLS approach is a widely used technique to estimate path models relating various blocks of variables measured from the same population. It is frequently applied in the social sciences and in economics. In this type of applications, deviations from normality and outliers may occur, leading to an efficiency loss or even biased results. In the current paper, a robust path model estimation technique is being proposed, the partial robust M (PRM) approach. In an example its benefits are illustrated.
1
Introduction
Consider the situation where one disposes of j blocks of observable variables, each of which one supposes to be the effect of a sole unobservable, latent variable. Furthermore, structural relations between the latent variables of the different groups are assumed to exist. Different techniques to estimate these latent variables as well as the relations between them, have been proposed in literature. On the one hand, one can use maximum likelihood techniques such as LISREL (J¨ oreskog and S¨ orbom 1979), where rigid model assumptions concerning multinormality have to be verified. If one desires less rigid assumptions, socalled soft modelling might prove a viable alternative. The most successful approach to soft modelling of the problem described before, is the so-called PLS approach (Wold 1982), which moreover gives the benefit of estimating the latent variables at the level of the individual cases, in contrast to LISREL. The PLS approach is also known as PLS path modelling or as PLS structural equation modelling. A myriad of applications of the PLS approach have been reported in literature, the most salient one probably being the European Customer Satisfaction Index (Tenenhaus et al. 2005). The simplest path model one can consider is a path model relating a block of variables x to a univariate variable y, through a latent variable ξ. Model estimates for this setting can also be used for prediction of y. Hence the PLS approach, relating two blocks of variables to each other over a single
The Partial Robust M-approach
231
latent variable (which may be a vector variable), can be used as a regression technique. The PLS estimator can be seen as a partial version of the least squares estimator. The latter has properties of optimality at the normal model. However, at models differring from the normal model, other estimators such as the M-estimator may have better properties (Huber 1981). Especially for heavytailed distributions such as the Cauchy distribution or the ε-contaminated normal distribution, partial versions of robust estimators may be expected to out-perform PLS. Hence, in a recent paper we have proposed the partial robust M-regression estimator (Serneels et al. 2005). Simulations have corroborated the aforementioned assumptions. As the PLS regression estimator is very sensitive to outliers and extreme values, the same holds for the PLS approach as a whole, since a PLS regression is carried out at each iteration. In the current paper, we propose a robust version of the PLS approach based on the robust M-estimator, which will be called the Partial Robust M-approach. An example will show the beneficial properties of the novel approach introduced here.
2
The Model and the Partial Robust M-Approach
Before we can proceed with the description of the partial robust M-approach, we first provide a brief introduction to the PLS approach. More elaborate introductions can be found in the works of Tenenhaus (1999) and Chin and Newsted (1999). Suppose one disposes of j blocks of centred observable variables xi = xi1 , · · · , xiki (i ∈ 1, 2, · · · , j), where ki denotes the number of variables in block i. These variables are referred to as the manifest variables. Each of these groups of variables can be considered to be essentially univariate: they are the observable counterpart of a single latent variable ξi . Manifest and latent variables are related to each other by the linear model (h ∈ 1, · · · , ki ): xih = ih ξi + εih . (1) It is supposed that the random error term εih has zero expectation and is non-correlated to the latent variable. The studied phenomenon is assumed to have been generated by structural relations between the latent variables ξi = βiq ξq + φi , (2) q
where it is assumed that the random error term φi has zero expectation and is not correlated to the latent variable ξi . In practice, the latent variables are estimated as linear combinations yi of the manifest variables xih : wih xih = wiT xi (3) yi = h
232
S. Serneels et al.
The vectors wi are called the weights. However, due to the structural relations (2), another estimate zi of ξi is given by: zi ∝
cqi yq .
(4)
q=i
The sign ∝ indicates that the variable on the left hand side of the Equation sign is the standardized version of the expression on the right hand side. Several estimation schemes exist. In this paper we will limit ourselves to the so-called centro¨ıd scheme, as this is the only scheme which will be used in the following section (a motivation thereto can be found in Tenenhaus, 1998). In the centroid scheme, it is necessary for the operator to specify the expected sign ciq = sgn(corr(ξi , ξq )), where ciq is set to zero if the latent variables considered are not expected to be correlated. In the original work by H. Wold, two modes for estimation of the weights were proposed. Here we will limit our discussion to what Wold referred to as “mode A”, which corresponds to the definition of the weights in PLS regression: wi = cov (xi , zi )
(5)
This leads to the following condition of stationarity: yi ∝ xTi xi
cqi yq
(6)
q=i
From Equation (6) it can be seen that the estimates for ξi can be obtained iteratively, starting from an initial guess yi . It can also be seen from Equation (5) that in each iteration, the computation of the new values for yi can be done by computing the first component of a PLS regression of zi on xi . A robustification of the PLS approach is now straightforward. The same iterative estimation scheme is being maintained, albeit at each step the respective PLS regressions are replaced by partial robust M-regressions (Serneels et al. 2005). Partial robust M-regression is an extension of robust M-regression to the latent variable multivariate regression scheme; in this context it has been proven to be superior to PLS if the data come from a non-normal distribution such as a Cauchy or a Laplace distribution. It has been shown that the partial robust M-regression estimator can be implemented as an iteratively re-weighted PLS algorithm (Serneels et al. 2005), where the weights correct for both leverage and vertical outlyingness. A good robust starting value for the algorithm has been described. The use of an iterative re-weighting algorithm makes the method very fast in the computational sense.
The Partial Robust M-approach
3
233
Example: Economical Inequality Leads to Political Instability
In this section we will study a data set first published by Russett (1964). It has been analyzed by PLS and PLS path modelling by Tenenhaus (1998, 1999). In the data set, five variables which were at the time thought to be representative of a country’s economical situation, were included. Their relation to seven variables which correspond to political (in)stability, was studied. It has been shown that some data pre-processing was necessary in order to obtain interpretable results. In the current paper, we will not further discuss the data pre-processing, but we will assume that the variables have been pre-processed as has been described by Tenenhaus (1999). The same pre-processing has been used for the classical and robust estimation. Furthermore, 3 observations out of 45 contained missing data. These observations have been left out in the results obtained here. The first block of variables, which correspond to the countries’ economical situation, in fact consists of two blocks. The first block, comprising the first three manifest variables, are variables which describe the (in)equality in terms of the possession of land fit for agriculture. The second block of manifest variables, consisting of the remaining two variables describing a country’s economical situation, correspond to the degree of industrialization in the respective country. Hence, Tenenhaus (1999) proposed a path model, where it is assumed that each of the blocks has been generated by a single latent variable, i.e. the agricultural inequality (ξ1 ), the degree of industrialization (ξ2 ) and political instability (ξ3 ). It is assumed that the agricultural inequality leads to political instability, whereas industrialization does not. Hence, we have obtained the coefficients ciq from Equation (6): c13 = c31 = 1 and c23 = c32 = −1. Both remaining coefficients c12 and c21 are set equal to zero. From Equation (6) we see how we can build up the iterative estimation (1) (1) scheme. We start from an initial guess, e.g. y1 and y3 are the first X and Y components obtained from a PLS regression of the political variables on (1) the agricultural variables, whereas y2 is taken as the first x2 component from a PLS regression of the political variables on the industrial variables. The superscripts indicate the iteration step. Suppose that we have in the (r) (r − 1)-th step of the algorithm yi as the then best estimates of the latent variables. Then we can update them in the rth step by the following scheme, based on Equation (6): (r+1)
(r)
• the variable y1 is the first PLS component of a PLS regression of y3 on X1 (X1 is a matrix consists of n observations of x1 ); (r+1) (r) • y2 is the first PLS component obtained from a PLS regression of −y3 on X2 ; (r+1) • y3 is the first PLS component obtained from a PLS regression of (r) (r) y2 − y1 on X3 .
234
S. Serneels et al.
Fig. 1. Causality scheme estimated by Tenenhaus (1998) by dint of the PLS approach relating economical inequality and political instability.
This processus is repeated until convergence. The robust estimates reported later in this section are obtained by the same iterative procedure, albeit the (r+1) estimates yj are in that case the first components of the corresponding PRM regressions. In path modelling it is customary to represent the path model by a flowchart. Manifest variables are displayed in boxes; latent variables are displayed in circles. The arrows show the direction in which the variables influence each other. The correlation coefficients between the manifest and latent variables are shown above the respective arrows. In order to describe the relations among the latent variables, the regression coefficients di describing the linear relation y3 = d1 y1 + d2 y2 , are shown above the arrows relating the latent variables. The results obtained by Tenenhaus (1998) are shown in Figure 1. Figure 1 leads Tenenhaus (1998) to the conclusion that political instability is caused rather by a lack of industrialization than by an inequality in the possession of land. However, based on economic arguments, in the original analysis by Russett it had been expected that each of the five economical variables would contribute equally to political instability. In the data set considered here, no outliers are present in the sense that they are bad measurements which should be deleted before performing the PLS approach. However, some influential observations are present. A good
The Partial Robust M-approach
235
6
5
SID
4
3
2
1
0
0
5
10
15
25 20 observation No.
30
35
40
45
Fig. 2. Squared Influence Diagnostic plot for PLS1 regression of the variable “demostab” on X2 .
diagnostic to detect influential observations in the PLS context is the Squared Influence Diagnostic (SID) which is based on the univariate PLS influence function (Serneels et al. 2004). As it is a univariate test, it should be performed separately on each of the variables of X3 . A SID plot of X2 on the variable “demostab”, e.g., is plotted in Figure 2. It unveils that the observation which corresponds to India (observation 22) is a very influential sample. This has also been signalled by Tenenhaus (1999), who notices that India is the only democracy whose level of industrialization is below the mean value. When computing the SID for other combinations of the Xi blocks and individual variables of X3 , a few other influential observations can be discerned. The presence of some observations which are very influential on the final estimate suggests that a robust estimate might in this case suffer less from these individual observations and might be more apt to discern the general trend in the data. As a robust estimation technique, we applied the partial robust M (PRM) approach to estimate the desired quantities. The tuning constant was set to 4 (for further details see Serneels et al. 200x) and convergence of the partial (PRM) approach was obtained after 3 iterations, as was the case for the PLS approach. The obtained estimates are shown in Figure 3. From Figure 3 it can be seen that the robust estimates differ somewhat from the estimates obtained by the classical PLS approach. The correlations between the manifest variables and the latent variables show the same trend as in Figure 1, although some small differences may be observed: the variable “einst” is shown to be less informative whereas the variable “ldeat” is more
236
S. Serneels et al.
Fig. 3. Causality scheme estimated by dint of the PRM-approach relating economical inequality and political instability.
informative to the robust model. Note that the correlations shown in the robust causality scheme (Figure 3) are Spearman correlations, as the usual Pearson correlations might also yield unreliable results due to deviations from normality. The main difference between the classical and robust estimates resides in the estimation of the latent variables and the way these are related to each other. From Tenenhaus (1999) it was decided that the latent variable corresponding to the level of industrialization (ξ2 ) determines to a much greater extent the country’s political instability (ξ3 ) than the agricultural inequality (ξ1 ) does. From the robust estimates, one observes that the latter latent variable is still more important than the former, although the difference is much smaller. One could indeed conclude that both agricultural inequality and industrialization contribute about equally to political instability.
4
Conclusions
The PLS approach is a technique which is widely applied to estimate path models between several blocks of variables. It is believed that the path model unveils the general trend of the structural relations which exist between these variables.
The Partial Robust M-approach
237
The PLS approach is very sensitive to influential observations such as outliers. These outliers might distort the final estimate in their direction. The PLS approach is a widely applied technique in social sciences and economics. In these fields of research, influential observations are frequently not outliers which are outlying due to bad measurement which should be removed before model estimation, but outliers often correspond to individuals which behave differently than the majority of the data. Hence, the information these observations carry should be used at the model estimation step, albeit their influence in the final estimate should be controlled. The aforementioned arguments suggest the use of a robust estimation technique for the path model. Robust M-estimators are resistant with respect to outliers, but remain highly efficient at the normal model. In the current paper, the partial robust M-approach has been proposed as a robust estimation technique for path modelling. It is based on several steps of partial robust M-regression (Serneels et al. 200x). In an example it has been shown to yield improvements over the PLS approach such that it can better unveil the general trend in the path model relation, in case the data do not follow a normal model.
References CHIN, W.W., and NEWSTED, P.R. (1999): Structural Equation modelling analysis with small samples using partial least squares. In Hoyle, R.H. (Ed.): Statistical strategies for small-sample research. Sage, Thousand Oaks (CA), pp. 307–341. HUBER, P.J. (1981): Robust Statistics. Wiley, New York. ¨ ¨ JORESKOG, K.G., and SORBOM, D. (1979): Advances in factor analysis and structural Equation models. Abt books, Cambridge. RUSSETT, B.M. (1964): Inequality and instability. World Politics, 21, 442–454. SERNEELS, S., CROUX, C. and VAN ESPEN, P.J. (2004): Influence properties of partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 71, 13–20. SERNEELS, S., CROUX, C., FILZMOSER, P., and VAN ESPEN, P.J. (2005): Partial robust M-regression. Chemometrics and Intelligent Laboratory systems, 79, 55-64. TENENHAUS, M. (1998): La r´ egression PLS. Technip, Paris. TENENHAUS, M. (1999): L’approche PLS. Revue de Statistique Appliqu´ ee, XLVII (2), 5–40. TENENHAUS, M., ESPOSITO VINZI, V., CHATELIN, Y.-M., and LAURO, C. (2005): PLS path modelling. Computational Statistics and Data Analysis, 48, 159–205. WOLD, H. (1982): Soft modeling: the basic design and some extensions. In: K.G. J¨ oreskog and H. Wold (eds.). Systems under indirect observation, vol. 2. NorthHolland, Amsterdam, pp. 1–54.
Classification in PLS Path Models and Local Model Optimisation Silvia Squillacciotti EDF Research and Development, 92140 Clamart, France Abstract. In this paper, a methodology is proposed which can be used for the identification of classes of units showing homogeneous behavioural models estimated through PLS Path Modelling. The proposed methodology aims at discovering or validating the existence of classes of units in PLS Path models in a predictive-oriented logic, such as it has been proposed, in the framework of PLS Regression, with PLS Typological Regression. An application to a study on customer satisfaction and loyalty is shown.
1
Introduction: State of the Art in PLS and Classification
Classification and discrimination in the framework of PLS Regression have traditionally been performed throughout two main approaches: the SIMCA approach (Wold et al., 1984) and PLS Discriminant Analysis (PLS-DA, Sj¨ ostr¨ om et al., 1986). In the SIMCA approach, PLS regression is performed over all the units. A cluster analysis is then performed on the retained components, in order to assign the units to the clusters. Local models can then be estimated (one for each group), and new units can eventually be assigned to one of the clusters according to the DmodX index, i.e. the distance of the unit from the model in the explanatory variables space. PLS-DA basically consists in a PLS Regression where the dependent variable is the group indicator vector. PLS-DA searches for the explanatory variables allowing the best separation among the classes. Hence, in the SIMCA approach, classes are defined once and for all at the beginning of the analysis. Moreover, the imputation of a new unit to a class is performed according to its distance from the model in the explanatory variables’ space, hence when predicting the class membership of a new unit, the PLS main predictive purpose is put aside. In PLS-DA, instead, the major drawback is that no dependent variable, other than the one containing the group membership information, is allowed: hence, the predictive purpose of PLS Regression may only concern the explanatory variables’ capacity to express the best separation among the classes. More recently, a technique for an iterative prediction-oriented classification inside PLS Regression has been proposed: PLS Typological Regression (Esposito Vinzi and Lauro, 2003). The starting point of this procedure coincides
Classification in PLS Path Models
239
with the SIMCA approach to classification: a PLS Regression is performed over all the units, and a cluster analysis on the retained components leads to the choice of the number of clusters and to the affectation of the units, and local PLS Regressions are then performed over the clusters. PLS Typological Regression goes further than the SIMCA approach: after having estimated the local models, each unit is re-assigned to the class corresponding to the closest local model. The distance from the models is computed as a distance from the dependent variables space, so as to take into account the predictive purpose of PLS Regression. Such a distance follows closely the DmodY,Ni index (Tenenhaus, 1998). If there is any change in the composition of the classes, the local models are re-estimated, distances are computed once again and the units are re-assigned to the closest group. The iterations stop when there is no change in the composition of the classes from one step to the other. Hence, the composition of the clusters at the final step of the algorithm is the result of an optimisation of the predictive capacity of the local models, according to PLS main criterion. Finally, a compromise model, coherent with the local ones, is estimated.
2
Context: The Analysis of Customer Loyalty and Customer Satisfaction Through PLS Path Modelling
When wishing to identify the drivers of customer satisfaction and its influence on customer loyalty, a more complex model than a two-block Regression may have to be defined. In fact, two aspects should be taken into account: first of all, we may suppose that different items have an impact on customer satisfaction and customer loyalty, such as the image of the firm in the customers’ mind, the perceived quality of the product/service, etc, and that these items may also influence one another before impacting on satisfaction and loyalty. At the same time, concepts such as customer satisfaction, image, and the other items taken into account in the model, may be seen as “complex” concepts, not directly observable, but measurable through the different aspects that in some way “reflect” such underlying constructs. In a survey, different questions may be asked in order to have an overall measure of customer satisfaction, perceived value, etc. In such a case, PLS Approach to Structural Equation Modelling is by far more adapted for the definition and the estimation of the model parameters. The importance and diffusion of PLS techniques in the study of customer satisfaction is also witnessed by its use in the estimation of an economic model used for the definition of a standard customer satisfaction indicator: the European Customer Satisfaction Index (ECSI) (Tenenhaus et al., 2005). ECSI is an adaptation of the Swedish customer satisfaction barometer (Fornell, 1992) and it is compatible with the American Customer Satisfaction Index. EDF (Electricit´e de France) is strongly concerned with the measurement of customer satisfaction and customer loyalty. The energy market is undergoing
240
S. Squillacciotti
price service
communication
image
satisfaction
price
Fig. 1. Model specification
a large number of relevant changes. France, as well as the whole European Community, is witnessing the opening of this market to competition. The measurement of customer satisfaction and the capacity to predict the customer’s behaviour in terms of loyalty are subjects of a very strong importance. The aim of this paper is to define a classification technique inside PLS Latent Variable Structural Equation Modelling allowing either to identify homogeneous consumer groups, or, eventually, to assign the units to classes which are known a priori according to a distance criterion which will be specified in the following. Such a classification technique should be defined so as to take into account the predictive purpose of PLS techniques. Applied to customer satisfaction modelling, this may help us in discovering whether “loyal” customers have the same satisfaction model as “switching” customers (i.e. if the drivers of satisfaction have the same importance for both groups) .
3
The Data and the Model
A satisfaction survey was lead over 791 customers. 32 manifest variables were measured; they concerned 6 latent concepts, i.e. satisfaction, image, billing, communication, service and price. Among the 791 customer, 133 eventually switched to a new supplier. We therefore have two groups of individuals (“loyal” customers and “switching” customers). We wished to find out if the two groups show different satisfaction models (i.e. different drivers of satisfaction) under the assumption that satisfaction (or non satisfaction) is the only element which determines the decision to change supplier. The model is shown in figure 1.
Classification in PLS Path Models
4
241
Method and Results
According to the objectives of our work as they have been defined in the previous paragraph, we wish to validate the existence of the a priori existing groups inside the PLS Path Models, so as to obtain groups of consumers homogeneous with respect to the defined model. While in PLS Regression a number of possibilities exist for performing a classification, such as the ones described in the present paragraph, this is not the case for PLS Path Modelling. Traditionally, in PLS Path Modelling, groups are defined according to external criteria (i.e. a priori information or results of external analyses). A model can then be computed for each group, and the local models can be compared to one another. Eventually, the existence of groups showing different models - either known to exist a priori or defined through external analyses - can be validated by means of a partial analysis criterion (Amato and Balzano, 2003). However, the composition of the classes is not further optimised in order to improve the models’ predictivity or homogeneity. The aim of this work is to propose a generalisation of PLS Typological Regression to PLS Path Modelling. First of all, a PLS Path Model is estimated over all the units. If classes are not known to exist a priori, they may be defined either according to the results of the global model or by assigning randomly the units to the classes. Local models are then estimated (one for each class), and a measure of the distance of each unit from each local model is computed. Units are then re-assigned to the class corresponding to the closest local model: if this causes any changes in the composition of the classes, the local models are re-estimated and the distances are computed once again. When there is no change in the composition of the classes from one step to the following, the obtained local models are compared in terms of predictivity (R2 ) and of intensity of the structural links on the final latent variables (in our study satisfaction). The main problematic issue in the generalisation of PLS Typological Regression to a PLS Path Model is the calculation of a measure of the distance of each unit from the local models: no such index exists in PLS Path Modelling. The required distance should be based on the local model’s capacity of reconstructing the observed values of the final endogenous manifest variables (the three observed variables of the loyalty block), and it should take into account the local model’s redundancy (the capacity of the exogenous latent variables of predicting the endogenous manifest variables) and the number of units in each cluster. In other words, the distance measure should express the statistical proximity of a unit from the model in the endogenous variables’ space. Following the DmodY,N distance and its utilisation in PLS Typological Regression, the distance we have defined is the following: 2 k [egik /Rd(ξ, yk )] Dg = ! 2 [e /Rd(ξ,yk )] i
k
gik
(ng −mg −1)(q−mg )
242
S. Squillacciotti
price service
communication
image
satisfaction
price
Fig. 2. Results of the PLS path model over the entire sample (791 units)
where: • e2gik is the residual of the “redundancy” model, i.e. the regression of the final endogenous manifest variables over the exogenous latent variables, • Rd(ξ, yk ) is the redundancy index for the final endogenous manifest variables (the three variables in the loyalty block) for group g, • ng is the number of units in group g, • mg is the number of exogenous latent variables in the local model for group g, • q is the number of final endogenous manifest variables. More in detail, the procedure is the following: Step 1: Estimation of the PLS Path Model over the entire sample, Step 2: Definition of the G classes, Step 3: Estimation of the G local models, Step 4: Computation of the distance Dg as defined above, Step 5: Attribution of each unit to the closest local model: if there is any change in the composition of the classes, repeat steps 3, 4, and 5, otherwise move to step 6. Step 6: Comparison of the final local models. The results of the first step (estimation of the global model) are shown in figure 2. The model is overall satisfying: all structural coefficients are significant and show a slightly stronger impact of image over satisfaction. The model predictivity for customer satisfaction, however, is rather low (0,39). Anyway, the reason for the research of classes of customers is not due to the necessity of improving the global model’s results but rather to the intention of finding groups of customers showing different satisfaction models. After the
Classification in PLS Path Models
price
price
service
service 0,21 t=5,63
communication
243
-0,23 t=-3,73
0,22 t=5,37
0,31 t=7,52
-0,45 t=7,33
image
R²=0,77
0,26 t=7,43
0,18 t=3,91
satisfaction
communication
0,24 t=4,87
R²=0,42 0,21 t=5,93
price
b
a
-0,30 t=-4,44
0,35 t=-7,67
image
-0,32 t=-4,74
satisfaction
R²=0,73
R²=0,34
price
Fig. 3. Local models results for classes 1 and 2 in step 1.
id 23504 25406 26004 26110 ... 1546603 1546703 1554103 1554803 1555603 1586703 1600203
distance distance cluster cluster from class1 from class2 step 1 step 2 0,3656087 0,4269022 1 1 0,5938988 0,5278993 1 2 0,9897991 0,9728581 1 2 0,5419516 0,4213784 1 2 ... ... ... ... 0,8696395 0,7988576 2 2 0,0945292 0,0388062 2 2 0,2917057 0,2474925 2 2 0,6116967 0,7163918 2 1 0,4918778 0,4570386 2 2 0,2869769 0,2820377 2 2 0,345102 0,3415433 2 2
Table 1. Computation of the distances and imputation to classes in step 2.
estimation of the global model, units were randomly assigned to two classes (respectively of 390 and 401 units). The results of the local models are shown in figures 3.a and 3.b. The main difference among the two classes concerns the importance of price and image in their impact on satisfaction. In class 1 image has a stronger impact on satisfaction than in class 2, while in class 2 price seems to have the strongest impact on the definition of image among its explicative variables. Class 1 has 71 of the 133 lost customers (53,4%), while the remaining 62 (46,6%) are in class 2. The classes are therefore scarcely characterised by the number of lost customers. The distances of each unit from each local model were then computed according to the equation above. Table 1 shows the values of the distances on a portion of the 791 customers. In order to define the final local models, 27 iterations were needed. The final results for the local models at the last iteration are given in figures 4.a - 4.b.
244
S. Squillacciotti
price
price
service
service
communication
0,25 t=4,62
0,19 t=4,52
-0,65 t=-16,53
0,29 t=5,96
0,27 t=6,82
image
R²=0,81
-0,31 t=-7,83
0,19 t=5,97
satisfaction
communication
R²=0,86
0,29 t=4,57
price
a
0,34 t=7,12
0,18 t=3,72
image
-0,38 t=5,65
0,59 t=6,37
R²=0,78
satisfaction
R²=0,57
price
b
Fig. 4. Results for the local models at the last iteration.
5
Conclusions and Future Perspectives
The final models in figures 4.a and 4.b show more evident differences compared to the first step local models. In the model for class 1 price seems to be a very important variable. First of all, its impact on satisfaction is stronger with respect to image. Moreover, the two latent variables having the strongest impact on image are once again price and billing (which concerns the readability of the bill, the simplicity in payments, etc.). On the other hand, class 2 is more strongly characterised by image as a driver for satisfaction. The most important dimensions influencing image for this class are no longer price and billing, but rather communication ant service (the perceived quality of the service). These results become even more interesting when taking into account the composition of the classes. Class 1 has 189 units, among which 107 have chosen to switch to a new supplier. Hence, 80% of the lost customers are in class 1. We may therefore suppose that, also taking into account the high model predictivity (R2 =0,86), this class is made of customers who are extremely sensible to price, and that for this class price is the most important element in the decision of changing the supplier. Class 2 instead, is mainly made of customers to whom satisfaction is more strongly related to image, and where the quality of the service, the communication to the customer and the global image are more important than elements related to “monetary” value in determining satisfaction. The lower model predictivity may be due to a lesser homogeneity of the class. Further investigations may take into account the existence of a third group. Researches are still on going on different aspects of the methodology. The first issue concerns the definition of the classes, namely when classes are not known a priori: we are investigating other options, such as defining the groups from the global model residuals (assuming that units which show high residuals are not well represented by the model). Another subject of future research concerns the possibility of defining, if necessary, during the iterations, different structural models for each group: in many cases we may suppose that the differences among the classes may not only concern the intensity of the structural links and the model predictivity, but also the definition of the model itself. Obviously, this may lead to problems in the comparison of
Classification in PLS Path Models
245
groups, since it is easier to compare models having the same structure. A third issue concerns the definition of a compromise model, such as the one defined in PLS Typological Regression, which describes all the units and is coherent with the local ones. This may require the definition of a measure of distance among the models.
References AMATO, S. and BALZANO, S. (2003): Exploratory approaches to group comparison. In: M. Vilares, M. Tenenhaus, P. Coelho, V. Esposito Vinzi and A. Morineau (Eds.): PLS and Related Methods. DECISIA, France, 443–452. ESPOSITO VINZI, V. and LAURO, C. (2003): PLS Regression and Classification. In: M. Vilares, M. Tenenhaus, P. Coelho, V. Esposito Vinzi and A. Morineau (Eds.): PLS and Related Methods. DECISIA, France, 45–56. FORNELL, C. (1992): A national customer satisfaction barometer: the Swedish experience. Journal of Marketing , 56, 6–21. R Guide, Version 8 . SAS Institute Inc, Cary, NC. SAS (1999): SAS/STAT User’s ¨ ¨ SJOSTR OM, M. et al. (1986): PLS Discriminant Plots. In: Proceedings of PARC in Practice. Elsevier, North Holland. TENENHAUS, M. (1998): La R´egression PLS: th´eorie et pratique. Technip, Paris. TENENHAUS, M. et al. (2005): PLS Path Modeling. Computational Statistics and Data Analysis, 48, 159–205. WOLD, S. et al. (1984): Multivariate Data Analysis in Chemistry. SIAM Journal of Scientific and Statistical Computing, 5, 735–744.
Hierarchical Clustering by Means of Model Grouping Claudio Agostinelli1 and Paolo Pellizzari2 1
2
Dipartimento di Statistica, Universit` a Ca’ Foscari, 30121 Venezia, Italia email:
[email protected] Dipartimento di Matematica Applicata, Universit` a Ca’ Foscari, 30121 Venezia, Italia email:
[email protected]
Abstract. In many applications we are interested in finding clusters of data that share the same properties, like linear shape. We propose a hierarchical clustering procedure that merges groups if they are fitted well by the same linear model. The representative orthogonal model of each cluster is estimated robustly using iterated LQS regressions. We apply the method to two artificial datasets, providing a comparison of results against other non-hierarchical methods that can estimate linear clusters.
1
Introduction
Hierarchical Cluster Analysis is a widely used method to group data. This procedure is based on a distance between the observations and it is completely non–parametric. For a review of this method see Kaufman and Rousseeuw (1990) and Everitt (1993). The hierarchical structure is valuable in descriptive and exploratory analysis and gives visual suggestions on how groups are merged together. Moreover, the dendrogram offers some guidance in the non-trivial problem of selection of the “optimal” number of clusters. Recently several fellows have proposed, in a non-hierarchical framework, methods that incorporate parametric information in the clusters. This is done introducing a meausure of similarity of the observations that is a function of the agreement with respect to a parametric model. There is a strong interest in linearly shaped clusters that are very simple and useful in a host of application (for example, edge detection and image processing). See M¨ uller and Garlipp (2005) and Hennig (2003) and Van Aelst et al. (2005) for recent work on this area. Our contribution is aimed to provide a method that retains the advantages of a hierarchy of clusters described by linear models. As it is often difficult or inappropriate to select a dependent variable, we estimate these models using robust and orthogonal regressions obtained by repeated application of Least Quantile Squares (LQS) regressions and rotations. The following section describes the hierarchical method we propose. In Section 3 we apply the method to two artificial datasets and sketch a compar-
Hierarchical Clustering by Means of Model Grouping
247
ison with other techniques. Some final remarks and possible future research avenues are finally given in Section 4.
2
A Parametric Hierarchical Cluster Analysis
Given n observations xi = (xi,1 , xi,2 , · · · , xi,p ) , i = 1, · · · , n, from a multivariate random variable X = (X1 , · · · , Xp ) with p components we denote by di,j = d(xi , xj ) ∀i, j = 1, · · · , n the distance between the observations xi and xj . In many cases di,j is the euclidean distance between xi and xj that is, " di,j =
p
#1/2 2
(xi,k − xj,k )
.
k=1
Let Pn , Pn−1 , · · · , Pk , P1 be the partitions generated by a clustering algorithm and by Ck,1 , · · · , Ck,k the clusters in each partition Pk . In the partition Pn , there are n clusters Cn,1 , · · · , Cn,n where each cluster contains only one observation while P1 has only one cluster, namely C1,1 that contains the whole dataset. Let #Ck,m denote the number of observations in the cluster Ck,m (the m-th cluster in the k-th partition) and I(Ck,m ) be the set of indexes of the observations belonging to the cluster Ck,m . In a classical Hierarchical Cluster Analysis we have to define the distance between a couple of clusters at stage k (k Dl,m ) depending on the distances of the observations in the two clusters: k Dl,m
= D(Ck,l , Ck,m ) = f ({d(i, j), i ∈ I(Ck,l ), j ∈ I(Ck,m )}) ,
where f is a function from R#Ck,l ×#Ck,m to R+ (the positive real line). Some examples of f are max, min, median or mean operators. The distances k Dl,m (l, m = 1, · · · , k) can be represented in a symmetric k × k matrix, where the main diagonal has zero values. The partition Pk−1 is obtained by Pk merging the two groups with the minimal k Dl,m (l = m) in a new cluster. We now assume that each cluster can be described by some linear model. As an example, assume that X is multivariate normally distributed and E(β1km X1 + . . . + βpkm Xp − β0km ) = 0, ||(β1km , . . . , βpkm )|| = 1, within cluster Ck,m . Different clusters might have different β’s coefficients and are merged when the respective models are close. For notational convenience we set (β0km , β1km , . . . , βpkm ) = Mk,m , which is to be intended as the (linear) model describing cluster m in partition k. The model Mk,m is estimated by an orthogonal and robust regression procedure that is described below. The distance between a couple of clusters (k Dl,m ) is accordingly modified as follows to take into account the representative linear models Mk,l and Mk,m . Let rk,l = (rk,l,1 , · · · , rk,l,n ) be the vector of standardized residuals of
248
C. Agostinelli and P. Pellizzari
all points with respect to the representative model Mk,l . For all cluster such that it is impossible to estimate a linear model (say, because too few observations are available) we set all the residuals to some predefined constant, like a proper quantile of the standard normal distribution (e.g. 1.96 or 2.57). Then we let the distance of clusters depend also on the residuals of the model, yielding: k Dl,m
= D(Ck,l , Ck,m ) = f (g ({d(i, j), rk,l,j , i ∈ I(Ck,l ), j ∈ I(Ck,m )})) ,
where g is a function from R#Ck,l ×#Ck,m × Rn to R#Ck,l ×#Ck,m and f is as before. In other words, the distance between two clusters depends on the distance between couples of points and on the residuals of points in one cluster with respect to the model estimated in the other cluster. The intuition is that two clusters are merged when points are close and the same linear parametric model fits well both groups. Among the possible g functions we have used • g1 (d(i, j), rk,l,j ) = d(i, j) (|rk,l,j | + 1) • g2 (d(i, j), rk,l,j ) = d(i, j)(|rk,l,j |+1) The two choices give similar behaviour and, for brevity, we report in the sequel only results for the second function. The k × k matrix with entries k Dl,m , i.e. the distance of the m−th to l−th cluster, is in general not symmetric. Moving from partition Pk to Pk−1 , the two groups with the smallest k Dl,m (l = m) are joined in a new cluster subject to the condition that #Ck,l ≥ #Ck,m . This is done to allow the join of a smaller cluster to a bigger one only, ensuring stability of the newly estimated model in the merging processes by the breakdown point of the estimation method. We left to future research exploration of symmetric distances taking into account both rkl and rkm in the computation of k Dlm . The description of our algorithm is completed in the next subsection devoted to the procedure used to estimate the robust orthogonal regressions Mk,m . 2.1
Robust Orthogonal Regression
Denote by T∆X,∆Y and Rθ the translation and rotation matrices (for simplicity, we describe the case of two variables only) ⎡ ⎤ ⎡ ⎤ cos(θ) sin(θ) 0 1 0 0 Rθ = ⎣ − sin(θ) cos(θ) 0 ⎦ , T∆X,∆Y = ⎣ 0 1 0 ⎦ 0 0 1 ∆X ∆Y 1 After observing that orthogonal and ordinary regressions coincide when the slope of the linear model is null, we iteratively translate and rotate the dataset computing Least Quantile Regression (LQS) along the following lines: 1. Run two LQS regressions to determine initial estimates using “dependent” y and “independent” x variables;
Hierarchical Clustering by Means of Model Grouping
2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
249
(ˆ α0 , βˆ0 ) := LQS(x, y); Use (ˆ α0 , βˆ0 ) to compute angle θ0 := atan(α ˆ 0 ) and translation ∆0 := −βˆ0 ; Set i := 0; Repeat Translation and rotation: [x∗ , y ∗ , 1] := [x, y, 1]Rθi T0,∆i ; (ˆ αi+1 , βˆi+1 ) := LQS(x∗ , y ∗ ); Compute angle ψi := atan(ˆ αi ) Update θi+1 := θi + ψi and ∆i+1 := ∆i − βˆi+1 ; i:=i+1; Until ψi ≈ 0
The algorithm keeps rotating the datapoints to adjust the angle θ between x and y components and translation ∆ until no further adjustment is needed. This procedure retains the robustness properties of LQS (Rousseeuw and Hubert, 1997), and usually terminates is a few iterations.
3
Numerical Examples
We consider two artificial datasets, named triangle and π-shaped data, available upon request to the authors. The first dataset, made of 450 points [X, Y ], is depicted in Figure 1 (left) and is generated according to the model [X, Y, 1] = [Z, W, 1]Rθ T∆X,∆Y , where Z ∼ Unif(0, 30) and W ∼ N(0, 1). Each third of the sample is taken from a different linear model: • n = 150 from θ = 0, ∆X = 0, ∆Y = 0; • n = 150 from θ = π/3, ∆X = 1, ∆Y = −5; • n = 150 from θ = −π/3, ∆X = 15, ∆Y = 15. The π-shaped data are obtained similarly merging together observations from • n = 150 from θ = −π/3, ∆X = 10, ∆Y = 30; • n = 150 from θ = π/6, ∆X = 1, ∆Y = 10; • n = 150 from θ = π/6, ∆X = −3, ∆Y = 15, where Z ∼ Unif(0, 20) and W ∼ N(0, 1). This dataset is represented, together with the generating linear models, in Figure 1 (right). Observe that in both cases the orthogonal residuals have the same distribution as W . Standard graphs of the distances of aggregation can help in detecting the “right” number of clusters. Figure 2 shows that three groups are a good choice for the triangle data. The corresponding three linear models are shown in Table 1. The classification of the points to the correct cluster is very satisfactory, taking into account that the groups widely overlap. The error rate is 16.5%
30
C. Agostinelli and P. Pellizzari
Y
20
5
10
−10
−5
15
0
Y
10
25
15
20
250
0
5
10
15
20
25
30
0
5
X
10
15
20
X
40 30 10
20
Ln(Distance of Aggregation+1)
40 30 20 10
Ln(Distance of Aggregation+1)
50
Fig. 1. Triangle (left) and π-shaped data (right), with the linear model generating the points.
430
435
440 Partitions
445
430
435
440
445
Partitions
Fig. 2. Distance of aggregation for triangle (left panel) and π-shaped (right panel) data, last 20 partitions.
but a simple reclassification using the estimated models narrows the error to 10.9%. The results for the π-shaped data are reported in Table 2, where we identify the “correct” three clusters using again aggregation distances, see Figure 2, right panel. The error rate of 24.5% can be reduced to 13.5% after reclassification as shown in Figure 3. It is interesting to compare these results with other non-hierarchical grouping techniques, as the mixture model by Hennig and the orthogonal regression clustering by M¨ uller and Garlipp. We perform all the computations using the R packages fpc and edci provided by the autors, with functions regmix and
Hierarchical Clustering by Means of Model Grouping
251
True values size θ ∆X ∆Y σ 150 0 0 0 1 150 π/3 1.047 1 -5 1 150 −π/3 −1.047 15 15 1 Estimated values Cluster size θ ∆X ∆Y σ 1 178 0.031 0 −0.382 0.940 2 156 1.016 1 −4.481 1.079 3 116 −1.066 15 14.603 0.892 Cluster 1 2 3
Table 1. Estimated linear models for triangle data.
True values size θ ∆X ∆Y σ 150 −π/3 −1.047 10 30 1 150 π/6 0.524 1 10 1 150 π/6 0.524 -3 15 1 Estimated values Cluster size θ ∆X ∆Y σ 1 260 -1.053 10 30.370 1.452 2 91 0.648 1 9.077 0.767 3 99 0.581 -3 18.750 0.688 Cluster 1 2 3
Table 2. Estimated linear models for the π-shaped data.
Classification errors Triangle π-shaped Hennig 14.22% 69.11% M¨ uller-Garlipp 12.44% 43.11%
Table 3. Classification errors for the two datasets using Hennig and M¨ uller-Garlipp methods.
oregMclust, respectively. Both methods produce very good results for the triangle data, but the classification of the points in the other dataset turns out to be somewhat more difficult. The classification errors are given in Table 3. The poor results in the π-shaped case are due to the selection of an erroneous number and/or description of the representative linear models. Figure 4 shows the classification and offer some explanation for high error rates. The clusters were selected maximizing BIC for Hennig’s mixreg and using 0.3, 0.2 as bandwidths for M¨ uller-Garlipp’ oregMclust in triangle and π-shaped data, respectively.
0
5
10
15
25 Y
20 15 10
25 Y
20 15 10
1 111111 1 111 11111 1 1 1111 11 11111 1111 1 11 11 1111111 1 1 111 1 11 11 1111111111 1 11111 111111 1111 1 1 1 1 1 1 3 33311 111 1 1 1111 111 1 3 11 3 333 3333333 1111 1 1 111111111111 33 1 11 111 3 3 333333 3333 3 3 1 1 11111 11 1111111111 33333 33 3 333 111 1 11 11 333333 3 3 11 11 1 111111 2 111111 11 3 3 333 111 1 1 1 3 1 111 111111111111 2 2 3333 222222 3333 333 3 11 11 1 1 11 1 33333333 2 2 2 2 3 2 3 3 1 3 3 3 3 2 22 22 22 1 2 1 2 1 2 2 2 3 1 3 2 222 2 2 2 1 111111 11 2 22 2 222222222 22 2 222 111 2 2 2 2 2 2 2 2 2 222222222 22 2 2 222 222
30
C. Agostinelli and P. Pellizzari
30
252
20
1 111111 1 111 11111 1 1 111 11111 11111 33 1 33 31 3331313 1 3 333 3 33 33 1111111111 3 33333 333311 1111 2 3 1 1 3 1 3 33333 333 3 3 1111 111 2 3 33 3 3 333 3333333 33 3 2 1 221112222222 33 2 2212121122 3 3 333333 3333 3 3 3 3 1 2 2 111 1 22 211111 33333 33 3 333 22 22 222 111 1 333333 3 3 1 1111 2 3 3 333 2222222 2221 1111 1 111111111 3 1 2 3 3 2 3 1 1 1 2 1 2 3 3 2 3 2 2 33 3 3 333333333333 2 222 2222222222 221 1 1 1 1 3 33 33 222 2 2 1 2 1 2 2 2 3 3 2 222 2 2 2 1 1111111 11 2 2 22222222222 22 2 222 111 2 2 2 2 2 2 2 2 2 222222222 22 2 2 222 222 0
5
10
X
15
20
X
30 25 20 10
15
y
10
15
y
20
25
30
Fig. 3. Classification of π-shaped data. The estimated models produce an error of 24.5% (left), that can be lowered to 13.5% by reclassification (right).
0
5
10 x
15
20
0
5
10
15
20
x
Fig. 4. Classification for Hennig (left) and M¨ uller-Garlipp (right) methods.
4
Conclusion
We propose in this paper a hierarchical clustering method that incorporates parametric information about each cluster. The groups are represented by a linear orthogonal model that is robustly estimated using iterative LQS regressions. The method retains the descriptive power of the hierarchical framework and merges two clusters when (points are close and) they are fitted well by the same linear model. We compare the results of our method to a couple of non-hierarchical techniques. The result are comparable for the triangle data but are somewhat encouraging for π-shaped data, where we achieve lower classification error.
Hierarchical Clustering by Means of Model Grouping
253
We leave to future research the study of properties of different agglomerative functions f and g. The strong assumption that points can only be added to a clusters might be relaxed, providing a quasi-hierarchical procedure that might drop some points with poor fit. Finally, observe that parametric models other than the linear ones could be used to describe a cluster under specific circumstances.
References EVERITT, B. (1993): Cluster Analysis. Edward Arnold, London, third edition. HENNIG, C. (2003): Clusters, outliers, and regression: Fixed point clusters Journal of Multivariate Analysis, 86, 183-212. KAUFMAN, L., and ROUSSEEUW, P. (1990): Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York. ¨ MULLER, C.H., and GARLIPP, T. (2005): Simple consistent cluster methods based on redescending M-estimators with an application to edge identification in images. Journal of Multivariate Analysis, 92, 359-385. ROUSSEEUW, P., and HUBERT, M. (1997): Recent developments in PROGRESS. In: Y. Dodge (ed.): L1 -Statistical Procedures and Related Topics, IMS Lecture Notes volume 31, 201-214. VAN AELST, S., WANG, X., ZAMAR, R., and ZHU, R. (2005): Linear grouping using orthogonal regression, Computational Statistics and Data Analysis, in press.
Deepest Points and Least Deep Points: Robustness and Outliers with MZE Claudia Becker1 and Sebastian Paris Scholz2 1
2
Wirtschaftswissenschaftliche Fakult¨ at, Martin-Luther-Universit¨ at Halle-Wittenberg, 06099 Halle, Germany Stresemannstr. 50, 47051 Duisburg, Germany
Abstract. Multivariate outlier identification is often based on robust location and scatter estimates and usually performed relative to an elliptically shaped distribution. On the other hand, the idea of outlying observations is closely related to the notion of data depth, where observations with minimum depth are potential outliers. Here, we are not generally bound to the idea of an elliptical shape of the underlying distribution. Koshevoy and Mosler (1997) introduced zonoid trimmed regions which define a data depth. Recently, Paris Scholz (2002) and Becker and Paris Scholz (2004) investigated a new approach for robust estimation of convex bodies resulting from zonoids. We follow their approach and explore how the minimum volume zonoid (MZE) estimators can be used for multivariate outlier identification in the case of non-elliptically shaped null distributions.
1
Introduction
In statistical data analysis, the data sets to analyze do not only grow in size, dimension and structural complexity, the analyst is also confronted with situations where the standard assumption of elliptically contoured distributions cannot be maintained. A first step to move away from this strict assumption may be to assume some sort of convex contours of the underlying model distribution instead. In this case, statistical inference like the construction of confidence or critical regions or the identification of outliers has to be based on the estimation of multivariate convex bodies. For elliptically contoured distributions, the regions of interest are ellipsoids, and their estimation can be based on estimators of multivariate location and scatter. Together with the growing demands on statistical methods we see a growing need for developing procedures which can cope with model violations. For large and high-dimensional data, departures from the pattern given by the majority of the data points will be harder to detect and to protect against. One demanding goal hence may be to find procedures which are able to reliably detect outlying observations. In the situation of elliptically contoured distributions, constructing robust estimators of multivariate location and scatter serves this purpose, since good outlier identification rules can be based on them (e.g. Becker and Gather (1999, 2001)). For the case of convex contoured distributions, Paris Scholz (2002) and Becker and Paris
Robustness and Outliers with MZE
255
Scholz (2004) propose an approach to robustly estimate a convex body. It is the aim of this paper to investigate whether this approach can also be used to construct useful outlier identification procedures. The paper is organized as follows. In the next section, we briefly review the problem of outlier identification. Section 3 is dedicated to the types of robust estimators used in such outlier identification procedures, and also to the newly introduced estimators for convex bodies, which are based on socalled zonoids. In Section 4, we take the connection between outlyingness and data depth to discuss the idea of zonoid trimmed regions for outlier identification, followed by a proposal for a more robust approach. We finish with some concluding remarks.
2
Outliers
The problem of outlier detection in multivariate samples has been extensively discussed in the literature (see Barnett and Lewis (1994), Gather and Becker (1997), for an overview over basic concepts). One possible approach is to work with a suitably chosen distance of each observation within a sample of size n, say. This distance is usually calculated from the center of the observations with respect to the sample’s scatter. Observations xi ∈ Rp with a distance larger than some appropriate critical value are identified as outliers (e.g. see Barnett and Lewis (1994), Rocke (1996), Rocke and Woodruff (1993), Rousseeuw and Leroy (1987), Rousseeuw and van Zomeren (1990)). This corresponds to the definition of a region of the underlying model distribution where observations will occur only with some low probability α (α outlier region, see Davies and Gather (1993), Becker and Gather (1999)). Figure 1 shows such a region for the bivariate standard normal distribution: outside the marked circle, an observation will only occur with probability α. In the classical approach (Healy (1968)), an observation xi is declared outlying if its Mahalanobis distance d2i = (xi − xn )T S −1 n (xi − xn ) exceeds some quantile of the χ2p , since asymptotically the d2i are χ2p distributed. Here, n n xn = i=1 xi /n and S n = i=1 (xi − xn )(xi − xn )T /(n − 1) denote sample mean and sample covariance matrix, respectively. Since this rule is not robust against outliers itself, several robustified versions exist. Usually, xn and S n are replaced by some robust estimators of multivariate location and covariance. In general, this leads to robust outlier identification procedures with respect to some suitably chosen robustness criteria. For example, the use of high-breakdown robust estimators in such procedures bounds the occurrence of masking and swamping effects (Becker and Gather (1999)).
3
Minimizers and MZE Estimates
High breakdown robust estimators for location and covariance which have become more or less standard during the last years are the minimum volume
256
C. Becker and S.P. Scholz
Fig. 1. Boundary of α outlier region of the bivariate standard normal
ellipsoid (MVE) and to a larger extent the minimum covariance determinant (MCD) estimators (Rousseeuw (1985)), due to growing computer facilities and improved algorithms (e.g. Rousseeuw and van Driessen (1999)). These estimators are solutions of a certain type of minimization problem. The idea is to find an outlier free subsample of the data and to base the estimation on this “clean” subsample, calculating its mean and its suitably standardized empirical covariance matrix. The corresponding problem is to determine a subset of size h of the given sample such that a measure of variability is minimized over all possible subsets of the data of at least size h. We will call such estimators minimizers (also see Becker and Paris Scholz (2004)). While both, MVE and MCD are based essentially on an elliptical structure of the underlying distribution, the minimum volume zonoid estimation (MZE) approach (Paris Scholz (2002)) relates to a convex contoured distributional structure. The MZE minimization problem consists of finding that subsample of size h which minimizes the volume of the estimated centered zonoid. Here, the estimated centered zonoid of a distribution F , based on a sample {x1 , . . . , xn }, xi ∈ Rp is defined by ⎧ ⎫ n k ⎨' ⎬ 1 n ) = conv ij : {i1 , . . . , ik } ⊆ {1, . . . , n}} Z(X { x ⎩ ⎭ n k=0
j=1
(Koshevoy and Mosler (1997, 1998)), where conv{·} denotes the convex hull n = { n } = {x1 − xn , . . . , xn − xn }. of a point set, and X x1 , . . . , x
Robustness and Outliers with MZE
257
The MZE approach, like MVE and MCD, yields affine equivariant location and scatter estimates (Paris Scholz (2002)). For some impressions on the behaviour of these estimates also see Becker and Paris Scholz (2004). Other estimation principles connected with zonoids can be found in Koshevoy et al. (2003). For all three minimizers (MVE, MCD, MZE), the choice of h = n+p+1 yields pairs of location and covariance estimators with maxi2 mum possible finite sample breakdown points (Davies (1987), Lopuha¨ a and Rousseeuw (1991), Paris Scholz (2002)). Also see Davies and Gather (2005a,b) for intensive discussions of problems connected with such breakdown derivations. More recently, also other choices of h are discussed to get estimators with acceptable breakdown and higher efficiency (e.g. Croux and Haesbroeck, 2000). We will see that such considerations also become relevant in outlier identification based on the MZE approach.
4
Outliers, Data Depth, and Robust Trimmed Regions
The approaches of defining outlier regions and identifying observations lying in these regions as outliers with respect to some underlying distribution are – although fairly general in their concept – usually determined by the case of elliptically contoured distributions. On the other hand, the idea of outlyingness of an observation is closely related to the idea of data depth (e.g. Liu (1992)). Outliers as observations lying “at the outer bounds” of a sample are least deep points, hence least deep points can be seen as candidates to be potentially identified as outliers. From such considerations comes the approach to define outlier identification via least deep data points, which is also closely related to outlier identification by trimming. Koshevoy and Mosler (1997) introduce zonoid trimmed regions, which can be used to define a data depth. Paris Scholz (2002) proposes to use this approach for identifying outliers in the case of a convex shaped target distribution. For a sample of size n, an α trimmed zonoid region ZRn (α) is defined by ⎧⎧ ⎫ k ⎨⎨ 1 ⎬ k ZRn (α) = conv x ij + 1 − xik+1 : ⎩⎩ αn ⎭ αn j=1 +
{i1 , . . . , ik+1 } ⊆ {1, . . . , n} , α=
k , n
k = 1, . . . , n − 1
(Koshevoy and Mosler (1997)). Figure 2 shows the 50% trimmed zonoid region for an example of a sample of size n = 10 from a bivariate standard normal. For outlier identification purposes, one could just prescribe α, calculate the α trimmed region and declare all observations lying outside this region as
C. Becker and S.P. Scholz
8
10
258
-2
0
2
4
6
ZR 10 (0.5)
-2
0
2
4
6
8
10
8
10
Fig. 2. Zonoid trimmed region, α = 0.5, n = 10, data from bivariate standard normal
0
2
4
6
ZR 10 (0.5)
-2
Z R 1 0 ,1 (0 .5 ) -2
0
2
4
6
8
10
Fig. 3. Zonoid trimmed region, α = 0.5, n = 10, data from bivariate standard normal, one observation replaced by outlier
outliers. The problem with this simple approach is that the α trimmed zonoid regions can themselves be disturbed by outliers, as can be seen in Figure 3. One observation of the sample of Figure 2 is replaced by some point far from the rest of the data, and the trimmed region is heavily influenced by this. Essentially, we have here the same problem as in convex hull peeling. As a first solution to the problem described above, Paris Scholz (2002) proposes to replace the original sample by the MZE sample and hence to use an MZE based estimation of the α trimmed zonoid regions. The advantage of this approach is that estimation of such trimmed regions can be done in a highly robust way (with respect to the criterion of finite sample breakdown). The clear disadvantage is that with this proposal we are completely restricted to the MZE sample. Figure 4 illustrates the problem. The convex contour
259
-2
0
2
4
6
Robustness and Outliers with MZE
-3
-2
-1
0
1
2
3
Fig. 4. MZE sample for outlier identification, n = 10, p = 2
inscribed into the point cloud shows the convex hull of the MZE sample in this case (note that with n = 10, p = 2, we have h = 6 here). All points lying outside this contour would be potential outliers. It is obvious that the number of potential outliers will usually be too high with this approach. As a remedy we propose to proceed similarly as with the tradeoff between high breakdown and efficiency for multivariate robust estimators: relax the condition for the size h of the subsamples in the MZE minimization problem. Proceed as follows: 1. Draw all h-subsets of size h ≥ (n + p + 1)/2. 2. Compute the volume of the estimated centered zonotope for all subsets. 3. Choose the subset with the smallest volume for estimation. We expect that this MZE(h) approach will yield estimators of location, covariance and also zonoid trimmed regions with lower breakdown point but depth contours applicable for outlier identification. Figure 5 shows the result for the choice h = 8 > 6 = (n + p + 1)/2. Now, only two points are found to be outlying, mirroring the visual impression quite well.
5
Conclusions
Based on the notion of zonoids, introduced by Koshevoy and Mosler (1997, 1998), Paris Scholz (2002) proposed a way to compute robust estimators of
C. Becker and S.P. Scholz
-2
0
2
4
6
260
-3
-2
-1
0
1
2
3
Fig. 5. MZE(h) sample for outlier identification, n = 10, p = 2, h = 8
multivariate location, covariance, and convex bodies, belonging to the class of so-called minimizers. In this paper, we investigated how this approach could be used for outlier identification with respect to multivariate distributions which are no longer elliptically shaped but show convex contours. We found that the approach can also be extended to identifying outliers, when relaxing the strong goal of gaining estimators with maximum breakdown point. Similar to discussions in the robustness literature, where the breakdown demand is relaxed in favour of efficiency, we recommend here to pursue the goal of using the MZE based outlier identification approach for samples with moderate numbers of outliers instead of constructing insurance against maximum possible numbers of outlying observations. If we try the latter, the method investigated here will tend to overestimate the number of outliers, as could be seen in the examples. Finally, changing the point of view from outlier identification to finding outlier free data subsets, the MZE procedure could hence also provide an alternative way of finding such subsets as starting sets for procedures like forward search based methods (see for example Atkinson et al. (2004)).
References ATKINSON, A.C., RIANI, M., CERIOLI, A. (2004): Exploring multivariate data with the forward search. Springer, New York.
Robustness and Outliers with MZE
261
BARNETT, V., and LEWIS, T. (1994): Outliers in statistical data. 3rd ed., Wiley, New York. BECKER, C., and GATHER, U. (1999): The masking breakdown point of multivariate outlier identification rules. J. Amer. Statist. Assoc., 94, 947–955. BECKER, C., and GATHER, U. (2001): The largest nonidentifiable outlier: A comparison of multivariate simultaneous outlier identification rules. Comput. Statist. and Data Anal., 36, 119–127. BECKER, C., and PARIS SCHOLZ, S. (2004): MVE, MCD, and MZE: A simulation study comparing convex body minimizers. Allgemeines Statistisches Archiv, 88, 155–162. CROUX, C., and HAESBROECK, G. (2000): Principal component analysis based on robust estimators of the covariance or correlation matrix: Influence functions and efficiencies. Biometrika, 87, 603–618. DAVIES, P.L. (1987): Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. Ann. Statist., 15, 1269–1292. DAVIES, P.L., and GATHER, U. (1993): The identification of multiple outliers. Invited paper with discussion and rejoinder. J. Amer. Statist. Assoc., 88, 782– 801. DAVIES, P.L., and GATHER, U. (2005a): Breakdown and groups (with discussion and rejoinder. To appear in Ann. Statist. DAVIES, P.L., and GATHER, U. (2005b): Breakdown and groups II. To appear in Ann. Statist. GATHER, U., and BECKER, C. (1997): Outlier identification and robust methods. In: G.S. Maddala and C.R. Rao (Eds.): Handbook of statistics, Vol. 15: Robust inference. Elsevier, Amsterdam, 123–143. HEALY, M.J.R. (1968): Multivariate normal plotting. Applied Statistics 17, 157– 161. KOSHEVOY, G., and MOSLER, K. (1997): Zonoid trimming for multivariate distributions. Ann. Statist., 9, 1998–2017. KOSHEVOY, G., and MOSLER, K. (1998): Lift zonoids, random convex hulls, and the variability of random vectors. Bernoulli, 4, 377–399. ¨ ¨ KOSHEVOY, G., MOTT ONEN, J., and OJA, H. (2003): A scatter matrix estimate based on the zonotope, Ann. Statist., 31, 1439-1459. LIU, R.Y. (1992): Data depth and multivariate rank tests. In: Y. Dodge (Ed.): L1 – Statistical analysis and related methods. North Holland, Amsterdam, 279–294. ¨ H.P., and ROUSSEEUW, P.J. (1991): Breakdown points of affine LOPUHAA, equivariant estimators of multivariate location and covariance matrices. Ann. Statist., 19, 229–248. PARIS SCHOLZ, S. (2002): Robustness concepts and investigations for estimators of convex bodies. Thesis, Department of Statistics, University of Dortmund (in German). ROCKE, D.M. (1996): Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann. Statist., 24, 1327–1345. ROUSSEEUW, P.J. (1985): Multivariate estimation with high breakdown point. In: W. Grossmann, G. Pflug, I. Vincze, W. Wertz (Eds.): Mathematical statistics and applications, Vol. 8. Reidel, Dordrecht, 283–297. ROUSSEEUW, P.J., and VAN DRIESSEN, K. (1999): A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223. ROUSSEEUW, P.J., and LEROY, A.M. (1987): Robust regression and outlier detection. Wiley, New York.
Robust Transformations and Outlier Detection with Autocorrelated Data Andrea Cerioli and Marco Riani Department of Economics – Section of Statistics, University of Parma, 43100 Parma, Italy
Abstract. The analysis of regression data is often improved by using a transformation of the response rather than the original response itself. However, finding a suitable transformation can be strongly affected by the influence of a few individual observations. Outliers can have an enormous impact on the fitting of statistical models and can be hard to detect due to masking and swamping. These difficulties are enhanced in the case of models for dependent observations, since any anomalies are with respect to the specific autocorrelation structure of the model. In this paper we develop a forward search approach which is able to robustly estimate the Box-Cox transformation parameter under a first-order spatial autoregression model.
1
Introduction
The development of robust high-breakdown methods for spatially autocorrelated data is an important research topic. Models for such data are usually fitted through maximum likelihood under a Gaussian assumption. It is notorious that maximum likelihood estimation is not robust to the presence of outliers. Furthermore, spatial autocorrelation can be the reason for additional troubles in the outlier detection process, since any anomalies have to be checked with respect to the assumed spatial model and neighbourhood structure. Cressie (1993) provides a wide description of exploratory tools that can be applied to uncover spatial outliers and that make use of neighbourhood information. However, these methods are based on case-deletion diagnostics and are prone to masking and swamping with a cluster of spatial outliers. Most high-breakdown methods for regression and multivariate estimation, such as least median of squares regression and minimum volume ellipsoid estimation (Rousseeuw and van Zomeren (1990)), are difficult to extend to autocorrelated observations, both conceptually and computationally. Cerioli and Riani (2002) and Atkinson et al. (2004) suggest a forward search approach to robustly fit spatial models. Their technique rests upon a computationally simple and statistically efficient forward algorithm, where at each step observations are added to the fitted subset in such a way that outliers and influential observations enter at the end. In this paper we show how the forward search approach can be extended to robustly improve normality of spatially autocorrelated data, a topic that
Robust Transformations with Autocorrelated Data
263
has been somewhat neglected in the statistical literature (Griffith and Layne (1995) and Pace et al. (2004) are two non-robust exceptions). Specifically, in §2 we focus on the popular first-order Simultaneous Autoregressive (SAR) model. Transformation of the response using the Box-Cox family of power transformations is considered in §3, where we introduce the notion of a transformed SAR model. §4 gives an overview of the forward search algorithm used for fitting the transformed model. The usefulness of our method is shown in §5 through a number of examples.
2
The Simultaneous Autoregressive (SAR) Model
Let S ≡ {s1 , . . . , sn } be a collection of n spatial locations and yi be a random variable observed at site si , i = 1, . . . , n. Spatial relationships between pairs of locations are represented through the simple weighting scheme: wij = 1 wij = 0
if sites si and sj are neighbours, otherwise,
and wii = 0. For a regular grid the most common definition of a neighbourhood structure is that for which wij = 1 if sj is immediately to the north, south, east or west of si . We write y = (y1 , . . . , yn ) and W = (wij ) for i, j = 1, . . . , n. Edge points typically raise problems in the statistical analysis of spatial systems. The basic difficulty is that they have fewer neighbours than interior points. For this reason we assume that, whenever possible, W has been suitably modified to account for edge effects. A simple but widely adopted technique is toroidal correction, which wraps a rectangular region onto a torus. Edge points on opposite borders are thus considered to be close, and all sites have the same number of neighbours. At each location we might have additional non-stochastic information about p − 1 spatial covariates. Let X denote the corresponding design matrix of dimension n × p, allowing also for the mean effect. The first-order Simultaneous Autoregressive (SAR) model is defined as (Cressie, 1993) (In − ρW )(y − Xβ) = ε,
(1)
where β = (β0 , . . . , βp−1 ) is a p-dimensional parameter vector, In is the n×n identity matrix, ρ is a measure of spatial interaction between neighbouring sites, and ε = (ε1 , . . . , εn ) is an n-dimensional vector of disturbances. Errors εi are defined to be independent and normally distributed with mean 0 and common variance σ 2 . It is assumed that (In −ρW )−1 exists. It is not essential for W to be symmetric, although in practice this is often the case. Estimation of parameters in (1) is by maximization of the likelihood l(β, σ 2 , ρ) = (2πσ 2 )−n/2 |In − ρW |exp{−
where
1 (y − Xβ) Σ(y − Xβ)}, 2σ 2
Σ = (In − ρW ) (In − ρW ).
(2)
264
3
A. Cerioli and M. Riani
The Transformed SAR Model
One crucial assumption underlying model (1) is normality of the additive errors εi . If this requirement is not satisfied in the original scale of measurement of the response, it may be that there is a nonlinear transformation of y which will yield normality, at least approximately. In this paper we adopt the popular class of nonlinear transformations suggested by Box and Cox (1964). Let y∗ = (y∗1 , . . . , y∗n ) = (In − ρW )y. Under the first-order SAR model (1), y∗ has mean (In −ρW )Xβ and scalar covariance matrix. For this modified data vector, the Box-Cox normalized power transformation to normality is , λ y∗ −1 λ = 0 λy˙ ∗λ−1 z(λ) = (3) y˙ ∗ log y∗ λ = 0
where y˙ ∗ = exp(Σi log y∗i /n) is the geometric mean of y∗1 , . . . , y∗n . We define the transformed SAR model to be a linear regression model with response z(λ), design matrix (In − ρW )X and Gaussian disturbance ε, as in model (1). That is, z(λ) = (In − ρW )Xβ + ε. (4) When λ = 1, there is no transformation in the standard SAR model; λ = 1/2 is the square root transformation, λ = 0 gives the log transformation and λ = −1 the reciprocal. These are the most widely used transformations in practical applications. Maximum likelihood estimation of λ could be performed by suitable modification of equation (2). However, likelihood analysis for spatial Gaussian processes can encounter numerical difficulties, such as convexity or multimodality of the resulting profile likelihood function (see Ripley (1988), §2.1), and adding a further parameter to (2) might result in unpredicted consequences. In addition, repeated evaluation of model (4) at subsequent steps of the forward algorithm requires a fast computational procedure. For these reasons, we do not resort to numerical maximization of the likelihood function with respect to the extended parameter set (β, σ 2 , ρ, λ). Following Atkinson and Riani (2000, §4.2), instead we derive an approximate score statistic by Taylor series expansion of (3) about a known value λ0 . The score statistic does not require computation of the maximum likelihood estimate of λ. As a result, the transformed SAR model (4) is approximated as . z(λ0 ) = (In − ρW )Xβ + γw(λ0 ) + ε, (5) where γ = −(λ − λ0 ) and w(λ0 ) = ∂z(λ) is known as a constructed ∂λ λ=λ0
variable. The t test for γ = 0 in model (5) is then the approximate score statistic for testing H 0 : λ = λ0 (6) in the transformed SAR model (4). This statistic makes proper allowance for spatial autocorrelation in the process of finding the best transformation, thus achieving the desirable goal of a joint spatial and transformation analysis.
Robust Transformations with Autocorrelated Data
4
265
Robust Fitting of the Transformed SAR Model and Diagnostic Monitoring
The transformed SAR model is repeatedly fitted through a block forward search (BFS) algorithm similar to the one suggested by Cerioli and Riani (2002). This algorithm is both efficient and robust. It is efficient because it makes use of the Gaussian likelihood machinery underlying models (1) and (4). It is robust because the outliers enter in the last steps of the procedure and their effect on the parameter estimates is clearly depicted. More generally, our approach allows evaluation of the inferential effect each location, either outlying or not, exerts on the fitted model. The key features of the BFS for finding the best transformation under model (4) are summarized as follows. Choice of the initial subset. We take blocks of contiguous spatial locations as the basic elemental sets of our algorithm. Blocks are intended to retain the spatial dependence properties of the whole study region and are defined to resemble as closely as possible the shape of that region. Confining attention to subsets of neighbouring locations ensures that spatial relationships are preserved by the BFS algorithm, so that ρ can be consistently estimated within each block. Atkinson et al. (2004) provide details about practical selection of blocks and empirical evidence of the effects produced by different choices. The initial subset for the BFS algorithm is then obtained without loss of generality through a least median of squares criterion applied to blocks. Progressing in the search. The transformed SAR model is repeatedly fitted to subsets of observations of increasing sizes, selected in such a way that outliers are included only at the end of the search. For this reason, progression in the BFS algorithm is performed by looking at the smallest squared standardized regression residuals from the fit at the preceding step. At each step, model (4) can be fitted either by exact maximum likelihood given the available data subset, or by a faster approximation to it. The weight matrix W is usually corrected for edge effects for the reason sketched in §2. Diagnostic monitoring. One major advantage of the forward search over other high-breakdown techniques is that a number of diagnostic measures can be computed and monitored as the algorithm progresses. Under model (4), we are particularly interested in producing forward plots of regression parameter estimates and transformation statistics. In the latter instance, we produce forward plots of the approximate score statistic for testing (6) under different values λ0 , using a separate search for each λ0 . These plots are then combined into a single picture which is named a “fan plot” after Atkinson and Riani (2000, p. 89). In most applications five values of λ0 are sufficient for selecting the appropriate transformation: 1, 0.5, 0, -0.5, -1, thus running from no transformation to the reciprocal.
A. Cerioli and M. Riani 10
266
5
−0.5
0
0
0.5
−5
1
2
−10
Score test (with torus correction)
−1
0
50
100
150
200
250
Subset size
Fig. 1. Fan plot for the clean dataset of Example 1.
5 5.1
Examples Example 1: Clean Data
In our first example we analyze the behaviour of the BFS algorithm for robustly fitting the transformed SAR model (4) in a dataset without outliers, to check that it does not produce spurious information. We first simulate n = 256 observations from model (1), with S a 16 × 16 regular grid, ρ = 0.1, p = 4, wij = 1 if sj is immediately to the north, south, east or west of si , and toroidal edge correction. Then we square the response values. The dataset is available at http://www.riani.it/gfkl2005. Any sensible transformation analysis should point to the square root transformation of y, i.e. to λ = 0.5. Figure 1 is the fan plot showing the forward plots of the approximate score statistic for testing hypothesis (6) under six values λ0 , ranging from -1 to 2, when the BFS is run with blocks of size 4 × 4 and toroidal edge correction. The central horizontal bands are at ±2.58, the 99% percentage points of the reference asymptotic normal distribution. The fan plot clearly depicts the correct transformation λ = 0.5, as the corresponding score statistic varies around zero along the search. Evidence against the other values of λ increases as the fitting subset grows. There is no effect of outlying observations at the end of the search. We conclude that our method provides the appropriate transformation, as well as the effect on the choice of λ exerted by each spatial location, in this “clean” example. We complement our transformation analysis by seeing how the forward plot of the maximum likelihood estimate of ρ changes under different values of λ. The corresponding plots are in Figure 2. Apart from the initial steps, where results from the search may be unstable, it is seen that estimation of ρ is not much affected by the specific transformation parameter. This indicates lack of appreciable interaction between the strength of spatial autocorrelation and the scale on which y is represented.
267
0.15 0.10
0 0.5 1
0.05
2
0.0
Autocorrelation parameter (with torus correction)
0.20
Robust Transformations with Autocorrelated Data
50
100
150
200
250
Subset size
Fig. 2. Example 1. Forward plots of the maximum likelihood estimate of ρ under different transformations.
5
0.5
0
1
−5
Score test (with torus correction)
10
0
−10
2
0
50
100
150
200
250
Subset size
Fig. 3. Fan plot for the contaminated dataset of Example 2.
5.2
Example 2: Contaminated Data
In our second example we evaluate the robustness properties of the BFS approach for transformation to normality with correlated data. For this purpose, we introduce a cluster of 16 spatial outliers in the simulated dataset of Example 1, by modifying the response values in the 4×4 area in the left-hand corner of S. Also this dataset is available at http://www.riani.it/gfkl2005. The outliers are masked and hard to detect by standard exploratory methods, such as visual inspection of the scatterplot matrix and diagnosis of the regression residuals. On the contrary, Figure 3 is extremely clear in picturing the influence that the outliers have on the selection of the transformation parameter. The true λ = 0.5 is correctly supported by all the non-contaminated data: the forward plot of the score statistic for testing λ = 0.5 varies around zero until the first spatial outlier is included in the fitted subset, at step
A. Cerioli and M. Riani 0 0.5 1
0.05
0.10
0.15
2
0.0
Autocorrelation parameter (with torus correction)
0.20
268
50
100
150
200
250
Subset size
Fig. 4. Example 2. Forward plots of the maximum likelihood estimate of ρ under different transformations.
241. Even allowing for spatial autocorrelation, progressive inclusion of the outliers renders the correct transformation increasingly less plausible. Nonrobust transformation analysis based on all the data would then wrongly suggest that this dataset does not need to be transformed (λ = 1). Furthermore, the outliers now have a disproportionate effect on estimation of ρ, again irrespective of the value of λ (Figure 4). 5.3
Example 3: Simulation Envelopes
In §3 we stressed the point that the approximate score statistic for testing (6) in the transformed SAR model makes proper allowance for spatial autocorrelation. However, it is not known how the asymptotic normal distribution approximates the true null distribution of the score statistic in small or moderate spatial samples. Therefore, it is useful to provide simulation evidence of the finite sample accuracy of approximation (5) and of the effect of spatial autocorrelation on the actual significance level of the score statistic. Figure 5 reports 90%, 95% and 99% envelopes of the distribution of the score statistic obtained from 200 independent simulations of the transformed SAR model (4) under the null hypothesis, in the setting of Example 2 with ρ estimated at the step before the inclusion of the first outlier. These envelopes are compared with the corresponding percentage points of the normal distribution (the horizontal lines in the figure). After the first steps, it is seen that there is good agreement between asymptotic and simulated percentage points. This result strenghtens our confidence in pointwise inference based on simple displays such as the fan plots of Figures 1 and 3 with spatially autocorrelated data, at least when the sample size is moderately large and the transformed SAR model fits well the data. Some preliminary simulation results (not reported here) seem to show that, for a fixed sample size, the accuracy of the normal approximation dete-
269
0 −4
−2
Score test
2
4
Robust Transformations with Autocorrelated Data
0
50
100
150
200
250
Subset size
Fig. 5. Simulation envelopes (dashed bands) and asymptotic percentage points (solid lines) of the score statistic for testing λ = λ0 , in the setting of Example 2.
riorates as the model fit worsens. The development of a general approach for calibrating asymptotic confidence bands of the score statistic under different SAR model fits is currently under investigation.
References ATKINSON, A.C. and RIANI, M. (2000): Robust Diagnostic Regression Analysis. Springer, New York. ATKINSON, A.C., RIANI, M. and CERIOLI, A. (2004): Exploring Multivariate Data with the Forward Search. Springer, New York. BOX, G.E.P. and COX, D.R. (1964): An Analysis of Transformations (with discussion). Journal of the Royal Statistical Society B, 26, 211–246. CERIOLI, A. and RIANI, M. (2002): Robust Methods for the Analysis of Spatially Autocorrelated Data. Statistical Methods and Applications - Journal of the Italian Statistical Society, 11, 335–358. CRESSIE, N.A.C. (1993): Statistics for Spatial Data. Wiley, New York. GRIFFITH, D.A. and LAYNE, L.J. (1999): A Casebook for Spatial Statistical Data Analysis. Oxford University Press, New York. PACE, R.K., BARRY, R., SLAWSON, V.C. Jr. and SIRMANS, C.F. (2004): Simultaneous Spatial and Functional Form Transformations. In: L. Anselin, R.J.G.M. Florax and S.J. Rey (Eds.): Advances in Spatial Econometrics. Springer, New York. RIPLEY, B.D. (1988): Statistical Inference for Spatial Processes. Cambridge University Press, Cambridge. ROUSSEEUW, P.J. and van ZOMEREN, B.C. (1990): Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association, 85, 633-639.
Robust Multivariate Methods: The Projection Pursuit Approach Peter Filzmoser1 , Sven Serneels2 , Christophe Croux3 , and Pierre J. Van Espen2 1
2
3
Department of Statistics and Probability Theory, Vienna University of Technology, A-1040 Vienna, Austria Department of Chemistry, University of Antwerp, B-2610 Antwerp, Belgium Department of Applied Economics, K.U. Leuven, B-3000 Leuven, Belgium
Abstract. Projection pursuit was originally introduced to identify structures in multivariate data clouds (Huber, 1985). The idea of projecting data to a lowdimensional subspace can also be applied to multivariate statistical methods. The robustness of the methods can be achieved by applying robust estimators to the lower-dimensional space. Robust estimation in high dimensions can thus be avoided which usually results in a faster computation. Moreover, flat data sets where the number of variables is much higher than the number of observations can be easier analyzed in a robust way. We will focus on the projection pursuit approach for robust continuum regression (Serneels et al., 2005). A new algorithm is introduced and compared with the reference algorithm as well as with classical continuum regression.
1
Introduction
Multivariate statistical methods are often based on analyzing covariance structures. Principal Component Analysis (PCA) for example corresponds to a transformation of the data to a new coordinate system where the directions of the new axes are determined by the eigenvectors of the covariance matrix of the data. In factor analysis the covariance or correlation matrix of the data is the basis for determining the new factors, where usually the diagonal of this scatter matrix is reduced by a variance part that is unique for each variable (“uniqueness”). In Canonical Correlation Analysis (CCA) one is concerned with two sets of variables that have been observed on the same objects, and the goal is to determine new directions in each of the sets with maximal correlation. The problem comes down to an eigenvector decomposition of a matrix that uses information of the joint covariance matrix of the two variable sets. In discriminant analysis the group centers and group covariance matrices are used for finding discriminant rules that are able to separate two or more groups of data coming from different populations. Traditionally, the population covariance matrix is estimated by the empirical sample covariance matrix. However, it is well known that outliers in
Robust Multivariate Methods: The Projection Pursuit Approach
271
the data can have severe influence to this estimator (see, e.g., Hampel et al., 1986). For this reason, more robust scatter estimators have been introduced in the literature, for a review see Maronna and Yohai (1998). Although robustness is paid for by lower efficiency of the estimator and a higher computational effort, the resulting estimation will usually be more reliable for the data at hand. Plugging in robust covariance matrices into the before mentioned methods leads to robust counterparts of the multivariate methods. The robustness properties of the resulting estimators have been studied, e.g. Croux and Haesbroeck (2000) for PCA, or Pison et al. (2003) for factor analysis. There exists another approach to robustify multivariate methods, without passing by a robust estimate of the covariance structure. This so-called Projection Pursuit (PP) approach uses the idea to project the multivariate data onto a lower dimensional space where robust estimation is much easier. PP was initially proposed by Friedman and Tukey (1974), and the original goal was to pursue directions that show the structure of the multivariate data if projected on these directions. This is done by maximizing a PP index, and the direction(s) resulting in a (local) maximum of the index are considered to reveal interesting data structures. Huber (1985) pointed out that PCA is a special case of PP, where the PP index is the variance of the projected data, and where orthogonality constraints have to be included in the maximization procedure. Li and Chen (1985) used this approach to robustify PCA by taking a robust scale estimator. Croux and Ruiz-Gazen (2005) investigated the robustness properties of this robust PCA approach, and they introduced an algorithm for fast computation. Robust estimation using PP was also considered for canonical correlation analysis (Branco et al., 2005), and this approach was compared with the method of robustly estimating the joint covariance matrix and with a robust alternating regression method. The PP approach has several advantages, including the following: (a) As mentioned earlier, robust estimation in lower dimension is computationally easier and faster, although on the other hand the search for “interesting” projection directions is again time consuming. (b) Robust covariance estimation is limited to data sets where the number of observations is larger than the number of variables. Thus, for many problems–like in chemometrics–PP based methods are the methods of choice for a robust data analysis. (c) The search for projection directions is sequential. Thus, the user can determine a certain number of directions he/she is interested in, and is not forced to perform a complete eigenanalysis of the covariance matrix. Especially for high dimensional problems the computation time can be reduced drastically by PP based methods as the number of interesting directions to be considered is often small. In this article we will focus on Continuum Regression (CR), a multivariate method introduced by Stone and Brooks (1990) that combines ordinary least
272
P. Filzmoser et al.
squares, partial least squares and principal components regression. Serneels et al. (2005) introduced robust CR using the PP approach. In the next section we will describe CR and outline how the parameters can be estimated in a robust way. A new algorithm for computation will be introduced in Section 3, and the precision of this algorithm will be compared with the proposed algorithm of Serneels et al. (2005). Section 4 underlines the robustness of this method by presenting simulation results for the case of outliers in the space of the regressor variables. The final section provides a summary.
2
Robust Continuum Regression by Projection Pursuit
CR is a regression technique that was designed for problems with high dimensional regressors and few observations. Therefore, let X be the n × p matrix of regressors where typically n << p. Let y be a vector with n observations of the response variable. Like in the regression setting, the model y = Xβ + ε
(1)
with the error term ε is considered and the focus is on estimating the regression coefficients β. Since the regressors are usually highly collinear, the coefficients β are not directly estimated, but a so-called latent variable model y = T hξ + ε
(2)
with new regression coefficients ξ is considered. The score matrix T h is of size n×h and h, the number of latent variables, is taken much smaller than p. The score matrix is related with the original regressors through T h = XW h with W h = (w1 , . . . , wh ) being a matrix with weights. The weight vectors are defined by δ
wi = argmax{Cov(Xa, y)2 Var(Xa) 1−δ −1 } a
(3)
(i = 1, . . . , h) under the constraints
wi = 1
and
Cov(Xwi , Xwj ) = 0 for j < i.
(4)
The tuning parameter δ can be chosen in the interval [0, 1]. By taking δ = 0 the criterion corresponds to ordinary least squares, δ = 0.5 is the Partial Least Squares (PLS) criterion, and δ = 1 results in principal component regression (see Stone and Brooks, 1990). The definition (3) of the weight vectors can be understood as PP index that has to be maximized for a projection direction a, and for subsequent projection directions the constraints (4) have to be fulfilled. The typically high dimensional regressor matrix X is projected to one dimension, namely Xa, and the variance “Var” of the projected data as well as the covariance
Robust Multivariate Methods: The Projection Pursuit Approach
273
“Cov” between two univariate variables are the basis for finding the weight vectors. “Var” and “Cov” are usually taken as sample variance and covariance estimators, respectively. By using more robust estimators instead, the influence of outliers will be reduced and the projection directions will be determined in a robust manner, resulting in a robust CR method. Serneels et al. (2005) suggested to take the α-trimmed variance and covariance because these estimators are easy to understand and fast to compute. The algorithm for (robust) CR based on PP can be summarized as follows: (a) Fix the number h of latent variables and the tuning parameter δ. The appropriate choice of h and δ is described in Serneels et al. (2005). (b) Define E 1 as the mean centered data matrix X. For robust CR, robust mean centering can be achieved by using the L1 -median (for an efficient algorithm see H¨ossjer and Croux, 1995). (c) Suppose that the weight vectors W i−1 = (w1 , . . . , wi−1 ) have already been computed. (i) The i-th weight vector wi is determined according to criterion (3) by scanning the projection directions a. In Section 3 we will provide more details on this. Multiplying the matrix E i (see below) with these weights gives the i-th score vector ti . (ii) The parameter vector ξ in model (2) is estimated by ordinary least squares in the classical case and in the robust case by any robust regression method, like Huber M-regression (Huber, 1981). Premultiplication with W i−1 gives the estimation of the coefficients β in the original model (1). (iii) Carry out a deflation in order to fulfill the model constraints (4): E i+1
3
i−1 tj t j = In − X. t t j=1 j j
(5)
Algorithms for Finding the PP Directions
A crucial point of CR is the maximization of the criterion (3) for the weights. In principle, all possible projection directions a ∈ IRp have to be scanned, which is impossible especially in situations where p is large. For this reason, the number of candidate directions is limited to a set that is still computable in reasonable time. Serneels et al. (2005) suggested to construct k directions that are arbitrary linear combinations of the n data points at hand, the first n directions being directly the n observations. The computation time as well as the precision of this algorithms will thus strongly depend on the number k of candidate directions. Here a new algorithm will be introduced and compared with the other proposal. This so-called grid algorithm works as follows. Let xi (i = 1, . . . , p) be the columns, or “variables” of the data matrix X.
274
P. Filzmoser et al.
(a) If p = 2: (i) A first approximation a1 of the projection direction a is obtained by maximizing δ
C(γ1j x1 + γ2j x2 ) = Cov(γ1j x1 + γ2j x2 , y)2 Var(γ1j x1 + γ2j x2 ) 1−δ −1 (6) 2 2 under the constraints γ1j + γ2j = 1 for j = 1, . . . , N . The unknowns γ1j and γ2j are the coordinates of G grid points regularly chosen on the unit circle in the interval [−π/2, π/2), and the maximum is taken among these G candidate directions. (ii) The second approximation a2 is searched like before, but in a smaller interval [−π/(2f ), π/(2f )) with f = 2. In each new iteration f is increased by 1, until after F interval halving steps the grid is fine enough to leave the solution essentially unchanged (marginal improvement smaller than a tolerance bound). (b) If p > 2: (i) Compute for each regressor variable i = 1, . . . , p the value of the objective function δ
C(xi ) = Cov(xi , y)2 Var(xi ) 1−δ −1
(7)
and sort the variables x(1) , . . . , x(p) , being in the columns of X, according to C(x(1) ) ≥ C(x(2) ) ≥ . . . ≥ C(x(p) ). (ii) The maximization is done now in the plane like in (a): Maximizing C(γ1j x(1) +γ2j x(2) ) results in the approximation a(1) . A next approximation a(2) is obtained by maximizing C(γ1j Xa(1) + γ2j x(3) ). This procedure is repeated until the last variable has entered the optimization. In a next cycle each variable is considered again for improving the value of the objective function. The algorithm terminates when the improvement is considered to be marginal. The precision of both algorithms is computed using the “Fearn” data (Fearn, 1983) which consists of 24 observations and 6 regressor variables. For δ = 0.5 we compute all h = 6 latent variables. Since δ = 0.5 corresponds to PLS, the solutions of both algorithms can be compared with the exact solution resulting from the SIMPLS algorithm (de Jong, 1993) in the case when the empirical sample variance and covariance are used in the criterion (3). The resulting regression coefficients are compared by computing the sum of all elementwise squared differences to the exact regression coefficients. This can be considered as measure of precision of the algorithm, which needs to be as small as possible. Since the precision measure could depend on the specifically generated directions for the algorithm described in Serneels et al. (2005), we average the precision measure over 100 runs. In Figure 1 the resulting precisions are presented for different parameter choices of the algorithms. For the algorithm of Serneels et al. (2005) different numbers of
Robust Multivariate Methods: The Projection Pursuit Approach
275
500
1000
2000
5000
10000
20000
50000
1e+05
2e+05
5e+05
5 2
10 2
5 5
10 5
20 5
20 10
50 5
50 10
100 5
100 10
100 20
100 50
1e−05
1e−03
200
1e−07
Average precision, log scale
x ... Algorithm of Serneels et al. (2005): number k of directions 100
+ ... Grid algorithm: choice of G (first line) and F (second line)
Fig. 1. Average precision for the regression coefficients of the Fearn data resulting from two different algorithms.
5 2
10 2
500
1000
2000
5000
10000
20000
50000
1e+05
2e+05
5e+05
5 5
10 5
20 5
20 10
50 5
50 10
100 5
100 10
100 20
100 50
0.20
2.00
20.00
200
0.02
Average time [s], log scale
x ... Algorithm of Serneels et al. (2005): number of directions 100
+ ... Grid algorithm: choice of G (first line) and F (second line)
Fig. 2. Average computation time (in seconds) for both algorithms, see Figure 1.
directions k are considered (scale on top), and for the grid algorithm different numbers of grid points G and interval halving steps F are used (scale on bottom). From Figure 1 we see that the precision is comparable for k = 1000 directions and the choice G = 10 and F = 5. By taking more computational effort, the precision is getting much better for the grid algorithm. It is also interesting to compare the algorithms with respect to computation time. Figure 2 presents the average computation time corresponding to the results of Figure 1. While the precision is about the same for k = 1000 and G = 10 and F = 5, the grid algorithm needs roughly twice as much time. On the other hand, the time for both algorithms is about the same for the parameters k = 5000 and G = 20, F = 10, but the precision of the grid algorithm is about 2 · 10−5 compared to 2 · 10−4 for the other algorithm. In general, if higher precision is needed, the grid algorithm will be much faster and at the same time more precise. On the other hand, if moderate precision is sufficient, the Serneels et al. (2005) algorithm is to be preferred.
4
Simulation
The advantage of robust CR over classical CR in presence of contamination was already demonstrated in Serneels et al. (2005) by simulations and an
Squared errors
P. Filzmoser et al. 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040
276
Left boxplots: classical CR Middle boxplots: algorithm Serneels et al. (2005) Right boxplots: grid algorithm
δ = 0.1
δ = 0.25
δ = 0.5
δ = 0.75
δ = 0.9
Fig. 3. Squared Errors from the simulation with outliers in the regressor variables.
example. In the simulations different distributions of the error term ε in the model (1) were considered. We recomputed the simulations for the grid algorithm and obtained similar results as for the previously proposed algorithm. Here we will consider the situation of outliers in the regressor variables. The matrix X of size n × p with n = 100 and p = 10 is generated from Np (0, C), a multivariate normal distribution with mean 0 and covariance matrix C = diag(1, 1/2, . . . , 1/p). W h is constructed to fulfill the constraints (4) with h = 3, and ξ is generated from a uniform random distribution in [0.5, 1]. These matrices are fixed for a particular simulation setup. Hence, the true regression parameter β = W h ξ is known. Then the error term is generated according to ε ∼ N (0, 1/10) and 10% of the rows of X are replaced by outliers coming from Np (5 · 0, I p ). For several values of the tuning parameter δ the classical CR algorithm, the algorithm of Serneels et al. (2005) and the grid algorithm was applied in 1000 simulation replications. the resulting ˆ (i) were computed for the estimated ˆ (i) β−β Squared Errors β − β δ,h ˆ (i) β δ,h
δ,h
regression coefficients in the i-th simulation obtained from the different algorithms, and the results are presented by parallel boxplots in Figure 3. Each group of three boxplots corresponds to a different value of δ. Both algorithms for robust CR lead to comparable results, at least for the choice k = 1000 directions, G = 10 grid points and F = 2, and α = 10% trimmed variance and covariance estimators. For all choices of δ the notches of the classical boxplots do not overlap with the robust ones, which is a strong evidence that the Squared Errors of the classical procedure is higher as for the robust ones, due to the presence of contamination.
5
Summary
The robustification of multivariate methods by plugging in robust covariance matrix estimates is limited to the case n > p. This limitation does not hold
Robust Multivariate Methods: The Projection Pursuit Approach
277
for methods based on PP, and the robustness can be achieved by applying robust estimators to the projected data. Here we outlined the procedure for robust CR, and a new algorithm was introduced. Robust CR turns out to be robust against outliers in the error terms, but also robust with respect to outliers in the regressor variables, as was shown by the simulations in this paper. Programs for computation are available in the Matlab programming environment from the first author.
References BRANCO, J.A., CROUX, C., FILZMOSER, P., and OLIVEIRA, M.R. (2005): Robust Canonical Correlations: A Comparative Study. Computational Statistics, 2. To appear. CROUX, C. and HAESBROECK, G. (2000): Principal Component Analysis based on Robust Estimators of the Covariance or Correlation Matrix: Influence Functions and Efficiencies. Biometrika, 87, 603–618. CROUX, C. and RUIZ-GAZEN, A. (2005): High Breakdown Estimators for Principal Components: The Projection-pursuit Approach Revisited. Journal of Multivariate Analysis. To appear. DE JONG, S. (1993): SIMPLS: An Alternative Approach to Partial Least Squares Regression. Chemometrics and Intelligent Laboratory Systems, 18, 251–263. FEARN, T. (1983): A Misuse of Ridge Regression in the Calibration of a Near Infrared Reflectance Instrument. Applied Statistics, 32, 73–79. FRIEDMAN, J.H., and TUKEY, J.W. (1974): A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Transactions on Computers, 9, 881–890. HAMPEL, F.R., RONCHETTI, E.M., ROUSSEEUW, P.J. and STAHEL, W. (1986): Robust Statistics. The Approach Based on Influence Functions. John Wiley & Sons, New York. ¨ HOSSJER, O. and CROUX, C. (1995): Generalizing Univariate Signed Rank Statistics for Testing and Estimating a Multivariate Location Parameter. Nonparametric Statistics, 4, 293–308. HUBER, P.J. (1981): Robust Statistics. John Wiley & Sons, New York. HUBER, P.J. (1985): Projection Pursuit. The Annals of Statistics, 13, 435–525. LI, G., and CHEN, Z. (1985): Projection-Pursuit Approach to Robust Dispersion Matrices and Principal Components: Primary Theory and Monte Carlo. Journal of the American Statistical Association, 80, 391, 759–766. MARONNA, R.A. and YOHAI, V.J. (1998): Robust Estimation of Multivariate Location and Scatter. In: S. Kotz, C. Read and D. Banks (Eds.): Encyclopedia of Statistical Sciences. John Wiley & Sons, New York, 589–596. PISON, G., ROUSSEEUW, P.J., FILZMOSER, P., and CROUX, C. (2003): Robust Factor Analysis. Journal of Multivariate Analysis, 84, 145–172. SERNEELS, S., FILZMOSER, P., CROUX, C. and VAN ESPEN, P.J. (2005): Robust Continuum Regression. Chemometrics and Intelligent Laboratory Systems, 76, 197–204. STONE, M. and BROOKS, R.J. (1990): Continuum Regression: Cross-validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression. Journal of the Royal Statistical Society B, 52, 237–269.
Finding Persisting States for Knowledge Discovery in Time Series Fabian M¨ orchen and Alfred Ultsch Data Bionics Research Group, Philipps-University Marburg, 35032 Marburg, Germany
Abstract. Knowledge Discovery in time series usually requires symbolic time series. Many discretization methods that convert numeric time series to symbolic time series ignore the temporal order of values. This often leads to symbols that do not correspond to states of the process generating the time series. We propose a new method for meaningful unsupervised discretization of numeric time series called ”Persist”, based on the Kullback-Leibler divergence between the marginal and the self-transition probability distributions of the discretization symbols. In evaluations with artificial and real life data it clearly outperforms existing methods.
1
Introduction
Many time series data mining algorithms work on symbolic time series. For numeric time series they usually perform unsupervised discretization of the values as a preprocessing step. For the discovery of knowledge that is interpretable and useful to the expert, it is of great importance that the resulting interval boundaries are meaningful within the domain. If the time series is produced by an underlying process with recurring persisting states, intervals in the value dimension should describe these states. The most commonly used discretization methods are equal width and equal frequency histograms. Both histogram methods potentially place cuts in high density regions of the observed marginal probability distribution of values. This is a disadvantage, if discretization is performed not merely for quantization and speedup of processing, but rather for gaining insight into the process generating the data. The same applies to other methods, e.g. setting cuts based on location and dispersion measures. While static data sets offer no information other than the actual values themselves, time series contain valuable temporal structure that is not used by the methods described above. We propose a new method for meaningful unsupervised discretization of univariate time series by taking the temporal order of values into account. The discretization is performed optimizing the persistence of the resulting states. In Section 2 we give a brief overview of related methods. The new discretization algorithm is described in Section 3. The effectiveness of our approach is demonstrated in Section 4. Results and future work are discussed in Section 5.
Finding Persisting States for Knowledge Discovery in Time Series
2
279
Related Work and Motivation
A recent review of discretization methods for data mining is given in Liu et al. (2002). The only unsupervised methods mentioned are equal width and equal frequency histograms. With unsupervised discretization no class labels are available, thus there can be no optimization w.r.t. classification accuracy. But for time series data in particular there is rarely some sort of labeling available for the time points. The choice of parameters for the Symbolic Approximation (SAX) (Lin et al. (2003)) (similar to equal frequency histograms) has been analyzed in the context of temporal rule mining in (Hetland and Saetrom (2003)). The authors suggest to use the model with the best performance on the validation data. But using support and confidence of rules as a quality score is ignoring a simple fact. Rules are typically created to gain a deeper understanding of the data and the patterns therein. Arguably, rules with high support and confidence are less likely to be spurious results. But they will not be useful if the interval boundaries of the discretization are not meaningful to the domain expert. The related task of time series segmentation (e.g. Keogh (2004)) is beyond the scope of this paper. Segmentation does not lead to recurring state labels per se. Instead of dividing the value dimension in intervals, the time dimension is segmented to produce line or curve segments homogeneous according to some quality measure. Postprocessing the segments can lead to recurring labels like increasing for segments with similar positive slopes.
3
Persistence in Time Series
We propose a new quality score for meaningful unsupervised discretization of time series by taking the temporal information into account and searching for persistence. We argue, that one discretization is better than another if the resulting states show more persisting behavior. We expect many knowledge discovery approaches to profit from more meaningful symbols that incorporate the temporal structure of the time series, e.g. rule discovery in univariate (e.g. Hetland and Saetrom (2003), Rodriguez et al. (2000)) and multivariate (e.g. Guimaraes and Ultsch (1999), H¨ oppner (2002), Harms and Deogun (2004), M¨ orchen and Ultsch (2004)) time series, or anomaly detection (e.g. Keogh et al. (2002)). Let S = {S1 , ..., Sk } be the set of possible symbols and s = {si |si ∈ S i = 1..n} be a symbolic time series of length n. Let P (Sj ) be the marginal probability of the symbol Sj . The k × k matrix of transition probabilities is given by A(j, m) = P (si = Sj |si−1 = Sm ). The self-transition probabilities are the values on the main diagonal of A. If there is no temporal structure in the time series, the symbols can be interpreted as independent observations of a random variable according to
280
F. M¨ orchen and A. Ultsch
the marginal distribution of symbols. The probability of observing each symbol is independent from the previous symbol, i.e. P (si = Sj |si−1 ) = P (Sj ). The transition probabilities are A(j, m) = P (Sj ). The most simple temporal structure is a first order Markov model (Rabiner (1989)). Each state depends only on the previous state, i.e. P (si = Sj |si−1 , ...si−m ) = P (Sj |si−1 ). Persistence can be measured by comparing these two models. If there is no temporal structure, the transition probabilities of the Markov model should be close to the marginal probabilities. If the states show persisting behavior, however, the self-transition probabilities will be higher than the marginal probabilities. If a process is less likely to stay in a certain state, the particular transition probability will be lower than the corresponding marginal value. A well known measure for comparing two probability distributions is the Kullback-Leibler divergence (Kullback and Leibler (1951)). For two discrete probability distributions P = {p1 , ..., pk } and Q = {q1 , ..., qk } of k symbols the directed (KL) and symmetric (SKL) versions are given in Equation 1. k pi 1 KL(P, Q) = pi log SKL(P, Q) = (KL(P, Q) + KL(Q, P )) (1) q 2 i i=1
For binary random variables we define the shortcut notation in Equation 2. SKL(p, q) := SKL({p, 1 − p}, {q, 1 − q}) ∀p, q ∈]0, 1]
(2)
The persistence score of state j is defined in Equation 3 as the product of the symmetric Kullback-Leibler divergence of the transition and marginal probability distribution for self vs. non-self with an indicator variable. The indicator determines the sign of the score. States with self-transition probabilities higher than the marginal will obtain positive values and states with low self-transition probabilities inhibit negative values. The score is zero if and only if the probability distributions are equal. P ersistence(Sj ) = sgn(A(j, j) − P (Sj ))SKL(A(j, j), P (Sj ))
(3)
A summary score for all states can be obtained as the mean of the values per state. This captures the notion of mean persistence, i.e. all or most states need to have high persistence for achieving high persistence scores. The calculation of the persistence scores is straight forward. Maximum likelihood estimates of all involved probabilities can easily be obtained by counting the number of symbols for each state for the P (Sj ) and the numbers of each possible state pair for A. The persistence score is used to guide the selection of bins in the Persist algorithm. The first step is to obtain a set of candidate bin boundaries from the data, obtained e.g. by equal frequency binning with a large number of bins. In each iteration of the algorithm all available candidate cuts are individually added to the current set of cuts and the persistence score is calculated. The cut achieving the highest persistence is chosen. This is repeated until the desired number of bins is obtained. The time complexity is O(n).
Finding Persisting States for Knowledge Discovery in Time Series
4
281
Experiments
We evaluated the performance of the Persist algorithm by extensive experiments using artificial data with known states and some real data where the true states were rather obvious. We compared the novel algorithm with the following eight methods: EQF (equal frequency histograms), SAX (equal frequency histograms of normal distribution with the same mean and standard deviation as the data) 1 EQW (equal width histograms), M±S (mean ± standard deviation of data), M±A (median ± adjusted median absolute deviation (AMAD) of data), KM (k-Means with uniform initialization and Manhattan distance), GMM (Gaussian mixture model), HMM (Hidden Markov Model). HMM is the only competing method using the temporal information of the time series. It is not quite comparable to the other methods, however, because it does not return a set of bins. The state sequence is directly created and the model is harder to interpret. Artificial data: We generated artificial data using a specified number of states and Gaussian distributions per state. We generated 1000 time series of length 1000 for k = 2, ..., 7 states. For each time series 10 additional noisy versions were created by adding 1% to 10% outliers uniformly drawn from the interval determined by the mean ± the range of the original time series. An example for 4 states and 5% outliers is shown in Figure 3(a). The horizontal lines indicate the true means of the 4 states. The large spikes are caused by the outliers. Figure 3(b) shows the Pareto Density Estimation (PDE) (Ultsch (2003)) of the marginal empirical probability distribution. We applied all discretization methods using the known number of states. We measured the accuracy of the discretization by comparing the obtained state sequence with the true state sequence used for generating the data. The median accuracies and the deviations (AMAD) for k = 5 states and three levels of outlier contamination are listed in Table 1. The Persist algorithm always has a higher median accuracy than any static method with large distances to the second best. The deviation is also much smaller than for the other methods, indicating high consistency. Even with 10% outliers, the performance of the new algorithm is still better than for any static method applied to the same data without outliers! Compared to the only other temporal method, HMM, the performance of Persist is slightly worse for 0% outliers. But with larger levels of outlier contamination, the HMM results degrade rapidly, even below the results from several static methods. The results for other values of k were similar. The absolute differences in accuracy were smaller for k = 2, 3, 4 and even larger for k = 6, 7. The performance of HMM degraded later w.r.t. outlier contamination for fewer states and earlier for more states. Figure 1 plots the median accuracies for 3 states, all methods, and all outlier levels. Again, the Persist method is always the best except for HMM at low outlier levels. 1
This is a special case of SAX with window size 1 and no numerosity reduction.
282
F. M¨ orchen and A. Ultsch Outliers EQF SAX EQW M±S M±A KM GMM HMM Persist
0% 0.74 ± 0.08 0.74 ± 0.09 0.67 ± 0.16 0.56 ± 0.11 0.51 ± 0.16 0.71 ± 0.21 0.79 ± 0.18 0.94 ± 0.08 0.90 ± 0.03
5% 0.71 ± 0.08 0.74 ± 0.08 0.33 ± 0.09 0.48 ± 0.10 0.48 ± 0.15 0.66 ± 0.22 0.27 ± 0.12 0.52 ± 0.34 0.86 ± 0.03
10% 0.69 ± 0.07 0.72 ± 0.08 0.32 ± 0.08 0.43 ± 0.09 0.45 ± 0.13 0.61 ± 0.24 0.24 ± 0.11 0.44 ± 0.29 0.83 ± 0.03
Table 1. Median accuracy for 5 states 1
EQF SAX EQW M±S M±A KM GMM HMM Persist
0.9
Accuracy
0.8
0.7
0.6
0.5
0.4 0
2
4 6 Percentage of outliers
8
States 0% 2 − 3 − 4 − 5 ◦ 6 + 7 +
1% − − − + + +
2% − − ◦ + + +
3% − − + + + +
Outliers 4% 5% 6% − − − ◦ + + + + + + + + + + + + + +
7% − + + + + +
8% − + + + + +
9% − + + + + +
10% ◦ + + + + +
10
Fig. 1. Median accuracy 3 states
Fig. 2. Test decisions of Persist vs. HMM
In order to check the results for statistical significance, we tested the hypothesis that the accuracy of the Persist is better than the accuracy of the competing algorithms with the rank sum test. The test was performed for all k and all noise levels. For the competing static methods all p-values were smaller than 0.001, clearly indicating superior performance that can be attributed to the incorporation of temporal information. Compared to HMM, the results are significantly better for the larger amounts of outliers and worse for no of few outliers. The more states are present, the less robust HMM tends to be. Table 2 shows the result of the statistical tests between Persist and HMM. A plus indicates Persist to be better than HMM, for a minus the accuracy is significantly lower, circles are placed where the p-values were larger than 0.01. In summary, the Persist algorithm was able to recover the original state sequence with significantly higher accuracy and more consistency than all competing static methods. The temporal HMM method is slightly better than Persist for no or few outliers, but much worse for more complicated and realistic settings with more states and outliers. Real data: For real life data the states of the underlying process are typically unknown. Otherwise a discretization into recurring states wouldn’t be necessary. We explored the behavior of the Persist algorithm in comparison with the other methods on two datasets that clearly show several states. The
Finding Persisting States for Knowledge Discovery in Time Series 10
0
log(Energy(EMG))
0.35
8
0.3
Likelihood
6
Data
4 2 0 −2
0.25 0.2 0.15 0.1 0.05
−4 −6 0
100
200
300
400
0 −6
500
−4
−2
0
Time
4
6
8
−10
−15
−20
−25
−30
10
0.18
0.16
0.16
0.14
0.14
0.14
0.12
0.12
0.12
0.08 0.06
0.1 0.08 0.06
0.04
0.04
0.02
0.02 −25
−20
−15
−10
−5
log(Energy(EMG))
(d) Persist bins
0
Likelihood
0.18
0.16
0.1
0 −30
3000
4000
5000
6000
(c) Muscle series
0.18
0 −30
2000
Time (ms)
(b) PDE of (a)
Likelihood
Likelihood
2
−5
Data
(a) Artificial series
283
0.1 0.08 0.06 0.04 0.02
−25
−20
−15
−10
log(Energy(EMG))
(e) EQF bins
−5
0
0 −30
−25
−20
−15
−10
−5
0
log(Energy(EMG))
(f) EQW bins
Fig. 3. Artificial data and Muscle data with results
muscle activation of a professional inline speed skater (M¨ orchen et al. (2005)) is expected to switch mainly between being active and relaxed. Five seconds of the data are shown in Figure 3(c). Consulting an expert we chose k = 3 states. The resulting bin boundaries of three selected methods are shown in Figures 3(d)- 3(f) as vertical lines on top of a probability density estimation plot. All methods (including the other methods not shown) except Persist place cuts in high density regions. EQF sets the first cut very close to the peak corresponding to the state of low muscle activation. This will result in a large amount of transitions between the first two states. EQW does the same for the second peak in the density, corresponding to high muscle activation. Persist is the only method that places a cut to the right of the second peak. This results in a state for very high activation. The validity of this state can also be seen from Figure 3(c), where the horizontal lines correspond to the bins selected by Persist. The very high values are not randomly scattered during the interval of high activation but rather concentrate toward the end of each activation phase. This interesting temporal structure is not visible from the density plot, is not discovered by the other methods, and was validated by the expert as the push off with the foot. The Power data describes the power consumption of a research center over a year (van Wijk (1999), Keogh (2002)). The data was de-trended to remove seasonal effects, half a week is shown in Figure 4(a). Persist with four states (Figure 4(b)) corresponded to (1) very low power usage, (2) usually low usage at nighttime, (3) rather low usage at daytime, and (4) usual daytime consumption. In contrast, the EQF method places a very narrow bin around the high density peak for nighttime consumption (Figure 4(c)). There will be
284
F. M¨ orchen and A. Ultsch −3
4
very low low medium high
−3
x 10
4
3.5
3
Likelihood
Power
Likelihood
3
EQF
x 10
3.5
2.5 2 1.5
2.5 2 1.5
1
1
0.5
0.5
Persist 1000
1050
1100
1150
1200 Time
1250
1300
1350
(a) Time series with
0 −500
0
500
Power
(b) Persist bins
1000
0 −500
0
500
1000
Power
(c) EQF bins
Fig. 4. Power data
frequent short interruptions of this state with symbols from the two neighboring states. This is demonstrated in Figure 4(a). Below the original data the state sequences created by EQF (top) and Persist (bottom) are shown as shaded rectangles. While the high states are almost identical, Persist creates much ’cleaner’ low states with higher persistence.
5
Discussion
The proposed quality score and algorithm for detecting persisting states has been shown to outperform existing methods on artificial data. In the Muscle data a state of very high activity was detected, that is neglected by the other methods. In the Power data less noisy states were found. The method is simple, exact, and easy to implement. The only competing method, HMM, is far more complex. The EM algorithm needs a good initialization, is sensitive to noise, and only converges to a local maximum of the likelihood. HMM models are also harder to interpret than the result of binning methods like Persist. Using each time point or a small window for discretization will usually produce consecutive stretches of the same symbol. In Daw et al. (2003) the authors state that “from the standpoint of observing meaningful patterns, high frequencies of symbol repetition are not very useful and usually indicate oversampling of the original data”. But interesting temporal phenomena do not necessarily occur at the same time scale. Trying to avoid this so called oversampling would mean to enlarge the window size, possibly destroying short temporal phenomena in some places. We think that with smooth time series it is better to keep the high temporal resolution and search for persisting states. The resulting labeled interval sequences that can be used to detect higher level patterns (e.g. H¨ oppner (2002), M¨ orchen and Ultsch (2004)).
References DAW, C.S., FINNEY, C.E.A., and TRACY, E.R. (2003): A review of symbolic analysis of experimental data. Review of Scientific Instruments, 74:0 916–930.
Finding Persisting States for Knowledge Discovery in Time Series
285
GUIMARAES, G. and ULTSCH, A. (1999): A method for temporal knowledge conversion In Proc. 3rd Int. Symp. Intelligent Data Analysis, 369–380. HARMS, S. K. and DEOGUN, J. (2004): Sequential association rule mining with time lags. Journal of Intelligent Information Systems (JIIS), 22:1, 7–22. HETLAND, M.L. and SAETROM, P. (2003): The role of discretization parameters in sequence rule evolution. In Proc. 7th Int. KES Conf., 518–525. ¨ HOPPNER, F. (2002): Learning dependencies in multivariate time series. Proc. ECAI Workshop, Lyon, France, 25–31. KEOGH, E. (2002): The UCR Time Series Data Mining Archive http://www.cs.ucr.edu/˜eamonn/TSDMA/index.html KEOGH, E., LONARDI, S., and CHIU, B. (2002): Finding Surprising Patterns in a Time Series Database in Linear Time and Space In Proc. 8th ACM SIGKDD, 550–556. KEOGH, E., CHU, S., HART, D., and PAZZANI, M. (2004): Segmenting time series: A survey and novel approach. Data Mining in Time Series Databases, World Scientific, 1–22. KULLBACK, S. and LEIBLER, R.A. (1951): On information and sufficiency Annals of Mathematical Statistics, 22, 79–86. LIN, J., KEOGH, E., LONARDI, S., and CHIU, B. (2003): A symbolic representation of time series, with implications for streaming algorithms. In Proc. 8th ACM SIGMOD, DMKD workshop, 2–11. LIU, H., HUSSAIN, F., TAN, C.L., and DASH, M. (2002): Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 4:6, 393–423. ¨ MORCHEN, F. and ULTSCH, A. (2004): Discovering Temporal Knowlegde in Multivariate Time Series In Proc. GfKl, Dortmund, Germany, 272–279. ¨ MORCHEN, F., ULTSCH, A., and HOOS, O. (2005): Extracting interpretable muscle activation patterns with time series knowledge mining. Intl. Journal of Knowledge-Based & Intelligent Engineering Systems (to appear). ¨ RODRIGUEZ, J.J., ALSONSO, C.J., and BOSTROM, H. (2000): Learning First Order Logic Time Series Classifiers In Proc. 10th Intl. Conf. on Inductive Logic Programming, 260–275. RABINER, L. R. (1989): A tutorial on hidden markov models and selected applications in speech recognition. In Proc. of IEEE, 77(2):0 257–286. ULTSCH, A. (2003): Pareto Density Estimation: Probability Density Estimation for Knowledge Discovery. In Proc. GfKl, Cottbus, Germany, 91–102. VAN WIJK, J. J., VAN SELOW, E. R. (1999): Cluster and Calendar Based Visualization of Time Series Data. In Proc. INFOVIS, 4-9.
Restricted Co-inertia Analysis Pietro Amenta1 and Enrico Ciavolino2 1
2
Department of Analysis of Economic and Social Systems, University of Sannio, 82100, Benevento, Italy Research Centre on Software Technology, University of Sannio, 82100, Benevento, Italy
Abstract. In this paper, an extension of the Co-inertia Analysis is proposed. This extension is based on a objective function which takes into account directly the external information, as linear restrictions about one set of variables, by rewriting the Co-inertia Analysis objective function according to the principle of Restricted Eigenvalue Problem (Rao (1973)).
1
Introduction
In applied or theoretical ecology as well as in other contexts (ex. chemometric field, customer satisfaction analysis, data sensory analysis) we often have to deal with the study of numerical data tables obtained in experimental applications. The study of this tables often requires the use of multivariate analyses in order to investigate the relationships between the two data sets. The asymmetric relationships between two sets of quantitative variables has been, at first, studied by Rao (1964) in the multivariate regression approach while, during these years, a good deal of attention has been paid to the Partial Least Square regression (Wold (1966)) and its generalizations. In the same way, in literature, in order to study symmetrical interdependence relationships have been proposed several techniques originated from Canonical Correlation Analysis (Hotteling (1936)) or from Tucker’s Inter-Battery Analysis (1958) and Co-inertia Analysis (COIA) (Chessel and Mercier (1993)) and their generalizations. Often, we have additional information about the structure of the experiment (ex. on statistical units or on the variables) that can be incorporated as ”external information” in the analysis, in order to improve the interpretability of the analysis of the phenomenon. These external information can take a variety of form: vector of ones, a matrix of dummy variables or a matrix of continuous variables. Orthogonal contrasts used in Analysis of Variance (AOV) by which different linear principal mean effects can be possible highlighted are a particular case of external information. In this paper we suppose linear restrictions about one set of variables. Many techniques have been proposed for incorporation of the external information (ex. Takane and Shibayama (1991)), but often they are based on a suitable pre-processing treatment of the data sets. Other approaches
Restricted Co-inertia Analysis
287
(H¨ oskuldsson (2001), Martens et al. (2005)) outline several ways to combine three or more matrices within the PLS framework. Aim of this paper is to provide an extension of Co-inertia Analysis based on an objective function which takes into account directly the external information. The COIA objective function is rewrited according to the principles of Restricted Eigenvalue Problem. We call this approach Restricted Co-inertia Analysis (RCOIA).
2
Restricted Co-inertia Analysis
The mathematical model of RCOIA may be examined by using the duality diagram (Cailiez and Pages (1976)). Let (X, QX , D) be the statistical study associated with the (n × p) matrix X, collecting a set of p quantitative/qualitative variables observed on n statistical units. QX is the (p × p) metric in p and D is the (diagonal) weights metric into vectorial space of variables n . Moreover, let (Y, QY , D) be the statistical study associated with the (n × q) matrix Y , collecting a set of q (quantitative/qualitative) variables observed on the same n statistical units. QY is the (q × q) metric of the statistical units in q . We assume that all the variables have zero means as regards the weights diagonal metric D. The statistical triplets (X, QX , D) and (Y, QY , D) are characterized by the same statistical units on which are observed two sets of different variables, so that, statistical units belong to different spaces. The study of a statistical triplet (X, QX , D) is equivalent, from a geometrical point of view, to search the inertia axes of a cloud of n points of p (principal axes) or, in similar way, looking for the inertia axes of a cloud of p points in n (principal components). In order to study the common geometry of the two clouds (co-structure), Chessel and Mercier proposed the Co-inertia Analysis which is a symmetric coupling method that provides a decomposition of the coinertia criterion trace(Y T DXQX X T DY QY ) on a set of orthogonal vectors. COIA maximizes the square covariance between the projection of X on wk (ψk = XQX wk ) and the projection of Y on cj (ϕj = Y QY cj ): cov 2 (ψk , ϕj ) = corr2 (ψk , ϕj ) × var(ψk ) × var(ϕj ). Note that the square of the latter entity corr(.) is maximized via canonical correlation analysis while a co-inertia axis maximizes cov 2 (.). An extension of COIA, considering the external information and according to the principle of the Restricted Eigenvalue Problem, can be written in the form of the following objective function (s = 1, . . . , min(p, q)): ⎧ max Cov 2 (XQX ws , Y QY cs ) ⎪ ⎪w ⎪ ⎨ s ,cs 2 ws QX = 1 (1) 2 ⎪ cs QY = 1 ⎪ ⎪ ⎩ T H ws = 0
288
P. Amenta and E. Ciavolino
where H T ws = 0 is the restriction criterion and H is the matrix of external information (as linear restrictions) on X of order (p × l) with l < p. Solutions are obtained by Lagrange multipliers method. The system (1) can be rewritten as: L = (wsT QX X T DY QY cs )2 − λ(wsT QX ws − 1) − µ(cTs QY cs − 1) − wsT Hγ (2) where λ, µ and γ are the Lagrange multipliers associated to the constraints, respectively. By applying the Lagrange method to the equation (2) we obtain the general eigenvalue problem T T Q−1 X (I − PH/Q−1 )QX X DY QY Y DXQX ws = λws X
(3)
−1 − T −1 where PH/Q−1 = H(H T Q−1 X H) H QX is the QX -orthogonal projection X operator onto the vectorial subspace spanned by the column vectors of matrix H. The eigenvectors of (3) are the stationary points of (1) with the eigenvalues as the corresponding values of the maximand. This leads us to the extraction of the eigenvalues λ and eigenvectors cs associated to the − T T eigen-system: Y T DX[QX − H(H T Q−1 X H) H ]X DY QY cs = λcs or − T T {QY Y T DX[QX − H(H T Q−1 X H) H ]X DY QY }vs = λvs 1/2
1/2
(4)
Let Vz be the matrix that contains the first z normalized eigenvectors of (4) and Λz = diag(λ1 , . . . , λz ). The first z (QY -normed) restricted co-inertia axes −1/2 cz in q are given by Cz = QY Uz while the first z (QX -normed) co-inertia p T T −1/2 axes wz in are obtained as Wz = (I − PH/Q such −1 )X DY QY Cz Λ X
that CzT QY Cz = I and WzT QX Wz = I. Finally, the RCOIA scores of X and Y z rows are given by TX = XQX Wz and TYz = Y QY Cz , respectively. Restricted −1/2 ˆ 1/2 T 1/2 column component loadings are obtained as ξzX = λz Q X X DY QY vz −1/2 1/2 −1/2 1/2 −1 ˆ X = QX − H(H T Q H)− and ξzY = (1/QY vz λz 2 )QY vz λz with Q X H T such that < ξzX , ξzX >= δz,z and < ξzY , ξzY >QY = δz,z (z, z = 1, . . . , min(p, q)). It is possible to show that RCOIA(Y, X, H) is equivalent to the ˆ X ). We remark that, in absence study of the statistical triplet (X T Y, QY , Q of external information, we have the same COIA solutions. If QX = I and QY = I then RCOIA(Y, X, H) leads us to the extraction of the eigenvalues and eigenvectors associated to the eigen-system ⊥ T Y T DXPH X DY vs = λvs .
We highlight as the first eigenvector is also solution of a restricted version of Partial Least Squares (PLS). In fact, if we apply a pre-processing treatment of ˆ= matrix X, in the sense of Takane and Shybayama decomposition (1991), X ⊥ ˆ XPH then, for the first solution, we have RCOIA(Y, X, H) = P LS(Y, X). For the other solutions it is sufficient to take into account in RCOIA the Wold deflation process of PLS. In this way, we can propose a PLS with
Restricted Co-inertia Analysis
289
external information coming from a restricted co-inertia criteria (Amenta et al., (2005)) not like a strategy. Another property of RCOIA is that the component scores tsX and tsY are not correlated for s = s and s, s = 1, . . . , z ≤ min(p, q), while scores of the same table are not independent as in usual Principal Component Analysis. A suitable deflation algorithm can be applied in order to obtain independent scores (Amenta and Ciavolino (2005)). From the geometric point of view, it is possible to show that the restricted co-inertia axis is given by the projection of the un-restricted co-inertia axis onto the Q−1 X null space N (H) of the matrix H associated to the linear constraints. Finally, it is possible consider RCOIA like a framework: moving from the restricted co-inertia problem (1) with different nature and coding of X and Y and different choices of QX and QY , it is possible to obtain (Amenta (2005)) the restricted versions (a lot of them are not proposed in literature) of several symmetrical and non-symmetrical methods. For example, we find the following methods proposed in literature as particular cases of RCOIA: Canonical analysis of contingency tables with linear constraints (B¨ ockenholt and B¨ ockenholt (1990)), Non Symmetrical Multiple Correspondence Analysis with linear constraints (Amenta and D’Ambra (1994)), Constrained Principal Component Analysis with external information (Amenta and D’Ambra (1996, 2000)), Generalized Constrained Principal Component Analysis with external information (Amenta and D’Ambra (2000)). In the same way, for example, it is possible to obtain unpublished restricted version of: Correspondence Analysis, Correspondence Analysis of juxtaposed contingency tables, Multiple Correspondence Analysis, PLS-Discriminant Analysis, PCA with respect to instrumental variables, Barycentric Discriminant Analysis, Inter-table PCA and many others.
3
Dune Meadow Data
The case study concerns the analysis of a data set studied in Jongman et al. (1987) based on measurements of some dune meadow flora and environmental characteristics. A fundamental property of biological systems is their ability to evolve depending on the system structure as well as on the relationships between the species and their environment (Prodon (1988)). Dune meadow represents a zone between the sea and the land where a very interesting biological system is found and where the relationships between the presence of some vegetables species and the physicals and managements variables are studied. Aim of this section is to study the relationships between two data sets taking into account some external information, as linear constraints on the X coefficients, in order to answer the following questions: (a) Which combination of environmental variables may be related to species abundance?; (b) What happens when are considered “External Information” in the analysis? The results will be presented without reference to all the re-
290
P. Amenta and E. Ciavolino Physical Variables 1 - Thickness of the A1 horizon 2 - Moisture content of the soil 3 - Agricultural grassland use 4 - Quantity of manure applied Management Variables 5 - Standard farming management 6 - Biological farming management 7 - Hobby-farming management 8 - Nature conservation management
Scale Cm five-point scale Ordinal four-point scale Scale 0: no; 1: yes; 0: no; 1: yes; 0: no; 1: yes; 0: no; 1: yes;
Labels Thickness Moisture Use Manuring Labels Farming-M Biological-M Hobby-M Nature-M
Table 1. Physical and Management Variables Ach mil Cal cus Agr sto Jun buf Air pra Leo aut Alo gen Lol per Ant odo Pla lan Bel per Poa pra Bro hor Poa tri Che alb Pot pal Cir arv Ran fla Ele pal Rum ace Ely rep Sag pro Emp nig Sal rep Hyp rad Tri pra Jun Art Tri rep Bra rut Vic lat
Table 2. Dune Meadow species Labels. Latin names. Variables Thickness Moisture Use Manuring Farming-M Biological-M Hobby-M Nature-M Constr. 1 1 1 1 -3 0 0 0 0 Constr. 2 1 1 1 1 -1 -1 -1 -1
Table 3. The external Information.
ports and graphical outputs provided by RCOIA method to make the paper more readable. The environmental variables (physicals and managements) are reported in table 1 and collected in the X matrix formed by 20 samples times 8 variables. In table 2 we have the labels (latin names) of the 30 dune meadow species collected in the Y matrix of 20 rows (samples) times 30 columns (species). The study of the relationships between two data sets try to find the optimal combination where the space spanned by the matrices reach to the maximization of the covariance between the environmental and species abundance. The covariance can be viewed as the combination of the environmental variables that characterize the presence of vegetables species, or reverse point of view, the combination of dune meadow species present for that kind of biological system. The interpretation of the biological system analysis is improved by incorporating the external information, as ANOVA orthogonal contrasts, on the environmental variables matrix. The constrains are developed on the hypothesis on the X matrix structure. The first one is established considering the results of the co-inertia analysis, where the ”Manuring” variable is set versus the other physical variables ”Moisture”, ”Thickness” and ”Use”, without considering the effect of the ”Management” ones. The second constraint compares the effect of the ”Management” variables versus the ”Physical” variables. The constraints are collected in a matrix H of order (8 × 2). The results of the co-inertia analysis are reported in the figure 1, where the Environmental Plane (1.a) shows the variables ”Mois-
Restricted Co-inertia Analysis
291
(b) Vegetables Species
(a) Environmental Variables 1
1
Manuring
0.8
0.8
0.6
0.6
Alo gen
Poa tri
Agr sto 0.4
0.4
Farming-M Use Hobby-M Biological-M
0
Lol per 0.2
Thickness
F(2)
F(2)
0.2
Moisture
Poa pra Ely
0
-0.2
-0.2
Pla
Nature-M -0.4
-0.4
-0.6
-0.6
-0.8
-0.8
-1 -1
-0.8
-0.6
-0.4
-0.2
0 F(1)
0.2
0.4
0.6
0.8
1
-1 -1
-0.8
-0.6
-0.4
Sag
Ele Pal
Jun Bro Cir Bel TriRum pra CheJunCalRan Emp Pot Ach Vic TriAirrep Hyp Bra
-0.2
Ant Leo
Sal
0 F(1)
0.2
0.4
0.6
0.8
1
Fig. 1. Co-inertia Analysis
ture” and ”Thickness” correlated to the first axis, while the second axis is influenced by the ”Manuring” while the four types of management variables have no influence on this factorial plane. The Vegetable plane (1.b) shows the dune meadow species correlated to the first axis are ”Lol per” ”Poa pra” and ”Ele pal” while the second one is characterized by the species ”Poa tri”, ”Alo gen” and ”Agr sto”. The conclusion on these results lead us to hyphotesize that in dune meadow biological system analyzed, the vegetables species of the ”Lol per” ”Poa pra” and ”Ele pal” are related to the ”Moisture” and ”Thickness” of the land, while the species ”Poa tri”, ”Alo gen” and ”Agr sto” can be found where the land is manured. Based on the results of the co-inertia analysis and on the hypothesis on the distinction of the physical and management variables, the introduction of constraints tries to respond the following questions: ”What happens if it is compared the effects ”Moisture”, ”Thickness” and ”Use” against the ”Manuring” without the effect of ”Managements” variables”? and ”What about the comparison effect of the ”Managements” variables versus the ”Physicals” ones ?”. In the Environmental Plane (2.a) the variables ”Farming-M” and ”Use” come out the center of the axes while the ”Manuring” is moved in the center as well as the ”Nature-M” and ”Thickness” are now correlated. The plane of environmental variables (2.a) is also changed: the first axis is now also correlated to the ”Nature conservation management” while the second one is characterized by the variables ”Use” and ”Standard farming management”. In the vegetables plane (2.b) there are not significant changes. These new planes give an improved interpretability in way to understand the species abundance, where in the simple co-inertia analysis on the first axis the species are related only to the physical variables, now there is also an effect of the ”Nature conservation management”, moreover on the second axis the effect of the environmental
292
P. Amenta and E. Ciavolino
(b) Vegetables Species
(a) Environmental Variables 1 1
0.8
0.8
Use
Farming-M
0.6
Hobby-M Biological-M Manuring
0.4
Poa pra
0
-0.6
-0.8
-0.8
-0.8
-0.6
-0.4
Ely Jun JunCir TriBro pra RanCal Che Rum EmpVicBel Ach Tri rep Hyp Pot Air Bra Leo Pla Sal Ant
-0.4
-0.6
-1 -1
Ele pal
-0.2
Thickness Nature-M
-0.4
Lol per
Sag
0
-0.2
Poa tri
Agr sto
0.2 F(2)
F(2)
Moisture
Alo gen
xxx xxx xxx xxx xxx
0.4
0.2
0.6
-0.2
0 F(1)
0.2
0.4
0.6
0.8
1
-1 -1
-0.8
-0.6
-0.4
-0.2
0 F(1)
0.2
0.4
0.6
0.8
1
Fig. 2. Restricted Co-inertia Analysis
variables is totally changed, in fact, there is an high impact factor of the ”Standard farming management” and the land ”Use”.
4
Conclusion
The showed method is an alternative way to include the external information (as linear constraints) improving the interpretability of the analysed phenomenon. The proposed method can be useful in many experimental applications. For instance, an indirect measure of customer satisfaction (Servqual) is to evaluate the gap between the elements of a service, grouped in five dimensions/aspects of Quality (Parasuraman et al., 1994), which consumers would expect as ideal (Expectations Score) and those they have recently experienced (Perceptions Score). The importance on the elements of service/product is measured by giving an judgment on the five dimensions (Tangibles, Assurance, Responsiveness, Reliability, Tangibles). These judgments can be considered as external information on the variables so RCOIA can be a suitable method to improve the analysis of the customers evaluations. Moreover, linear constraints on both matrices X and Y can improve the quality and the interpretability of the analysed phenomenon (Amenta, (2005)). Acknowledgement: This research has been supported with a PRIN 2004 grant (Resp: P.Amenta)
References AMENTA, P. (2005): Double Restricted Co-inertia Analysis. Submitted. AMENTA, P. and CIAVOLINO, E. (2005): Single Restricted Co-inertia-PLS Analysis. Submitted.
Restricted Co-inertia Analysis
293
AMENTA P. and D’AMBRA L. (1994): Analisi non Simmetrica delle Corrispondenze Multiple con Vincoli Lineari. Book of XXXVII Riunione S.I.S., Sanremo, Italy. AMENTA P. and D’AMBRA L. (1996): L’Analisi in Componenti Principali in rapporto ad un sottospazio di riferimento con informazioni esterne. Quaderni di Statistica del Dipartimento di Metodi Quantitativi e Teoria Economica, Universit` a “G. D’Annunzio” di Pescara, n. 18/1996. AMENTA P. and D’AMBRA L. (2000): Constrained Principal Components Analysis with External Information. Rivista di Statistica Applicata, Vol. 12, n. 1. AMENTA P. and D’AMBRA L. (2000): Generalized Constrained Principal Component Analysis with external information. Book of XL Riunione S.I.S. Firenze, Italy. AMENTA, P., DURAND J.F and D’AMBRA L. (2005): The objective function of Restricted Partial Least Squares. Submitted. ¨ ¨ BOCKENHOLT U. and BOCKENHOLT I. (1990): Canonical analysis of contingency tables with linear constraints. Psychometrika, 55, pp. 633-639. CAILLIEZ, F. and PAGES, J.P. (1976): Introduction a ` l’analyse des donn´ees. Smash, Paris. CHESSEL, D. and MERCIER P. (1993): Couplage de triplets statistiques et liaisons especes-environement. In J.D. LEBRETON and B. ASSELAIN (Eds.) Biometrie et environment. Masson, Paris. ¨ HOSKULDSSON, A. (2001): Casual and path modelling. Chemometrics and Intell. Lab. Systems, 58, 2, pp. 287-311. HOTTELLING, H. (1936): Relations between two sets of variates. Biometrika, 28. JONGMAN, R.H., TER BRAAK, C.J.F. and VAN TONGEREN, O.F.R. (1987): Data analysis in community and landscape ecology. Pudoc, Wageningen. MARTENS, H., ANDERSSEN, E., FLATBERG, A., GIDSKEHAUG, L.H., H∅Y, M., WESTAD, F., THYBO, A. and MARTENS M. (2005): Regression of a data matrix on descriptors of both rows and of its columns via latent variables: L-PLSR. Computational Statistics and Data Analysis, 48, pp. 103-123. PARASURAMAN A., ZEITHAML V., BERRYL. (1994) Reassessement of expectations as a comparison standard in measuring service quality: implications for future research. Journal of Marketing, 58, 1. PRODON R. (1988): Dynamique des syst´emes avifaune-v´eg´etation apr´es d´eprise rurale et incendies dans les py´en´ees m´editerran´eennes siliceuses. Th´ ese, Universit´e Paris VI. 333 pp.. RAO, C.R. (1964): The use and interpretation of principal component analysis in applied research. Sankhya, A, 26, pp. 329-358. RAO, C.R. (1973): Linear Statistical Inference and Its Application. Wiley, New York. TAKANE, Y. and SHIBAYAMA, T. (1991): Principal Component Analysis with External Information on both subjects and variables. Psychometrika, 56, pp. 97-120. TUCKER, L.R. (1958): An inter-battery method of factor analysis. Psycometrika, 23, n.2, pp. 111-136. WOLD, H. (1966): Estimation of principal components and related models by iterative least squares. In P.R. KRISHNAIAH (Eds.) Multivariate Analysis. Academic Press, New York.
Hausman Principal Component Analysis Vartan Choulakian1 , Luigi Dambra2 , and Biagio Simonetti2 1
2
D´ept. de Math/Statistique, Universit´e de Moncton, Moncton, N.B., Canada E1A 3E9 Dept. of Mathematics and Statistics, University of Naples ”Federico II”, 80126, Napoli, Italy
Abstract. The aim of this paper is to obtain discrete-valued weights of the variables by constraining them to Hausman weights ( –1, 0, 1) in principal component analysis. And this is done in two steps: First, we start with the centroid method, which produces the most restricted optimal weights –1 and 1; then extend the weights to –1,0 or 1.
1
Introduction
A number of methods have been proposed to modify principal component analysis (PCA) to improve the interpretation of results. The oldest method being the rotation of the pc weights to achieve simple structure proposed by Thurstone, see for a description, Jackson (1991, ch. 8). Hausman (1982) proposed a branch and bound algorithm to obtain discrete-valued weights of the variables by constraining them to be -1, 0 and 1 (the weights -1,0,1 will be named Hausman weights). Choulakian (2001) named the Hausman weights extended simple structure and presented an algorithm to construct (3p − 1)/2 nonredundant directions, where p is the number of variables. It is evident that this propblem is NP hard for moderately large values of p. Chipman and Gu (2003) named the Hausman weights homogeneity constraints and proposed an algorithm to obtain them based on thresholding the original pc weights, such that the resulting Hausman weights have the minimum angle with the original pc weights. Vines (2000) proposed simple principal components: She considered simple directions as directions that can be written as proportional to a vector of integers. Sometimes, simple directions have the Hausman weights. Joliffe (2002, ch. 11) presents an overview of several pre and post simplified approximations to PCA. Rousson and Gasser (2004) asserted that Hausman’s algorithm is computationally extremely slow, and they proposed a method in two steps to obtain interpretable components. The aim of this paper is to propose a simple and efficient computational approach to do PCA with Hausman weights. So the mathematical problem that we name Hausman PCA (HPCA) is to solve maxv
||Yv||2 subject to vi = −1, 0, 1, ||v||2
(1)
Hausman Principal Component Analysis
295
where vi is the ith coordinate of the vector v. This will be done in two steps: First, we start with the centroid method of calculating principal axes and components, which produces the most restricted optimal weights -1 or 1; then extend the weights simply to -1, 0 or 1. The proposed approach will help us in quantifying the exact change in the variance of the pc by the deletion of a variable or a group of variables. It will be seen that this exact change is a function of many terms, one of the terms being the absolute value of the loading of the deleted variable on the pc. This result will compliment Cadima and Jolliffe (1995), who discussed the pitfalls in interpreting a pc by ignoring variables with small absolute value loadings; a summary of their results can also be found in Jolliffe (2002, p.193). An important advantage of our approach is that we start with optimal centroid weights having values -1 or 1. This paper is organized as follows: In section 2, we summarize the mathematical theory behind the centroid method taken from Choulakian (2003, 2005a, 2005b). In section 3, we present the main results concerning HPCA. In section 4, we apply the HPCA to the data set considered by Cadima and Jolliffe (1995). Finally, in section 6, we conclude with some remarks.
2
The Centroid Principal Component Analysis
The centroid principal component analysis (CPCA), proposed by Burt (1917) and developed by Thurstone (1931), was a widely used method to calculate principal components before the advent of the computers. It was considered a heuristic method that provided an approximate solution to the classical PCA. It was defined as an algorithmic procedure based on equation (3). However, recently Choulakian (2003, 2005a) showed that the centroid method is an optimal procedure: It is a particular singular value decomposition based on particular matrix norms. Let X be a data set of dimension nxp, where n observations are described by the p variables. Let Y represent the standardized data set, and V = 2 Y Y/n is the correlation matrix. For a vector u ∈ Rn ,we define ||u||2 = u u. The p-th vector norm of a vector v = (v1 , ..., vm ) is defined to be ||v||p = m p ( i=1 |vi | )1/p for p ≥ 1 and ||v||∞ = maxi |vi | . The variational definitions of the centroid method are λ1 = maxv
||Yv||2 ||Y u||1 u Yv = maxu = maxu,v , ||v||∞ ||u||2 ||u||1 ||v||∞
= max(vY Yv)
1/2
subject to vi = 1 or − 1 for i = 1, ...n.
(1) (2)
The exact computation of the first centroid principal component weights, v1 , and the associated dispersion measure, λ1 , can be done by combinatorial optimization (3), where v1 = arg max v Vv v
and λ21 = v1 Vv1 .
(4)
296
V. Choulakian et al.
Let u1 = arg max ||Y u||1 subject to ||u||2 = 1. u
(5)
Let s1 be the vector of the first principal component scores, and c1 the vector of the first principal component loadings vector. The interplay between s1 and c1 , known as transitional formulae, are 1st pc scores vector: s1 = Yv1 and ||s1 ||2 = s1 u1 = λ1 ,
(6)
1st pc loadings vector: c1 = Y u1 and ||c1 ||1 = c1 v1 = λ1 ,
(7)
sgn(c1 )= v1
and s1 = λ1 u1 and λ1 = u1 Yv1 ,
(8)
where sgn(c1 ) is the coordinatewise sign function. Note that we have distinguished between the first pc weights vector, v1 , and the first pc loadings vector, c1 . However, v1 and c1 are related in non linear way by the first equation in (8). The vector v1 represents the first principal axis on which the projected sample points have the greatest variance. The jth element of c1 , the loading of the jth variable on the first principal axis of the column points, is proportional to the ordinary covariance between the jth column yj of the data set Y and the latent factor variable s1 /λ1 = u1 , that is, c1j = ncov(yj , u1 ). To calculate the second principal component we apply the algorithm to the residual data matrix Y(1) = Y − s1 c1 /λ1 .
(9)
The residual covariance matrix can be expressed as, V(1) = Y(1) Y(1) = V − c1 c1 .
(10)
Let vi , ci and si be the ith centroid principal component weights, covariances and scores, respectively, calculated from Y(i−1) .Then:
vj ci = 0 for j < i,
(11)
sj si = 0 for j = i.
(12)
Equations (11 and 12) show that the vectors of principal scores,si ’s, are mutually orthogonal, but the vectors of the principal loadings, ci ’s, are not orthogonal. To have results similar to the ordinary PCA, where the factor loadings, ci ’s, are orthogonal, we construct the residual data set Y(1) = Y − Y(c1 c1 /c1 c1 ).
(13)
Hausman Principal Component Analysis
2.1
297
Alternating Algorithm
Choulakian (2005b) presented three ascent algorithms to calculate the quantities s1 , c1 and λ1 , one of them being based on the transitional formulas (6) and (7), similar to Wold’s (1966) NIPALS algorithm. The algorithm can be summarized in the following way. Let c be any starting value: Step 0: Put k = 0 and v(k) = sgn(c); Step 1: s(k+1) = Yv(k) and λ(s(k+1) ) = s(k+1) 2 ; √ Step 2: u(k+1) = s(k+1) / s(k+1) s(k+1) , c(k+1) = Y u(k+1) and λ(c(k+1) ) = (k+1) c ; 1 Step 3: If λ(c(k+1) )−λ(s(k+1) ) > 0, v(k+1) =sgn(c(k+1) ), k = k + 1, and go to Step 1; otherwise, stop. The algorithm may converge to a local solution. So to find the optimal solution, it should be restarted from multiple initial points; and that good starting points can be the rows or the columns of the data matrix Y.
3
Hausman Principal Component Analysis
The main computational problem is to solve equation (1). The theoretical foundations of our computational approach are based on the following two arguments. First, we note that theoretically, (1) can be solved by doing (2p − 1) CPCAs: Consider CPCA of the the nonnull subsets of the columns of Y, and choose the one that satisfies (1). This is also a NP hard problem. Second, let us designate by vC = arg maxvi =±1 v Vv, sC = YvC , µ2C = vC VvC /vC vC = sC sC /p = λ2C /p, because p = v v; these are the optimal estimates obtained by the CPCA. Partition vC into two parts: vC+ and vC− , vC+ containing the +1 values and vC− containing the −1 values. Let us designate by vH = arg maxvi =±1,0 v Vv, sH = YvH , µ2H = vH VvH /vH vH , the optimal estimates of problem (1). Partition vH into three parts: vH+ , vH0 and vH− , vH+ containing the +1 values, vH0 containing the 0 values, and vH− containing the −1 values. Our basic hypothesis is the following: Because of the optimality of the CPCA and of the HPCA, we should have vH+ ⊆ vC+ and vH− ⊆ vC− .
(14)
Limited empirical studies showed that (14) is true. From which, we can deduce a very simple way to obtain vH : First obtain vC ; then, by zeroing operations, that is changing some of the elements of vC into zero, obtain vH . Here we describe the effects of the zeroing operation. To simplify notation, we eliminate the subindex c from vC , sC and λ2C . We designate by v−i the 2 vector v whose ith element is replaced by 0, s−i = Yv−i and λ2−i = ||s−i ||2 . We have s−i = Yv−i = Y(v − vi ei ), = s − vi yi ,
(3)
298
V. Choulakian et al.
where yi is the ith column of Y and ei is the p dimensional vector having 0’s everywhere and 1 at the ith coordinate. λ2−i = ||s−i ||2 = ||s||2 + ||yi ||2 − 2vi yi s, 2
2
2
2
= λ2 + ||yi ||2 − 2vi λci , by (6), (7) and (8), 2
= λ2 + ||yi ||2 − 2λ|ci |, because by (8) vi = sgn(ci ),
(4)
where ci is the loading of the ith variable. The change in the average variance by the elimination of the ith variable is dif f−i =
λ2−i λ2 − , v v−i v v −i 2
=
2
2
λ2 + ||v||2 ||yi ||2 − 2 ||v||2 λ|ci | 2 ||v||2
2 (||v||2 −1)
,
(5)
2
because ||v||2 represents the number of nonnull elements in v. Equation (17) is quite illuminating: It shows that the effect of the deletion of the ith variable on the change in the average variance depends on three terms, one of them is a function of the loading of the variable. This compliments Cadima and Jolliffe (1995), who discussed the pitfalls in interpreting a pc by ignoring variables with small absolute value loadings. A criterion for the deletion of the ith variable is dif f−i > 0, or equivalently 2
2
2
d−i = λ2 + ||v||2 ||yi ||2 − 2 ||v||2 λ|ci | > 0
(18)
Equations (15) through (18) can easily be generalized for the case of deletion of two variables i and j. A criterion for the deletion of the couple (i, j) of the variables is d−(i,j) = d−i + d−j + 2vi vj yi yj (19) And in the case of k variables, the criterion becomes d−(i1 ,...,ik ) =
k j=1
d−ij +
k
vij vim yi j yim
(j=m)=1
The algorithm that we propose to delete variables is based on equations (18) and (19); that is, at most two variables are deleted at each iteration. The starting values are the centroid solutions. Step 1: Put v = vc , s = sc , λ2 = λ2c , and α = {1, ..., p}; Step 2: Calculate d−i , for i ∈ α; Step 3: Define ω = {i|d−i > 0}; Step 4: If ω is not empty, let i∗ = arg maxi∈ω d−i ; otherwise stop.
Hausman Principal Component Analysis
299
Step 5: If ω − {i∗ } is empty, put v = v−i∗ , s = s−i∗ , λ2 = λ2−i∗ , and α = α − {i∗ }, and go to Step 2; otherwise, let j ∗ = arg maxi∈ω−{i∗ } d−(i∗ ,j) , put v = v−{i∗ ,j ∗ } , s = s−{i∗ ,j ∗ } , λ2 = λ2−{i∗ ,j ∗ } , and α = α − {i∗ , j ∗ }, and go to Step 2.
4
Example
We shall reconsider the foodstuff data set of dimension 12x7 found in Lebart, Morineau and F´enelon (1982). Cadima and Jolliffe (1995) used it to show that the common practice of subjectively deleting variables with ’small’ absolute loadings in the ordinary PCA can produce misleading results. The crux of the problem is how to quantify smallness. The following table displays the first three pc’s produced by HPCA and in the parentheses are found the corresponding results given by Cadima and Jolliffe for the 2nd and 3rd pcs. The two highlighted cases and the one case in italics are illuminating. On the second pc the standardized PCA loading of the variable poultry is -0.24, and it is considered small in absolute value in comparison with the three large values (0.58, 0.41 and 0.63), and thus given 0 weight by Cadima and Jolliffe; while the corresponding standardized HPCA loading is -0.27, and its HPCA weight is -1. So in the ordinary PCA the truncated version of the component will be z2 = 0.58x1 + 0.41x2 − 0.24x5 + 0.63x6 ,
(20)
and using HPCA the truncated version of the component will be the linear composite z2 = x1 + x2 − x5 + x6 .
(21)
Equations (20) and (21) will produce very similar results, because their correlation is 0.939. The comparison of the 3rd pc results is much more interesting: The standardized PCA loading of the variable vegetables is -0.29, and it is considered small in absolute value and thus given 0 weight by Cadima and Jolliffe; while the corresponding HPCA loading is -0.32, and its HPCA weight is -1. The standardized PCA loading of the variable milk is -0.23, and the standardized HPCA loading of the variable milk is -0.15, and both are considered small and given 0 weight. Now what is interesting is that the values -0.24 and -0.23, which are quite near on two different axes of the ordinary PCA, are given different weights by HPCA, -1 for the first and 0 for the second; and this explains the complexity of the problem of the deletion of the variables. Here, we point out that the exact combinatorial algorithm, maximization of (1), and the iterative algorithm described in this paper produced the same results on this data set.
300
V. Choulakian et al.
Variables v1 , bread vegetables fruits meat poultry milk wine
5
c1 / c1 c1 v2 , c2 / c2 c2 v3 , c3 / c3 c3 0, 0.07 1, 0.55 (1, 0.58) 1, 0.45 (1, 0.40) 1, 0.33 1, 0.43 (1, 0.41) -1, -0.32 (0, -0.29) 1, 0.30 0, 0.08 (0, -0.10) -1, -0.40 (1, -0.34) 1, 0.75 0, -0.11 (0, -0.11) 0, 0.10 (0, 0.07) 1, 0.47 -1, -0.27 (0, -0.24) 1, 0.37 (1, 0.38) 0, 0.09 1, 0.64 (1, 0.63) 0, -0.15 (0, -0.23) 0, 0.06 0, 0.10 (0, 0.14) 1, 0.61 (1, 0.66)
Conclusion
We conclude by summarizing the principal results of this paper. First, HPCA is equivalent to CPCA on a well chosen subset of the variables. And this remark provides a solid theoretical support to our computational procedure. We consider HPCA an objective way of eliminating variables on the principal axes. Acknowledgement Choulakian’s research was financed by the Natural Sciences and Engineering Research Council of Canada.
References BURT, C. (1917): The Distribution And Relations Of Educational Abilities. P.S. King & Son: London. CADIMA, J., and JOLLIFFE, I.T. (1995): Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics, 22, 203–214. CHIPMAN, H.A., and GU, H. (2003): Interpretable dimension reduction. To appear in Journal of Applied Statistics. CHOULAKIAN, V. (2001): Robust Q-mode principal component analysis in L1 . Computational Statistics and Data Analysis, 37, 135–150. CHOULAKIAN, V. (2003): The optimality of the centroid method. Psychometriks, 68, 473–475. CHOULAKIAN, V. (2005a): Transposition invariant principal component analysis in L1 . Statistics and Probability Letters, 71, 1, 23–31. CHOULAKIAN, V. (2005b): L1 -norm projection pursuit principal component analysis. Computational Statistics and Data Analysis, in press. Hausman, R.E. (1982): Constrained multivariate analysis. Studies in Management Sciences, 19, 137–151. JACKSON, J.E. (1991): A User’s Guide To Principal Components. Wiley: New York. JOLLIFFE, I.T. (2002): Principal Component Analysis. Springer Verlag: New York, 2nd edition. ROUSSON and GASSER (2004): Simple component analysis. Applied Statistics, 53, 539–556.
Hausman Principal Component Analysis
301
THURSTONE, L.L. (1931): Multiple factor analysis. Psychological Review, 38, 406–427. VINES, S.K. (2000): Simple principal components. Applied Statistics, 49, 441–451. WOLD, H. (1966): Estimation of principal components and related models by iterative least squares. In Krishnaiah, P.R., ed.: Multivariate Analysis, Academic Press, N.Y., pp. 391-420.
Nonlinear Time Series Modelling: Monitoring a Drilling Process Amor Messaoud, Claus Weihs, and Franz Hering Fachbereich Statistik, Universit¨ at Dortmund, Germany Abstract. Exponential autoregressive (ExpAr) time series models are able to reveal certain types of nonlinear dynamics such as fixed points and limit cycles. In this work, these models are used to model a drilling process. This modelling approach provides an on-line monitoring strategy, using control charts, of the process in order to detect dynamic disturbances and to secure production with high quality.
1
Introduction
Deep hole drilling methods are used for producing holes with a high length-todiameter ratio, good surface finish and straightness. For drilling holes with a diameter of 20 mm and above, the BTA (Boring and Trepanning Association) deep hole machining principle is usually employed. The process is subject to dynamic disturbances usually classified as either chatter vibration or spiralling. Chatter leads to excessive wear of the cutting edges of the tool and may also damage the boring walls. Spiralling damages the workpiece severely. The defect of form and surface quality constitutes a significant impairment of the workpiece. As the deep hole drilling process is often used during the last production phases of expensive workpieces, process reliability is of primary importance and hence disturbances should be avoided. In this work, we used exponential autoregressive (ExpAr) time series models to model the drilling process. This modelling approach provides an on-line monitoring strategy, using control charts, of the process in order to detect chatter vibration as early as possible.
2
Amplitude-dependent Exponential Autoregressive (ExpAr) Time Series Models
ExpAr models are introduced in an attempt to construct time series models which reproduce certain features of nonlinear random vibration theory, see Haggan and Ozaki (1981). An ExpAr model is given by 2 2 xt = φ1 + π1 e−γxt−1 xt−1 + . . . + φp + πp e−γxt−1 xt−p + εt , (1) where {εt } is a sequence of i.i.d. random variables, usually with zero mean and finite variance, γ, φi , πi , i = 1, . . ., p, are constants, p is the model order. The
Nonlinear Time Series Modelling: Monitoring a Drilling Process
303
autoregressive (AR) coefficients of the model are made to be instantaneously dependent on the state xt−1 . They change from {φi + πi } to {φi } as |xt−1 | changes from zero to +∞. The nonlinear coefficient γ acts as a scaling factor. 2 It modifies the effect of xt−1 in the term e−γxt−1 . Haggan and Ozaki (1981) showed that the ExpAr model exhibits a limit cycle behavior under the following conditions i) All the roots of the characteristic equation λp − φ1 λp−1 − φ2 λp−2 − . . . − φp = 0. lie inside the unit circle. Therefore xt starts to damp out when | xt−1 | becomes too large. ii) Some roots of the characteristic equation λp − (φ1 + π1 )λp−1 − (φ2 + π2 )λp−2 − . . . − (φp + πp ) = 0. lie outside the unit circle. Therefore xt starts to oscillate and diverge for small | xt−1 |. The results of these two effects are expected to produce a sort of self excited oscillation. The above two conditions are necessary for the existence of a limit cycle but not sufficient. A sufficient condition is iii)
1−
p i=1
φi
/
p
πi > 1 or < 0.
i=1
The condition (iii) guarantees that a fixed point does not exist for the ExpAr model. Some ExpAr models without satisfying condition (iii) still have a limit cycle. Ozaki (1982) noted that this is because the fixed points themselves of the model are unstable. He gave a condition to check whether the singular points are stable or not whenever condition (iii) is unsatisfied.
3
Estimation of the ExpAr Model
The maximum likelihood estimate of γ is obtained by minimizing the variance of the prediction errors, see Shi et al. (2001). Such estimation is a commonly time consuming nonlinear optimization procedure. Moreover, it can be proved that the objective function for the nonlinear coefficient γ is not convex and multiple local optima may exist. Therefore, there is no guarantee that a derivative-based method will converge to the global optimum. To overcome this problem, a straightforward estimation procedure was proposed by Haggan and Ozaki (1981). Shi and Aoyama (1997) and Baragona et al. (2002) used a genetic algorithm to estimate the model parameters. However,
304
A. Messaoud et al.
these two methods involve computation difficulties and are not suitable for use in manufacturing systems (real-time), where CPU-time and memory are important. The important task of a real-time estimation procedure is the fast determination of the nonlinear coefficient γ. The estimation of the other coefficients {φi , πi }, i = 1, 2, . . ., p, in the model is only a linear least squares problem whenever γ is determined. Shi et al. (2001) proposed a heuristic determination of the nonlinear coefficient γ from the original data set. They defined logδ γˆ = − , (2) max x2i
where δ is a small number (e. g., 0.0001), 1 ≤ i < N and N is the length 2 of the data series. As mentioned, the AR coefficients {φi + πi e−γxt−1 }, and hence the roots of the characteristic equation, of the ExpAr time series model are made to be instantaneously dependent on the state xt−1 . In terms of the mechanism of the ExpAr time series model to reveal the limit cycle, the scaling parameter γ takes the role of adjusting the instantaneous roots of the model. Using equation 2, if the observation xt−1 is far away from 2 the equilibrium point, e−ˆγ xt−1 = δ. Therefore, the AR coefficients become equal to {φi } and the resulting model has all roots less than unit to force the next state xt not to diverge further. Moreover, if the observation xt−1 moves to zero, the AR coefficients become equal to {φi + πi }. Therefore, the instantaneous model may have some roots outside the unit circle to force the next state to increase. For further details, see Shi et al. (2001).
4
Modelling the Drilling Process
In order to study the dynamics of the process, several drilling experiments are conducted and several on-line measurements were sampled, see Weinert et al. (2002). Chatter is easily recognized in the on-line measurements by a fast increase of the dynamic part of the torque, force and acceleration signals. However, the drilling torque measurements yield the earliest and most reliable information about the transition from stable operation to chatter. Weinert et al. (2002) modelled the transition from stable drilling to chatter vibration by a Hopf bifurcation in a van der Pol equation. Therefore, the drilling torque are described by x ¨(t) + h(t) b2 − x(t)2 x(t) ˙ + w2 x(t) = W (t), (3) where x(t) is the drilling torque, b and w are constants, h(t) is a bifurcation parameter and W (t) is a white noise process. In this case, Hopf bifurcation occurs in the system when a stable fixed point becomes unstable to form a limit cycle, as h(t) vary from positive to negative values. As mentioned, ExpAr time series models are able to reveal complex nonlinear dynamics such as singular point and limit cycle. Therefore, we propose
Nonlinear Time Series Modelling: Monitoring a Drilling Process
305
Fig. 1. Time series of the drilling torque
to use the ExpAr time series models to describe the drilling torque and to set up a monitoring strategy based on control charts. 4.1
Experimental Results
The ExpAr model is used to fit the drilling torque moments in an experiment with feed f = 0.231 mm, cutting speed vc = 69 m/min and amount of oil V˙ oil = 229 l/min. For more details see Weinert et al. (2002). The data are recorded with a sampling rate of v = 20000 Hz and consists on 7131136 observations, see Figure 1. In this experiment the transition from stable operation to chatter occurs before depth 340 mm. Indeed, by eye inspection, the effect of chatter in this experiment is apparent on the bore hole wall after depth 340 mm. For the problem of on-line monitoring of the drilling process, a common way is to segment on-line measurements of the drilling torque. Therefore, the data are divided into segments of length 4096, which is used by Theis (2004) to calculate the periodograms. In each segment, the ExpAr(p) time series model is fitted to centered data. The parameters are estimated using the real-time estimation procedure with δ = 0.0001 in equation (2). A time lag p= 40 is selected. It is a reasonable choice but not optimal. For further details, see Messaoud et al. (2005). 4.2
Diagnostic Checks
For model diagnostic, the residuals are plotted against hole depth in mm in Figure 2. We also check to see whether the errors are probably centered, symmetric and Gaussian. Figures 3 a and b show the histograms of the errors over two segments during stable drilling and chatter. They have a symmetric shape around zero and Gaussian appearance. However, the null hypothesis
306
A. Messaoud et al.
Fig. 2. Plot of the residuals
Observation number 3559424- 3563520 (Hole depth 250 mm)
(a)
Observation number 4993024- 4997120 (Hole depth 350 mm)
(b)
Fig. 3. Histograms of the predicted errors (a) before chatter (b) after chatter
of normality of the residuals is rejected in all time segments using the Kolmogorov Smirnov test. This is explained by the presence of outliers. As a final check, the fitted model is simulated using the estimated coefficients and residual variance. The first p = 40 values of the drilling torque in each segment are used as initial values. In fact, a model which cannot reproduce a similar series by simulation is certainly not interesting to statisticians and engineers. The results show that the simulated values behave similar to the observed data, see Messaoud et al. (2005). In conclusion, the estimated ExpAr(40) model provides a good fit to the drilling torque measurements.
Nonlinear Time Series Modelling: Monitoring a Drilling Process
5 5.1
307
Monitoring the Process ExpAr(40) Time Series based Control Charts
Usually, time series based control charts are used to monitor the residuals. That is, a time series model is used to fit the data and to calculate the residuals. In this work, more than 7000000 are available. This causes the inapplicability of monitoring the residuals. Therefore, we propose to monitor the series {ˆ σε2 } using the nonparametric exponentially weighted moving average (NEWMA) control chart proposed by Hackl and Ledolter (1992). Note that in the following, the index t refers to the segment number t of length 4096. For this control chart the sequential 2 2 2 rank St∗ is the rank of σ ˆε,t among σ ˆε,t−m , . . ., σ ˆε,t−1 given by St∗ = 1 +
t
2 2 I(ˆ σε,t >σ ˆε,i ),
(4)
i=t−m+1
where I(.) is the indicator function. For tied observations, we used the midrank, see Gibbons and Chakraborti (1992). The standardized sequential rank Stm is given by 2 m+1 m ∗ St = St − . (5) m 2
The control statistic Tt is the EWMA of standardized ranks, computed as follows Tt = max{A, (1 − λ)Tt−1 + λStm }, t = 1, 2, . . ., where A is a lower reflection boundary, T0 is a starting value, usually set equal to zero, and 0 < λ ≤ 1. The process is considered in-control as long as Tt < h, where h > 0 is an upper control limit. Note that the upper-sided NEWMA control chart is considered because the statistic Stm is “higher the better”. Indeed, a decrease in σ ˆε2 means a process improvement. The parameters of the control chart are selected according to a performance criteria of the chart. Usually, the performance of control charts is evaluated by the average run length (ARL). The run length is defined as the number of observations that are needed to exceed the control limit for the first time. The ARL should be large when the process is statistically in-control (in-control ARL) and small when a shift has occurred (out-of-control ARL). Messaoud et al. (2005) proposed to use an integral equation to approximate the in-control ARL of the NEWMA control chart. 5.2
Experimental Results
In this section, the series {ˆ σε2 } is monitored using different EWMA control charts. The parameters of these charts are selected so that all the charts have the same in-control ARL equal to 500. This choice should not give
308
A. Messaoud et al.
Hole Depth Observation Monitoring σ ˆ e2 (mm) number λ 0.1 0.3 0.5 ≤32 ≤107 0 0 0 32-50 108-117 61 27 18 50-75 118-150 17 0 1 75-100 151-249 0 0 0 100-125 250-366 0 0 0 125-150 370-416 1 0 0 150-175 417-665 0 0 0 175-200 666-832 0 0 0 200-225 833-849 0 0 0 225-250 850-865 14 0 0 250-275 866-966 0 0 0 275-300 966-999 0 0 0 300-325 850-865 37 33 25 325-350 866-966 56 5 2
Total
186 65
46
Table 1. Out-of-control signals of the different control charts (m=100)
a lot of false alarm signals because all control charts are applied to 1200 observations. For the smoothing parameters, we used λ= 0.1, 0.3 and 0.5. The corresponding values for h are respectively 0.330, 0.608 and 0.769. The values of the reflection boundaries A are respectively −0.330, −0.608, −0.769. The 2 2 m = 100 recent observations σ ˆε,t−100 , . . ., σ ˆε,t−1 are considered as reference sample. A larger sample cannot be used because the monitoring procedures should start before depth 35 mm (data segment 120). In fact, chatter may be observed after that depth because the guiding pads of the BTA tool leave the starting bush, this is discussed next. Table 1 shows the results for depth ≤ 350 mm. Table 1 shows that all control charts signal at 32 ≤ depth ≤ 35 mm. In fact, it is known that approximately at depth 32 mm the guiding pads of the BTA tool leave the starting bush, which induces a change in the dynamics of the process. From previous experiments, the process has been observed to either stay stable or start with chatter vibration. The three control charts produced many out-of-control signals at depth 300 ≤ depth ≤ 325 mm. Indeed, the transition from stable drilling to chatter vibration have started after approximately at depth 300 mm. Therefore, in this experiment chatter vibration may be avoided if corrective actions are taken after these signals.
6
Conclusion
This work is an attempt to integrate nonlinear time series and control charts in order to monitor a drilling process. The results show the potential use
Nonlinear Time Series Modelling: Monitoring a Drilling Process
309
of this strategy. The future research should focus on the estimation of the coefficients and choice of the time lag p of the ExpAr time series model. Moreover, a practical procedure to select the control charts parameters (λ, h and A) is needed. This issue is not considered in this work.
Acknowledgements This work has been supported by the Graduate School of Production Engineering and Logistics at the university of Dortmund and the Collaborative Research Centre “Reduction of Complexity in Multivariate Data Structures” (SFB 475) of the German Research Foundation (DFG).
References BARAGONA, R.; BATTAGLIA, F.; and CUCINA, D. (2004): A Note on Estimating Autoregressive Exponential Models. Quaderni di Statistica, 4, 1–18. GIBBONS, J. D. and CHAKRABORTI, S. (1992): Nonparametric Statistical Inference. 3rd ed. Marcel Dekker, New York, NY. HACKL, P. and LEDOLTER, J. (1992): A New Nonparametric Quality Control Technique. Communications in Statistics-Simulation and Computation 21, 423–443. HAGGAN, V. and OZAKI, T. (1981): Modelling Nonlinear Random Vibration Using Amplitude Dependent Autoregressive Time Series Model. Biometrika, 68, 189–196. MESSAOUD, A.; W.; WEIHS, C.; and HERING, F. (2005): Modelling the Nonlinear Time Varying Dynamics of a Drilling Process. Technical Report of SFB 475, University of Dortmund. OZAKI, T. (1982): The Statistical Analysis of Perturbed Limit Cycle Processes Using Nonlinear time Series Models. Journal of Time Series Analysis, 3, 29– 41. SHI, Z. and AOYAMA, H. (1997): Estimation of the Exponential Autoregressive time Series Model by Using the Genetic algorithm. Journal of Sound and Vibration, 205, 309–321. SHI, Z.; TAMURA, Y.; and OZAKI, T. (2001): Monitoring the Stability of BWR Oscillation by Nonlinear Time Series Modelling. Annals of Nuclear energy, 28, 953–966. THEIS, W. (2004): Modelling Varying Amplitudes. PhD dissertation, Department of Statistics, University of Dortmund. URL http://eldorado.unidormund.de:8080/FB5/ls7/forschung/2004/Theis ¨ WEINERT, K.; WEBBER, O.; HUSKEN, M.; MEHNEN, J.; and THEIS, W. (2002): Analysis and prediction of dynamic disturbances of the BTA deep hole drilling process. In: R. Teti (Ed.): Proceedings of the 3rd CIRP International Seminar on Intelligent Computation in Manufacturing Engineering. 297–302.
Word Length and Frequency Distributions in Different Text Genres Gordana Anti´c1 , Ernst Stadlober1 , Peter Grzybek2 , and Emmerich Kelih2 1 2
Department of Statistics, Graz University of Technology, A-8010 Graz, Austria Department for Slavic Studies, Graz University, A-8010 Graz, Austria
Abstract. In this paper we study word length frequency distributions of a systematic selection of 80 Slovenian texts (private letters, journalistic texts, poems and cooking recipes). The adequacy of four two-parametric Poisson models is analyzed according their goodness of fit properties, and the corresponding model parameter ranges are checked for their suitability to discriminate the text sorts given. As a result we obtain that the Singh-Poisson distribution seems to be the best choice for both problems: first, it is an appropriate model for three of the text sorts (private letters, journalistic texts and poems); and second, the parameter space of the model can be split into regions constituting all four text sorts.
1
Text Base
The relevance of word length studies in general, and for purposes of text classification particularly, has recently been studied in detail and is well documented – cf. Grzybek (ed.) (2005), Anti´c et al. (2005), Grzybek et al. (2005). On the basis of multivariate analyzes, convincing evidence has been obtained that word length may play an important role in the attribution of individual texts to specific discourse types, rather than to individual authors. The present study continues this line of research, in so far as the word length frequency distributions of 80 Slovenian texts are analyzed. Yet, this study goes a step further in a specific direction. Most studies in this field, particularly the ones mentioned above, thus far have conducted discriminant analyzes on the basis of characteristics derived from the empirical frequency distributions; in this paper, however, an attempt is made to introduce an additional new aspect to this procedure, by carrying out discriminant analyzes based on the parameters of a theoretical discrete probability model fitted to the observed frequency distribution. The texts which serve as a basis for this endeavor represent four different text types (private letters, journalistic texts, poems, and cooking recipes), twenty texts of each text type being analyzed. These texts have been chosen in a systematic fashion: based on previous insight from the studies mentioned above (cf. Grzybek et al. 2005), the specific selection has been deliberately made in order to cover the broad textual spectrum, or its extreme realizations, at least. Table 1 represents the composition of the sample. The paper aims at giving answers to the following questions.
Word Length and Frequency Distributions in Different Text Genres
AUTHOR Ivan Cankar Journal Delo Simon Gregorˇci´c anonymous
TEXT TYPE
AMOUNT
Private letters Journalistic text Poems Cooking recipes
20 20 20 20
Total
311
80
Table 1. Text Sample: 80 Slovenian Texts
a. Can the word length frequency distributions of our sampled texts be theoretically described, and if so, is one discrete probability model sufficient to describe them, or is more than one model needed? b. Based on the answer to the first set of questions, it is interesting to find out whether one can discriminate the texts by using the parameters of the given model(s) as discriminant variables. In case of a positive answer, this would give us the possibility to attach a certain text to a text group by classifying the parameter values of the fitted model. Before going into the details, it should be mentioned that word length is measured by the number of syllables per word. Since our texts are taken from a pre-processed corpus (Graz Quantitative Text Analysis Server QuanTAS), the length of a word, defined as an orthographic-phonological unit (cf. Anti´c et al. 2005) can be automatically analyzed, using specially designed programs.1
2
Searching for a Model
In finding a suitable model for word length frequency distributions, an ideal solution for future interpretations of the model parameters would be the existence of a unique model, appropriate for all analyzed texts of the text basis under study. The totality of all texts of a given natural language would be an extreme realization of this procedure. Furthermore, it is important to find the simplest model possible, i.e., a model with a minimal number of parameters (model of low order). If more than one model is necessary for the description of a particular text sample, it may be important to establish the connections between these models, and to find out whether they can be derived as special cases of one unifying, higher-order model. 1
The text base is part of the text database developed in the interdisciplinary Graz research project on “Word Length Frequencies in Slavic Texts”. Here, each text is submitted to unified tagging procedures (as to the treatment of headings, numbers, etc.). For details, see http://www-gewi.uni-graz.at/quanta.
312
G. Anti´c et al.
Consul-Jain
0,20
Hyperpoisson Singh Poisson Cohen Poisson
0,15
0,10
0,05
0,00
0
Letters
20
40 Journalistic Texts
Poems
60
Recipes
80
Fig. 1. Results of Fitting Four Two-Parameter Poisson Models to 80 Slovenian Texts
Due to the fact that we are concerned with words that have at least one syllable, these models will be considered to be 1-displaced. In the subsequent discussion we restrict our study to generalizations of the 1-displaced Poisson distribution, having two parameters each. Based on the observation that the standard Poisson model with one parameter is insufficient to describe all of our texts, we investigate four different two-parametric generalizations which proved to be adequate models for specific texts of several languages (cf. Best 1997): (a) Cohen-Poisson, (b) Consul-Jain-Poisson, (c) Hyper-Poisson, and (d) Singh-Poisson. In order to test the goodness of fit of these probability models, we apply the standardized discrepancy coefficient C = χ2 /N , where N is the text length (number of words in the text). As an empirical rule of thumb we consider the fit of the model (i) as not appropriate in case of C > 0.02, (ii) as sufficient if 0.01 < C ≤ 0.02, and (iii) as extremely good if C ≤ 0.01. The result of fitting the four models to each of the 80 members of our text base is given in Figure 1, where geometrical symbols represent the different models. The horizontal line in the graphical display is the reference bound C = 0.02. It can be observed that for the text group of recipes, the values of C are far beyond the reference line, in case of all probability models; therefore none of the models is appropriate for recipes.2 2
As more detailed studies have shown, recipes are generally quite “resistent” to modelling, and cannot be described by other models either. This is in line with linguistic research emphasizing the particular text structure of recipes.
Word Length and Frequency Distributions in Different Text Genres
313
Additionally, Fig. 1 shows that the Consul-Jain-Poisson model is not appropriate for both journalistic texts and poems. As compared to this, the Cohen-Poisson model provides more or less good fits for private letters, journalistic texts, and poems, but further analyses showed that this model is not able to discriminate journalistic texts from private letters. Consequently, we now restrict our attention to Hyper-Poisson and Singh-Poisson distributions only. 2.1
The 1-Displaced Hyper-Poisson (a,b) Distribution
This distribution has repeatedly been discussed as a model for word length frequency and sentence length frequency distributions. It is a generalization of the Poisson distribution with parameter a, by introducing a second parameter b. In its 1-displaced form, the Hyper-Poisson distribution is given as Px =
ax−1 , x = 1, 2, 3, . . . a > 0, b > 0 (x−1) 1 F1 (1; b; a) b
(1)
where 1 F1 (1; b; a) is the confluent hypergeometric series with first argument 1 and b(x−1) = b (b + 1) . . . (b + x − 2) (cf. Wimmer/Altmann 1999, 281). The first raw and the second central moment of the 1-displaced Hyper-Poisson distribution are µ = E(X) = a + (1 − b)(1 − P1 ) + 1 V ar(X) = (a + 1)µ + µ(2 − µ − b) + b − 2 .
(2)
The estimates x ¯ and m2 can be used for calculating the unknown parameters a and b as: a ˆ=x ¯ − (1 − ˆb)(1 − Pˆ1 ) − 1 ¯2 − m2 + x ¯(1 + Pˆ1 ) − 2 ˆb = x . ˆ ¯P1 − 1 x
(3)
Detailed analysis shows that the fits of the Hyper-Poisson distribution to some of the journalistic texts are not appropriate. As listed in Table 2, only five of twenty journalistic texts can be adequately described by the HyperPoisson model. A closer look at the structure of the journalistic texts shows that the frequencies of 2- and 3-syllable words tend to be almost the same; however, a good fit of the Hyper-Poisson model demands rather a monotonic decreasing trend of these frequencies. This may be illustrated by the following two examples. Let us consider two typical journalistic texts from the journal Delo (# 29 and # 32). The observed word length frequencies of these two texts are represented in Fig. 2. For one of the two texts (# 32), we obtain a a good fit (C = 0.0172), for the other one a bad fit (C = 0.0662). For each of these two texts, we independently simulated ten artificial texts from the Hyper-Poisson distribution
314
G. Anti´c et al.
Text
a ˆ
ˆb
C
Text
a ˆ
ˆb
C
21 22 23 24 25 26 27 28 29 30
2.14 2.81 2.60 2.16 3.02 2.11 2.82 3.04 2.26 2.09
2.12 3.09 3.31 2.40 3.25 2.36 3.07 3.46 2.33 1.87
0.06 0.03 0.04 0.05 0.03 0.03 0.02 0.02 0.07 0.06
31 32 33 34 35 36 37 38 39 40
2.95 2.85 2.06 2.53 3.10 2.58 2.75 1.81 2.53 1.82
3.17 3.66 1.81 2.68 3.67 2.77 3.52 1.68 2.52 1.82
0.04 0.02 0.06 0.04 0.02 0.06 0.02 0.04 0.05 0.04
Table 2. Fitting the Hyper-Poisson Distribution to Journalistic Texts
Delo (# 29) Delo (# 32)
300 250 200 150 100 50 0 Delo (# 29) Delo (# 32)
1
222 235
2
147 160
3
146 110
4
100 70
5
16 14
6
2 11
7 3
Fig. 2. Word Length Frequencies for Texts #29 and #32 (from Delo)
with parameter combinations fluctuating around the estimated parameters of the given texts. These ten simulations as well as empirical (black line) and theoretical (dashed line) values are plotted in the same graph (see Figure 3) to exhibit the random effect and to study the distributional characteristic of an “ideal” text following the Hyper-Poisson distribution. Figure 3 shows that the monotonic decreasing trend is essential for modelling texts with the Hyper-Poisson distribution, but this criterion is not satisfied in case of the text with bad fit (C = 0.0662). 2.2
The 1-Displaced Singh-Poisson (a,α) Distribution
The next model to be tested is the 1-displaced Singh-Poisson model, which introduces a new parameter α changing the relationship between the probability of the first class and the probabilities of the other classes. It is given as ⎧ ⎨ 1 − α + αe−a , x=1 (x−1) −a Px = αa e , x = 2, 3 . . . ⎩ (x − 1)!
Word Length and Frequency Distributions in Different Text Genres
315
(b) Text # 32 (C ≈ 0.02) (a) Text # 29 (C ≈ 0.07) Fig. 3. Simulating Hyper-Poisson Distributions; (a) : (2.26;2.33), (b) : (2.85;3.66)
where a > 0 and 0 ≤ α ≤ 1/(1 − e−a ) (cf. Wimmer/Altmann 1999: 605). The first raw and the second central moment of the 1-displaced Singh-Poisson distribution are µ = E(X) = αa + 1 V ar(X) = αa(1 + a − αa) . The estimated parameters a ˆ and α ˆ are functions of the empirical moments of the distribution given as a ˆ=
m(2) − 2, x ¯−1
α ˆ=
(¯ x − 1)2 m(2) − 2¯ x+2
where m(2) is an estimation of the second factorial moment µ(2) . The 1-displaced Singh-Poisson model proves to be appropriate for the majority of private letters and journalistic texts. In case of the poems, where the fitting results are less convincing, we obtain α ≈ 1 for all twenty texts analyzed; this is a clear indication that for poems, even the 1-displaced Poisson model seems to be satisfactory. On the other hand, for the group of recipes, this model is not appropriate, due to peculiar relationships between the frequencies: in some cases two or more frequency classes are nearly equal, in other cases there are tremendous ups and downs of frequency classes; the model, however, demands rather monotone relationships between frequency classes.
3
Interpretion of Parameters
Since the 1-displaced Singh-Poisson distribution turns out to be an appropriate model for three of the four text groups (private letters, journalistic
316
G. Anti´c et al.
Journalistic Texts
Letters
2,1000
Recipes
Poems
a
á
1,8000
1,5000
1,2000
0,9000
0,6000
0
20
40
60
80
Fig. 4. Parameter Values (left) and Regions of the Parameters a and α (right) of the Singh-Poisson Model Fitted to Texts of Four Text Types
Conf. interval
Letters
Journalistic
Poems
Recipes
a ˆ α ˆ
[0.914; 0.954] [0.880; 0.909]
[1.602; 1.703] [0.801; 0.822]
[0.705; 0.796] [0.972; 1.009]
[1.629; 1.756] [0.926; 0.952]
Table 3. Confidence Intervals for Both Singh-Poisson Parameters
texts, and poems), the next step includes an analysis of possible connections between the parameters of this model. Figure 4, left panel represents the results of this analysis as scatter plot: the estimated parameter α ˆ is represented by circles, the estimated parameter a ˆ by triangles. It is evident that each group of texts leads to a different pattern of the parameters: in case of private letters, both parameters are very close to each other in a very small interval [0.88; 0.95]; in case of journalistic texts, opposed to this, they are quite distant from each other, and for poems, their placement on the scatter plot is reversed with respect to the order in the previous two cases as can be seen in Table 3 and Fig. 4. The parameter values for the recipes are also added in the same plot irrespective of the fact that there is a bad fit. One can observe that they are placed in a specific parameter region. According to a, there is an overlapping of the confidence intervals of journalistic texts and recipes; with respect to α, there is an overlapping of poems and recipes. However, as shown in Figure 4, right panel, both parameters taken together lead to a good discrimination of all four text groups, regardless of the fact that the model fit for recipes is not appropriate. One can observe four homogenous groups indicating the power of the parameters (a, α) of the 1-displaced Singh-Poisson distribution for classification of four text types.
Word Length and Frequency Distributions in Different Text Genres
4
317
Conclusions
In this study, 80 Slovenian texts from four different text types are analyzed: private letters, journalistic texts, poems, and cooking recipes. In trying to find a unique model within the Poisson family for all four groups, Poisson models with two parameters proved to be adequate for modelling three out of four text types. The relatively simple 1-displaced Singh-Poisson distribution yielded the best results for the first three text groups. However, texts belonging to the group of cooking recipes have a peculiar structure which cannot be modelled within the Poisson family, requiring a certain monotonic relationship between frequency classes. Different texts from a given language (in our case Slovenian) can thus be compared and distinguished on the basis of the specific model parameters. As an additional result, we demonstrated that, at least in our case, the parameters of the 1-displaced Singh-Poisson distribution are suited to discriminate between all four text sorts. This discrimination yields better results than the other three Poisson models studied.
References ´ G., KELIH, E.; GRZYBEK, P. (2005): Zero-syllable Words in Determining ANTIC, Word Length. In: P. Grzybek (Ed.): Contributions to the science of language. Word Length Studies and Related Issues. Kluwer, Dordrecht, 117–157. BEST, K.-H. (Ed.) (1997): The distribution of Word and Sentence Length. WVT, Trier. [= Glottometrika; 16] GRZYBEK, P. (Ed.) (2005): Contributions to the Science of Language. Word Length Studies and Related Issues. Kluwer, Dordrecht. ´ G. (2005): QuantitaGRZYBEK, P., STADLOBER, E., KELIH, E., and ANTIC, tive Text Typology: The Impact of Word Length. In: C. Weihs and W. GAUL (Eds.), Classification – The Ubiquitous Challenge. Springer, Heidelberg; 53-64. ´ G., GRZYBEK, P. and STADLOBER, E. (2005): Classification KELIH, E., ANTIC, of Author and/or Genre? The Impact of Word Length. In: C. Weihs and W. GAUL (Eds.), Classification – The Ubiquitous Challenge. Springer, Heidelberg; 498-505. WIMMER, G., and ALTMANN, G. (1999): Thesaurus of univariate discrete probability distributions. Essen.
Bootstrapping an Unsupervised Morphemic Analysis Christoph Benden Department of Linguistics – Linguistic Data Processing University of Cologne, 50923 Cologne
[email protected] Abstract. Unsupervised morphemic analysis may be divided into two phases: 1) Establishment of an initial morpheme set, and 2) optimization of this generally imperfect first approximization. This paper focuses on the first phase, that is the establishment of an initial morphemic analysis, whereby methodological questions regarding ’unsupervision’ will be touched on. The basic algorithm for segmentation employed goes back to Harris (1955). Proposals for the antecedent transformation of graphemic representations into (partial) phonemic ones are discussed as well as the postprocessing step of reapplying the initially gained morphemic candidates. Instead of directly using numerical (count) measures, a proposal is put forward which exploits numerical interpretations of a universal morphological assumption on morphemic order for the evaluation of the computationally gained segmantations and their quantitative properties.
1
Introduction
In this paper, a boostrapping method for unsupervised morphemic analysis and possible extensions thereof are explored. On the outset, a few words on the notion ’unsupervised’ seem to be appropriate (section 2). The method described is ’bootstrapping’ in the sense that it only yields a provisional list of the morphemes of a language (here: German). This is a good starting point but clearly needs refinements. This method, based on proposals by Harris (1955) is shortly introduced in section 3.3. Since the basic segmentation algorithm heavily depends on the kind of representation of words (i.e. graphemic vs. (partial) phonemic), two experiments with (partially) phonemically transformed representations have been carried out (section 3.2). After the segmentation itself (sections 3.3-3.4), a parsing and evaluation step is applied that refines the analysis and reduces the number of morpheme candidates. The preliminary results are presented in section 4, followed in section 5 by a discussion of possible refinements and evaluation processes to succeed the morphological bootstrapping.
2
Being ’unsupervised’
With regards to the qualification of an analysis as being ’unsupervised’ or ’knowledge-free’, to my knowledge, there does not exist any clear definition
Bootstrapping an Unsupervised Morphemic Analysis
319
of that notion. Within the context of ’unsupervised (computational) morphological analysis’, one could however agree that the term implies something like ’without (any) interference’, which can be understood quiet differently. The following quotation is typical of many attempts and might be a good starting point: [A] Given an unannotated input corpus, the algorithm [...] extracts a list of candidate content words. This is simply a list of all the alphabetic space- or punctuation-delimited strings in the corpus that have a corpus frequency below .01% of the total token count. [...] [B] We do not attempt to define a phonologically sensible distance scoring function, as this would require making assumptions about how the phonology of the target language maps onto its orthography, thus falling outside the domain of knowledge-free induction. (Baroni et al. 2003:3-4) It is obvious that every analysis producing alternative or concurrent results needs decision procedures, as for instance stated in [A] above: a threshold is defined on word frequencies to differentiate content words from function words. While the authors take this as a valid step during an ’unsupervised’ analysis, I would not: thresholds usually arise from experiences of the scientist with the results of previous runs of the analysis. As I see it, decision procedures, such as defining a threshold, should follow hypotheses or guiding parameters that are themselves independently and (ideally) well founded. On the other hand, Baroni et al. (2003) regard manipulation of the symbolic representation of the data as a supervising intervention (cf. above, [B]) which again I would interpret differently. The elements in question are those of the second level of articulation (cf. Martinet 1960): simply distinctive and meaningless parts which form meaningful elements (morphemes). Transformations of the data on the basis of the mapping rules between graphemic and phonemic representation is no supervising intervention. It is a conversion of an improper (written)representation of language into a more adequate representation (phonemic, or some variation thereof), thereby retaining the original intention of the speaker. As far as I can see, there are three candidate operations on data which one could claim leave an analysis ’unsupervised’: • Phonologically justified adaptation of the graphemic/phonemic representation (e.g. letter-to-phoneme conversions, archiphonemical representations), • Distributional parameters (ultimately the only information available to computational linguistics), • Absolute typological parameters seem to be well suited but they are rare and describe trivial facts or rather tendencies.
320
C. Benden
Fig. 1. Overview of elements and succession of the segmantation process
3
The Process Chain
The analysis is implemented as a process chain, which means that the distinct parts are executed successively, with each component taking over the result processed by the previous one. The overall process is depicted in Fig. 1. The following subsections will explain the distinct components in their respective order. 3.1
Corpus Selection, Preprocessing and Indexing
The corpora used are two selections of German texts from the science section of the Frankfurter Allgemeine Zeitung (FAZ) published between 1991 and 1993. Two corpus selections were taken into account, one consisting of 1111 texts (∼500,00 tokens, ∼61,000 types), the other consisting of 100 texts (∼50,000 tokens, ∼13,000 types). In preprocessing, only words consisting of genuine German orthographic symbols are accepted for analysis. Indexing consists only of putting together a list of the words (types) and their absolute frequency. 3.2
Letter Conversion
Element conversion on the second level of articulation can be justified even within the requirements of unsupervised analysis (cf. section 2). Since the
Bootstrapping an Unsupervised Morphemic Analysis
321
algorithm used, heavily depends on the symbolic representations of words (cf. sections 3.3-4) and originally was designed to deal with phonetic (or at least phonemic) representations, the first experimental step is to transform the graphemic representation into a (partially) phonemic one. The hypothesis is that phonetically more adequate representations will yield better segmentations and eventually better overall analyses. Two grapheme-to-phoneme conversions, called GermanFull and GermanLight, are actually implemented. GermanFull uses a modified version of the rule-based subsystem of the IMS German Festival1 and attempts a maximally adequate phonemic transformation. The second conversion GermanLight only reduces digraphs (<mm>,
, etc.) and the trigraph <sch> as well as geminate writings of single vowels (e.g. , , ) to single graphs. Furthermore, it tries to resolve the as an indicator of vowel length ( [dro:t]) or as (possible) glottal fricative (). The error rate for GermanLight (1000 words) is 1,9% (all errors occurred in derivational or compositional forms like , <er(r)egen> etc.); for GermanFull (1000 words) it is 32,7%. GermanFull is evaluated somewhat warily because phonetic errors that the system has no means of avoiding (e.g. Schwas are never resolved by rule and thus will not appear) are not taken into account. The actual error rate is therefore higher. If compared to better articulated rule-based systems (e.g. Bernstein & Pisoni 1980) which achieve up to 80% correctly converted words, the work with GermanFull having < 67,3% correct conversions is postponed and only the results for GermanLight are taken into consideration for the time being. 3.3
Counting Successors Forward and Backward
Successor counts (SCs) and, mutatis mutandis predecessor counts, (PCs) describe the number of different letters that may follow or precede a given sequence of letters within a word respectively. Given the inflected word the 100-text corpus produces the following sequence: (1)
g e g e n s t ¨a n d l i c h e n 14 22 5 3 8 5 3 1 1 2 1 1 1 1 1 0 0 1 1 1 1 1 1 1 5 6 12 4 15 11 25 16
SC PC
Starting with the first letter , there are, according to the data (, , etc.), 14 possible letters which could follow. By expanding the sequence to , 22 letters can follow this sequence (cf. , etc.). Having done this forward and backward, the above values emerge. Fig. 2 shows the outcome for , applying the graphemic and GermanLight representation respectively. Further details regarding section 3.2 and 3.3 can be found in Harris (1955 and later), Hafer & Weiss (1974), D´ejean (1998), Goldsmith (2001), Benden (2005). 1
A text-to-speech system available at http://www.ims.uni-stuttgart.de.
322
C. Benden
Fig. 2. SV and PC with graphemic (left) and GermanLight (right) representation Corpus Tokens Types Morphemes Morphemes Morphemes Morphemes Morphemes Morphemes
1111 texts 100 texts ∼500,000 ∼50,000 ∼61,000 ∼13,000 graphemic 28,484 8,613 GermanLight 28,621 8,708 graphemic [source token > 10] 4,697 918 GermanLight [source token > 10] 4,610 871 graphemic [source types > 10] 1,587 290 GermanLight [source types > 10] 1,503 255
Table 1. Proportions of word types and morphemes after segmentation
3.4
Segmentation
In order to justify segmentation within unsupervised analysis, general distributional (or information theoretical) considerations come into play. Since only a small fraction of possible letter combinations is actually in use, the emerging patterns of SCs and PCs obtain significance: within a morpheme, the SC/PC is generally expected to decrease, while at the end of a morpheme the counts generally increase. Hence morphemic breaks are determined after local maxima to the right (SC) or left (PC) of the actual letter (cf. Benden 2005 for more detail). Obviously, SCs emphasize ’early’ morphemes (prefixes), PCs emphasize ’late’ morphemes (suffixes).2 The reduction gained by the segmentation (without considering correctness here) might be read off from the ratio of types to purported morphemes in Table 1 (the last four proportions using a threshold are only illustrative here; they are not taken into account further). 2
Because the values of both directions overlap, only the respective higher count of both directions is taken into account. This is admittedly arbitrary and difficult to motivate independently.
Bootstrapping an Unsupervised Morphemic Analysis Word Count of source types gegen-st¨ and-liCen 32-49-62 gegen-st¨ and-liCe-n 32-49-301-2221 gegen-st¨ and-liC-en 32-49-219-2749 gegen-st¨ and-liC-e-n 32-49-219-1277-2221 ... ...
Sum Count of source tokens 143 94-299-129 2603 94-299-1623-29801 3049 94-299-1677-23046 3798 94-299-1677-37324-29801 ... ...
323
Sum 522 31817 25116 69195 ...
Table 2. First four tokenizations for
3.5
Tokenization: Reapplication of the Morphemes
The morphemes gained through segmentation are reapplied in a tokenization process whereby the order of analyses is determined by the length of the leftmost morphemes. Since the status of the heuristic morphemes with respect to notions like root, prefix or suffix is unclear at this stage of the process, all possible tokenizations are produced. The analysis with the 1111-text corpus, for instance, supplies 858 possible tokenizations for and 3013 for , of which the first four are given in Table 2. 3.6
Selection
This is the most critical and as yet less explored part of unsupervised analyses in general. As can be seen in Table 2, it is not simply picking out the first word after a left-to-right-longest-first tokenization given a certain order. To me, it is yet uncertain whether such an order might count as ’supervision’ in the sense understood here, or not. The decision to simply give preference to longer morphemes during selection does not seem to be justifiable on external (linguistic) basis. However, arbitrary order and thresholds should, if possible, be excluded as a defining measure . Momentarily, quantitative interpretations of universal assumptions about morpheme order as shown in (2) are tested. (2)
inflectional derivational root derivational inflectional ninf l > nderiv > nstem < nderiv < ninf l
On the assumption that an order of inflectional - derivational - root - derivational - inflectional3 morphemes is a universally (at least as a tendency) underlying pattern, one has an expectation as to the freedom of combination and its occurrence as a quantitative pattern. Inflectionals combine more freely than derivationals; the latter, in turn, combine more freely than roots so that the proportions are as indicated in the second row of (2).4 Since only 3
4
Although a common issue in introductory lectures, the original source of the schema is difficult to trace in literature. Of course partial validity of the schema e.g. in case of exclusive prefixation or suffixation is also taken into account, e.g. in <sichtroot-ungderiv-eninfl>.
324
C. Benden
Total number of word types: 13,252 Total of analyses after tokenization: 12,243,981 Reduced analyses after selection: 125,348 Percentage of selection with correct segmentation: 56% Missing segmentations total: 44% Missing but in discarded analyses: 7% Missing but also not in discarded analyses: 37% Error types of the 37% completely missing segmentations: Missing root morphemes: 95% Missing derivationals: 2.5% Bad transcription: 2.5%
Table 3. Results of selection
the potential of combination should be taken into account, only type frequencies are considered. The application of this schema works like a filter in that the amount of possible analyses provided by the tokenization is reduced by a factor of about 100, cf. Table 3 for the 100-text corpus5 . The one schema actually applied is not enough to yield a good precision, it only provides a reduced set of analyses. Therefore, the overall precision is 13,252 / 125,348 = 0.11, that is for every word about 10 analyses are provided. The recall measure 56% / (100% - 37%) = 0,89 on the other hand affirms that the schema yields a selection that serves as a reasonable filter and first approximation.
4
Discussion
The first observation is that an - at least partial - conversion from graphemic to phonemic representations does not provide significantly better overall results although a number of individual segmentations (cf. Fig. 2) do actually improve. Even the list of morphemes provided is, for the most frequent (> 10 types) remarkably similar. Different results are expected from an improved version of GermanFull although this remains to be attested. The selection component achieves a reduction of possible analyses by a factor of about 100 by using a linguistically founded and quantitatively interpreted morphological schema. Because of its general linguistic foundation, it is no arbitrarily introduced threshold that needs an explanation itself. The most prominent errors during segmentation, tokenization and selection can be traced back to a missing word type providing a root or affix as in the example in Table 4 where the suffix (a postvocalic form of the derivational suffix ) could not be established during analysis. A word type in the corpus would have yielded the missing form. This could be interpreted as a consequence of the sparse data problem and is one 5
The percentages are based on a sample of 100 words that was chosen arbitrarily from the whole list of word types.
Bootstrapping an Unsupervised Morphemic Analysis Word Count of source types amerika-niSe 13,25 ameri-ka-niSe 18,548,25 ameri-k-aniSe 18,104,5 ameri-k-aniS-e 18,104,1,1277 ... ...
Sum Count of source tokens 38 61,143 591 531,2945,143 127 531,518,14 1400 531,518,1,37324 ... ...
325
Sum 204 3619 1063 38374 ...
Table 4. First four tokenizations for
major problem of unsupervised morphological analysis. The types provided by the initial corpus do not supply an analysis with a sufficient range of morphological diversity. There are two possible solutions to this problem: • improvements on the level of morphology itself (cf. Goldsmith 2001, Baroni 2003, i.e. extensions to linguistically founded guidelines along the lines of section 3.6) • the possibility of integrating hypotheses originating from other levels of analysis (especially phonology and syntax) into the actual analysis with a recursive adaptation of every affected form. The latter approach has, as far as I can see, not been elaborated on and will be the main topic of my future research.
References BARONI, M. (2003): Distribution-driven morpheme discovery: A computational experimental study. Yearbook of Morphology 2003, 213–248. BENDEN, C. (2005): Automated Detection of Morphemes Using Distributional Measurements. In: C. Weihs and W. Gaul (Eds.) Classification - The Ubiquitous Challenge. Proceedings of the 28th Annual Conference of the Gesellschaft f¨ ur Klassifikation, Dortmund University. Springer, Berlin, 490–497. BERNSTEIN, J. and PISONI, D. (1980): Unlimited text-to-speech system: Description and evaluation of a microprocessor based device. Proceedings of the 5th International Conference on Acoustics, Speech, and Signal Processing. Denver, 576–579 ´ DEJEAN, H. (1998). Morphemes as Necessary Concepts for Structures Discovery from Untagged Corpora. Workshop on Paradigms and Grounding in Natural Language Learning. Adelaide, 295-299. GOLDSMITH, J. (2001): Unsupervised Learing of the Morphology of a Natural Language. Computational Linguistics, 27.2, 153–198. HAFER, M. and WEISS, S. (1974): Word Segmentation by Letter Successor Varieties. Information Storage and Retrieval, 10, 371–385. HARRIS, Z. (1955): From Phoneme to Morpheme. Language, 31, 190–222. MARTINET, A. (1960): El´ements de linguistique g´en´erale. Paris, Librairie Armand Colin.
Automatic Extension of Feature-based Semantic Lexicons via Contextual Attributes Chris Biemann1 and Rainer Osswald2 1
2
Institut f¨ ur Informatik, Abteilung Automatische Sprachverarbeitung, Universit¨ at Leipzig, 04109 Leipzig, Germany Fachbereich Informatik, Lehrgebiet Intelligente Informations- und Kommunikationssysteme, FernUniversit¨ at in Hagen, 58084 Hagen, Germany
Abstract. We describe how a feature-based semantic lexicon can be automatically extended using large, unstructured text corpora. Experiments are carried out using the lexicon HaGenLex and the Wortschatz corpus. The semantic classes of nouns are determined via the adjectives that modify them. It turns out to be reasonable to combine several classifiers for single attributes into one for complex semantic classes. The method is evaluated thoroughly and possible improvements are discussed.
1
Introduction
Natural language processing systems for text retrieval and question answering that go beyond mere statistical pattern matching require the semantic analysis of large collections of text. In particular, such systems rely on a reasonably large computational lexicon that provides not only morphosyntactic but also semantic information about lexical units. While building a high quality semantic lexicon might presumably not be possible without manually created lexical entries, there is no doubt that, especially in the case of nouns, automatic classification methods have to be exploited for reasons of quantity and coverage. This paper describes how an automatic semantic classification using co-occurrence statistics on very large text corpora can successfully extend a manually created semantic lexicon.
2 2.1
Resources The Computational Lexicon HaGenLex
The lexicon used for our experiments is the semantically based computational lexicon HaGenLex (Hartrumpf et al. 2003). HaGenLex is a domain independent lexicon for German that currently comprises about 25,000 lexical entries, roughly half of which are nouns. All HaGenLex entries are semantically annotated, where the semantic description is based on the MultiNet paradigm, a knowledge representation formalism developed for the representation of natural language semantics (Helbig 2001).
Automatic Extension of Feature-based Semantic Lexicons
327
MultiNet provides classificatory as well as relational means of representation. The experiments reported here are restricted to the classification of nouns with respect to their ontological sort and semantic features. MultiNet defines a hierarchy of 45 ontological sorts like d (discrete object) and abs (situational object), of which 17 apply to nouns (cf. Figure 4). In addition, nouns are classified with respect to 16 binary semantic features like human and movable (cf. Figure 3). These feature and sorts are not independent of each other; e.g., human+ implies animate+, artificial−, and discrete object. In order to exclude inconsistent choices, all possible combinations are explicitly combined into (complex) semantic classes, on which a natural specialization hierarchy is defined. In total, there are 50 semantic classes, of which the most frequent 22 in our training data are listed in Figure 5.
2.2
The German Corpus ‘Projekt Deutscher Wortschatz’
Our text resource is the German main corpus of the ‘Projekt Deutscher Wortschatz’ (35 million sentences, 500 million tokens).1 By calculating statistically significant neighboring co-occurrences (Biemann et al. 2004) and part-of-speech filtering, pairs of adjectives and nouns are determined that typically co-occur next to each other. If two words A and B are in subsequent position in a corpus, then A is called the left neighbor of B and B the right neighbor of A. To determine pairs of statistically significant neighbors, a significance measure is applied that indicates the amount of “surprise” of seeing frequent co-occurrences of A and B under the assumption of independence – the larger the significance value, the less is the probability that they co-occurred just by chance. If this measure exceeds a certain threshold, we call A a (left) neighboring co-occurrent of B and define the (left) neighboring profile of B as the set of all (left) neighboring co-occurrents. Our method for classifying nouns is based on the Distributional Hypothesis (Harris 1968), which implies that semantic similarity is a function over global contexts (cf. Miller and Charles 1991). Concretely, we try to classify nouns by considering their modifying adjectives. The set of modifying adjectives for a given noun is here approximated by the statistical adjective profile of the noun, which is defined as the set of adjectives in the left neighboring profile of the noun. (Correspondingly, the noun profile of an adjective is the set of nouns in its right neighboring profile.) These profiles contain lemmatized words and consist of the union of the full form profiles. From our corpus we extracted over 160,000 nouns that co-occur with one or more of 23,400 adjectives (where half of the nouns have only one adjective in their profile). It has turned out that taking into account the actual significance values has no impact on the classification results; what is important is merely that adjective-noun pairs show up multiple times and typically in the corpus. 1
See www.wortschatz.uni-leipzig.de.
328
C. Biemann and R. Osswald
3
Method
3.1
Constructing a Classifier for Single Attributes
For every relevant semantic attribute of nouns, a classifier is constructed in the following way: For every adjective that modifies at least one noun from the training set, a profile is calculated stating the proportion how often this adjective favors which class (class probabilities). The classifier is not limited in the number of classes. Unclassified nouns are then classified on the basis of their adjective profiles; this change between profile calculation and classification of new nouns is iterated in an EM-bootstrapping style (cf. Dempster et al. 1977) until no more nouns can be classified. Initialize adjective and noun profiles; Initialize the training set; As long as new nouns get classified: Calculate adjective class probabilities; For each unclassified noun n: Multiply class probabilities class-wise; Assign class with highest probability to noun n; Fig. 1. Bootstrapping algorithm for assigning semantic attributes to nouns
Figure 1 gives an overview of the algorithm. In the outer loop, class probabilities are assigned to each adjective that indicate how often this adjective can be found in adjective profiles of nouns of the respective class, i.e., how strong this adjective votes for which class. The probability is calculated from the frequency distribution per class, divided by the total number of nouns per class and normalized in sum to one. Division by the total number of nouns per class is motivated by distributing the same probability mass for all classes and has turned out to be crucial when dealing with skewed class distributions. Because the number of classified nouns increases in every iteration step, the class probabilities per adjective have to be re-calculated in each iteration. Within the inner loop, the algorithm tries to assign classes to nouns that have not been classified in the previous steps: the class probabilities of the adjectives occurring in the respective adjective profile are multiplied classwise. Only adjectives occurring in at least one adjective profile of an already classified noun are taken into consideration. The class with the highest value is then assigned to the noun. To increase classificatory precision, one can introduce a threshold α for the minimal number of adjectives in the adjective profile of a noun. The experiments described in Section 4 make use of such a threshold.
Automatic Extension of Feature-based Semantic Lexicons
3.2
329
Combining Attribute Classifications
The overall goal is to classify nouns with respect to the (complex) semantic classes introduced in Section 2.1. In principle, such a classifier could be constructed along the lines of Section 3.1. However, first experiments in that direction have led to a rather unsatisfying precision (tradeoff between 60% precision at 45% recall and 76% precision at only 2.8% recall). The method described here, in contrast, uses separate classifiers for each semantic feature and each ontological sort and combines their results as follows: (1) Determine all complex semantic classes that are compatible with all results of the individual classifiers. (2) From the results of (1) select those classes that are minimal with respect to the specialization relation on the set of complex semantic classes. (3) If the set determined in (2) contains exactly one element, then take this as the result class, otherwise refuse a classification. The classifier is weak in the sense that it does not always assign a class (which is already the case for the individual classifiers). The results presented in Section 4.2 are based on the combination method just described. In order to improve the recall, the following two modifications suggest themselves for future experiments: If the set determined in step (2) contains more than one element, select the most specialized semantic class that is more general than all elements in the set. If no class can be found by step (1) then ignore the results of the most unreliable single classifiers step-by-step until a compatible class is found and proceed with (2).
4
Evaluation
For evaluation, we used 10-fold cross validation on a set of 6045 HaGenLex nouns in all experiments. In the preselection of the training set, care was taken to exclude polysemous nouns. The precision (number of correct classifications divided by number of total classifications) was calculated on the basis of the unification of the test sets, although in all experiments a much larger number of nouns could be classified. The threshold α for the minimum number of adjectives in the adjective profile of a noun was varied from 2 to 20, which led to different numbers of total classifications, as shown in Figure 2. For all further experiments, we (arbitrarily) fixed the minimum number of classifying adjectives to five, which lead to a classification of over 31,000 nouns in all experiments. Since for only 5133 nouns from the HaGenLex training set more than four co-occurring adjectives could have been extracted from the corpus, the a priori upper bound on the recall (number of correctly classified divided by number of total items) is 84.9%. Section 4.1 discusses the results for the individual classifiers for semantic features and ontological sorts, Section 4.2 presents the results for the combined classifier for complex semantic classes.
330
C. Biemann and R. Osswald
Fig. 2. Minimal adjective number α vs. corpus coverage and classifier precision
4.1
Assignment of Semantic Features and Ontological Sorts
As mentioned in Section 3, a separate binary classifier was constructed for all 16 features. Figure 3 shows the distribution in the training data for the semantic features and the fraction of the smaller class (bias). It can be seen that the classifiers are able to assign the right features to the test nouns if their bias is not smaller than 0.05. In the other cases we observe a high total precision per feature (method, instit, mental, info, animal and geogr) which was more or less obtained by always assigning the more frequent attribute. The less frequent +-attribute is recognized poorly in these cases. The overall precision is 93.8% (87.6% for +-attributes), overall recall is 75.8% (76.9% for +-attributes). As for the ontological sorts, we constructed for each of the 17 sorts a binary training set that contains words where the sort is present (attribute +) or absent (attribute −). Nouns not specified with respect to the respective sort were excluded from the training set. Figure 4 shows a similar picture as Figure 3: sorts having a bias over 0.1 can be differentiated well or even very well, less frequent sorts lead to problems. Notice that for the sorts ab and o, the attribute − was taken into consideration in the diagram in Figure 4, because this was the less frequent attribute. Overall precision is 93.3% (90.35% for attribute +) at an overall recall of 79.2% (76.3% for attribute +). It is worthwhile to recall from Section 2.1 that neither the semantic features nor the ontological sorts are independent of each other. (The ontological sorts are even arranged in a tree hierarchy.) Ideally, the individual classifiers respect these dependencies, which is prerequisite for combining their results to (complex) semantic classes.
Automatic Extension of Feature-based Semantic Lexicons
#
+
−
bias
6004 6032 9008 6015 5995 6015 6028 5932 5995 6009 6010 6015 5864 5892 5827 6033
12 39 162 119 143 188 518 969 1313 1352 1505 1664 2204 2260 2345 2910
5992 5993 8846 5896 5852 5827 5510 4963 4682 4657 4505 4351 3660 3632 3482 3123
0.0020 0.0065 0.0180 0.0198 0.0239 0.0313 0.0859 0.1634 0.2190 0.2250 0.2504 0.2766 0.3759 0.3836 0.4024 0.4823
feature
method instit mental info animal geogr thconc instru human legper animate potag artif axial movable spatial
331
Fig. 3. Left: distribution of features in the training set; right: total precision and recall and precision and recall of +-attributes versus bias in training set sort
#
+
−
bias
re mo oa o− me qn ta s as na at io ad abs d co ab−
6033 6033 6033 6033 6045 6045 6033 6010 6031 6033 6033 6033 6031 6033 6010 6033 6033
7 8 39 5994 41 41 107 224 363 411 450 664 1481 1846 2663 2910 3082
6026 6025 5994 39 6004 6004 5926 5786 5668 5622 5583 5369 4550 4187 3347 3123 2951
0.0012 0.0013 0.0065 0.0065 0.0068 0.0068 0.0177 0.0373 0.0602 0.0681 0.0746 0.1101 0.2456 0.3060 0.4431 0.4823 0.4891
Fig. 4. Left: Distribution of +/− attributes in training sets; right: precision and recall in total per sort and for attributes + versus bias in training data.
4.2
Assignment of Complex Semantic Classes
With respect to the task of extending the given semantic lexicon, the most important point of our approach is the quality of the assignment of complex semantic classes as described in Section 3.2. Figure 5 lists the cross-validation results for all complex semantic classes with at least 40 (≈ 0.68%) occurrences in the training set. For the remaining classes, which comprise about 5.9% of the training set, Figure 5 presents a collective evaluation (class “rest”). An obvious thing to notice is the fact that certain semantic classes are assigned with very good precision whereas others show a rather bad performance. A first conclusion could be that certain semantic properties of nouns
332
C. Biemann and R. Osswald
class # prec rec nonment-dyn-abs-situation 1421 92.25 26.81 human-object 1313 95.05 78.98 prot-theor-concept 516 59.05 12.02 nonoper-attribute 411 0.00 0.00 ax-mov-art-discrete 362 51.94 37.02 nonment-stat-abs-situation 226 48.39 6.64 animal-object 143 100.00 16.08 nonmov-art-discrete 133 57.41 23.31 ment-stat-abs-situation 126 70.00 5.56 nonax-mov-art-discrete 108 40.82 18.52 tem-abstractum 107 97.06 30.84 mov-nonanimate-con-potag 98 73.21 41.84 art-con-geogr 96 55.26 21.88 abs-info 94 35.71 10.64 art-substance 88 65.52 21.59 nat-discrete 88 100.00 25.00 nat-substance 86 64.29 10.47 prot-discrete 73 100.00 53.42 nat-con-geogr 63 80.00 19.05 prot-substance 50 94.44 34.00 mov-art-discrete 45 100.00 31.11 meas-unit 41 100.00 2.44 rest 357 52.17 10.08
Fig. 5. Precision and recall for complex semantic classes
are reflected by modifying adjectives while others are not. Notice that the assignment of complex semantic classes does not show the same close correspondence between class size and precision that has been observed in the previous section on the classification by single attributes. The overall precision of the assignment of semantic classes is about 82.3% at a recall of 32.8%. The fairly low recall is due to the fact that the method of Section 3.2 refuses a classification in case the results of the single attribute classifiers are not fully consistent with each other. Despite of this low recall, our approach gives us classification results for about 8500 unknown nouns. If we relax the minimal number α of co-occurring adjectives from five to two, the number of newly classified nouns rises even to almost 13,000, with a reduction of precision of only 0.2%.
5
Conclusion and Future Work
We have presented a method to automatically extend noun entries of semantic lexica via modifying adjectives. Given a moderate number of training items, the approach is able to classify a high number of previously unclassified nouns at more than 80% overall precision. An evaluation for the different semantic noun classes shows that certain semantic classes can be characterized by modifying adjectives while others can not. It would be interesting to see whether there is a similar distinction for other contextual constellations as, for instance, role filler positions in verb frames, but this requires much more preprocessing.
Automatic Extension of Feature-based Semantic Lexicons
333
To improve the recall of our method, the combination of the single attribute classifiers as described in Section 3.2 could be relaxed by taking the quality of the classifiers into account. Another way to circumvent the sparse data problem is to abstract from single adjectives by means of semantic adjective classes like ‘physical property’; cf. (Biemann and Osswald 2005, Sect. 6.1). However, this would require a large scale classification of adjectives by appropriate semantic classes. A further important issue for the extension of the method is the treatment of polysemy: If a word has multiple readings that differ in at least one attribute, the method as proposed here classifies the word according to the most frequent reading in the corpus in the best case. In the worst case, the word will not get classified at all, because the adjectives seem to contradict each other in some attributes. A possibility to split an adjective profile into several profiles, which reflect the different readings, is shown in (Bordag 2003) for untyped co-occurrences and can be paraphrased for the task described here as follows: Presuming one reading per sentence, weak co-occurrence between the context words of the different readings, and strong co-occurrence within the context words of the same reading, the adjective profiles can be split in disjoint subsets that collect modifiers of different noun readings, respectively.
References BIEMANN, C., BORDAG, S., HEYER, G., QUASTHOFF, U. and WOLFF, C. (2004): Language-independent Methods for Compiling Monolingual Lexical Data. In: Proceedings of CicLING 2004. LNCS 2945, Springer, Berlin, 215– 228. BIEMANN, C. and OSSWALD, R. (2005): Automatische Erweiterung eines semantikbasierten Lexikons durch Bootstrapping auf großen Korpora. In: B. Fisseni, H.-C. Schmitz, B. Schr¨ oder and P. Wagner (Eds.): Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen – Beitr¨ age zur GLDV-Tagung 2005 in Bonn. Peter Lang, Frankfurt am Main, 15–27. BORDAG, S. (2003): Sentence Co-Occurrences as Small-World-Graphs: A Solution to Automatic Lexical Disambiguation. In: Proceedings of CicLING 2003. LNCS 2588, Springer, Berlin, 329–333. DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. HARRIS, Z. (1968): Mathematical Structures of Language. John Wiley & Sons, New York. HARTRUMPF, S., HELBIG, H. and OSSWALD, R. (2003): The Semantically Based Computer Lexicon HaGenLex – Structure and Technological Environment. Traitement automatique des langues, 44(2), 81–105. HELBIG, H. (2001): Die semantische Struktur nat¨ urlicher Sprache: Wissensrepr¨ asentation mit MultiNet. Springer, Berlin MILLER, G.A. and CHARLES, W.G. (1991): Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28.
Learning Ontologies to Improve Text Clustering and Classification Stephan Bloehdorn1 , Philipp Cimiano1 , and Andreas Hotho2 1 2
Institute AIFB, University of Karlsruhe, D–76128 Karlsruhe, Germany KDE Group, University of Kassel, D–34321 Kassel, Germany
Abstract. Recent work has shown improvements in text clustering and classification tasks by integrating conceptual features extracted from ontologies. In this paper we present text mining experiments in the medical domain in which the ontological structures used are acquired automatically in an unsupervised learning process from the text corpus in question. We compare results obtained using the automatically learned ontologies with those obtained using manually engineered ones. Our results show that both types of ontologies improve results on text clustering and classification tasks, whereby the automatically acquired ontologies yield a improvement competitive with the manually engineered ones.
1
Introduction
Text clustering and classification are two promising approaches to help users organize and contextualize textual information. Existing text mining systems typically use the bag–of–words model known from information retrieval (Salton and McGill (1983)), where single terms or term stems are used as features for representing the documents. Recent work has shown improvements in text mining tasks by means of conceptual features extracted from ontologies (Bloehdorn and Hotho (2004), Hotho et al. (2003)). So far, however, the ontological structures employed for this task are created manually by knowledge engineers and domain experts which requires a high initial modelling effort. Research on Ontology Learning (Maedche and Staab (2001)) has started to address this problem by developing methods for the automatic construction of conceptual structures out of large text corpora in an unsupervised process. Recent work in this area has led to improvements concerning the quality of automatically created taxonomies by using natural language processing, formal concept analysis and clustering (Cimiano et al. (2004), Cimiano et al. (2005)). In this paper we report on text mining experiments in which we use automatically constructed ontologies to augment the bag–of–words feature representations of medical texts. We compare results both (1) to the baseline given by the bag–of–words representation alone and (2) to results based on the MeSH Tree Structures as a manually engineered medical ontology. We show that both types of conceptual feature representations outperform
Learning Ontologies to Improve Text Clustering and Classification
335
the Bag–of-Words model and that results based on the automatically constructed ontologies are highly competitive with those of the manually engineered MeSH Tree Structures. The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 describes our approach for automatically constructing ontology structures. Section 4 reviews the concept extraction strategies used to augment bag–of–words vectors. Section 5 finally reports on the results of the text classification and clustering experiments. We conclude in section 6.
2
Related Work
To date, the work on integrating background knowledge into text classification, text clustering or related tasks is quite heterogenous. Green (1999) uses WordNet to construct chains of related synsets from the occurrence of terms for document representation and subsequent clustering. We have reported promising results when using additional conceptual features extracted from manually engineered ontologies recently in Bloehdorn and Hotho (2004) and in Hotho et al. (2003). Other results from similar settings are reported in Scott and Matwin (1999) and Wang et. al (2003). One of the earlier works on automatic taxonomy construction is reported in Hindle (1990) in which nouns are grouped into classes. Hearst’s seminal work on using linguistic patterns also aimed at discovering taxonomic relations (Hearst (1992)). More recently, Reinberger and Spyns (2005) present an application of term clustering techniques in the biomedical domain. An overview over term clustering approaches for learning ontological structures as used in this paper is given in Cimiano et al. (2005). Alternative approaches for conceptual representations of text documents that do not require explicit manually engineered background knowledge are for example Latent Semantic Analysis (Deerwester et al. (1990)) or Probabilistic Latent Semantic Analysis (Cai and Hofmann (2003)). These approaches mainly draw from dimension reduction techniques, i.e. they compute a kind of concepts statistically from term co-occurrence information. In contrast to our approach, the concept-like structures are, however, not easily human interpretable.
3
Ontology Learning as Term Clustering
In this paper we adopt the approach described in Cimiano et al. (2004) and Cimiano et al. (2005) to derive concept hierarchies from text using clustering techniques. In particular we adopt a vector-space model of the texts, but using syntactic dependencies as features of the terms1 instead of relying only on word co-occurrence. The approach is based on the distributional hypothesis 1
Here we also refer to multi-word expressions if detected from the syntax alone.
336
S. Bloehdorn et al.
(Harris (1968)) claiming that terms are semantically similar to the extent to which they share similar syntactic contexts. For this purpose, for each term in question we extract syntactic surface dependencies from the corpus. These surface dependencies are extracted by matching text snippets tagged with part–of–speech information against a library of patterns encoded as regular expressions. In the following we list syntactic expressions we use and give examples of the features extracted from these expressions, whereby a:b ++ means that the count for attribute b of instance a is incremented by 1: adjective modifiers: alveolar macrophages macrophages: alveolear++ prepositional phrase modifiers: a defect in cell function defect: in cell function ++, cell function: defect in ++ possessive modifiers: the dorsal artery’s distal stump dorsal artery: has distal stump ++ noun phrases in subject or object position: the bacterium suppresses various lymphocyte functions bacterium: suppress subj ++, lymphocyte function: suppress obj ++ prepositional phrases following a verb: the revascularization occurs through the common penile artery penile artery: occurs through ++ copula constructs: the alveolar macrphage is a bacterium alveolar macrophage: is bacterium ++ verb phrases with the verb to have: the channel has a molecular mass of 105 kDa channel: has molecular mass ++
On the basis of these vectors we calculate the similarity between two terms t1 and t2 as the cosine between their corresponding vectors: cos((t1 , t2 )) = t1 ·t2 t1 ·t2 . The concept hierarchy is built using hierarchical clustering techniques, in particular hierarchical agglomerative clustering (Jain et al. (1999)) and divisive Bi-Section KMeans (Steinbach et al. (2000)). While agglomerative clustering starts with merging single terms each considered as one initial cluster up to one single cluster Bi-Section KMeans repeatedly splits the initial cluster of all terms into two until every term corresponds to a leaf cluster. The result is a concept hierarchy which we consider as a raw ontology. Due to the repeated binary merges and splits the hierarchy typically has a higher overall depth as manually constructed ones. For this reason we consider in our experiments a reasonable higher number of superconcepts than with manually engineered ontologies. More details of the ontology learning process can be found in Cimiano et al. (2004) and Cimiano et al. (2005).
4
Conceptual Document Representations
In our approach, we exploit the background knowledge given by the ontologies to extend the bag–of–words feature vector with conceptual features on a
Learning Ontologies to Improve Text Clustering and Classification
337
higher semantic level. In contrast to the simple term features, these conceptual features overcome a number of shortcomings of the bag–of–words feature representation by explicitly capturing multi–word expressions and conceptually generalizing expressions through the concept hierarchy. In our approach we only consider concepts which are labelled by noun phrases. As a lot of additional information is still hidden in the standard bag–of–words model we use a hybrid representation using concepts and the conventional term stems. Concept Annotation. We describe here the main aspects of the concept annotation steps, the interested reader is referred to the more detailed description in Bloehdorn and Hotho (2004). (1) Candidate Term Detection: due to the existence of multi-word expressions, the mapping of terms to the initial set of concepts can not be accomplished directly by compiling concept vectors out of term vectors. We use a candidate term detection strategy that moves a window over the input text, analyzes the window content and either decreases the window size if unsuccessful or moves the window further if a valid expression is detected. (2) To avoid unnecessary queries to the ontology we analyze the part–of–speech patterns in the window and only consider noun phrases for further processing. (3) Morphological Transformations: typically the ontology will not contain all inflected forms of its entries. Therefore we use a fallback strategy that utilizes stem forms maintained in a separate index for the ontology, if the search for a specific inflected form is unsuccessful2 . Generalization. The generalization step consists in adding more general concepts to the specific concepts found in the text, thus leading to some kind of ‘semantic smoothing’. The intuition behind this is that if a term like arrythmia appears, the document should not only be represented by the concept [arrythmia], but also by the concepts [heart disease] and [cardiovascular disease] etc. up to a certain level of generality. This thus increases the similarity with documents talking about some other specialization of [cardiovascular disease]. We realize this by compiling, for every concept, all superconcepts up to a maximal distance h into the concept representation. The result of this process is a “concept vector” that can be appended to the classical term vector representation. The resulting hybrid feature vectors can be fed into any standard clustering or classification algorithm.
5
Experiments
We have conducted extensive experiments using the OHSUMED text collection (Hersh et al. (1994)) which was also used for the TREC-9 filtering track3 . 2
3
Typically, the problem of disambiguating polysemous window content has to be addressed properly (Hotho et al. (2003)). The ontologies we report on in this paper, contained only concepts that were unambiguously referred to by a single lexical entry thus eliminating the need for word sense disambiguation strategies. http://trec.nist.gov/data/t9 filtering.html
338
S. Bloehdorn et al.
It consists of titles and abstracts from medical journals indexed with multiple MeSH descriptors and a set of queries with associated relevance judgements. Ontologies and Preprocessing Steps: In our experiments we used domain ontologies that were extracted automatically from the text corpus on the one hand and the Medical Subject Headings (MeSH) Tree Structures Ontology as a competing manually engineered ontology on the other. The automatically extracted ontologies were built according to the process described in section 3 using the 1987 portion of the collection, i.e. a total of 54,708 documents. The actual concept hierarchy was built using hierarchical agglomerative clustering or divisive Bi-Section KMeans. In overview, we performed experiments with the following configurations: agglo-7000: automatically constructed ontology, linguistic contexts for the 7,000 most frequent terms4 , taxonomy creation via agglomarative clustering; bisec-7000: automatically constructed ontology, linguistic contexts for the 7,000 most frequent terms4 , taxonomy creation via Bi-Section KMeans divisive clustering; bisec-14000: automatically constructed ontology, linguistic contexts for the 14,000 most frequent terms, taxonomy creation via Bi-Section KMeans divisive clustering; mesh: manually constructed ontology compiled out of the Medical Subject Headings (MeSH)5 containing more than 22,000 concepts enriched with synonymous and quasi-synonymous language expressions.
In all experiments, term stems6 were extracted as a first set of features from the documents. Conceptual features were extracted as a second set of features using the ontologies above and a window length of 3. Text Classification Setting: For the experiments in the text classification setting, we also used the 1987 portion of the OHSUMED collection. Two thirds of the entries were randomly selected as training documents while the remainder was used as test set, resulting in a training corpus containing 36,369 documents and a test corpus containing 18,341 documents. The assigned MeSH terms were regarded as categories for the documents and binary classification was performed on the top 50 categories that contained the highest number of positive training documents. In all cases we used AdaBoost (Freund and Schapire (1995)) with 1000 iterations as classification algorithm and binary weighting for the feature vectors. As evaluation measures for text 4
5
6
More accurately, we used the intersection of the 10,000 most frequent terms with the terms present in the MeSH Thesaurus, resulting in approx. 7,000 distinct terms here. The controlled vocabulary thesaurus of the United States National Library of Medicine (NLM), http://www.nlm.nih.gov/mesh/ In these experiments, term stem extraction comprises the removal of the standard stopwords for English defined in the SMART stopword list and stemming using the porter stemming algorithm.
Learning Ontologies to Improve Text Clustering and Classification
Ontology [none] agglo-7000 agglo-7000 agglo-7000 bisec-7000 bisec-7000 bisec-7000 bisec-14000 bisec-14000 bisec-14000 mesh mesh
Configuration term term & concept.sc10 term & concept.sc15 term & concept.sc20 term & concept.sc10 term & concept.sc15 term & concept.sc20 term & concept.sc10 term & concept.sc15 term & concept.sc20 term & concept term & concept.sc5
Error 00.53 00.53 00.53 00.53 00.52 00.52 00.52 00.53 00.53 00.52 00.52 00.52
Ontology [none] agglo-7000 agglo-7000 agglo-7000 bisec-7000 bisec-7000 bisec-7000 bisec-14000 bisec-14000 bisec-14000 mesh mesh
Configuration term term & concept.sc10 term & concept.sc15 term & concept.sc20 term & concept.sc10 term & concept.sc15 term & concept.sc20 term & concept.sc10 term & concept.sc15 term & concept.sc20 term & concept term & concept.sc5
Error 00.53 00.53 00.53 00.53 00.52 00.52 00.52 00.53 00.53 00.52 00.52 00.52
339
macro-averaged (in %) Prec Rec F1 BEP 52.60 35.74 42.56 45.68 52.48 36.52 43.07 46.30 52.57 36.31 42.95 46.46 52.49 36.44 43.02 46.41 53.39 36.79 43.56 46.92 54.36 37.32 44.26 47.31 55.12 36.87 43.86 47.25 51.92 36.12 42.60 45.35 52.17 36.86 43.20 45.74 53.37 36.85 43.60 45.96 53.65 37.56 44.19 47.31 52.72 37.57 43.87 47.16 micro-averaged (in %) Prec Rec F1 BEP 55.77 36.25 43.94 46.17 55.83 36.86 44.41 46.84 55.95 36.67 44.30 46.99 55.76 36.79 44.33 46.97 56.59 37.25 44.92 47.49 57.24 37.71 45.46 47.76 57.18 37.21 45.08 47.68 54.88 36.52 43.85 45.86 55.27 37.27 44.52 46.27 56.39 37.27 44.87 46.44 56.81 37.84 45.43 47.78 55.94 37.94 45.21 47.63
Table 1. Performance Results in the Classification Setting.
classification we report classification error, precision, recall, F1 -measure and breakeven point7 . Table 1 summarizes some of the classification results. In all cases, the integration of conceptual features improved the results, in most cases at a significant level. The best results for the learned ontologies could be achieved with the bisec-7000 ontology and a superconcept integration depth of 15 resulting in 44.26% macro-avg. F1 which is comparable to the results for the MeSH ontology. Text Clustering Setting: For the clustering experiments we first compiled a corpus which contains only one label per document. We used the 106 queries provided with the OHSUMED collection and regarded every answer set of a query as a cluster. We extracted all documents for all queries which occur in only one query. This results in a dataset with 4389 documents and 106 labels (clusters). Evaluation measures for Text Clustering are entropy, purity, inverse purity, and F1 -measure7 . Table 2 presents the results of the text clustering task, averaged over 20 repeated clusterings with random initialization. With respect to macroaveraging, the integration of conceptual features always improves results and also does so in most cases with respect to micro-averaging. Best macroaveraged results were achieved for the bisec-14000 ontology with 20 super7
For a review of evaluation measures refer to Sebastiani (2002) in the text classification setting and to Hotho et al. (2003) in the text clustering setting.
340
S. Bloehdorn et al.
Ontology [none] agglo-7000 agglo-7000 agglo-7000 bisec-7000 bisec-7000 bisec-7000 bisec-14000 bisec-14000 bisec-14000 mesh mesh
Ontology [none] agglo-7000 agglo-7000 agglo-7000 bisec-7000 bisec-7000 bisec-7000 bisec-14000 bisec-14000 bisec-14000 mesh mesh
macro-averaged (in %) Configuration Entropy F1 Inv. Purity Purity terms 2,6674 19,41% 17,22% 22,24% term & concept.sc1 2,6326 19,47% 17,68% 21,65% term & concept.sc10 2,5808 19,93% 17,55% 23,04% term & concept.sc20 2,5828 19,88% 17,69% 22,70% term & concept.sc1 2,5896 19,84% 17,72% 22,53% term & concept.sc10 2,5361 20,17% 17,38% 24,02% term & concept.sc20 2,5321 20,01% 17,38% 23,59% term & concept.sc1 2,5706 19,96% 17,76% 22,80% term & concept.sc10 2,4382 21,11% 17,68% 26,18% term & concept.sc20 2,4557 20,77% 17,46% 25,67% term & concept.sc1 2,4135 21,63% 17,70% 27,78% term & concept.sc10 2,3880 21,93% 17,64% 28,98% micro-averaged (in %) Configuration Entropy F1 Inv. Purity Purity terms 3,12108 14,89% 14,12% 15,74% term & concept.sc1 3,1102 15,34% 14,56% 16,21% term & concept.sc10 3,1374 15,21% 14,43% 16,08% term & concept.sc20 3,1325 15,27% 14,62% 15,97% term & concept.sc1 3,1299 15,48% 14,84% 16,18% term & concept.sc10 3,1533 15,18% 14,46% 15,98% term & concept.sc20 3,1734 14,83% 14,23% 15,48% term & concept.sc1 3,1479 15,19% 14,63% 15,80% term & concept.sc10 3,1972 14,83% 14,33% 15,37% term & concept.sc20 3,2019 14,67% 14,07% 15,36% term & concept.sc1 3,2123 14,92% 14,91% 14,93% term & concept.sc10 3,2361 14,61% 14,64% 14,59%
Table 2. Performance Results in the Clustering Setting.
concepts. These result is competitive to the one we obtained with the mesh ontology. Surprisingly the best micro avg. results could be found for the strategy adding a single superconcept only.
6
Conclusion
The contribution of this paper is twofold. We presented a novel approach for integrating higher-level semantics into the document representation for text mining tasks in a fully unsupervised manner that significantly improves results. In contrast to other approaches, the discovered conceptual structures are well understandable while not based on manually engineered resources. On the other hand, we see our approach as a new way of evaluating learned ontologies in the context of a given text clustering or classification application. Further work is directed towards improving the automatically learned ontologies on the one hand. On the other, it will aim at a tighter integration of the conceptual knowledge, including the exploration of more fine-grained and unparameterized generalization strategies. Acknowledgements This research was partially supported by the European Commission under contract IST-2003-506826 SEKT (http://www.sekt-project.com) and the by the German Federal Ministry of Education, Science, Research and Technology (BMBF) in the project SmartWeb (http://smartweb.dfki.de).
Learning Ontologies to Improve Text Clustering and Classification
341
References BLOEHDORN, S. and HOTHO, A. (2004): Text Classification by Boosting Weak Learners based on Terms and Concepts. In: Proceedings of ICDM, 2004 . IEEE Computer Society. CAI, L. and HOFMANN, T. (2003): Text Categorization by Boosting Automatically Extracted Concepts. In: Proceedings of ACM SIGIR, 2003. ACM Press. CIMIANO, P.; HOTHO, A. and STAAB, S. (2004): Comparing Conceptual, Partitional and Agglomerative Clustering for Learning Taxonomies from Text. In: Proceedings of ECAI’04. IOS Press. CIMIANO, P. and HOTHO, A. and STAAB, S. (2005): Learning Concept Hieararchies from Text Corpora using Formal Concept Analysis. Journal of Artificial Intelligence Research. To appear. DEERWESTER, S.; DUMAIS, S.T.; LANDAUER, T.K.; FURNAS, G. W. and HARSHMAN, R.A. (1990): Indexing by Latent Semantic Analysis. Journal of the Society for Information Science, 41, 391–407. FREUND, Y. and SCHAPIRE, R.E. (1995): A Decision Theoretic Generalization of On-Line Learning and an Application to Boosting. In: Second European Conference on Computational Learning Theory (EuroCOLT-95). GREEN, S.J. (1999): Building Hypertext Links By Computing Semantic Similarity. IEEE Transactions on Knowledge and Data Engineering, 11, 713–730. HARRIS, Z. (1968): Mathematical Structures of Language. Wiley, New York, US. HEARST, M.A. (1992): Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING). HERSH, W. R.; BUCKLEY, C.; LEONE, T.J. and HICKAM, D.H. (1994): OHSUMED: An Interactive Retrieval Evaluation and new large Test Collection for Research. In: Proceedings of ACM SIGIR, 1994. ACM Press. HINDLE, D. (1990): Noun Classification from Predicate-Argument Structures. In: Proceedings of the Annual Meeting of the ACL. HOTHO, A.; STAAB, S. and STUMME, G. (2003): Ontologies Improve Text Document Clustering. In: Proceedings of ICDM, 2003 . IEEE Computer Society. JAIN, A. K., MURTY, M. N., and FLYNN, P. J. (1999): Data Clustering: A review. ACM Computing Surveys,31, 264–323. MAEDCHE, A. and STAAB, S. (2001): Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 16, 72–79. REINBERGER, M.-L. and SPYNS, P. (2005): Unsupervised Text Mining for the Learning of DOGMA-inspired Ontologies. In: Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press. To appear. SALTON, G. and MCGILL, M.J. (1983): Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, US. SCOTT, S. and MATWIN, S. (1999): Feature Engineering for Text Classification. In: Proceedings of ICML, 1999. Morgan Kaufmann. 379–388. SEBASTIANI, F. (2002): Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1–47 STEINBACH, M., KARYPIS, G., and KUMAR, V. (2000): A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining 2000. WANG, B.; MCKAY, R.I.; ABBASS, H.A. and BARLOW, M. (2003): A Comparative Study for Domain Ontology Guided Feature Extraction. In: Proceedings of ACSC-2003. Australian Computer Society.
Discovering Communities in Linked Data by Multi-view Clustering Isabel Drost, Steffen Bickel, and Tobias Scheffer Humboldt-Universit¨ at zu Berlin, Institut f¨ ur Informatik Unter den Linden 6, 10099 Berlin, Germany {drost, bickel, scheffer}@informatik.hu-berlin.de Abstract. We consider the problem of finding communities in large linked networks such as web structures or citation networks. We review similarity measures for linked objects and discuss the k-Means and EM algorithms, based on text similarity, bibliographic coupling, and co-citation strength. We study the utilization of the principle of multi-view learning to combine these similarity measures. We explore the clustering algorithms experimentally using web pages and the CiteSeer repository of research papers and find that multi-view clustering effectively combines link-based and intrinsic similarity.
1
Introduction
Citation Analysis has originally been carried out manually (Garfield, 1972), but many discovery tasks in this problem area can be automated. Finding communities in linked networks is a sub-problem of citation analysis. The task here is to find clusters of thematically related papers or web pages (White & McCain, 1989, Kautz et al., 1997, Getoor, 2003) where objects within clusters are similar and dissimilar between clusters. When clustering publications or web pages it seems appropriate to make use of the similarity of their textual content. Yet also the inbound and outbound links can be used to define the similarity of two documents. The k-means algorithm already has been applied to citation analysis (Hopcroft et al., 2003). The EM algorithm (Dempster et al., 1977), and the recently developed multi-view clustering method (Bickel & Scheffer, 2004), appear to be suitable. But it is not clear how these approaches differ in terms of cluster quality. We discuss how partitioning cluster algorithms can be applied to linked data. We review vector space representations of linked documents and their correspondence to the bibliographic coupling and co-citation similarity measures. We study appropriate distributional models that can be used to instantiate EM. When having different measures of similarty at hand the natural question is whether algorithms can use a combination. We develop an undirected graph model and use multi-view clustering algorithms. A comparative analysis of the resulting clustering methods leads us to results on their cluster quality. We obtain results on the benefit of the co-citation, bibliographic
Discovering Communities in Linked Data by Multi-view Clustering
343
coupling, the undirected, and the multi-view model. Additionally we compare link based clustering to clustering based on the textual content of papers or web pages. The rest of the paper is organized as follows. Section 2 reviews related work, we describe the problem setting in Section 3. In Section 4, we discuss clustering algorithms and their application for citation analysis. Section 5 presents empirical results, and Section 6 concludes.
2
Related Work
Citation analysis dates back to Garfield (1972) who proposed the impact factor as a performance measure for journals. White and McCain (1989) coined the term bibliometrics for automated analysis of citation data. Bibliometrics focuses on two graphs: the co-citation graph (White & McCain, 1989) relates papers by the proportion of jointly cited work. The collaboration graph (White, 2003), by contrast, relates papers by jointly authored research papers (the mathematician P´ al Erd˝ os is believed to be the node with highest degree, having more than 500 co-authors). It is known that many properties (such as the degree of the nodes) of naturally grown graphs, such as citation or social networks, follow power laws (Redner, 1998). This distinguishes them from random graphs (Liljeros et al., 2001, Alberich et al., 2002). Small-world properties are typical for such compounds (Watts & Strogatz, 1998). In this respect, the web exhibits the same properties as a citation network and the same algorithms can be applied to analyze its cluster structure (Gibson et al., 1998, Getoor, 2003). The problem of clustering web search results has been addressed using modified versions of k-means (Modha & Spangler, 2000, Wang & Kitsuregawa, 2001) as well as a spectral clustering algorithm (He et al., 2001); here, the instances are represented using a combination of document content, inbound, and outbound links. The multi-view EM and multi-view k-means clustering methods can be applied when each instance has a representation in two distinct vector spaces. In our problem area, those spaces can be inbound links, outbound links, and text content. Multi-view clustering appears interesting for citation analysis because, if this requirement is met, then it often outperforms the regular EM substantially (Bickel & Scheffer, 2004).
3
Problem Setting
We consider the problem of clustering linked objects. More precisely, we assume that each document has an unknown “true” class membership. This true class label is not visible to the clustering algorithm, but we use the labels to evaluate the quality of the resulting clusters as the homogeneity of true class memberships within the returned clusters. The homogeneity measure is the entropy of the true classes within the generated clusters (Equation 1). C
344
I. Drost et al.
is a partitioning of the instances X into clusters ci , and L is the (manual) partitioning into true classes lj . Hence, p(lj |ci ) is the fraction of instances in ci that have true class label lj . Intuitively, the entropy is the average number of bits needed to encode the true class label of an instance, given its cluster membership. Since the true class memberships are not visible, no algorithm can directly optimize this criterion. ⎛ ⎞ |ci | ⎝− EC,L = p(lj |ci ) log p(lj |ci )⎠ (1) ci ∈C
|X|
lj ∈L
The k-means and EM algorithms require instances to be represented in a vector space. Let V = {1, . . . , n} be a universe of documents of which we wish to cluster a subset X ⊆ V . Let E ⊆ V ×V be the citation graph; (xj , xk ) ∈ E if xj cites xk . For every xj ∈ X, we define a vector xin j of inbound links: in xjk = 1 if document xj is cited by xk , and 0 otherwise. The outbound vector xout is defined analogously: xout j jk = 1 if xj cites xk . In addition, we consider txt the intrinsic, text-based representation xtxt j . In the context of k-means, xj is a normalized tfidf vector; in the context of multinomial EM, it is a vector that counts, for every word in the dictionary, the number of occurrences in document xj . Let us review common concepts of similarity for linked documents. Intuitively, the bibliographic coupling measures the number of common citations in two papers whereas the co-citation is a measure of how frequently two papers are being cited together. That is, the bibliographic coupling of two papers is the correspondence of their sets of documents connected by outbound links whereas the co-citation strength of two papers equals the similarity of their sets of documents connected by inbound links. The general EM algorithm is instantiated with a model-specific likelihood function. Based on the bibliographic coupling this likelihood has to quantify how well the vector of outbound links xout of a document xj corresponds j to some cluster; based on co-citation, the vector of inbound links xin j has to be considered. The k-means algorithm requires a similarity measure. A natural similarity function based on the bibliographic coupling is the cosine xjout ,xout k ; the co-citation |xout ||xout | j k in similarity of xj and xin k . In the
between two vectors of outbound links bc(xj , xk ) =
similarity cc(xj , xk ) is defined as the cosine textual view, text similarity ts(xj , xk ) can naturally be calculated as the cosine between document vectors xtxt and xtxt j k . In addition to the concepts of co-citation and bibliographic coupling, we out will also study an undirected model, xundir = xin j j + xj .
4
Clustering Algorithms for Citation Analysis
In this section, we discuss how k-means and EM clustering can be applied to citation analysis.
Discovering Communities in Linked Data by Multi-view Clustering
4.1
345
Clustering by k-Means
The well known k-means algorithm starts with k random mean vectors and then, in turns, assigns each instance to the cluster with nearest mean vector and re-calculates the means by averaging over the assigned instances as long as there is a change in the cluster assignments. 4.2
EM for Citation Analysis
The Expectation Maximization algorithm (Dempster et al., 1977) can be used for maximum likelihood estimation of mixture model parameters. Applied to citation analysis, the mixture components are the clusters of related papers that we wish to identify. We get cluster assignments from the estimated mixture model by assigning each instance xj to the cluster of highest a posteriori probability argmaxi P (ci |xj ). We introduce the multinomial citation model for clustering linked data. In this model, a paper has a certain number n of links, where n is a random variable governed by P (n). Each of these n links is a random variable that can take |V | distinct values, it is governed by a cluster-specific distribution θi (xk ). References are drawn without replacement as there can be at most one link between each pair of papers. The distribution of n random variables with |V | values, drawn without replacement, is governed by the multi-hypergeometric distribution. The multi-hypergeometric distribution is the generalization of the hypergeometric distribution for non-binary variables. Unfortunately, it is computationally infeasible because calculation of probabilities requires summation over a huge trellis and even a lookup-table is impractically large. Since the number of links in a paper is much smaller than the number of papers in V , it can be approximated by the multinomial distribution. This corresponds to drawing citations with replacement. The likelihood in the multinomial citation model is given in Equation 2. The “n!” term reflects that there are n! ways of drawing any given set of n citations in distinct orderings. PΘ (xj |ci ) = P (n)n!θi (xk )xjk (2) xk ∈V out Again, xj = xin for bibliographic coupling. j for co-citation and xj = xj The E and M steps for the multinomial model are given in Equations 3, 5, and 6 (posterior and maximum likelihood estimator for the multinomial distribution are well-known). As we see in Equation 4, it is not necessary to know P (n) if only the posterior PΘ (ci |xj ) is of interest. We can apply Laplace smoothing by adding one to all frequency counts. πi xl ∈V P (n)n!θi (xl )xjl πi PΘ (xj |ci ) E step: PΘ (ci |xj ) = = (3) xjl k πk PΘ (xj |ck ) k πk xl ∈V P (n)n!θk (xl ) πi xl ∈V θi (xl )xjl = (4) xjl k πk xl ∈V θk (xl )
346
I. Drost et al.
M step: θi (xk ) =
j∈V
πi =
xlk P (ci |xl , Θ) xl ∈X xlj P (ci |xl , Θ)
xl ∈X
(5)
1 PΘ (ci |xk ) |X|
(6)
xk ∈X
The multinomial distribution is also frequently used as a model for text. In the multinomial text model, words are drawn with replacement according to a cluster-specific distribution θi (xk ). The likelihood of a document xj = xtxt j in cluster ci ; can be characterized analogously to Equation 2; the E and M steps for the multinomial text model follow Equations 4 and 6, respectively (with x = xtxt ). 4.3
Combining Text Similarity, Co-Citation, and Bibliographic Coupling
The methods that we studied so far can be applied using text similarity, cocitation, or bibliographic coupling as similarity metric. It is natural to ask for the most effective way of combining these measures. A baseline for the combination of inbound and outbound links that we consider is the undirected model (Section 3) in which inbound and outbound links are treated alike. We study the multi-view clustering model (Bickel & Scheffer, 2004). Multiview clustering can be applied when instances are represented in two distinct (ideally independent) views. Here, distinct views naturally are xin , xout , and xtxt . Two interleaving EM algorithms then learn the parameters of distinct models, each model clusters the data in one of the views. The parameters are estimated such that they maximize the likelihood plus an additional term that quantifies the consensus between the two models. This approach is motivated by a result of Dasgupta et al. (2002) who show that the probability of a disagreement of two independent hypotheses is an upper bound on the probability of an error of either hypothesis. Table 1 briefly summarizes the multi-view clustering algorithm (Bickel & Scheffer, 2004). In our experiments, we study multi-view k-means and multi-view EM with multinomials. The multi-view clustering algorithm returns two parameter sets Θ(1) and (2) Θ and two clustering hypotheses, one in each view. A unified cluster assignment can be obtained by using the argmax of a combined posterior applying bayes rule and a conditional independence assumption (Equation 7). Equa(1) (2) tion 7 needs the definition of a combined prior πi , we use πi = 21 (πi + πi ). (1)
(2)
πi PΘ(1) (xj |ci )PΘ(2) (xj |ci ) πi PΘ (xj |ci ) PΘ (ci |xj ) = = (2) in k πk PΘ (xj |ck ) k πk PΘ (1) (xj |ck )PΘ (2) (xj |ck )
(7)
In the multi-view k-means algorithm, we assign an example xj to the (1)
cluster with argmaxi
(1)
(2)
(2)
xj ,mi xj ,mi (1) (1) |xj ||mi |
(2) (2) |xj ||mi |
(1)
, where mi
vectors of the i-th cluster in the respective view.
(2)
and mi
are the mean
Discovering Communities in Linked Data by Multi-view Clustering (1)
(2)
(1)
347
(2)
Input: instances {(x1 , x1 ), . . . , (xm , xm )}.
1. Randomly initialize parameters Θ(2) in view (2). (2) 2. E step in view (2): compute posterior P (ci |xj , Θ (2) ) of cluster membership given the model parameters in view (2). 3. Until convergence: (a) For v ∈ {(1), (2)}: i. M step in view v: Find model parameters Θv that maximize the likelihood given the posterior P (ci |xvj¯ , Θ v¯ ) computed in the last step. ii. E step in view v: compute posterior P (ci |xvj , Θ v ) of cluster membership given the model parameters in the current view. (b) End For. 4. Return combined model Θ = Θ(1) ∪ Θ(2) . Table 1. Multi-view Clustering.
5
Comparative Analysis
In this section, we will investigate the relative benefit of the different algorithms and representations in terms of cluster quality and regarding different applications (scientific publications or web pages). In order to measure the cluster quality as the average entropy (Equation 1) we use manually defined labels that are hidden to the clustering algorithms. For our experiments we use the CiteSeer data set (3,312 scientific publications, six classes) (Lu & Getoor, 2003) and the well known WebKB collection (8,318 university web pages, six classes). Let us first study how the different clustering methods compare in terms of cluster quality for only link-based representations. Fig. 1 shows the averaged cluster quality over ten runs of multinomial EM and k-means for both data sets. Error bars indicate standard error (in most cases unperceivably small). The multinomial model fits the CiteSeer data best. Simple k-means clustering gives the best performance for WebKB. For this problem, the inbound links (co-citation) contain the most relevant information and lead to the best results. For the CiteSeer data, the undirected model works best. In Figure 2 we want to answer the question whether the usage of textual content has a positive impact on cluster quality. For CiteSeer, we combine outbound link information and text because outbound links lead to a better clustering results; for WebKB we combine inbound links information and text for the same reason. For CiteSeer combining textual content and link information by multi-view EM works better than each of the single approaches. For the WebKB data, combining link and text information did not lead to an improvement in clustering quality. It is remarkable, that for WebKB data the inlinks seem to contain far more valuable information for clustering than the textual content of the web pages. We also ran experiments with concatenated
348
I. Drost et al. CiteSeer
Entropy
2.6
Inbound Links Outbound Links Multi-View Undirected Model
2.4
2 Entropy
2.8
WebKB 2.2
Inbound Links Outbound Links Multi-View Undirected Model
1.8 1.6
2.2 2
1.4 k-Means
Multinomial EM
k-Means
Multinomial EM
Fig. 1. Cluster entropy for link-based clustering.
WebKB
Outbound Links Text Text+Outbound Links
2 Entropy
Entropy
CiteSeer 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4
Inbound Links Text Text+Inbound Links
1.8 1.6 1.4
k-Means
Multinomial EM
k-Means
Multinomial EM
Fig. 2. Cluster entropy for link- and text-based clustering.
text and link vectors. Yet for all datasets and algorithms, clustering quality was significantly worse in comparison to multi-view clustering.
6
Conclusion
We analyzed how partitioning clustering algorithms can be applied to the problem of finding communities in linked data using similarity metrics based on co-citation, bibliographic coupling, and textual similarity as well as a combinations of them. For the combination of different similarity metrics we considered an undirected and a multi-view model. We motivated and discussed the multinomial distributional model for citation data that can be used to instantiate general EM. Experiments show that for publication citation analysis (CiteSeer data) the combination of different measures always improves the clustering performance. The best performance is achieved with the multi-view model based on outlink and textual data. By contrast, for web citation analysis (WebKB data) the inbound links are most informative and combining this measure with others (outbound links or text) deteriorates the performance.
Discovering Communities in Linked Data by Multi-view Clustering
349
Acknowledgment This work was supported by the German Science Foundation DFG under grant SCHE 540/10-1. We thank Lise Getoor for kindly providing us with the CiteSeer data set.
References ALBERICH, R., MIRO-JULIA, J., & ROSSELL´ o, F. (2002): Marvel universe looks almost like a real social network (Preprint). arXiv id 0202174. BICKEL, S., & SCHEFFER, T. (2004): Multi-view clustering. IEEE International Conference on Data Mining. DASGUPTA, S., LITTMAN, M.L., & McALLESTER, D. (2002): Pac generalization bounds for co-training. Advances in Neural Information Processing Systems 14 (pp. 375–382). Cambridge, MA: MIT Press. DEMPSTER, A., LAIRD, N., & RUBIN, D. (1977): Maximum likelihood from incomplete data via the EM algorithm. Journ. of Royal Stat. Soc. B, 39. GARFIELD, E. (1972): Citation analysis as a tool in journal evaluation. Science, 178, 471–479. GETOOR, L. (2003): Link mining: A new data mining challenge. SIGKDD Exploration 5. GIBSON, D., KLEINBERG, J.M., & RAGHAVAN, P. (1998): Inferring web communities from link topology. UK Conference on Hypertext (pp. 225–234). HE, X., DING, C.H.Q., ZHA, H., & SIMON, H.D. (2001): Automatic topic identification using webpage clustering. ICDM (pp. 195–202). HOPCROFT, J., KHAN, O., & SELMAN, B. (2003): Tracking evolving communities in large linked networks. Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining. KAUTZ, H., SELMAN, B., & SHAH, M. (1997): The hidden web. AI Magazine, 18, 27–36. LILJEROS, F., EDLING, C., AMARAL, L., STANLEY, H., & ABERG, Y. (2001): The web of human sexual contacts. Nature, 411, 907–908. LU, Q., & GETOOR, L. (2003): Link-based text classification. IJCAI Workshop on Text Mining and Link Analysis, Acapulco, MX. MODHA, D.S., & Spangler, W.S. (2000): Clustering hypertext with applications to web searching. ACM Conference on Hypertext (pp. 143–152). REDNER, S. (1998): How popular is your paper? an empirical study of the citation distribution. European Physical Journal B, 4, 131–134. WANG, Y., & KITSUREGAWA, M. (2001): Link based clustering of Web search results. Lecture Notes in Computer Science, 2118. WATTS, D., & STROGATZ, S. (1998): Collective dynamics of small-world networks. Nature, 393, 440–442. WHITE, H. (2003): Pathfinder networks and author cocitation analysis: a remapping of paradigmatic information scientists. Journal of the American Society for Information Science and Technology, 54, 423–434. WHITE, H., & McCAIN, K. (1989): Bibliometrics. Annual Review of Information Science and Technology, 24, 119–186.
Crosslinguistic Computation and a Rhythm-based Classification of Languages August Fenk1 and Gertraud Fenk-Oczlon2 1
2
Institut f¨ ur Universit¨ at Institut f¨ ur Universit¨ at
Medien- und Kommunikationswissenschaft, Klagenfurt, 9020 Klagenfurt, Austria Sprachwissenschaft und Computerlinguistik, Klagenfurt, 9020 Klagenfurt, Austria
Abstract. This paper is in line with the principles of numerical taxonomy and with the program of holistic typology. It integrates the level of phonology with the morphological and syntactical level by correlating metric properties (such as n of phonemes per syllable and n of syllables per clause) with non-metric variables such as the number of morphological cases and adposition order. The study of crosslinguistic patterns of variation results in a division of languages into two main groups, depending on their rhythmical structure. Syllable-timed rhythm, as opposed to stress-timed rhythm, is closely associated with a lower complexity of syllables and a higher number of syllables per clause, with a rather high number of morphological cases and with a tendency to OV order and postpositions. These two fundamental types of language may be viewed as the “idealized” counterparts resulting from the very same and universal pattern of variation.
1
Holistic Typology and Numerical Taxonomy
The goal of linguistic typology was from the very beginning a “classification” of languages not from the perspective of genetic and areal relations (Altmann & Lehfeldt (1973: 13)), but a “typological classification” such as the “morphological typology of the nineteenth and early twentieth centuries” (Croft (1990: 1)). In Croft the term “classification” is used in the sense of a superordinate concept, and not, as in several other authors, as a neighbouring concept of “typology”. Hempel & Oppenheim, however, suggest using “typological system” as a superordinate concept comprising “ordnende” as opposed to “klassifizierende Form” (Hempel & Oppenheim (1936: 79, 121)). In its modern form, the domain of typology is “the study of cross-linguistic patterns of variation”, says Croft (1990: 43) and attributes its earnest beginnings to Greenberg’s (1966) discovery of implicational universals of morphology and word order. Greenberg’s work was indeed very modern as compared with those recent studies confining themselves to seeking dependencies within syntax, within morphology, or within phonology. But his studies are, from the point of view of a “holistic typology”, instances of a “partial typology”. The program of a “holistic” or “systemic typology” is much older and even more
Crosslinguistic Computation and a Rhythm-based Classification
351
ambitious with its claim to integrate also phonological properties—in addition to grammatical properties, i.e. syntactic parameters (such as word order) and morphological parameters. In the words of Georg von der Gabelentz, who introduced the term “typology” into linguistics: “Jede Sprache ist ein System, dessen s¨ammtliche Theile organisch zusammenh¨angen und zusammenwirken. /. . . / Ich denke an Eigenth¨ umlichkeiten des Wort- und des Satzbaues, an die Bevorzugung oder Verwahrlosung gewisser grammatischer Kategorien. Ich kann, ich muss mir aber auch denken, dass alles dies zugleich mit dem Lautwesen irgendwie in Wechselwirkung stehe. /. . . / Aber welcher Gewinn w¨are es auch, wenn wir einer Sprache auf den Kopf zusagen d¨ urften: Du hast das und das Einzelmerkmal, folglich hast du die und die weiteren Eigenschaften und den und den Gesammtcharakter!” (von der Gabelentz (1901: 481); cited from Plank (1991: 421)). Predictivity is the goal of the “hopeful” program of holistic typology (Plank (1998)), and “numerical taxonomy” specifies the appropriate methodological principle, i.e. the principle to construct taxonomic groups with great “content of information” on the basis of “diverse character correlations in the group under study” (Sokal & Sneath (1963: 50), cited from Altmann & Lehfeldt (1973: 17)).
2
Crosslinguistic Patterns found in Previous Studies
Our previous studies, and the present study as well, use two rather uncommon methods in order to identify crosslinguistic patterns of variation. The first facet of this new correlational device is a “crosslinguistic” computation in the literal sense of the word: Each single language is represented by a single data pair (concerning two variables X and Y ), and the computation is across the whole corpus of (a, b, c, . . . , n) languages. The second facet is the use of two correlational findings as the premises from which one may infer a third correlational assumption: Given high correlations of a certain variable X with two different partners (Y, Z), this is a good hint that there might be a correlation between Y and Z as well. The higher the correlations XY and XZ, and the higher therefore the respective determination coefficients, the more plausible the inference regarding a correlation Y Z. An example in the form of a syllogistic inference: the higher Y , the lower X. the lower X, the higher Z. Therefore: the higherY , the higher Z. “Therefore” in the conclusion means: “Therefore” it is plausible to proceed to the assumption of a positive correlation Y Z. To put it more precise and more general: In the absence of any differing content-specific arguments we have to expect a positive rather than a negative sign of a third correlation in cases of equal signs in the “premises”, and a negative rather than a positive
352
A. Fenk and G. Fenk-Oczlon language syll./clause phon./syll. Dutch 5.045 2.9732 .. .
English .. .
5.772
2.6854
Italian .. .
7.500
2.1212
Japanese
10.227
1.8756
Table 1. The principle of a “crosslinguistic correlation” in the literal sense of the term (see correlation (a) in the text)
sign of a third correlation in cases of different signs (+, −) in the “premises”. Needless to say, that any specific expectation of this sort may prove to be wrong despite of its a priori plausibility. This way of statistical thinking is, in principle, known from the methods of partial correlation and path analysis. What seems to be new—at least within typological research—is its explicit use in order to generate new assumptions or to judge the plausibility of new assumptions respectively. Both facets of this inferential device can best be demonstrated by means of and together with the results of our previous studies. The first one of these studies is a statistical reanalysis (Fenk-Oczlon & Fenk (1985)) of experimental data by Fenk-Oczlon (1983): In the experimental study, native speakers of 27 different languages were asked to give a written translation of a set of 22 simple declarative sentences—e.g. The sun is shining; I thank the teacher —and to determine the number of syllables of each of the sentences. These written translations (completely represented in the appendix of Fenk-Oczlon (1983)) allowed, moreover, to count the words per sentence and to determine the number of phonemes with the aid of grammars of the respective languages. (The results of this procedures and calculations, i.e. the characteristic values of each single language—mean n of syll./clause, mean n of words/clause, etc.—are listed up in Fenk & Fenk-Oczlon (1993, Table 4).) As expected, the language’s mean number of syllables per clause was approximately in the region of Miller’s (1956) magical number seven, plus or minus two. But obviously the single languages’ position within this range on the continuum “n of syllables/clause” was not accidental: Dutch, which is known for its complex syllables, encoded the semantic units with a mean of 5.05 syllables/clause; Japanese with its extremely simple syllables (or mora) marked the other end of the range with a mean of 10 syllables (or mora) per clause. We suspected the syllable-complexity (n of phonemes/syllable) being the relevant determinant. This assumption was tested by correlating the languages’ mean number of syllables/clause with their mean number of phonemes/syllable.
Crosslinguistic Computation and a Rhythm-based Classification
353
This was, as far as we can see, the first “crosslinguistic correlation” in the literal sense of the word, and it turned out to be highly significant (FenkOczlon & Fenk (1985)): (a) the more syllables per clause, the fewer phonemes per syllable In a later study (Fenk & Fenk-Oczlon (1993)) with a slightly extended sample of languages we tested three further assumptions (b, c, and d). Correlation (a) indicates the view of systemic balancing effects providing a crosslinguistically “constant” or “invariant size” of simple declarative sentences. If this view holds, one has to assume a further balancing effect between word complexity (in terms of n of syllables) and the complexity of sentences (in terms of n of words): (b) the more words per clause, the fewer syllables per word Correlation (b) is a crosslinguistic version of Menzerath’s generalization “the bigger the whole, the smaller its parts”, while the following correlation (c) is a crosslinguistic version of a law actually verified by Menzerath (1954) in German. Here, the “whole” is not the sentence but the word: (c) the fewer phonemes per syllable, the more syllables per word Correlations (a) and (c) taken together as “premises” (see above) indicated a positive correlation (d): (d) the more syllables per clause, the more syllables per word The whole set of mutually dependent linear correlations (a, b, c, d) proved to be significant, and the calculations of higher-order (e.g. quadratic) functions resulted, for obvious reasons, in even higher determination coefficients. This pattern of crosslinguistic variation seems to reflect time-related constraints in sentence production and perception. A follow-up study (Fenk-Oczlon & Fenk (1999)) with an again extended sample of now 34 languages (18 Indo-European including German, and 16 Non-Indoeuropean) could not only verify this set of correlations between metric properties but revealed, moreover, a significant association between such metric properties and the predominant word order of languages. Comparisons between Object-Verb order versus Verb-Object order and the respective ttests significantly showed that OV order is associated with a low number of phonemes per syllable and a high number of syllables per word and per clause, and VO order with the opposite characters. These results encouraged our search for further connections between metric and non-metric properties.
3
Connecting Metric with Non-metric Properties
The formulation of the following hypotheses was, first of all, guided by more or less provisional ideas about interdependences between linguistic characteristics, but was assisted by the “inferential principle” described above. The linguistic arguments and the relevant chain of reasoning (for more details see Fenk-Oczlon & Fenk (2005)) resulted in a set of new hypotheses. Actu-
354
A. Fenk and G. Fenk-Oczlon
syll.
r=+
syll.
p p p p p p prp p = p p p p−p p p p p p p p p tend. to prepos.
pp clause p p p p pr = + pp pp p @ @r = − p ppp p pp pp p pp p pp @ pp ppp p pp p p p @ p p p r=− r=− r=− p p p r = + p pp p @ pp ppp e = −ppp pp p pp ppp @ pp pp ppp pp pp r = + p pp @ phon. words e = + n of pppppppppppppppppppppp word
clause
r=−
syll.
r=−
cases
Fig. 1. A correlational model connecting metric properties (in the left part of the figure) with the two non-metric properties “tendency to prepositions” and “number of cases”. Significant correlations: solid lines; non-significant coefficients > 0.32: broken lines; non-significant coefficients < 0.32: dotted lines; e = expected sign differing from the sign obtained
ally, the following list contains only 5 different correlations, because B3 is a paraphrase of A3. A Number of morphological cases (A): a high number of cases is associated A1 with a low number of phonemes per syllable (r = −), A2 with a high humber of syllables per clause (r = +), and A3 with a low proportion of prepositions (r = −), i.e. a tendency to postpositions. B Adposition order (B): a tendency to prepositions (as opposed to a tendency to postpositions), is associated B1 with a high number of phonemes per syllable (r = +), B2 with a low number of syllables per clause (r = −), and B3 with a low number of morphological cases (r = −). The tendency to suffixing is generally stronger than the tendency to prefixing (e.g. Greenberg (1966)), and postpositions get more easily attached to the stem, thus forming a new semantic case (e.g. a local case). This is the linguistic argument for hypothesis A3. One might add a formal argument connecting our metric parameters with the non-metric properties A and B: Given a plausible assumption of a correlation of A or B with either “syll./clause” or “phon./syll.”, this is sufficient—most apparently in the case of a “diagonal” relation in the right part of Figure 1—for the construction of this correlational model. A point-biserial correlation revealed a highly significant result regarding this correlation A3: A high proportion of postpositions, or a low proportion of prepositions respectively, coincides with a high number of cases (Fenk-Oczlon & Fenk (2005)). The negative correlation B1 between the number of cases and the number of phonemes per syllable proved to be “almost significant” when calculated for only those 20 languages having case.
Crosslinguistic Computation and a Rhythm-based Classification
355
Figure 1 illustrates in its left part the correlations between the metric variables and connects these complexity measures with the non-metric variables (adposition order, number of cases) in the right part. All significant correlations correspond to the plausibility arguments explicated above. In the right part, even the non-significant correlations correspond to those arguments. Exceptions are the two non-significant correlations in the left part of Figure 1: Seeing the significant correlations (solid lines) of the parameter syll./word with its partners syll./clause and words/clause one should expect rather a negative sign (e = −) in a possible correlation between these two partners, while the two significant correlations between syll./word and its “partners” words/clause and phon./syll. have the same sign and would rather suggest a positive sign (e = +) between those two partners. Actually, the result was a positive coefficient in the first case (r = +0.328, broken diagonal line) and a negative coefficient near zero in the second case (r = −0.013, dotted line).
4
A Rhythm-based Distinction Between Two Fundamental Types of Language
The comparison in Table 2, though not statistically corroborated in every detail, offers a synopsis of our results so far. We should add that a high number of morphological cases (right column) will go hand in hand with separatist case exponents and a low number of morphological cases (left column) with cumulative case exponents. And it is really tempting to associate the pattern in the right column with agglutinative morphology and the pattern on the left with fusional or isolating morphology. Instead we take the speech rhythm as an anchor of typological distinction—as did Auer (1993) within phonology and Donegan & Stampe (1983) as well as Gil (1986) in the sense of a holistic approach—and as a determinant of a pattern of variation affecting phonology, morphology, and syntax. Our correlational results match the findings and interpretations of Donegan & Stampe rather than those of Gil. All natural languages show a segmentation into intonation units, due to our breath cycle, and a segmentation of intonation units into syllables. Intonation units may be considered a special case of action units (Fenk-Oczlon & Fenk (2002)) comprising a limited number of syllables as their basic element. Smaller parts of syllables, such as vowels and consonants, are not more than “analytical devices” or “convenient fictions for use in describing speech.” (Ladefoged (2001: 175)). The syllables are not only the basic elements of speech and the most appropriate crosslinguistic measure for the “size” of sentences, but represent, moreover, the single “pulses” of a language’s rhythmic pattern. And this pattern is closely associated with syllable complexity: syllable-timed rhythm with low syllable complexity (low n of phonemes per syllable), stress-timed rhythm with high syllable complexity (e.g. Roach (1982), Auer (1993), Ramus et al. (2000)). One might even argue that rhythm affects syllable complexity and that the parameter “phon./syll.”
356
A. Fenk and G. Fenk-Oczlon stress-timed rhythm
syllable-timed rhythm
metric properties: high n of phonemes per syllable low n of syllables per clause low n of syllables per word high n of words per clause
metric properties: low n of phonemes per syllable high n of syllables per clause high n of syllables per word low n of words per clause
non-metric properties: VO order tendency to prepositions low n of cases
non-metric properties: OV order tendency to postpositions high n of cases
Table 2. Two fundamental types of language
in our Figure 1 is the point of impact: Changes in the rhythmic structure of a language, induced for instance by language contact, will induce changes and balancing effects in other parameters of the system. This “moving” pattern of variation, and the boundaries of variation, may be viewed as universal facts about language. The two patterns figured out in Table 2 may well be considered “idealized” counterparts resulting from the very same and universal pattern of variation. Our model of this universal “groundplan” of languages includes, first of all, metric variables or otherwise quantitative variables, such as a language’s number of cases. This was an advantage in constructing a correlational model of that groundplan. After integrating the data from our most recently gained translations from an English version of our test-sentences into Austronesian languages, we hope to improve the model by some kind of path analysis including, where possible, a search for the “best fitting” function between any two partners related to each other.
References ALTMANN, G. and LEHFELDT, W. (1973): Allgemeine Sprachtypologie. Wilhelm Fink, M¨ unchen. AUER, P. (1993): Is a Rhythm-based Typology Possible? A Study of the Role of Prosody in Phonological Typology. KontRI Working Paper (University Konstanz) 21. CROFT, W. (1990): Typology and Universals. Cambridge University Press, Cambridge. DONEGAN, P. and STAMPE, D. (1983): Rhythm and the Holistic Organization of Language Structure. In: J.F. Richardson et al. (Eds.): Papers from the Parasession on the Interplay of Phonology, Morphology and Syntax. Chicago: CLS 1983, 337–353. FENK, A. and FENK-OCZLON, G. (1993): Menzerath’s Law and the Constant Flow of Linguistic Information. In: R. K¨ ohler and B. Rieger (Eds.): Contri-
Crosslinguistic Computation and a Rhythm-based Classification
357
butions to Quantitative Linguistics. Kluwer Academic Publishers, Dordrecht, 11–31. FENK-OCZLON, G. (1983): Bedeutungseinheiten und sprachliche Segmentierung. Eine sprachvergleichende Untersuchung u ¨ber kognitive Determinanten der Kernsatzl¨ ange. Narr, T¨ ubingen. FENK-OCZLON, G. and FENK, A. (1985): The Mean Length of Propositions is 7 Plus Minus 2 Syllables—but the Position of Languages within this Range is not Accidental. In: G. D’Ydewalle (Ed.): Cognition, Information Processing, and Motivation. XXIII Int. Congress of Psychology. (Selected/revised papers). North-Holland, Elsevier Science Publishers B.V., Amsterdam, 355–359. — (1999): Cognition, Quantitative Linguistics, and Systemic Typology. Linguistic Typology, 3–2, 151–177. — (2002): The Clausal Structure of Linguistic and Pre-linguistic Behavior. In: T. Giv´ on and B.F. Malle (Eds.): The Evolution of Language out of PreLanguage. (Typological Studies 53). John Benjamins, Amsterdam, 215–229. — (2005): Crosslinguistic Correlations between Size of Syllables, Number of Cases, and Adposition Order. In: G. Fenk-Oczlon and Ch. Winkler (Eds.): Sprache und Nat¨ urlichkeit. Gedenkband f¨ ur Willi Mayerthaler. Narr, T¨ ubingen, 75–86. GABELENTZ, G. von der (1901): Die Sprachwissenschaft, ihre Aufgaben, Methoden und bisherigen Ergebnisse. Tauchnitz, Leipzig. GIL, D. (1986): A Prosodic Typology of Language. Folia Linguistica, 20, 1986, 165–231. GREENBERG, J.H. (1966): Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements. In: J.H. Greenberg (Ed.): Universals of Language. MIT Press, Cambridge, MA, 73–113. HEMPEL, C.G. and OPPENHEIM, P. (1936): Der Typusbegriff im Lichte der neuen Logik. A.W. Sijthoff’s Uitgeversmaatschappij N.V., Leiden. LADEFOGED, P. (2001): Vowels and Consonants: an Introduction to the Sounds of Languages. Blackwell Publishing, Oxford. MENZERATH, P. (1954): Die Architektonik des deutschen Wortschatzes. D¨ ummler, Bonn. MILLER, G.A. (1956): The Magical Number Seven, Plus or Minus Two: some Limits on our Capacity for Processing Information. Psychological Review, 63, 81–97. PLANK, F. (1986): Paradigm Size, Morphological Typology, and Universal Economy. Folia Linguistica, 20, 29–48. — (1991): Hypology, Typology: The Gabelentz Puzzle. Folia Linguistica, 25, 421– 458. — (1998): The Co-variation of Phonology with Morphology and Syntax: A Hopeful History. Linguistic Typology, 2, 195-230. RAMUS, F., HAUSER, M.D., MILLER, C., MORRIS, D., and MEHLER, J. (2000): Language Discrimination by Human Newborns and by Cotton-top Tamarin Monkeys. Science, 288, 349–351. ROACH, P. (1982): On the Distinction between “Stress-Timed” and “SyllableTimed” Languages. In: D. Crystal (Ed.): Linguistic Controversies. Edward Arnold, London, 73-79. SOKAL, R.R. and SNEATH, P.H.A. (1963): Principles of Numerical Taxonomy. W.H. Freeman, San Francisco.
Using String Kernels for Classification of Slovenian Web Documents Blaˇz Fortuna and Dunja Mladeniˇc J. Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia Abstract. In this paper we present an approach for classifying web pages obtained from the Slovenian Internet directory where the web sites covering different topics are organized into a topic ontology. We tested two different methods for representing text documents, both in combination with the linear SVM classification algorithm. The first representation used is a standard bag-of-words approach with TFIDF weights and cosine distance used as similarity measure. We compared this to String kernels where text documents are compared not by words but by substrings. This removes the need for stemming or lemmatisation which can be an important issue when documents are in other languages than English and tools for stemming or lemmatisation are unavailable or are expensive to make or learn. In highly inflected natural languages, such as Slovene language, the same word can have many different forms, thus String kernels have an advantage here over the bag-of-words. In this paper we show that in classification of documents written in highly inflected natural language the situation is opposite and String Kernels significantly outperform the standard bag-of-words representation. Our experiments also show that the advantage of String kernels is more evident for domains with unbalanced class distribution.
1
Introduction
Classification of documents is usually performed by representing the documents as word-vectors using the bag-of-words document representation and applying some classification algorithm on the vectors (Sebastiani, 2002). Bagof-words document representation usually cuts a document text into words and represents the document with the frequency of words that occur in the document. Even though it ignores the order of the words, it was found to perform well in combination with different classification algorithms and usually outperforms alternative representations on the standard problems of document categorization. However, experiments are usually performed on standard document categorization datasets and most of them contain documents written in English. There are mixed results on the performance change due to using word stemming as a pre-processing on English documents. However, when dealing with non-English documents, especially documents written in a highly inflected languages, applying stemming or lemmatisation can be crucial. Namely, in highly inflected natural languages, a word having the same or very similar meaning can occur in several tens of, slightly different, forms
String Kernels
359
(depending on the gender, number, case, etc.). Unfortunately, we do not always have stammer or lemmatiser available for a particular natural language (it may not be publicly available or even it may not exist). This paper investigates performance of an alternative document representation, String kernels, on non-English documents. String kernels cut the document text into sequences of characters regardless of the word boundaries. This can be seen as an alternative approach to handling the problem of having slightly different words carrying almost the same meaning. Namely, in most cases, these words differ in the word suffix, so taking the first k letters of the word (where k is smaller than the average length of the words) can be seen as a way of obtaining a word stem. For illustration, in the following examples of Slovenian sentences, talking about traffic problems, bag-of-words does not find any connection between them. However, String kernels identify that the words ’cesti’, ’obcestnega’, ’cestisce’ and ’cestninsko’, all different forms of word ’road’, share common substrings. Note that in the case of String kernels of length 5, the substring ’cesti’ does not necessary contain letters from the same words (see bold letters in the example). • ’Prevrnjeni tovornjak povzroca zastoje na cesti . . . ’ • ’Zaradi zamasenega obcestnega jarka in odtoka je popljavneno cestisce na . . . ’ • ’Pred cestninsko postajo nastajajo daljsi zastoji.’
Previous research has shown that on categorization of English documents with linear SVM, the bag-of-words document representation outperforms String kernels (Saunders et al., 2002). We show that String kernels outperform the bag-of-words representation on documents written in a highly inflected natural language, namely Slovenian. The difference in performance is larger on problems with unbalanced class distribution. To the best of our knowledge this is the first experimental comparison of these two document representations on documents written in highly inflected natural language. This paper is organized as follows. Section 2 describes the used methodology including the Support Vector Machine classifier and String kernels. Section 3 describes the used datasets. Experimental comparison of the two document representations is provided in Section 4, followed by discussion is in Section 5.
2 2.1
Methodology Support Vector Machine
The most common technique for representing text documents is bag-of-words (BOW) using word frequency with TFIDF weighting. In the bag-of-words representation there is a dimension for each word; a document is than encoded as a feature vector with word frequencies as elements. Document classification has been performed using different classification algorithms on the bag-ofwords document representation. The linear Support Vector Machine (SVM)
360
B. Fortuna and D. Mladeniˇc
(Bose et al., 1992) algorithm is known to be one of the best performing for text categorization eg., in (Joachims, 1999). Thus, in this paper we report on experiments using linear SVM for classifying web documents. Support vector machine is a family of algorithms that has gained a wide recognition in the recent years as one of the state-of-the-art machine learning algorithms for tasks such as classification, regression, etc. In the basic formulation they try to separate two sets of training examples by hyperplane that maximizes the margin (distance between the hyperplane and the closest points). In addition one usually permits few training examples to be misclassified. For unbalanced datasets, different cost can be assigned to examples according to the class value (Morik et al., 1999). The cost is controlled by parameters j and C, where C corresponds to the misclassification cost (C+ = jC and C− = C). An alternative approach to handling unbalanced datasets based on shifting the SVM induced hyperplane was proposed in (Brank et al., 2003). In this paper we consider only changing the value of SVM parameter j in order to improve performance on unbalanced datasets. We avoided hyperplane shifting by using a measure for experiments that does not depend on the threshold. When constructing the SVM model, only the inner product between training examples is needed for learning the separaton hyperplane. This allows the use of so called kernel function. The kernel function is a function that calculates inner product between two mapped examples in feature space. Since explicit extraction of features can have a very high computational cost, a kernel function can be used to tackle this problem by implicit use of mapped feature vectors. 2.2
String Kernels
The main idea of string kernels (Lodhi et al., 2002; Saunders et al., 2002) is to compare documents not by words, but by the substrings they contain – there is a dimension for each possible substring and each document is encoded as a feature vector with substring weights as elements. These substrings do not need to appear contiguous in the document, but they receive different weighting according to the degree of contiguity. For example: substring ’c-a-r’ is present both in the words ’card’ and ’custard’ but with different weighting. Weight depends on the length of the substring and the decay factor λ. In the previous example, the substring ’car’ would receive weight λ3 as part of ’card’ and λ6 as part of ’custard’. Feature vectors for documents are not computed explicitly because it is computationally very expensive. However, there exists an efficient dynamic algorithm (Lodhi et al., 2002) that computes the inner product between two feature vectors. We use this algorithm as a kernel in the SVM. The advantage of this approach is that it can detect words with different suffixes or prefixes: the words ’microcomputer’, ’computers’ and ’computerbased’ all share common substrings. The disadvantage of this approach is that computational cost is higher than that of BOW.
String Kernels
361
We have used our own implementation of SVM, bag-of-words and string kernels which are all part of our Text Garden 1 suite of tools for text mining. The SVM implementation is very efficient and gives similar performance to SVMlight. Its advantage is a tight integration with the rest of Text Garden.
3
Dataset Description
We compared performance of bag-of-words and String kernels on several domains containing document from Mat’kurja directory of Slovenian web documents (such as Open directory or Yahoo!). Each web page is described with a few sentences and is assigned to a topic from the directory’s taxonomy. The whole directory contains 52,217 documents and 63,591 words. Similar, as proposed in some previous experiments on Yahoo! documents (Mladenic and Grobelnik 2002), we have selected some top-level categories and treated each as a separate problem. Top level category ’Arts’ having 3557 documents and ’Science and Education’ having 4046 documents were used ignoring hierarchical structure of the documents they contain. From each of them we select three subcategories of different sizes thus having different percentage of positive examples. This way we obtained domains with different proportion of positive examples ranging from unbalanced (where only 4 % of examples are positive and 96 % are negative) to balanced with 45% of examples being positive. The selected domains are as follows. From Arts we have selected three subcategories: Music having 45 % of documents, Painting having 7 % of documents and Theatre having 4 % of documents. From ’Science and Education’ the following three subcategories were selected: Schools having 25 % of documents, Medicine having 14 % of documents and Students having 12 % of documents. For each subcategory we define a separate domain having all the documents from the subcategory as positive documents and all the documents from other subcategories of the same top-level category as negative documents.
4
Experiments
All the experimental results are averaged over five random splits using holdout method, randomly splitting each category into a training part (30%) and a testing part (70%). A classifier is generated form training documents and evaluated on the testing documents. The evaluation is performed using Break Even Point (BEP) – a hypothetical point at which precision (ratio of positive documents among retrieved ones) and recall (ratio of retrieved positive documents among all positive documents) are the same. There was no special pre-processing performed on the documents used in experiments except removing html-tags and changing all the characters to lowercase. 1
http://www.textmining.net
362
B. Fortuna and D. Mladeniˇc Category Subcategory BOW [%] M-Arts Music 80 ± 1.9 Painting 22 ± 5.5 Theatre 24 ± 3.1 M-Science Schools 81 ± 3.8 Medicine 32 ± 1.9 Students 30 ± 4.0
SK [%] 88 ± 0.4 60 ± 2.6 61 ± 6.6 78 ± 2.6 75 ± 2.0 59 ± 1.1
Table 1. Results for classification task, BEP is used as evaluation measure. String kernel of length 5 and λ = 0.2. Bold numbers are significantly higher.
For classification we use linear SVM algorithm with cost parameter C set to 1.0. We ran experiments for different values of the length of substrings used in string kernel, of the value of decay parameter λ and of parameter j. We have tested the following hypotheses all assuming usage of linear SVM classifier for document classification: string kernels outperform bag-of-words on documents written in highly inflected natural languages (Section 4.1) with the difference being more evident on data with unbalanced class distribution (Section 4.2) and using SVM mechanism for handling unbalanced data improves performance of the two representations (Section 4.3). We have also to limited extent investigated the influence of two String kernel parameters with the main hypothesis being that using too short String kernels hurts the performance (Section 4.4). 4.1
String Kernels vs. Bag-of-words on Inflected Languages
The first hypothesis we have tested is that String kernels achieve better results than bag-of-words on document written in a highly inflected language. Our experiments confirm that hypothesis, as on eight out of nine domains String kernels achieve significantly higher BEP than bag-of-words (with significance level 0.001 on seven domains and 0.05 on one domain). Table 2 gives the results of categorization for six domains of Slovenian documents. 4.2
String Kernels vs. Bag-of-words on Unbalanced Datasets
From the initial experiments on a few domains, we have noticed that the difference in performance between the two representations varies between domains. Thus we have performed experiments on more domains, selecting them to have different percentage of positive examples. Our hypothesis was that the difference is larger on domains with more unbalanced class distributions. As can be seen from Figure 1, this is the case on domains having less than 15% of positive examples (the four leftmost domains in Figure 1), String kernels achieve much higher BEP compared to bag-of-words. On one domain having 25% of positive examples the difference
String Kernels
363
100 BOW 90
SK
80 70 60 50 40 30 20 10 0 MTheatre (4%)
MPainting (7%) MStudent (12%)
MMedicine (14%)
MSchools (25%)
MMusic (45%)
Fig. 1. Comparison of SVM performance on Slovenian documents using bag-ofwords (BOW) and String kernels (SK). The domains are sorted according to the percentage of positive examples, from M-Theatre (4%) to M-Music (45%).
J Music Painting Theatre Schools Medicine Students
Bag-of-words 1.0 5.0 10.0 80 ± 1.8 84 ± 1.0 84 ± 0.8 22 ± 5.5 48 ± 2.6 48 ± 2.6 24 ± 3.1 38 ± 7.8 38 ± 7.7 81 ± 3.9 80 ± 0.8 80 ± 1.1 32 ± 1.8 55 ± 3.1 55 ± 3.1 30 ± 4.0 50 ± 3.3 50 ± 3.0
String kernels 1.0 5.0 10.0 88 ± 0.4 87 ± 0.8 87 ± 0.8 60 ± 2.6 58 ± 2.7 58 ± 2.7 61 ± 6.6 62 ± 5.8 62 ± 5.8 78 ± 2.6 77 ± 2.1 77 ± 1.9 75 ± 2.0 73 ± 0.8 73 ± 0.6 59 ± 1.1 58 ± 0.8 58 ± 1.0
Table 2. Influence of SVM parametr j on six domains of Slovenian documents using bag-of-words using String kernel of length 5 and λ = 0.2.
in performance is not significant, while on the balanced class domains (the last column in Figure 1) String kernels are again significantly better than bag-of-words (but the absolute difference in performance is much lower). 4.3
Setting SVM Parameters to Handle Unbalanced Datasets
The categorization algorithm we are using, SVM, already has a mechanism for handling domains with unbalanced class distribution (commonly referred to as parameter j). The j parameter enables assigning different misclassification cost to examples of positive and of negative class. The default value of j is 1.0. Setting it to some value greater then 1.0 is equivalent to over-sampling by having j copies of each positive example. We have investigated influence of changing the value of j and found that changing it from 1.0 to 5.0 significantly improves performance (significance level 0.01) of bag-of-words on all but one domain (see Table 2). Setting j to higher values (j = 10.0) does not significantly change the performance. Changing the value of parameter j when using String kernels, does not significantly influence the performance of SVM, as can be seen in Table 3.
364
4.4
B. Fortuna and D. Mladeniˇc
Changing Parameters of String Kernels
String kernels work with sequences of characters. It was shown in previous work on English documents (Lodhi et al., 2002) that the length of the sequence significantly influences the performance to some degree. As expected, our experiments confirmed that finding for Slovenian documents, too. Namely, using too short string kernels (in our case 3 characters) results in significantly (significance level 0.05) lower performance than using longer string kernels, achieving in average over the six domains BEP of 65.5 compared to BEP 70 achieved when using String kernels of length 4. Having length 4, 5 or 6 results in similar performance on Slovenian documents. However, one would expect that this might depend on the natural language, as in some cases having length 4 or 5 may still be too short. We have also varied the value of decay factor of string kernel (parameter λ) from 0.1 to 0.4 and found that it does not influence the performance on our domains.
5
Discussion
We have tested two methods for representing text documents, bag-of-words and String kernels, both in combination with linear SVM classification algorithm. We have shown that when dealing with documents written in a highly inflected natural language, such as Slovene, String kernels significantly outperform a commonly used bag-of-words representation. Moreover, the advantage of String kernels is more evident for the domains with unbalanced class distribution having less than 15% of positive examples. As string kernels use substrings instead of whole words for representing document content, this seems to compensate for stemming or lemmatisation which can be important for documents in highly inflected languages. This is especially important when tools for stemming or lemmatisation are unavailable or expensive. Because we are dealing with highly inflected natural languages bag-ofwords fails to match different forms of the same word. On the other hand, string kernels are able to match them because they use substrings (in our case of length 5, not words) as features and allow gaps between parts of the substrings. We have also found that using the SVM mechanism (parameter j) for handling unbalanced domains, significantly improves the bag-of-words performance but it still stays significantly lower than the performance of String kernels. The same parameter does not significantly influence the performance of String kernels. The performance of String kernels is significantly influenced by the length of the kernel but only if the kernel is very short (using length 3 yields significantly worse performance than using length 4, but there is no difference between length 4 and length 5).
String Kernels
365
In the future work, it would be interesting to repeat the experiments on some other natural languages and possibly try to relate the advantage of String kernels over bag-of-words to the degree of the language inflection. In our experiments we use Break even point as the evaluation measure, as commonly used in document categorization. However, we have noticed that if using the threshold proposed by SVM for predicting the class value, the value of precision or recall is very low, in most cases close to 0. A closer look has revealed that even though both bag-of-words and String kernels have problem with setting the right threshold, this is more evident for String kernels. In future work we want to investigate possibilities of improving the threshold, eg., as post-processing by shifting the SVM induced hyper-plane as proposed in (Brank et al 2003) for handling unbalanced domains using bag-of-words.
Acknowledgements This work was supported by the Slovenian Research Agency and the IST Programme of the European Community under SEKT Semantically Enabled Knowledge Technologies (IST-1-506826-IP) and PASCAL Network of Excellence (IST-2002-506778).
References B.E. BOSER, I.M. GUYON, and V.N. VAPNIK (1992): Proc. 5th Annual ACM Workshop on Computational Learning Theory, 144–152. Pittsburgh, PA, July 1992. ACM Press J. BRANK, M. GROBELNIK, N. MILIC-FRAYLING, D. MLADENIC (2003): Training text classifiers with SVM on very few positive examples, Technical repot, MSR-TR-2003-34. T. JOACHIMS (1999): Making large-scale svm learning practical. In: B. Scholkopf, C. Burges, and A. Smola (eds.): Advances in Kernel Methods - Support Vector Learning. MIT-Press. H. LODHI, C. SAUNDERS, J. SHAWE-TAYLOR, N. CRISTIANINI, and C. WATKINS (2002): Text classification using string kernels. Journal of Machine Learning Research, 2, 419–444. D. MLADENIC and M. GROBELNIK (2003): Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35(1): 45–87. K. MORIK, P. BROCKHAUSEN, and T. JOACHIMS (1999): Combining statistical learning with a knowledge-based approach – A case study in intensive care monitoring. Int. Conf. Machine Learning J. PLISSON, N. LAVRAC, and D. MLADENIC (2004): A rule based approach to word lemmatization. Proc. 7th Int. Conf. Information Society IS-2004, 83–86. Ljubljana: Institut Jozef Stefan. C. SAUNDERS, H. TSCHACH, and J. SHAWE-TAYLOR (2002): Syllables and Other String Kernel extensions. Proc. 19th Int. Conf. Machine Learning F. SEBASTIANI (2002): Machine Learning for Automated Text Categorization. ACM Computing Surveys, 34:1, 1–47.
Semantic Decomposition of Character Encodings for Linguistic Knowledge Discovery Dafydd Gibbon1 , Baden Hughes2 , and Thorsten Trippel1 1
2
Fakult¨ at f¨ ur Linguistik und Literaturwissenschaft Universit¨ at Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany Department of Computer Science and Software Engineering University of Melbourne, Parkville 3010, Australia
Abstract. Analysis and knowledge representation of linguistic objects tends to focus on larger units (e.g. words) than print medium characters. We analyse characters as linguistic objects in their own right, with meaning, structure and form. Characters have meaning (the symbols of the International Phonetic Alphabet denote phonetic categories, the character represented by the glyph ‘∪’ denotes set union), structure (they are composed of stems and parts such as descenders or diacritics or are ligatures), and form (they have a mapping to visual glyphs). Character encoding initatives such as Unicode tend to concentrate on the structure and form of characters and ignore their meaning in the sense discussed here. We suggest that our approach of including semantic decomposition and defining font–based namespaces for semantic character domains provides a long–term perspective of interoperability and tractability with regard to data–mining over characters by integrating information about characters into a coherent semiotically–based ontology. We demonstrate these principles in a case study of the International Phonetic Alphabet.
1
Introduction and Preliminaries
High quality language documentation according to agreed professional standards is becoming an essential part of the empirical resources available for linguistic analysis, and a new subdiscipline, documentary linguistics, has emerged in this area [Himmelmann, 1998]. The main emphasis of the language documentation enterprise lies in three areas: the provision of extensive and consistently annotated development data for the human language technologies, the sustainable and interpretable preservation of endangered languages data [Gibbon et al, 2004] and the professional archiving of documents of any kind by the methods of text technology. In contrast, little attention has been paid from a linguistic point of view to the incorporation of the smallest structural units of written texts, characters, into this enterprise. On closer inspection, characters, character sets and encodings which are used to represent textual data turn out to be a linguistic domain in their own right, but one which has hardly been explored. Our contribution is to introduce a new approach to character decomposition and classification, and an outline formalisation of this approach. First we discuss encoding strategies, from legacy practice through current Unicode
Semantic Decomposition of Character Encodings
367
practice to the need for a more generic approach. We provide a case study around the International Phonetic Alphabet, defining characters as linguistic signs, and examining their properties according to a linguistic model which relates meaning, structure and form, with properties represented as feature vectors or attribute–value matrices (AVMs) according to current notational conventions in general and computational linguistics. The generic character descriptions are used to explicate conventional Unicode and non–Unicode character encodings. We then show how semantic character decomposition brings advantages for the representation of user–oriented properties of characters, such as their linguistic meanings, their structures, or their context–sensitive rendering. In order to show how to overcome problems of missing characters in typical uses we discuss an ontological approach to character mapping, based on the idea of fonts as namespaces with mappings to a variety of encodings. The domain of character encoding has a number of importantly differentiated terms and concepts which are often employed loosely in everyday use, e.g. character, letter, text element, and glyph. These terms need to be clearly defined in order to appreciate the context of the remainder of this work. We proceed after the model of [D¨ urst et al, 2004] and [Unicode Consortium, 2003], in defining a character, its various renderings, and the text processes, input methods, collation approaches and storage requirements; these sources should be consulted for further detail on encoding. In addition to the character, its rendering and its role in text processes we are also interested in the semantics and pragmatics of characters, i.e. the meaning and role of characters in the usage contexts of language communities, and in the development of a generic classification of characters from this point of view in a coherent and comprehensive character ontology. We avoid both glyph–based ‘lookalike’ and code–based criteria, and take a linguistic approach to solving the problem of unifying the linguistic properties of both best–practice and legacy character encodings. We have developed an analytical, classificatory and representational approach independent of specific fonts or character encodings, and at a higher generic level than Unicode, in that provision is made for including coherent user–oriented semantics and pragmatics of characters. The representational meta–syntax we use is attribute–value based; for applications in interchange and archiving there is a straightforward mapping into the more verbose XML notational conventions.
2
Characters as Signs: A Case Study of IPA Characters
The body of this work is a short case study of an application area for semantic character decomposition in which feature–based character descriptions are developed as the basic units of a character ontology for character–based data– mining tasks in the context of the semantic web.
368
D. Gibbon et al.
We define a character as a linguistic sign and decompose its semantics into linguistic feature vectors representing semantic interpretation (in phonetic, phonemic and orthographic worlds), structure, and glyph rendering interpretation. The decomposition inherits a range of properties from Unicode concepts such as inherent directionality and combining behaviour, and the result is applicable both to Unicode and non–Unicode character encodings. A commonly used standard character set is the International Phonetic Alphabet (IPA). The standardizing body is the International Phonetic Association, which periodically considers revisions to the character set. The organisation of the properties of characters in this set may be expressed as a vector [SY N, ST Y, SEM ], where the components of the vector are defined The SY N component constitutes the syntax of the characters. Characters may be either stem characters, as in ‘p’, or complex characters consisting of a simple character with one or more diacritics, such as ‘ph ’. The stem character may be analysed in terms of component functions such as circles, descenders and ascenders. The IPA stem characters are represented by a standard coding known as the IPA coding [International Phonetic Association, 1999], sometimes as the Esling codes [Esling and Gaylord, 1993], in which each character or diacritic has a numerical code, and the syntax of diacritic arrangements over, under, left and right of characters is defined. Unlike the Unicode code– blocks, the IPA numbers cover the entire IPA character set, and the mappings to IPA semantics and glyphs are technically complete and sound. The IPA code numbers are therefore suitable as a representation at the generic level which we introduce in the present contribution, and for practical purposes these numbers can be mapped into other less straightforward codes (e.g. Unicode, LATEX macros, TrueType or OpenType font tables). The ST Y component constitutes the style semantics (rendering semantics) of the character, i.e. a mapping of the character (represented by its Esling code, or its code in another code table) to a glyph (or a glyph structure consisting of an arrangement of glyphs) in the sense already defined. A standard description of the IPA glyphs is provided by Pullum and Ladusaw [Pullum et al. 1986]; this description pre–dates the most recent revisions of the IPA in 1993 and 1996, however. The style semantics is thus an interpretation function from the character syntax into the style semantic domain of glyph configurations: R : SY N → ST Y . The SEM domain constitutes the domain semantics of the character, e.g. the sound type denoted by an IPA character as defined by the International Phonetic Association. In the ASCII code set, the hex code 07 denotes a warning, and is rendered by the acoustic beep. The hex code 58 denotes the upper case version of the 24th letter of the English alphabet and is rendered by ‘X’. The denotational semantics is thus an interpretation function from the character syntax into the user–oriented semantic domain: D : SY N → SEM . Examples of denotations of IPA characters are: • the voiceless velar fricative denoted by the simple character ‘x’,
Semantic Decomposition of Character Encodings
369
• the aspirated voiceless bilabial plosive denoted by ‘ph ’. In fact, phoneticians define a number of subdomains for the IPA characters, one of which is language independent (the narrow phonetic domain of physical sounds), the others being language dependent (the phoneme sets of individual languages). The narrow phonetic domain is indicated by square bracket quotes [p], and the phonemic domains are indicated by forward slash quotes /p/. The quotes represent semantic interpretation functions from the character rendered by the glyph or glyphs which they enclose into the relevant denotation domain of the character. The mappings I : SY N → SEM and I : SY N → ST Y are traditionally defined implicitly and simultaneously in the IPA chart.1
⎡
⎤
SCHEME: IPA ⎥ aspirated-p ⎢ ⎡CHAR: ⎡ ⎤⎤⎥ ⎢ ⎥ CASE: lower ⎢ ⎥ ⎢ ⎢ ⎥ CHAR: p ⎢ ⎥⎥⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ STEM: 1 ⎢ IPA-NUMBER: 101 ⎢ ⎥⎥ ⎥ ⎢ ⎢ ⎥⎥ ⎣ ⎦ NAME: latin small letter p ⎢ ⎢ ⎥⎥ UNICODE: ⎢ ⎢ ⎥⎥ CODE: U+0070 ⎢ ⎢ ⎥⎥ ⎡ ⎤ ⎢ SYN ⎢ ⎥⎥ CASE: lower ⎢ ⎢ ⎥⎥ ⎢ ⎢ ⎥⎥ h ⎢ CHAR: ⎥ ⎢ ⎢ ⎥⎥ ⎥ ⎢ DIA: 2 ⎢ ⎢ ⎥⎥ IPA-NUMBER: 404 ⎢ ⎥ ⎢ ⎢ ⎥⎥ ⎣ ⎦ ⎢ ⎣ ⎦⎥ NAME: latin small letter h ⎢ ⎥ UNICODE: ⎢ ⎥ CODE: U+0068 ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ GLYPH: p ⎢ ⎥ ⎢ ⎢ STEM: 1 NAME: ⎥ ⎥ ‘pee’ ⎢ ⎢ ⎥ ⎥ PULLUM-LADUSAW: lower-case p ⎥ ⎢ ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ GLYPH: h ⎢ ⎢ ⎥ ⎥ ⎢ STY: 1 ⎢ DIA: 2 NAME: ⎥ ⎥ ‘aitch’ ⎢ ⎢ ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ PULLUM-LADUSAW: ⎢ ⎢ ⎥ superscript h ⎥ ⎢ ⎥ ⎣ ⎦ DIA-X-POS: post ⎢ ⎥ REL: 1 , 2 ⎢ ⎥ DIA-Y-POS: super ⎢ ⎥ ⎡ ⎤ ⎡ ⎤ ⎢ ⎥ DOMAIN: narrow-phonetic ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ PLACE: bilabial ⎢ ⎥ ⎢ ⎥ ⎢ SEG: 1 ⎣ ⎥ ⎦ ⎢ SEM: ⎢ MANNER: plosive ⎥ ⎥ ⎢ ⎥ ⎣ ⎦ VOICING: voiced ⎢ ⎥ ⎣ ⎦ VOT: 2 aspirated PRAG: regulated by International Phonetic Association
⎢ META:
Fig. 1. Structure of semiotic vector extract for [ph ] in IPA name-space.
For IPA characters, the vector components SY N , ST Y and SEM are further analysed into component–specific vectors specifying syntactic composition, glyph structure style, and sound type semantics respectively. These 1
http://www2.arts.gla.ac.uk/IPA/fullchart.html
370
D. Gibbon et al.
vectors can be represented as attribute value structures in a standard linguistic notation; the example illustrated is ‘ph ’. The composition of the syntactic composition sub–vector SY N , the glyph structure style vector ST Y , and the sound type semantic vector SEM for ‘ph ’ are shown in Figure 1. The full description cannot be given in this context for reasons of space. Indeed there may not be a full description, in the sense that alternative codings for this character also exist, in the form of the Esling codes, LATEX macros, as code–points in legacy fonts, or even as the SAMPA mapping to basic latin characters [Gibbon et al. 2000], and can be included in the attribute–value structure. Following computational linguistic conventions the mappings between the main vectors are shown here by co–indexing the related properties in the three main vectors. The detailed technical formalisation of the mappings between syntax, rendering and semantics is not the subject of this contribution, however.
3
Knowledge Discovery from Character Encodings
Having laid the foundation of characters as complex constituents, described the relationships between characters and higher level constructs such as fonts and explored the various types of properties applied to characters, we can now turn to a discussion of how these properties can be manipulated and explored in various different ways to realise new linguistic knowledge from the underlying characters themselves. With this analytical and representational mechanism we are able to classify characters from a number of perspectives, including their proximity in the semiotic vector space, in linguistic meaning, structure and context–sensitive rendering, provenance throughout a family of related fonts etc. The details of the nomenclature will no doubt lead to controversial debate, but the architecture of our approach to generic character classification is clear. From the semiotic vector model illustrated in Figure 1 we can derive a number of different types of classification and relation mining strategies for different application domains: • multi–dimensional classifications based on similarity of any combination of components of the semiotic vectors; • computation of tree representations, graph representations or matrix representations for visualisation, search, sorting and merging with standard unification grammar operations; • similarity definition, determined by generalisations (attribute–value structure intersections) over feature structures at various hierarchical levels: SYN: UNICODE values (or other font or encoding values such as ASCII, SILDOULOS); CASE, CHAR, CODE values (by further decomposition on Unicode principles); STEM, DIACRITIC values; STY: GLYPH, HOR–POS, Y–POS values; GLYPH STATUS, DIACRITIC values;
Semantic Decomposition of Character Encodings
371
SEM: DOMAIN, PLACE, MANNER, VOICING values; – SEGMENT, VOICEONSET values; META: CHAR; SCHEME; PRAG: regulatory criteria and versioning; definitions of orthographic and phonemic coverage of a given language. The classification task in this context is relatively straightforward, since for most cases the questions will be related to the similarity or differences of a given character or font. In our more formal context, we can not only identify the differences, but quantitify them and ground them in a domain of interpretation. This represents a significant advancement over the ad hoc, manual inspection methods which currently characterise the field of comparative linguistic encoding analysis.
4
Towards a Character Mapping Ontology
The AVM–based metrics can be displayed in a number of ways for interpretation. For mappings to specific fonts we favour an ontological approach, considering character encodings used within a single font as a type of name– space, thus enabling mappings to many different encodings. In the simplest case, we could utilise the simple character mapping ontology discussed in [Gibbon et al, 2004], which defined an XML data structure for a given character set, and hence the basis on which different character sets could be compared. More complex comparisons and mappings may be expressed in a character markup mapping language eg CMML [Davis and Scherer, 2004]. A fully expressed out character ontology based on the principles outlined in the present discussion requires extensive further discussion in order to achieve a working consensus. As a minimal requirement, a distinction between the SY N , SEM and ST Y attributes is required; further distinctions, as in Figure 1, will have variable granularity and be extensible on demand. Assuming coherent definitions of characters as signs with SY N and SEM attributes for a particular character set (of which the IPA code numbers and their definitions as given in [International Phonetic Alphabet, 1999] are a suitable example), the remaining issue is how to map the syntactically and semantically coherent system into other encodings, both into Unicode and into code points for glyph collections in specific fonts. At the present state of the art, there are two options, such as the following for the IPA: 1. Mapping of IPA code numbers directly into code points (or sets of code points) in specific fonts such as IPAKIEL, SILDOULOS or TIPA. 2. Mapping of IPA code numbers into Unicode, in which codes may be scattered over different code–blocks, with a second layer of mapping into specific fonts.
372
D. Gibbon et al.
If these mappings are known, then in principle the properties defined in the ontology can be associated with other encodings and their glyph renderings. But note that with the current Unicode regime, an inverse function is not available: since the basic latin codes are massively ambiguous with regard to their SEM , i.e. user–oriented semantic, properties, there is no simple way of inducing a mapping from glyphs, or even from Unicode numbers, into the semantically oriented encoding. In this respect, Unicode numbers are no different from the codes for glyphs in any arbitrary font. The solution to this problem is to map ontological codes to font code– points with a convention such as name–space assignment. A biunique mapping is created by distinguishing between, say, ‘ipa:basic latin’ (the IPA relevant subset of the basic latin code block) and ‘english alphabet:basic latin’ (the subset containing the 52 upper and lower case characters of the English alphabet) or ‘ascii keyboard:basic latin’ (the subset including digits, some punctuation marks and some cursor control codes). The IPA Unicode mapping is then from the ontological representation to the union of two character blocks: ipa : basic latin ∪ ipa : ipa extensions.
5
Future Directions
For the purpose of defining interoperable text processes over characters, these mappings can be expressed straightforwardly in XML and manipulated at the levels of ontology, unicode, font and glyph properties by an appropriate language such as XSL. The next steps in the present enterprise are: 1. translation of the formal properties illustrated by our IPA example into interoperable XML; 2. definition of inter–level mappings between ontological information and both Unicode blocks and specific fonts; 3. development of an encoding definition language as a tool for specifying the < SY N, ST Y, SEM > vector and its subvectors; 4. practical characterisation of the properties of legacy documents which use non–standard fonts.
6
Conclusion
The analytical and representational model presented here permits complex data mining operations over linguistic data regardless of its expression in particular character encodings. Furthermore, the approach permits complex linguistic properties to be used coherently as query terms, a dimension not associated either with legacy fonts or Unicode. Using a semiotically based ontological approach to character encoding, a new dimension to the definition of text processes for search and text classification can be defined. For example, an electronic document which contains
Semantic Decomposition of Character Encodings
373
uses of a font such as IPAKIEL or SILIPA can be assigned to the semantic domain of linguistics with a high degree of confidence, and can thus be assumed to have been authored by a linguist with that degree of confidence. This is only the case, of course, if the relation between the font and the relevant ontology has been defined. The same applies to other specialised fonts which relate to other semantic domains, with far–reaching consequences for document classification in the context of the semantic web. With an ontological approach to character description of the kind introduced in the present contribution, generic search tools can be developed with a far higher degree of granularity than is currently available. An important issue for future work will be how the development of ontologies of this kind can be supported by machine learning techniques. Given that characters are the smallest units of text, they are available in sufficient numbers to permit the application of sophisticated induction techniques for this purpose.
References DAVIS, M and SCHERER, M. (2004): Character Mapping Markup Language (CharMapML). Unicode Technical Report #22, Unicode Consortium. http://www.unicode.org/reports/tr22/ ¨ DURST, M., YERGEAU, F., ISHIDA, R., WOLF, M. and TEXIN, T. (2005): Character M odel for the World Wide Web 1.0: Fundamentals. World Wide Web Consortium. http://www.w3.org/TR/charmod/ ESLING, J. H. and GAYLORD, H. 1993. Computer Codes for Phonetic Symbols. Journal of the International Phonetic Association 23(2), pp. 83–97. GIBBON, D., BOW, C., BIRD, S. and HUGHES, B. (2004): Securing Interpretability: The Case of Ega Language Documentation. Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, 2004. Euopean Language Resources Association: Paris. pp 1369–1372. GIBBON, D., MERTINS, I., MOORE, R. (2000): Handbook of Multimodal and Spoken Language Systems: Resources, Terminology and Product Evaluation. New York etc.: Kluwer Academic Publishers. HIMMELMANN, N. P. (1998): Documentary and descriptive linguistics. Linguistics 36, pp.161–195. INTERNATIONAL PHONETIC ASSOCATION (1999): Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press: Cambridge. http://www2.gla.ac.uk/IPA/ PULLUM, G. K. and LADUSAW, W. A. (1986): Phonetic Symbol Guide. The University of Chicago Press: Chicago. UNICODE CONSORTIUM, (2003): The Unicode Standard, Version 4.0, Reading, MA, Addison–Wesley, 2003. http://www.unicode.org/versions/Unicode4.0.0/
Applying Collaborative Filtering to Real-life Corporate Data Miha Grcar, Dunja Mladeniˇc, and Marko Grobelnik Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia Abstract. In this paper, we present our experience in applying collaborative filtering to real-life corporate data. The quality of collaborative filtering recommendations is highly dependent on the quality of the data used to identify users’ preferences. To understand the influence that highly sparse server-side collected data has on the accuracy of collaborative filtering, we ran a series of experiments in which we used publicly available datasets and, on the other hand, a real-life corporate dataset that does not fit the profile of ideal data for collaborative filtering.
1
Introduction and Motivation
The goal of collaborative filtering is to explore a vast collection of items in order to detect those which might be of interest to the active user. In contrast to content-based recommender systems which focus on finding contents that best match the user’s query, collaborative filtering is based on the assumption that similar users have similar preferences. It explores the database of users’ preferences and searches for users that are similar to the active user. The active user’s preferences are then inferred from preferences of the similar users. The content of items is usually ignored. The accuracy of collaborative filtering recommendations is highly dependent on the quality of the users’ preferences database. In this paper we would like to emphasize the differences between applying collaborative filtering to publicly available datasets and, on the other hand, to a dataset derived from real-life corporate Web logs. The latter does not fit the profile of ideal data for collaborative filtering. The rest of this paper is arranged as follows. In Sections 2 and 3 we discuss collaborative filtering algorithms and data quality for collaborative filtering. Our evaluation platform and the three datasets used in our experiments are described in Sections 4 and 5. In Sections 6 and 7 the experimental setting and the evaluation results are presented. The paper concludes with the discussion and some ideas for future work (Section 8).
2
Collaborative Filtering
There are basically two approaches to the implementation of a collaborative filtering algorithm. The first one is the so called “lazy learning” approach
Applying CF to Real-life Corporate Data
375
(also known as the memory-based approach) which skips the learning phase. Each time it is about to make a recommendation, it simply explores the database of user-item interactions. The model-based approach, on the other hand, first builds a model out of the user-item interaction database and then uses this model to make recommendations. “Making recommendations” is equivalent to predicting the user’s preferences for unobserved items. The data in the user-item interaction database can be collected either explicitly (explicit ratings) or implicitly (implicit preferences). In the first case the user’s participation is required. The user is asked to explicitly submit his/her rating for the given item. In contrast to this, implicit preferences are inferred from the user’s actions in the context of an item (that is why the term “user-item interaction” is used instead of the word “rating” when referring to users’ preferences in this paper). Data can be collected implicitly either on the client side or on the server side. In the first case the user is bound to use modified client-side software that logs his/her actions. Since we do not want to enforce modified client-side software, this possibility is usually omitted. In the second case the logging is done by a server. In the context of the Web, implicit preferences can be determined from access logs that are automatically maintained by Web servers. Collected data is first preprocessed and arranged into a user-item matrix. Rows represent users and columns represent items. Each matrix element is in general a set of actions that a specific user took in the context of a specific item. In most cases a matrix element is a single number representing either an explicit rating or a rating that was inferred from the user’s actions. Since a user usually does not access every item in the repository, the vector (i.e. the matrix row), representing the user, is missing some/many values. To emphasize this, we use the terms “sparse vector” and “sparse matrix”. The most intuitive and widely used algorithm for collaborative filtering is the so called k-Nearest Neighbors algorithm which is a memory-based approach. Technical details can be found, for example, in Grcar (2004).
3
Sparsity Problem and Data Quality for Collaborative Filtering
The fact that we are dealing with a sparse matrix can result in the most concerning problem of collaborative filtering – the so called sparsity problem. In order to be able to compare two sparse vectors, similarity measures require some values to overlap. What is more, the lower the amount of overlapping values, the lower the relialibility of these measures. If we are dealing with high level of sparsity, we are unable to form reliable neighborhoods. Sparsity is not the only reason for the inaccuracy of recommendations provided by collaborative filtering. If we are dealing with implicit preferences, the ratings are usually inferred from the user-item interactions, as already
376
M. Grcar et al.
Fig. 1. Data characteristics that influence the data quality, and the positioning of the three datasets used in our experiments, according to their properties.
mentioned earlier in the text. Mapping implicit preferences into explicit ratings is a non-trivial task and can result in false mappings. The latter is even more true for server-side collected data in the context of the Web since Web logs contain very limited information. To determine how much time a user was reading a document, we need to compute the difference in time-stamps of two consecutive requests from that user. This, however, does not tell us weather the user was actually reading the document or he/she, for example, went out to lunch, leaving the browser opened. There are also other issues with monitoring the activities of Web users, which can be found in Rosenstein (2000). From this brief description of data problems we can conclude that for applying collaborative filtering, explicitly given data with low sparsity are preferred to implicitly collected data with high sparsity. The worst case scenario is having highly sparse data derived from Web logs. However, collecting data in such manner requires no effort from the users and also, the users are not obliged to use any kind of specialized Web browsing software. This “conflict of interests” is illustrated in Figure 1.
4
Evaluation Platform
To understand the influence that highly sparse server-side collected data has on the accuracy of collaborative filtering, we built an evaluation platform. This platform is a set of modules arranged into a pipeline. The pipeline consists of the following four consecutive steps: (i) importing a user-item matrix (in the case of implicit preferences, data needs to be preprocessed prior to entering the pipeline), (ii) splitting data into a training set and a test set, (iii) setting a collaborative filtering algorithm (in the case of the kNN algorithm we also need to specify a similarity measure) and an evaluation protocol, (iv) making predictions about users’ ratings and collecting evaluation results. In the process of splitting the data into a training set and a test set, we randomly select a certain percentage of users (i.e. rows from the user-item matrix) that serve as our training set. The training set is, in the case of
Applying CF to Real-life Corporate Data
377
the kNN algorithm, used to search for neighbors or, in the case of modelbased approaches, as a source for building a model. Ratings from each user from the test set are further partitioned into “given” and “hidden” ratings, according to the evaluation protocol. These concepts are discussed in Breese et al. (1998) and are in this paper left out due to the lack of space.
5
Data Description
For our experiments we used three distinct datasets. The first dataset was EachMovie (provided by Digital Equipment Corporation) which contains explicit ratings for movies. The service was available for 18 months. The second dataset with explicit ratings was Jester (provided by Goldberg et al.) which contains ratings for jokes, collected over a 4-year period. The third dataset was derived from real-life corporate Web logs. The logs contain accesses to an internal digital library of a fairly large company. The time-span of acquired Web logs is 920 days. In this third case the users’ preferences are implicit and collected on the server side, which implies the worst data quality for collaborative filtering. In contrast to EachMovie and Jester, Web logs first needed to be extensively preprocessed. Raw logs contained over 9.3 million requests. After all the irrelevant requests were removed we were left with only slightly over 20,500 useful requests, which is 0.22% of the initial database size. The next problem emerged from the fact that we needed to map implicit preferences contained in log files, into explicit ratings. As already explained, this is not a trivial task. Claypool et al. (2001) have shown linear correlations between the time spent reading a document and the explicit rating given to that same document by the same user. However, their test-users were using specialized client-side software, which made the collected data more reliable. Despite this fact we decided to take reading times into account when preprocessing Web logs. We plotted reading times inferred from consecutive requests onto a scatter plot shown in Figure 3. The x-axis shows requests ordered by their timestamps, and the y-axis shows the inferred reading time on a logarithmic scale. We can see that the area around 24 hours is very dense. These are the last accesses of a day. People went home and logged in again the next day, which resulted in approximately 24-hour “reading” time. Below the 24-hour line, at approximately 10-hour reading time, a gap is evident. We decided to use this gap to define outliers – accesses above the gap are clearly outliers. We decided to map reading times onto a discrete 3-score scale (scores being 1=“not interesting”, 2=“interesting”, and 3=“very interesting”). Since items were research papers and 20 seconds is merely enough to browse through the abstract, we decided to label documents with reading times below 20 seconds as “not interesting”. Documents with reading times between 20 seconds and 10 minutes were labelled as “interesting” and documents with reading times
378
M. Grcar et al.
Fig. 2. Mapping implicit preferences contained in the corporate Web logs onto a discrete 3-score scale.
Table 1. The comparison between the three datasets.
from 10 minutes to 10 hours were labelled as “very interesting”. We decided to keep the outliers due to the lack of data. In the first scenario they were labelled as “very interesting” and in the second one as “interesting”. Since we had no reliable knowledge about the outliers, the second scenario should have minimized the error we made by taking them into account. Table 1 shows the comparison between the three datasets. It is evident that a low number of requests and somewhat ad-hoc mapping onto a discrete scale are not the biggest issues with our corporate dataset. The concerning fact is that the average number of ratings per item is only 1.22, which indicates extremely poor overlapping. Sparsity is consequently very high, 99.93%. The other two datasets are much more promising. The most appropriate is the Jester dataset with very low sparsity, followed by EachMovie with higher sparsity but still relatively high average number of ratings per item. Also, the latter two contain explicit ratings, which means that they are more reliable than the corporate dataset (see also Figure 1).
6
Experimental Setting
We ran a series of experiments to see how the accuracy of collaborative filtering recommendations differs between the three datasets (from EachMovie
Applying CF to Real-life Corporate Data
379
and Jester we considered only 10,000 randomly selected users to speed up the evaluation process). First, we randomly selected 70% of the users as our training set (the remaining 30% were our test set). Ratings from each user in the test set were further partitioned into “given” and “hidden” ratings according to the “all-but-30%” evaluation protocol. The name of the protocol implies that 30% of all the ratings were hidden and the remaining 70% were used to form neighborhoods in the training set. We applied three variants of memory-based collaborative filtering algorithms: (i) k-Nearest Neighbors using the Pearson correlation (kNN Pearson), (ii) k-Nearest Neighbors using the Cosine similarity measure (kNN Cosine), and (iii) the popularity predictor (Popularity). The latter predicts the user’s ratings by simply averaging all the available ratings for the given item. It does not form neighborhoods and it provides each user with the same recommendations. It serves merely as a baseline when evaluating collaborative filtering algorithms (termed “POP” in Breese et al. (1998)). For kNN variants, we used a neighborhood of 80 users (i.e. k=80), as suggested in Goldberg et al. (2001). We decided to evaluate both variants of the corporate dataset (the one where the outliers were labelled as “very interesting”, referred to as “1/2/3/3”, and the one where the outliers were labelled as “interesting”, referred to as “1/2/3/2”). For each dataset-algorithm pair we ran 5 experiments, each time with a different random seed (we also selected a different set of 10,000 users from EachMovie and Jester each time). We decided to use normalized mean absolute error (NMAE) as the accuracy evaluation metric. We first computed NMAE for each user and then we averaged it over all the users (termed “per-user NMAE”) (see Herlocker et al. (2004)).
7
Evaluation Results
Our evaluation results are shown in Figure 4. The difference between applying kNN Pearson and kNN Cosine to EachMovie is statistically insignificant (we used two-tailed paired Student’s t-Test to determine if the differences in results are statistically significant). However, they both significantly outperform Popularity. In the case of Jester, which has the smallest degree of sparsity, kNN Pearson slightly, yet significantly outperforms kNN Cosine. Again, they both significantly outperform Popularity. Evaluation results from the corporate datasets (two variants of the same dataset, more accurately) show that predictions are less accurate and that NMAE value is relatively unstable (hence the large error bars showing standard deviations of NMAE values). The main reason for this is low/no overlapping between values (i.e. extremely high sparsity), which results in inability to make several predictions. In the first scenario (i.e. with the 1/2/3/3 dataset) we can see that the differences in NMAE of kNN Pearson, kNN Cosine and Popularity are all statistically
380
M. Grcar et al.
Fig. 3. The evaluation results.
insignificant. In the second scenario (i.e. with the 1/2/3/2 dataset), however, kNN Pearson outperforms kNN Cosine and Popularity, while the accuracies of kNN Cosine and Popularity are not significantly different.
8
Discussion and Future Work
What is evident from the evaluation results is that the corporate dataset does not contain many overlapping values and that this represents our biggest problem. Before we will really be able to evaluate collaborative filtering algorithms on the given corporate dataset, we will need to reduce its sparsity. One idea is to apply LSI (latent semantic indexing) (Deerwester et al. (1990)) or to use pLSI (probabilistic latent semantic indexing) (Hofmann (1999)) to reduce the dimensionality of the user-item matrix, which consequently reduces sparsity. Another idea, which we believe is even more promising in our context, is to incorporate textual contents of the items. There were already some researches done on how to use textual contents to reduce sparsity and improve the accuracy of collaborative filtering (Melville et al. (2002)). Luckily we are able to obtain textual contents for the given corporate dataset. What is also evident is that mapping implicit into explicit ratings has great influence on the evaluation results. We can see that going from Corporate 1/2/3/3 to Corporate 1/2/3/2 is fatal for kNN Pearson (in contrast to kNN Cosine). This needs to be investigated in greater depth; we do not wish to draw conclusions on this until we manage to reduce the sparsity and consequently also the standard deviations of NMAE values. Also interesting, the Cosine similarity works just as well as Pearson on EachMovie and Jester. Early researches show much poorer performance of the Cosine similarity measure (Breese et al. (1998)). As a side-product we noticed that the true value of collaborative filtering (in general) is shown yet when computing NMAE over some top percentage
Applying CF to Real-life Corporate Data
381
of eccentric users. We defined eccentricity intuitively as MAE (mean absolute error) over the overlapping ratings between “the average user” and the user in question (greater MAE yields greater eccentricity). The average user was defined by averaging ratings for each particular item. Our preliminary results show that the incorporation of the notion of eccentricity can give the more sophisticated algorithms a fairer trial. In near future, we will define an accuracy measure that will weight per-user NMAE according to the user’s eccentricity, and include it into our evaluation platform.
Acknowledgements This work was supported by the 6FP IP SEKT (2004–2006) (IST-1-506826IP) and the Slovenian Ministry of Education, Science and Sport. The EachMovie dataset was provided by Digital Equipment Corporation. The Jester dataset is courtesy of Ken Goldberg et al.. The authors would also like to thank Tanja Brajnik for the help.
References BREESE, J.S., HECKERMAN, D., and KADIE, C. (1998): Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. CLAYPOOL, M., LE, P., WASEDA, M., and BROWN, D. (2001): Implicit Interest Indicators. In: Proceedings of IUI’01. DEERWESTER, S., DUMAIS, S.T., and HARSHMAN, R. (1990): Indexing by Latent Semantic Analysis. In: Journal of the Society for Information Science, Vol. 41, No. 6, 391–407. GOLDBERG, K., ROEDER, T., GUPTA, D., and PERKINS, C. (2001): Eigentaste: A Constant Time Collaborative Filtering Algorithm. In: Information Retrieval, No. 4, 133–151. GRCAR, M. (2004): User Profiling: Collaborative Filtering. In: Proceedings of SIKDD 2004 at Multiconference IS 2004, 75–78. HERLOCKER, J.L., KONSTAN, J.A., TERVEEN, L.G., and RIEDL, J.T. (2004): Evaluating Collaborative Filtering Recommender Systems. In: ACM Transactions on Information Systems, Vol. 22, No. 1, 5–53. HOFMANN, T. (1999): Probabilistic Latent Semantic Analysis. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. MELVILLE, P., MOONEY, R.J., and NAGARAJAN, R. (2002): Content-boosted Collaborative Filtering for Improved Recommendations. In: Proceedings of the 18th National Conference on Artificial Intelligence, 187–192. RESNICK, P., IACOVOU, N., SUCHAK, M., BERGSTROM, P., and RIEDL, J. (1994): GroupLens: An Open Architecture for Collaborative Filtering for Netnews. In: Proceedings of CSCW’94, 175–186. ROSENSTEIN, M. (2000): What is Actually Taking Place on Web Sites: ECommerce Lessions from Web Server Logs. In: Proceedings of EC’00.
Quantitative Text Typology: The Impact of Sentence Length Emmerich Kelih1 , Peter Grzybek1 , Gordana Anti´c2 , and Ernst Stadlober2 1
2
Department for Slavic Studies, University of Graz, A-8010 Graz, Merangasse 70, Austria Department for Statistics, Technical University Graz, A-8010 Graz, Steyrergasse 17/IV, Austria
Abstract. This study focuses on the contribution of sentence length for a quantitative text typology. Therefore, 333 Slovenian texts are analyzed with regard to their sentence length. By way of multivariate discriminant analyses (M DA) it is shown that indeed, a text typology is possible, based on sentence length, only; this typology, however, does not coincide with traditional text classifications, such as, e.g., text sorts or functional style. Rather, a new categorization into specific discourse types seems reaonable.
1
Sentence Length and Text Classification: Methodological Remarks
Text research, based on quantitative methods, is characterized by two major spheres of interest: (1) quantitative text classification, in general (cf., e.g., Alekseev 1988), and (2) authorship discrimination and attribution of disputed authorship, in particular (cf., e.g., Smith 1983). Both lines of research are closely interrelated and share the common interest to identify and quantify specific text characteristics, with sentence length playing a crucial role and obviously being an important factor. However, in most approaches sentence length is combined with other quantitative measures as, e.g., the proportion of particular parts of speech, word length (usually measured by the number of letters per word), the proportion of specific prepositions, etc. (cf. Karlgren/Cutting 1994, Copeck et al. 2000). This, in fact, causes a major problem, since the specific amount of information, which sentence length may provide for questions of text classification, remains unclear. The present study starts at this particular point; the objective is an empirical analysis based on a corpus of 333 Slovenian texts. From a methodological perspective, the procedure includes the following steps, before M DA will be applied to quantitative text classification: a. the theoretical discussion of qualitative approaches to text classification, mainly of research in the realm of text sorts and functional styles, and the relevance of these classifications for empirical studies; b. the elaboration of an operational definition of ‘sentence’ as well as of a consistent measuring unit;
Quantitative Text Typology: The Impact of Sentence Length
383
c. the derivation of adequate statistical characteristics from the frequency distribution of sentence lengths, in addition to average sentence length. 1.1
Definition of ‘word’ and ‘sentence’
In this study, ‘sentences’ are considered to be constitutive units of texts, separated from each other by punctuation marks; by way of a modification of usual standards, the definition of sentence used in this study is as follows: Definition 1. The punctuation marks [.], [. . . ], [?], and [!] function as sentence borders, unless these characters are followed by a capital letter in the initial position of the subsequent word. This definition is not claimed to be of general linguistic validity; rather, it turns out to be adequate for our corpus of pre-processed texts, taken from the Graz Quantitative Text Analysis Server (QuanTAS).1 Now, as far as the measuring unit of sentence length is concerned, often the number of clauses is claimed to be adequate, since clauses are direct constituents of sentences. Yet, in our study, the number of words (tokens) per sentence is preferred, a word being defined as an orthographic-phonetic unit. Apart from the fact, that we thus have very operational definitions of units at our disposal, control studies including alternative definitions of both ‘word’ and ‘sentence’ have shown that both definitions are rather stable, and that a change of definition results in shifts of systematic nature (Anti´c et al. 2005, Kelih/Grzybek 2005). 1.2
Text Basis, Methods of Classification, and Statistical Characteristics Applied
The 333 Slovenian texts under study, have not been arbitrarily chosen; rather, they were supposed to cover the broad spectrum of possible genres, and thus to be representative for the textual world in its totality. Therefore, the texts were taken from the above-mentioned corpus, in which each text has been submitted to a qualitative a priori classification, according to which each text is attributed to a particular text sort. The theoretical distinction of text sorts being based on specific communicative-situational factors (cf. Adamzik 2000). For the present study, all text sorts have additionally been attributed to functional styles: as opposed to text sorts, the theory of functional styles (cf. Ohnheiser 1999) refers to rather general communicative characteristics. The degree of abstractness is extremely different in case of texts sorts and functional styles: whereas contemporary research in text sorts distinguishes about 4,000 different text sorts, functional styles usually confine to a number of about six to eight. Any kind of qualitative generalization necessarily results 1
This data base contains ca. 5,000 texts from Croatian, Slovenian, and Russian; all texts are pre-processed and specifically tagged; this procedure guarantees a unified approach.– Cf.: http://www-gewi.uni-graz.at/quanta
384
E. Kelih et al.
Functional style
Text sort
Everyday style Administrative style
Private letters Recipes Open Letters Humanities Natural sciences articles Readers’ letters Novels Dramatic texts
Science Journalistic style Literary prose Drama
m1
s
h
S
total
15.40 10.09 26.07 21.53 20.88 23.46 23.75 14.24 6.48
10.08 4.39 14.25 11.71 11.10 11.18 13.01 8.48 5.38
3.79 3.05 4.63 4.55 3.75 3.76 3.98 4.51 3.60
7.55 3.40 15.66 22.31 13.55 8.27 21.16 4.32 13.85
31 31 29 46 32 43 30 49 42
Table 1. Text sorts and functional styles: some statistical characteristics
in some kind of uncertainty relation and may lead to subjective decisions. On the one hand, such subjective decisions may be submitted to empirical testing, attempting to provide some intersubjectively approved agreements (cf. Grzybek/Kelih 2005). On the other hand, one may investigate in how far qualitatively obtained classifications, taken as mere tentative a priori classifications, bear a closer empirical examination. This paper follows the second direction: our aim is to study, (a) to what degree a classification of texts can be achieved on the basis of sentence length (or, to put it in other words, to what degree sentence length may contribute to a classification of texts), and (b) in how far qualitative classifications involving either (b1 ) text sorts or (b2 ) functional styles correspond to the empirical findings. Table 1 represents the involved spectrum of text sorts and functional styles, along with a number of statistical characteristics described below. As was mentioned above, in this study each individual text is treated as a separate object: for each individual text, sentence lengths are measured by the number of words per sentence. Thus, a frequency distribution of x-word sentences is obtained. From this frequency distribution, a set of statistical variables can be derived, such as: mean x = m1 ), variance (s2 = m2 ), stan (¯ dard deviation (s), entropy (h = − p · ldp), the first four central moments (m1 , m2 , m3 , m4 ) and quotients, such as the coefficient of variation v = x ¯/s), Ord’s I = m2 /m1 Ord’s S = m3 /m2 , and many others. This pool of variables – ca. 35 variables have been derived for our analyses (cf. Grzybek et al. 2005) – serves as a basis for M DA. Of course, the aim is to use only a minimum of these variables;2 therefore, the 35 variables are tested for their relevance in text classification in a preliminary study. As a first result, it turns out that there are four dominant characteristics, which are important 2
The corresponding procedures have proven to be efficient with regard to word length studies by the authors of this text before, and they shall be applied to sentence length studies, here.
Quantitative Text Typology: The Impact of Sentence Length
(a) Text Sorts
385
(b) Functional Styles
Fig. 1. Results of Discriminant Analyses
for all subsequent steps: (i) average sentence length x ¯, (ii) standard deviation s, (iii) Ord’s criterion S, and (iv) entropy h. Notwithstanding the treatment of individual texts, Table 1 offers a general orientation, representing the values of these statistical characteristics with regard to the nine text sorts. Obviously, there are tremendous differences between the various functional styles and, within each functional style, between different text sorts. On the one hand, these observations imply a clear warning as to any corpus-based approach, not paying due attention to genre diversity. On the other hand, these observations give reason to doubt the adequacy of merely qualitative classifications.
2 2.1
Sentence Length and Discriminant Analyses Submitting the Qualitative Classifications to Empirical Testing
On the basis of the above-mentioned discussion, the question arises in how far the tentative a priori attribution of individual texts (a) to text sorts and (b) to functional style is corroborated by sentence length analyses. The results of M DA show that only 62.50% of the texts are correctly attributed to one of the nine text sorts; likewise, only 66.40% of the texts are correctly attributed to one of the six functional styles – cf. Fig. 1. This result indicates that neither text sorts nor functional styles can be adequate categories for text classifications based on sentence length. The obviously necessary new classification should start with text sorts, since they are more specific than functional styles. Given a number of nine text sorts, a first step in this direction should include the stepwise elimination of individual text sorts.
386
E. Kelih et al.
Text groups Scientific texts Journalistic texts
Group membership Scientific texts Journalistic texts 65 19
13 83
total 78 102
Table 2. Attribution of Scientific and Journalistic Texts
2.2
Stepwise Reduction: Temporary Elimination of Text Sorts
An inspection of Fig. 1(a) shows that dramatic texts and cooking recipes cover relatively homogeneous areas in our sample of 333 texts. This is a strong argument in favor of assuming sentence length to be a good discriminating factor for these two texts sorts. Consequently, temporarily eliminating these two text sorts from our analyses, we can gain detailed insight into the impact of sentence length on the remaining seven text sorts (private letters, scientific texts from human and natural sciences, open letters, journalistic articles, readers’ letters, and novels). As M DA of these remaining 266 texts show, an even less portion of only 51.9% are correctly classified. However, 98% (48 of 49) of our novel texts are correctly classified, followed by the private letters; as to the latter, 64.5% are correctly classified, but 25.8% are misclassified as novels. Obviously, novel texts and private letters seem to have a similar form as to their sentence length; therefore, these two text sorts shall be temporarily eliminated in the next step.
2.3
Stepwise Reduction: Formation of New Text Groups
The remaining five text sorts (human and natural sciences, open and readers’ letters, articles) consist of 180 texts. Discrimination analyses with these five text sorts lead to the poor result of 40% correct classifications. Yet, the result obtained yields an interesting side-effect, since all text sorts are combined to two major groups: (i) scientific texts, and (ii) open letters and letters. Attributing readers’ letters – which almost evenly split into one of these two groups – to the major group of journalistic texts, we thus obtain two major text groups: 78 scientific texts, and 102 journalistic texts. A discriminant analysis with these two groups results in a relatively satisfying percentage of 82.20% correct classifications (cf. Table 2). Since the consecutive elimination of text sorts (recipes, dramatic texts, novel texts, private letters) has revealed that the remaining five text sorts form two global text groups, the next step should include the stepwise reintroduction of all temporarily eliminated text sorts.
Quantitative Text Typology: The Impact of Sentence Length Text group Cooking Recipes (CR) Scientific Texts (ST) Private Letters (PL) Journalistic Texts (JT) Novel Texts (NT) Dramatic Texts (DT)
CR
ST
PL
JT
NT
DT
total
30 0 3 0 0 0
0 58 1 16 2 0
0 0 16 7 0 0
0 11 2 71 1 0
0 9 8 8 46 2
1 0 1 0 0 40
31 78 31 102 49 42
387
Table 3. Six Text Groups
3
Re-integration: Towards a New Text Typology
Re-introducing the previously eliminated text sorts, particular attention has to be paid to the degree of correct classification, the percentage of 82.2% obtained above representing some kind of benchmark. In detail, the following percentages were obtained: 1. re-introducing the cooking recipes (three major texts groups, n = 211) results in 82.5% correct classifications; 2. additionally re-integrating the dramatic texts (four major texts groups, n = 253) even increases the percentage of correct classifications to 86.60%; 3. also re-integrating the novel texts (five major texts groups, n = 302) still results in 82.5% correct classifications; 4. finally re-integrating the last missing text sort (private letters) finally yields 78.8% correct classifications of n = 333 texts (cf. Table 3). The synoptical survey of our new classification allows for a number of qualitative interpretations: Obviously, sentence length is a good discriminant for dramatic texts, probably representing oral speech in general (in its fictional form, here). The same holds true for the very homogeneous group of cooking recipes, most likely representing technical language, in general. Sentence length also turns out to be a good discriminating factor for novel texts with a percentage of ca. 94% correct classifications. Scientific texts and journalistic texts form two major groups which are clearly worse classified as compared to the results above (74.35% and 69.61%, respectively); however, the majority of mis-classifications concern attributions to the opposite group, rather than transitions to any other group. As compared to this, private letters –which were re-introduced in the last step – represent a relatively heterogeneous group: only 51.61% are correctly classified, 25.81% being attributed to the group of novel texts.
388
E. Kelih et al.
Fig. 2. Discriminant Analysis of Six Text Groups
4
Summary
The present study is a first systematic approach to the problem of text classification on the basis of sentence length as a decisive discriminating factor. The following major results were obtained: a. Taking the concept of functional styles as a classificatory basis, sentence length turns out to be not feasible for discrimination. However, this results does not depreciate sentence length as an important stylistic factor; rather functional styles turn out to be a socio-linguistic rather than a stylistic category. The same conclusion has to be drawn with regard to text sorts. b. with regard to our 333 Slovenian texts four statistical characteristics turn out to be relevant in discriminant analyses, based on sentence length as the only discriminating factor: mean sentence length (¯ x), standard deviation (s), Ord’s S, and enropy h; at least these variables should be taken into account in future studies, though it may well turn out that other variables play a more decisive role; c. our discriminant analyses result in a new text typology, involving six major text groups: in this typology, sentence length has a strong discriminating power particularly for dramatic texts (oral discourse), cooking recipes (technical discourse), and novel texts (everyday narration); with certain reservations, this holds true for scientific and journalistic discourse, too, with some transitions between these two discourse types. Only private letters represent a relatively heterogeneous group which cannot clearly be attributed to one of the major discourse types. Given these findings, it will be tempting to compare the results obtained to those, previously gained on the basis of word length as discriminating variable. On the one hand, this will provide insight into the power of two (or
Quantitative Text Typology: The Impact of Sentence Length
389
more) combined linguistic variables for questions of text classification; it will be particularly interesting to see in how far classifications obtained on the basis of other variables (or specific combinations of variables) lead to identical or different results. Finally, insight will be gained into the stylistic structure of specific texts, and discourse types, in a more general understanding.
References ADAMZIK, K. (Ed.) (2000): Textsorten. Reflexionen und Analysen. Stauffenburg, T¨ ubingen. ALEKSEEV, P.M. (1988): Kvantitativnaja lingvistika teksta. LGU, Leningrad. ´ G., KELIH, E.; GRZYBEK, P. (2005): Zero-syllable Words in Determining ANTIC, Word Length. In: P. Grzybek (Ed.): Contributions to the science of language. Word Length Studies and Related Issues. Kluwer, Dordrecht, 117–157. COPECK, T., BARKER, K., DELISLE, S. and SZPAKOWICZ, St. (2000): Automating the Measurement of Linguistic Features to Help Classify Texts as Technical. In: TALN-2000, Actes de la 7e Conf´erence Annuelle sur le Traitement Automatique des Langues Naturelle, Lausanne, Oct. 2000, 101–110. GRZYBEK, P. (Ed.) (2005): Contributions to the Science of Language. Word Length Studies and Related Issues. Kluwer, Dordrecht. GRZYBEK, P. and KELIH, E. (2005): Textforschung: Empirisch! In: J. Banke, A. Schr¨ oter and B. Dumont (Eds.): Textsortenforschungen. Leipzig. [In print] ´ G. (2005): QuantitaGRZYBEK, P., STADLOBER, E., KELIH, E., and ANTIC, tive Text Typology: The Impact of Word Length. In: C. Weihs and W. Gaul (Eds.), Classification – The Ubiquitous Challenge. Springer, Heidelberg; 53–64. KARLGREN, J. and CUTTING, D. (1994): Recognizing text genres with simple metrics using discriminant analysis. In: M. Nagao (Ed.): Proceedings of COLING 94, 1071–1075. ´ G., GRZYBEK, P. and STADLOBER, E. (2005) Classification KELIH, E., ANTIC, of Author and/or Genre? The Impact of Word Length. In: C. Weihs and W. Gaul (Eds.), Classification – The Ubiquitous Challenge. Springer, Heidelberg; 498–505. KELIH, E. and GRZYBEK, P. (2005): Satzl¨ angen: Definitionen, H¨ aufigkeiten, Modelle. In: A. Mehler (Ed.), Quantitative Methoden in Computerlinguistik und Sprachtechnologie. [= Special Issue of: LDV-Forum. Zeitschrift f¨ ur Computerlinguistik und Sprachtechnologie / Journal for Computational Linguistics and Language Technology] [In print] OHNHEISER, I. (1999): Funktionale Stilistik. In: H. Jachnow (Ed.): Handbuch der sprachwissenschaftlichen Russistik und ihrer Grenzdisziplinen. Harrassowitz, Wiesbaden, 660–686. SMITH, M.W.A (1983): Recent Experience and New Developments of Methods for the Determination of Authorship. Bulletin of the Association for Literary and Linguistic Computing, 11(3), 73–82.
A Hybrid Machine Learning Approach for Information Extraction from Free Text G¨ unter Neumann
LT–Lab, DFKI Saarbr¨ ucken, D-66123 Saarbr¨ ucken, Germany
Abstract. We present a hybrid machine learning approach for information extraction from unstructured documents by integrating a learned classifier based on the Maximum Entropy Modeling (MEM), and a classifier based on our work on Data–Oriented Parsing (DOP). The hybrid behavior is achieved through a voting mechanism applied by an iterative tag–insertion algorithm. We have tested the method on a corpus of German newspaper articles about company turnover, and achieved 85.2% F-measure using the hybrid approach, compared to 79.3% for MEM and 51.9% for DOP when running them in isolation.
1
Introduction
In this paper, we investigate how relatively standardized ML techniques can be used for IE from free texts. In particular, we will present a hybrid ML approach in which a standard Maximum–Entropy Modeling (MEM) based classifier is combined with a tree-based classifier based on Data–Oriented Parsing (DOP), a widely used paradigm for probabilistic parsing. The major motivations for the work presented in this paper are 1) to explore, for the first time, the benefits of combining these two leading ML paradigms in NLP for information extraction, and 2) to exploit ML–IE approaches for German documents. This issue is of interest, because so far, nearly all proposed ML– IE approaches are considering English documents (in fact, we are not aware of any results reported for German using a ML–IE approach using a comparative IE–task). However, since German is a language with important different linguistic phenomena compared to English (e.g., rich morphology, free–word order, word compounds), one cannot simply transpose the performance results of ML–IE approaches obtained for English to German. The core idea of a supervised ML–IE approach from free text is simple (see also fig. 1): Given a corpus of raw documents annotated only with the relevant slot–tags from the template specification, enrich the corpus with linguistic features automatically extracted by the Linguistic Text Engine. Pass this annotated corpus to the Machine Learning Engine which computes (through the application of its core learning methods) a set of template
Thanks to Volker Morbach for his great help during the implementation and evaluation phase of the project. This work was supported by a research grant from BMBF to the DFKI project Quetal (FKZ: 01 IW C02).
A Hybrid Machine Learning Approach for IE
391
Fig. 1. Blueprint of the Machine Learning perspective of Information Extraction.
specific annotation functions, i.e., mappings from linguistic features to appropriate template slots. These learned mappings are then used to automatically annotate new documents – pre-processed by the same Linguistic Text Engine, of course – with template specific information. We are following the standard view of IE “as classification”, in that we classify each token to belonging to one of the slot–tags or not. In particular we want to explore the effect of the linguistic feature extraction to the performance of our ML–IE approach. The linguistic features are computed by our system Smes a robust wide-coverage German text parser, cf. Neumann and Piskorski (2002). The features can roughly be classified into lexical (e.g., token class, stem, PoS, compounds) and syntactic (e.g., verb groups (VG), nominal phrases (NP), named entities (NE)). In order to explore the effects of features from different levels, classification is performed as an incremental tagging algorithm, on basis of the following two–level learning approach: 1) Token level (cf. sec. 2): each token is individually tagged with one of the slot– tags using only lexical features. 2) Token group level (cf. sec. 3): a sequence of tokens is recognized and tagged with one of the slot–tags by applying a set of tree patterns. Both levels are learned independently from each other, but they are combined in the application phase, and this is why we call our ML–IE approach hybrid.
2
MEM for Exploiting the Token Level
The language model for the token level is obtained using Maximum Entropy Modeling (MEM). The major advantages of MEM for IE from unstructured texts are 1) that one can easily combine features from different linguistic
392
G. Neumann
levels, and 2) that the estimation of the probabilities are based on the principle of making as few assumptions as possible, other than the constraints on feature combination and values are imposing, cf. Pietra et al. (1997). The probability distribution that satisfies these properties is the one with the highest entropy, and has the form p(a|b) =
n n 1 f (a,b) f (a,b) αj j with Z(b) = αj j · Z(b) j=1 j=1
(1)
a∈A
where a refers to the outcome (or tag) and A the tag set, b refers to the history (or context), and Z(b) is a normalization function. Features are the means through which an experimenter feeds problem-specific information to MEM (n lexical features in our case), all of them bearing the form ! 1 if a = a and cp(b) = true (2) fj (a, b) = 0 otherwise where cp stands for a contextual predicate, which considers all information available for all tokens surrounding the given token t0 (our context window is [t−2 , t−1 , t0 , t+1 , t+2 ] ) and all information available for t0 . We use the following lexical feature set: token, token class, word stem, and PoS. The task of the MEM training algorithm is to compute the values of the feature weights αj . We are using Generalized Iterative Scaling, a widely used estimation procedure, cf. Darroch and Ratcliff (1972).
3
DOP for Exploiting the Token Chain Level
Data-Oriented Parsing (DOP) is a probabilistic approach to parsing that maintains a large corpus of analyses of previously occurring sentences, cf. Bod et al. (2003). New input is parsed by combining tree-fragments from the corpus; the frequencies of these fragments are used to estimate which analysis is the most probable one. So far, DOP has basically been applied on syntactic parse trees. In this paper, we show how DOP can be applied to IE. The starting point is the XML–tree of an annotated template instance. Such a template tree t is extracted from an annotated document by labeling the root node with the domain–type (see fig. 2) and the immediate child nodes with the slot–tags (called slot-nodes). Each slot-node s is the root of a sub–tree (called slot-tree and denoted as ts ) whose yield consists of the text fragment α spanned by s. All other nodes of ts result from the linguistic analysis of α performed by Smes. Note that in contrast to the token level all information computed by Smes is used at this level, i.e., in addition to the lexical features, we also make use of the named entities (NE) and phrasal level. Each template tree t obtained from the training corpus is generalized by cutting off certain sub–trees from t’s slot–trees, which is basically performed
A Hybrid Machine Learning Approach for IE
393
Fig. 2. Example of the tree generalization using DOP.
by deleting the link ni → nj between a non-terminal node ni and its child node nj and by removing the complete subtree rooted at nj (cf. lower left tree in fig. 2). The resulting tree t is more general than t, since it has fewer terminal as well as non-terminal nodes than t but otherwise respects the structure of t. All generalized trees are further processed by extracting all slot–trees. Finally, each slot–tree is assigned a probability p(ts ) such that ti :root(ti )=s p(ti ) = 1). The tree decomposition operation is linguistically guided by the head feature principle, which requires that the head features of a phrasal sign be shared with its head daughter, cf. Neumann (2003). For example, the head daughter of a NP is its noun N. Using this notation, tree decomposition traverses each slot–tree from the top–downwards by cutting of the non–head daughters with the restriction that if the root label of a non–head daughter d denotes a token class or a named entity, then we retain the root node of d, but cut off d’s sub–trees.
4
Hybrid Iterative Tag Insertion
The application phase is realized as a tag–insertion method that is iteratively applied by a central search control on a new document as long as no new slot– tag can be inserted (using the slot unknown for initializing the tag sequence).
394
G. Neumann
Fig. 3. The Hybrid Iterative Tag Insertion approach.
The slot–tags are predicted by a set of operators. Each operator corresponds to one of the learning algorithms, viz. MEM–op and DOP–op, see fig. 3. The hybrid property of the approach is obtained such that in each iteration all operators are applied independently of each other on the actual tagged sequence. This results in a set of operator–specific new tagged sequences each having an individual weight. The N –best new tagged sequences are passed to the next iteration step, i.e., we perform a beam–search with beam size N . The following common weighting scheme is used by each operator opk ⎧ (j+1) (j) (j) ⎪ − #p(j) ⎨ w · #p + fk · ∆w · #p , if #p(j+1) > #p(j) w(j+1) =
⎪ ⎩
#p(j+1) w(j)
(3)
, if #p(j+1) = #p(j)
where w(i) denotes the weight of the tagged sequence determined in iteration step i (setting w(0) = 0 enforces 0 ≤ w(i) ≤ 1), #p(i) is the number of fixed tag positions after iteration i (by fixed we mean that after the tag unknown has been mapped to slot–tag s, s cannot be changed in next iterations). ∆w is a feature weight, and fk a operator–specific performance number (both having values between 0 and 1), which is determined by applying opk with different parameter settings on a seen subset of the training corpus by recording the different values of F–measures obtained. An operator opk applies the trained model of a learner on a new linguistically preprocessed token sequence and computes predictions for new slot–tags. Since application can be done in different modes, each operator opk fixes different parameters. For MEM–op, we define specific instances of it depending on the search direction (e.g., leftmost not yet fixed tag unknown, rightmost unknown or best unknown), use of a lexicon, use of previous made predictions, or the maximum number of iterations, cf. also Ratnapharkhi (1998). For DOP–op different instances could implement different tree matching methods. Currently, we use the following generate–and–test
A Hybrid Machine Learning Approach for IE
395
tree matching method: from the current token sequence consider all possible sub–sequences (constrained by an automatically computed breadth–lexicon, used to restrict the “plausible” length of a potential slot–filler); construct an XML–tree with a root label whose label is the current slot–type in question; apply the same tree generalization method as used in the training phase; finally check for equality of this generalized DOP–tree with corresponding trees from the DOP–model.
5
Experiments
Since there exists no standard IE–corpus for German, we used a corpus of news articles reporting company turnover for the years 1994 and 1995. The corpus has been annotated with the following tags: Org (organization name), Quant (quantity of the message, which is either turnover or revenue), Amount (amount of the reported event), Date (reported time period), Tend (increase (+) or decrease (-) of turnover), Diff (amount of money announced for that time period). The corpus consists of 75 template instances with 5.878 tokens, from which we used 60 instances for training and 15 for testing. Evaluation of our hybrid ML–IE approach was done using the standard measures recall (R) and precision (P) and its combined version F–measure.1 We were mainly been interested in checking whether the combination of MEM and DOP improves the overall performance of our method compared to the performance of our method, when running MEM and DOP in isolation. Table 1 shows the result of running different instances of the MEM–op on the test set. Inspecting table 1, we can see that the best result was obtained when MEM was running in best–search mode taking into account previous made decisions using no lexicon. Table 2 displays the performance of the DOP–op applied on different sizes of the training set (using the same test set in all runs). As one can see, precision decreases when the training size grows (see next paragraph for a possible explanation). Table 3 shows that the overall performance of the system increases, when MEM and DOP are combined. We can also see that not all instances of the MEM–op benefit by the combined approach. However, the first table row shows that the F1 value for the MEM–op increases from 79.3% to 85.2% when combined with DOP. The results suggest that MEM performs better than our current DOP tree matcher when running in isolation. The reason is that the tree patterns extracted by means of DOP are more restricted in predicting new tags than MEM. Furthermore, since we currently build tree patterns only for the slot–fillers without taking into account context, they are probably too ambiguous. We assume that the degree of ambiguity increases with the number of documents, which might explain, why the performance of DOP decreases. However, when MEM and DOP are combined, it seems that DOP actually 1
2
R F1= (ββ 2+1)P , where we are using β=1 in our experiments. P +R
396
G. Neumann
L? P?
• • ◦
leftmost PRE REC FME • 74.9 76.9 75.9 ◦ 65.6 80.1 72.2 • 79.8 74.2 76.9
PRE 77.4 65.6 82.7
best REC 81.2 80.1 79.6
FME 79.3 72.2 81.1
rightmost PRE REC FME 73.2 74.7 73.9 65.6 80.1 72.2 80.6 73.7 77.0
Table 1. Performance of difference instances of the MEM–op on the single slot– task. All of them use the model obtained after i∗ = 76 iterations (which was determined during training as optimal). L? indicates whether a lexicon automatically determined from the slot–fillers of the training corpus was used by the MEM–op. P? specifies whether previous made predictions have been taken into account.
opDOP C15 C30 C45 C60
PRE 071.3 064.4 059.5 055.2
REC 046.8 045.7 047.3 048.9
FME 056.5 053.5 052.7 051.9
Table 2. Dependency of the DOP–op on the size of the training set C|doc| .
L? P?
• • ◦
leftmost PRE REC FME • 75.3 76.9 76.1 ◦ 66.4 80.7 72.8 • 79.2 73.7 76.3
PRE 85.4 67.4 82.7
best REC 85.0 81.2 79.6
FME 85.2 73.7 81.1
rightmost PRE REC FME 77.0 77.4 77.2 66.7 81.7 73.4 80.6 73.7 77.0
Table 3. The single slot performance values for combined MEM and DOP.
can contribute to the overall performance result of F1=85.2%. The reason is, that on the one hand side, MEM contributes implicitly contextual information for DOP in that it helps to restrict the search space for tree matching, and on the other hand side, it might be that the more “static” tree patterns might help to filter out some unreliable tag–sequences otherwise predicted by MEM when running in isolation. Our results also suggest, that not all possible combinations of operator instances improve the system performance, and even more, that one cannot expect, that the best operator (when running in isolation) will automatically also be the best choice for a hybrid approach.
6
Related Work
Chieu and Ng (2002) present a MEM approach to IE and compare their system with eight other ML–IE methods for the single slot task. For English seminar announcements data, they report F1=86.9%, which ranks best
A Hybrid Machine Learning Approach for IE
397
(F1=80.9% on average for all systems). Bender et al. (2003) have recently applied MEM for the CoNLL 2003 Named Entity task on English and German data, reporting F1=68.88% for German (83.92% for English). They used a different set of slots (viz. Org, Pers, Loc, Misc), as well as a cleaned–up corpus (i.e., linguistically completely disambiguated, which is not the case for our method). The best system (88.76% for English, 72,41% for German) also used a hybrid approach by combining MEM, HMM, transformation based learning, and a winnow–based method called RRM, cf. Florian et al. (2003). They also report that MEM belongs to their best standalone performers, and that a combined approach achieved the best overall performance. The major differences wrt. our approach are the use of a cleaned–up corpus, and the use of a non–incremental hybrid approach. A hybrid approach more closely related to our incremental method is described in Freitag (1998), where he combines a dictionary learner, term–space text classification and relational rule reduction. The experimental results presented here show that a hybrid ML–IE approach combining MEM and DOP can be useful for the problem of IE. So far, we have used our approach for the slot filling task. However, since our approach is in principle open for the integration of more deeper linguistic knowledge, the method should also be applicable for more complex tasks, like learning of n-ary slot relations, or even paragraph–level template filling.
References BENDER, O., OCH, F., and NEY, H. (2003): Maximum Entropy Models for Named Entity Recognition In: Proceedings of CoNLL-2003, pp. 148-151. BOD, R., SCHA, R. and SIMA’AN, K. (2003): Data-Oriented Parsing. CSLI Publications, University of Chicago Press. CHIEU, H. L. and NG, H. T. (2002): A Maximum Entropy Approach to Information Extraction from Semi–Structured and Free Text. In Proceedings of AAAI 2002. DARROCH, J. N. and RATCLIFF, D. (1972). Generalized Iterative Scaling for Log-Linear Models. Annals of Mathematical Statistics, 43, pages 1470–1480. FLORIAN, R., ITTYCHERIAH, A., JING, H., and ZHANG, T. (2003): Named Entity Recognition through Classifier Combination. In: Proceedings of CoNLL2003, pp. 168-171. FREITAG, D. (1998): Multistrategy Learning for Information Extraction. In Proceedings of the 15th ICML, pages 161–169. NEUMANN, G. (2003): A Data-Driven Approach to Head-Driven Phrase Structure Grammar. In: R. Scha, R. Bod, and K. Sima’an (eds.) Data-Oriented Parsing, pages 233–251. NEUMANN, G. and PISKORSKI, J. (2002): A Shallow Text Processing Core Engine. Journal of Computational Intelligence, 18, 451–476. PIETRA, S. D., PIETRA, V. J. and LAFFERTY, J. D. (1997): Inducing Features of Random Fields. Journal of IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–393. RATNAPARKHI, A. (1998): Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. Thesis, University of Pennsylvania, Philadelphia, PA.
Text Classification with Active Learning Blaˇz Novak, Dunja Mladeniˇc, and Marko Grobelnik Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia
Abstract. In many real world machine learning tasks, labeled training examples are expensive to obtain, while at the same time there is a lot of unlabeled examples available. One such class of learning problems is text classification. Active learning strives to reduce the required labeling effort while retaining the accuracy by intelligently selecting the examples to be labeled. However, very little comparison exists between different active learning methods. The effects of the ratio of positive to negative examples on the accuracy of such algorithms also received very little attention. This paper presents a comparison of two most promising methods and their performance on a range of categories from the Reuters Corpus Vol. 1 news article dataset.
1
Introduction
In many real world machine learning tasks, labeled training examples are rare, expensive and/or difficult to obtain (e.g. medical applications), while at the same time there is a lot of unlabeled examples available. One such class of learning problems is text classification. While it is very easy to collect large quantities of textual data, labeling a sufficient amount of it requires significant human resources. Active learning methods are concerned with reducing this effort to the minimal possible by selecting such a set of examples to be labeled by the oracle (domain expert, user, . . . ) that still provides all the necessary information about the data distribution while at the same time minimizing the number of examples. The problem with the existing research is that there is practically no comparison of different active learning approaches on a standardized, useful dataset. This paper aims to fill this void and at the same time analyze the performance of active learning with regard to the ratio of positive to negative labeled examples in the classification task.
2
Related Work
Active learning has tight links to the field of experiment design in statistical literature. It is a generic term describing a special, interactive kind of a learning process. In contrast to the usual (passive) learning, where the student (learning algorithm in our case) is presented with a static set of examples that are then used to construct a model, active learning paradigm means that the
Text Classification with Active Learning
399
‘student’ has the possibility to select the examples which appear to be the most informative. Angluin (1988) addressed the problem of learning a concept with several types of constructed queries. While some lower bounds on the number of queries required were given, the results aren’t directly useful in text classification since it is practically impossible to construct meaningful queries from currently used representations of textual documents. In order to avoid this problem, a ‘query filtering’ (Lewis and Gale (1994)) approach can be used. Seung et al. (1992) described a stream–based query filtering algorithm called ‘Query By Committee’ (QBC). In stream–based query filtering, queries are not constructed but selected from an infinite input stream of existing queries. The idea is that — in a noiseless concept learning scenario — the fastest way to find the correct concept is to always split the version space in half, where the version space is the set of all hypotheses still consistent with data seen so far - i.e. VS = {h ∈ H|h(xi ) = yi , ∀i ∈ {1..n}}. Since the version space can be arbitrarily complex it is often not possible to analitically determine how to bisect it. The QBC approach solves this problem by observing that the query will bisect the version space if for any answer from the oracle, exactly half of the hypotheses will be removed, which means that the number of hypotheses for which the example is in the class must be equal to the number of hypotheses for which the example is not in the class. If an infinite number of hypotheses is sampled from the version space, each example from the query stream can be simply tested for this property. Even if a small finite number of hypotheses is randomly sampled, the algorithm works well. Freund et al. (1993) have shown that this approach does require only a logarithmic number of queries compared to random sampling (i.e. labeling a random set of examples of the same size), but the resuls only hold for noiseless settings. Random sampling from the version space can also be problematic. Yet another approach to active learning is pool–based query filtering. In this setting, the learning algorithm is provided with a constant set of queries from which a subset is selected during the learning process. This in theory enables it to perform the most general optimization, while at the same time makes it sensitive to outliers. Despite this problem, pool–based active learning algorithms currently appear to be the best in terms of performance. Based on previous comparisons (e.g. Baram et al. (2004)) and results found in the original papers, we selected two methods from this category to use in our comparison. Some ideas for combining active learning with multi–view settings were also proposed. A multi–view setting occurs where there are several independent sets of attributes available. The disagreement between models trained on different attribute sets can then be used to select query examples. Muslea et al. (2002) combined this idea with the expectation maximization algorithm. EM is used in the model learning phase to try to use the implicit distribution information present in the unlabeled data.
400
B. Novak et al.
2.1
Simple Margin
Tong and Koller (2001) presented an SVM–based algorithm for active learning. Like QBC, it also tries to do version space bisection. Since the SVM learns n a decision function of the form f (x) = w · φ(x), where w = i=1 αi yi φ(xi ) and φ : X → F a map from input space to some feature space, the set of all hypotheses can be written as ! " w · φ(x) H = f |f (x) = w
and the version space then as V = {w ∈ W |w = 1, yi (w · φ(xi )) > 0, i = 1..n} which is a subset of a unit hypersphere w = 1. We then wish to find such an example x from the unlabeled set U for which version space size (i.e. the area on the hypersphere) is as equal as possible both if the example is considered to be in the positive or the negative class. After the user provides the real label y for this example, it is added to the labeled training L set and the procedure is repeated from the begining. Because an analytic solution for this problem is not feasible, three different aproximations are suggested in the paper. • MaxMin Margin: Each example from the pool is separately added to both positive and negative class, after which the decision hyperplane is − recalculated. Let m+ i and mi be the sizes of the corresponding SVM margins for the i–th example. The query is then chosen as − argmax min(m+ i , mi ). i
− • Ratio Margin: Instead of maximizing min(m+ i , mi ), ratios of margin sizes m+
m−
i min( m− , mi+ ) are maximized, therefore taking only relative ratios of mari i gins into consideration. • Simple Margin: Example closest to the current decision plane is chosen for querying, resulting in a kind of ‘uncertainty sampling’ (Lewis and Gale (1994)).
Because Tong and Koller (2001) report very little difference between the accuracies of all three algorithms, we applied the Simple Margin strategy due to its low computational complexity compared to the other two since it does not have to recalculate the model for every example considered. 2.2
Error Reduction Sampling
Roy and McCallum (2001) proposed a different approach to active learning. Instead of minimizing the version space, they try to directly minimize the criterion by which the model will be judged — it’s expected future error.
Text Classification with Active Learning
401
Let P (y|x) be the unknown true probability distribution function and PˆD the probability distribution estimated based on# currently labeled dataset D. The expected error of the model is then EPˆD = x L(P (y|x), PˆD (y|x))P (x) — i.e. a weighted disagreement between the distributions, measured by some loss function. If we choose L to be a log–loss function L = y∈Y P (y|x) log PˆD (y|x) and approximate the unknown P (y|x) with the current distribution estimate PˆD (y|x), we get
˜ˆ = 1 E PˆD (y|x) log(PˆD (y|x)), PD |P| x∈P y∈Y
for some sample P also randomly taken from the unlabeled set, which is basically the negative average entropy of PˆD measured over some random unlabeled sample. For the resulting ‘utility’ estimate for a possible query x ˜ ˆ by labeling x with every possible label we take the weighted average E PD ˜ ˆ and y and temporarily adding it to D, calculating the corresponding E PD weighting it by the current models posterior PˆD (y|x) and finally choosing the example with the lowest weighted–average expected error. A possible interpretation of this algorithm is that we select such examples that reinforce our current belief (i.e. decrease the entropy of the model). The original paper used a bagged version of the na¨ıve bayes classifier in order to somewhat smooth the overly sharp posterior distribution derived from a single na¨ıve bayes model. As already suggested in the original paper, we instead used a more robust SVM model. Because the output from such a model is a real number, we implemented the algorithm from Platt (2002) which converts the output from the SVM to a number between 0 and 1 and used that as the classficiation probability. This algorithm gives no guarantees that the resulting probability is similar to that computed by a bayesian model, but still provides for a reasonable estimate in comparison to that of a na¨ıve bayes.
3
Experimental Setting
We have run all of our experiments on the Reuters Corpus Vol. 1 (Rose et al. (2002)). The corpus consists of about 810.000 news articles from 20 aug. 1996 to 19 aug. 1997, manually categorized into a shallow taxonomy of 103 categories. We have used a commonly used split at 14 apr. 1997, giving 504.468 articles for training and 302.323 articles for testing. We removed the common words by using the standard english 523 stopword set. A Porter stemmer was used to further simplify the dataset. Finally, news articles were converted into TF-IDF (Salton (1991)) vectors. Out of the 103 categories, 11 were chosen such that they cover a large range of positive to negative examples ratios, the most balanced having about 46% of positive examples and the most unbalanced only 0.76%.
402
B. Novak et al. 0.9 0.8 0.7
F1
0.6 0.5 0.4 0.3 0.2 Random Simple
0.1 0
100
200
300
400 500 600 Labeled samples
700
800
900
1000
Fig. 1. Plot of average F1 on CCAT category having 46% of positive examples. The bottom line shows performance of Random sampling, used here as baseline algorithm.
The algorithms were run on a random subsample of the data that consists of 5.000 training and 10.000 testing examples for a total of 30 runs. For each run, two labeled examples (one positive and one negative) were provided initially and then a total of 100 queries were made with a batch size of 1 — i.e. after every sample was labeled and added into the training set, everything was recalculated. For efficiency reasons, both algorithms only had 200 randomly chosen examples available as the unlabeled pool in each iteration. The size of the evaluation pool P for the Error Reduction Sampling algorithm was also set to 200 for all experiments. Random sampling was also implemented for baseline reference. For three categories, Random sampling and Simple were allowed to make 1000 queries in order to better show how active learning performs comparatively to random labeling. Error Reduction Sampling was however stopped at 100 because of its computational complexity and the fact that we did not implement incremental SVM learner.
Text Classification with Active Learning
0.8
0.7
0.7
0.6
0.6
0.5
0.5 F1
0.9
0.8
F1
0.9
403
0.4
0.4
0.3
0.3
0.2
0.2 Random Simple Error reduction sampling
0.1 0 0
100
200
300
400 500 600 Labeled samples
700
800
Random Simple Error reduction sampling
0.1 0 900
1000
0
100
200
300
400 500 600 Labeled samples
700
800
900
1000
Fig. 2. Plot of average F1 on M14 (left) and M143 (right) category having 10% and 2% of positive examples respectively. The highest line corresponds to Simple margin and the middle to Error reduction sampling.
1
0.6
0.9 0.5
0.8 0.7
0.4
F1
F1
0.6 0.5
0.3
0.4 0.2
0.3 0.2
0.1 Random Simple Error reduction sampling
0.1 0 0
20
40
60
Labeled samples
80
Random Simple Error reduction sampling
0 100
0
20
40
60
80
100
Labeled samples
Fig. 3. Plot of average F1 on GSPO (left) and E13 (right) category having 4% and 0.8% of positive examples respectively. The highest line corresponds to Simple margin and the middle to Error reduction sampling, while Random sampling is practicaly achieving F1 close to 0.
4
Results
Figure 1 shows the behaviour of Random sampling and Simple margin on one of the largest category from RCV1 in terms of average F1 as a function of queries made. For clarity, Error reduction sampling is ommited from Figure 1. It can be seen that Simple margin outperforms Random sampling, with the difference getting smaller as the number of queries (labeled examples) grows. However, the advantage of active learning is more evident on categories with smaller percentage of positive examples, as can be seen from Figures 2 and 3. Figure 2 shows the behaviour of Random sampling, Simple margin and Error reduction sampling on two of the largest categories from RCV1. On all three large categories, the Random sampling and Simple margin algorithms were allowed to make 1000 queries, so that we can better see the overall benefit of active learning. Figure 3 shows the same graph, but on some of the smallest
404
B. Novak et al.
Category Positive examples Random sampling Error reduction sampling Simple CCAT 46% 0.729 0.572 0.792 GCAT 29% 0.389 0.527 0.839 C15 18% 0.451 0.658 0.777 ECAT 14% 0.030 0.234 0.499 M14 10% 0.124 0.613 0.833 GPOL 6% 0.009 0.280 0.611 E21 5% 0.020 0.362 0.597 GSPO 4% 0.008 0.537 0.947 M143 2% 0.024 0.687 0.832 C183 0.89% 0.001 0.099 0.295 E13 0.80% 0.002 0.275 0.499 GHEA 0.76% 0.000 0.037 0.093
Table 1. Average F1 for three active learning algorithms after 100 queries.
categories in the corpus and using only up to 100 queries. As it can be nicely seen from all the graphs, the difference between random sampling and active learning becomes progressively larger as the ratio of positive examples goes toward zero. Table 1 shows the average F1 of all the three algorithms after making 100 queries. Not surprisingly, Simple margin always achieves the best results and Error reduction sampling has consistently a bit worse performance than Simple margin. What is surprising is that on one of the largest categories (CCAT shown in Figure 1) Error reduction sampling after 100 examples consistently performs worse than random sampling.
5
Conclusions and Future Work
We have shown that active learning really is useful for text classification problems. In our experiments, benefits achieved by it range from halving the required amount of labeling work for balanced datasets to 50–times reduction in labeling investment for a realistic, strongly unbalanced dataset. For instance, in order to achieve F1 = 0.8 on a fairly balanced CCAT category having 46% of positive examples, it needs 110 labeled examples compared to random sampling (no active learning) that needs about 220 examples for the same performance (see Figure 1). From our comparison it is also evident that the Simple algorithm of Tong and Koller performs better on a large news article set than Estimation of Error Reduction method; which is however still much better than simply randomly selecting data for labeling. In future we hope to implement an efficient incremental SVM learning algorithm to see if severly increasing the unlabeled pool size significantly improves the performance of both active learning algorithms.
Text Classification with Active Learning
405
References ANGLUIN, D. (1988): Queries and concept learning. Machine Learning, 2(3):319– 342, 1988 BARAM, Y. and EL-YANIV, R. and LUZ, K. (2004): Online Choice of Active Learning Algorithms. The Journal of Machine Learning Research, 2004, 255– 291 FREUND, Y. and SEUNG, H. S. and SHAMIR, E. and TISHBY, N. (1993): Information, prediction, and query by committee. Advances in Neural Information Processing Systems 5, pages 483-490, 1993 LEWIS, D. D. and GALE, W. A. (1994): A sequential algorithm for training text classifiers. In: Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval. MUSLEA, I. and MINTON, S. and KNOBLOCK, C. (2002): Active + Semi– supervised Learning = Robust Multi–View learning. In: Proc. of the 19th International Conference on Machine Learning, pp. 435-442. PLATT, J. C. (2002): Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers, MIT Press ROSE, T. G. and STEVENSON, M. and WHITEHEAD, M. (2002): The Reuters Corpus Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources. In: 3rd International Conference on Language Resources and Evaluation, May, p. 7 ROY, N. and MCCALLUM, A. (2001): Toward Optimal Active Learning through Sampling Estimation of Error Reduction. In: Proc. of the 18th International Conference on Machine Learning, pp 441-448. SALTON, G. (1991): Developments in Automatic Text Retrieval. Science, Vol 253, pp 974-979, 1991 SEUNG H. S. and OPPER, M. and SOMPOLINSKY, H. (1992): Query by Committee. Computational Learning Theory pp. 287-294, 1992 TONG, S. and KOLLER, D. (2000): Support Vector Machine Active Learning with Applications to Text Classification. In: Proc. of the 17th International Conference on Machine Learning, pp. 999-1006.
Towards Structure-sensitive Hypertext Categorization Alexander Mehler1 , R¨ udiger Gleim1 , and Matthias Dehmer2 1 2
Universit¨ at Bielefeld, 33615 Bielefeld, Germany Technische Universit¨ at Darmstadt, 64289 Darmstadt, Germany
Abstract. Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.
1
Introduction
Web structure mining deals with exploring hypertextual patterns (Kosala and Blockeel (2000)). It includes the categorization of macro structures (Amitay et al. (2003)) such as web hierarchies, directories, corporate sites. It also includes the categorization of single web pages (Kleinberg (1999)) and the identification of page segments as a kind of structure mining on the level of micro structures (Mizuuchi and Tajima (1999)). The basic idea is to perform structure mining as function learning in order to map web units above, on or below the level of single pages onto at most one predefined category per unit (Chakrabarti et al. (1998)). The majority of these approaches utilizes text categorization methods. But other than text categorization, they also use HTML markup, metatags and link structure beyond bag-of-word representations of the pages’ wording as input of feature selection (Yang et al. (2002)). Chakrabarti et al. (1998) and F¨ urnkranz (2002) extend this approach by including pages into feature selection which are interlinked with the page to be categorized. Finally, the aggregation of representations of the wording, markup and linking of pages is demonstrated by (Joachims et al. (2001)). The basic assumption behind these approaches is as follows: Web units of similar function/content tend to have similar structures. The central problem is that these structures are not directly accessible by segmenting and categorizing single web pages. This is due to polymorphism and its reversal relation of discontinuous manifestation (Mehler et al. (2004)): Generally speaking, polymorphism occurs if the same (hyper-)textual unit manifests
Towards Structure-sensitive Hypertext Categorization
407
Fig. 1. Types of links connecting hypertext modules (symbolized as circles).
several categories. This one-to-many relation of expression and content units is accompanied by a reversal relation according to which the same content or function unit is distributed over several expression units. This combines to a many-to-many relation between explicit, manifesting web structure and implicit, manifested functional or content-based structure. Polymorphism occurs when, for example, the same web page of the genre of conference websites provides information about the call for papers, the submission procedure and conference registration, that is, when it manifests more than one function. The reversal case occurs when, for example, a call for papers is manifested by different pages each informing about another conference topic. The former case results in multiple categorizations without corresponding to ambiguity of category assignment since actually several categories are manifested. The latter case results in defective or even missing categorizations since the pages manifest features of the focal category only in part. Both cases occur simultaneously, if a page manifests several categories, but some of them only in part. If this many-to-many relation is prevalent, proper hypertext categorization is bound to a preliminary structure analysis which first resolves polymorphism and discontinuous manifestation. We hypothesize this structure analysis to be constrained as follows: • The functional structure of websites is determined by their membership in web genres (Yoshioka and Herman 2000). Hypertext categories are (insofar as they focus on the functions web pages are intended to serve) specific to genres, e.g. conference website (Yoshioka and Herman 2000), personal home page (Rehm (2002)) or online shop. • What is common to instances of different web genres is the existence of an implicit logical document structure (LDS) – analogously to textual units whose LDS is described in terms of section, paragraph and sentence categories. In case of instances of web genres we hypothesize that their LDS includes at least three levels: – Document types, which are typically manifested by websites, constitute the level of pragmatically closed acts of web-based communication (e.g. conference organization or online shopping). They organize a system of dependent sub-functions manifested by modules:
408
A. Mehler et al.
– Module types are functionally homogeneous units of web-based communication manifesting a single, but dependent function, e.g. call for papers, program or conference venue as sub-functions of the function of web-based conference organization. – Finally, elementary building blocks (e.g. logical lists, tables, sections) only occur as dependent parts of hypertext modules. • Uncovering the LDS of websites contributes to breaking the many-tomany-relation of polymorphism and discontinuous manifestation. It aims to explicate which modules are manifested by which (segments of which) visible web pages of the same site and which links of which types – as distinguished in figure (1) – interlink these modules. • The central hypothesis of this paper is that hypertext categorization has to be reconstructed as a kind of structure learning focussing on prototypical, recurrent patterns of the LDS of websites as instances of web genres on the level of document types and their typing according to the functions their constitutive modules are intended to serve. In order to support this argumentation, the following section describes an experiment in hypertext categorization. After that, an algorithm is outlined which reconstructs hypertext categorization in terms of a structure sensitive model. Finally, we give some conclusions and prospect future work.
2
An Experiment in Hypertext Categorization
Our hypothesis is that if polymorphism is a prevalent characteristic of web units, web pages cannot serve as input of categorization since polymorphic pages simultaneously instantiate several categories. Moreover, these multiple categorizations are not simply resolved by segmenting the focal pages, since they possibly manifest categories only discontinuously so that their features do not provide a sufficient discriminatory power. In other words: We expect polymorphism and discontinuous manifestation to be accompanied by many multiple categorizations without being reducible to the problem of disambiguating category assignments. In order to show this, we perform a categorization experiment according to the classical setting of function learning, using a corpus of the genre of conference websites. Since these websites serve recurrent functions (e.g. paper submission, registration etc.) they are expected to be structured homogeneously on the basis of stable, recurrent patterns. Thus, they can be seen as good candidates of categorization. The experiment is performed as follows: We apply support vector machine (SVM) classification which proves to be successful in case of sparse, high dimensional and noisy feature vectors (Joachims (2002)). SVM classification is performed with the help of the LibSVM (Hsu et al. (2003)). We use a corpus of 1,078 English conference websites and 28,801 web pages. Hypertext representation is done by means of a bag-of-features approach using about
Towards Structure-sensitive Hypertext Categorization Category Abstract(s) Accepted Papers Call for Papers Committees Contact Information Exhibition Important Dates Invited Talks
rec. prec. 0.2 1.0 0.3 1.0 0.1 1.0 0.5 0.8 0 0 0.4 1.0 0.8 1.0 0 0
Category Menu Photo Gallery Program, Schedule Registration Sections, Sessions, Plenary etc. Sponsors and Partners Submission Guidelines etc. Venue, Travel, Accommodation
409
rec. prec. 0.7 0.7 0 0 0.8 1.0 0.9 1.0 0.1 0.3 0 0 0.5 0.8 0.9 1.0
Table 1. The categories of the conference website genre applied in the experiment.
85,000 lexical and 200 HTML features. This representation was done with the help of the HyGraph system (Gleim (2005)) which explores websites and maps them onto hypertext graphs (Mehler et al. (2004)). Following Hsu et al. (2003), we use a Radial Basis Function kernel instead of a polynomial kernel, but other than in Mehler et al. (2004), we augment the corpus base and use a more fine-grained category set. Optimal parameter selection is based on a minimization of a 5-fold cross validation error. Further, we perform a binary categorization for each of the 16 categories based on 16 training sets of pos./neg. examples (see table 1). The size of the training set is 1,858 pages (284 sites); the size of the test set is 200 (82 sites). We perform 3 experiments: Experiment A – one against all: First we apply a one against all strategy, that is, we use X \ Yi as the set of negative examples for learning category Ci where X is the set of all training examples and Yi is the set of positive examples of Ci . The results are listed in table (1). It shows the expected low level of effectivity: recall and precession perform very low on average. In three cases the classifiers fail completely. This result is confirmed when looking at column A of table (2): It shows the number of pages with up to 7 category assignments. In the majority of cases no category could be applied at all – only one-third of the pages was categorized. Experiment B – lowering the discriminatory power: In order to augment the number of categorizations, we lowered the categories’ selectivity by restricting the number of negative examples per category to the number of the corresponding positive examples by sampling the negative examples according to the sizes of the training sets of the remaining categories. The results are shown in table (2): The number of zero categorizations is dramatically reduced, but at the same time the number of pages mapped onto more than one category increases dramatically. There are even more than 1,000 pages which are mapped onto more than 5 categories. Experiment C – segment level categorization: Thirdly, we apply the classifiers trained on the monomorphic training pages on segments derived as follows: Pages are segmented into spans of at least 30 tokens reflecting segment borders according to the third level of the pages’ document object model trees. Column C of table (2) shows that this scenario does not solve the problem of multiple categorizations since it falls back to the problem of zero categorizations. Thus, polymorphism is not resolved by simply segmenting pages, as other segmentations along the same line of constraints confirmed.
410
A. Mehler et al. #categorizations A (page level) B (page level) C (segment level) 0 12,403 346 27,148 1 6,368 2387 9,354 2 160 5076 137 3 6 5258 1 4 0 3417 0 5 0 923 0 6 0 1346 0 7 0 184 0
Table 2. The number of pages mapped onto 0, 1, ..., 7 categories in exp. A,B,C.
There are competing interpretations of these results: The category set may be judged to be wrong. But it reflects the most differentiated set applied so far in this area (cf. Yoshioka and Herman (2000)). Next, the representation model may be judged to be wrong, but actually it is usually applied in text categorization. Third, the categorization method may be seen to be ineffective, but SVMs are known to be one of the most effective methods in this area. Further, the classifiers may be judged to be wrong – of course the training set could be enlarged, but already includes about 2,000 monomorphic training units. Finally, the focal units (i.e. web pages) may be judged to be unsystematically polymorph in the sense of manifesting several logical units. It is this interpretation which we believe to be supported by the experiment. Why are linear segmentations of web pages according to experiment C insufficient? The reason is twofold: Because a category may be distributed over several pages, it is possible that pages analyzed in isolation do not manifest category markers sufficiently. Thus, segmentations of pages of the same site are interdependent. But since not all pages belong to the same structural level of a website (a call for participation belongs to another level than an abstract of an invited talk), segmentation also needs to be aware of website structure. As categories are manifested by pages of different structural levels, these pages are not linearly ordered. This is also proven by structural recursion, since a call for papers, for example, may include several workshops each having its own call. That is, linear segmentations of pages do not suffice because of discontinuous manifestations. But linear orderings of pages do not suffice either because of functional website structure. Fuzzy classification does not solve this problem as long as it only performs multiple category assignments to varying degrees of membership, since such mappings do not allow to distinguish between ambiguity of category assignment and polymorphism. Thus, web page categorization necessarily relies on resolving polymorphism and discontinuous manifestation and thus relates to learning implicit logical hypertext document structure (LDS) outlined in the following section.
3
Reconstructing Hypertext Categorization
A central implication of latter section is that, prior to hypertext categorization, the many-to-many relation of visible and hidden web structure has
Towards Structure-sensitive Hypertext Categorization
411
to be resolved at least with respect to LDS. Thus, hypertext categorization is bound to a structural analysis. Insofar this analysis results in structured representations of web units, function learning as performed by text categorization is inappropriate to mining web genres. It unsystematically leads to multiple categorizations when directly applied to web units whose borders do not correlate with the functional or content-based categories under consideration. Rather, a sort of structure learning has to be performed, mapping these units onto representations of their LDS which only then are object to mining prototypical sub-structures of web genres. In this section, hypertext categorization is reconstructed along this line of argumentation. An algorithm is outlined, henceforth called LDS algorithm, which addresses structure learning from the point of view of prototypes of LDS. It is divided into two parts: I. Logical Document Structure Learning Websites as supposed instances of web genres have first to be mapped onto representations of their LDS. That is polymorphism has to be resolved with respect to constituents of this structure level. This includes the following tasks: (i) Visible segments of web pages have to be identified as manifestations of constituents of LDS. (ii) Visible hyperlinks have to be identified as manifestations of logical links, i.e. as kernel links or up, down, across or external links. (iii) Finally, functional equivalents of hyperlinks have to be identified as manifestations of logical links according to the same rules, i.e. of links without being manifested by hyperlinks. Solving these tasks, each website is mapped onto a representation of its LDS based on the building blocks described in section (1). This means that websites whose visible surface structures differ dramatically may nevertheless be mapped onto similar LDS representations, and vice versa. So far, these intermediary representations lack any typing of their nodes, links and sub-structures in terms of functional categories of the focal web genre. This functional typing is addressed by the second part of the algorithm: II. Functional Structure Learning The representations of LDS are input to an algorithm of computing with graphs which includes four steps: 1. Input: As input we use a corpus C = {Gi | i ∈ I} of labeled typed directed graphs Gi = (V, E, k(Gi ), τ ) with kernel hierarchical structure modeled by an ordered rooted tree k(Gi ) = (V, D, x, O) with root x and order relation O ⊆ D2 , D ⊆ E. O is an ordering of kernel links e ∈ D only. Since k(Gi ) is a rooted tree, it could equivalently be defined over the nodes. Typing of edges e ∈ E is done by a function τ : E → T where T is a set of type labels. In case of websites, vertices v ∈ V are labeled as either accessible or unaccessible web pages or resources and edges are typed as kernel, across, up, down, internal, external or broken links. In case of logical hypertext document structure, vertices are logical modules whereas the set of labels of edge types remains the same. 2. Graph similarity measuring: The corpus C of graphs is input to a similarity measure s : C 2 → [0, 1] used to built a similarity matrix (Bock (1974))
412
A. Mehler et al.
S = (skj ) where skj is the similarity score of the pairing Gi , Gj ∈ C. s has to be sensitive to the graphs’ kernel hierarchical structure as well as to the labels of their vertices and the types of their edges. We utilize the measure of Dehmer & Mehler (2004) which is of cubic complexity. 3. Graph clustering: Next, the similarity matrix is input to clustering, that is, to unsupervised learning without presetting the number of classes or categories to be learned. More specifically, we utilize hierarchical agglomerative clustering (Bock (1974)) based on average linkage with subsequent partitioning. This partitioning refers to a lower bound (Rieger (1989)) θ = η¯ + 21 σ, where η¯ is the mean and σ the standard deviation of the absolute value of the differences of the similarity levels of consecutive agglomeration steps. This gives a threshold for selecting an agglomeration step for dendrogram partitioning whose similarity distance to the preceding step is greater than θ. We use the first step exceeding θ. 4. Graph prototyping: Next, for each cluster X = {Gi1 , . . . , Gin } ⊆ C of the ˆ has to be computed output partitioning of step (3) a graph median G |X| ˆ according to Bunke et al. (2001): G = arg maxG∈X n1 k s(G, Gik ). ˆ as a prototype of the The basic idea of applying this formula is to use G cluster X in the sense that it prototypically represents the structuring of all members of that set of graphs. ˆ as kernels of 5. Graph extraction: The last step is to use the prototypes G instance based learning. More specifically, the prototype graphs can be used as templates to extract sub-structures in new input graphs. The idea is to identify inside these graphs recurrent patterns and thus candidates of functional categories of the focal genre (e.g. paper submission or conference venue graphs in case of the genre of conference websites).
It is this last step which addresses the final categorization by using structured categories in order to categorize sub-structures of the input graphs. It replaces the mapping of visible segments of web units onto predefined categories by mapping sub-structures of the hidden LDS onto clusters of homogeneously structured instances of certain module types of the focal web genre.
4
Conclusion
This paper argued that websites are fuzzy manifestations of hidden logical document structure. As far as hypertext categorization deals with genre-based, functional categories, visible web document structures do not suffice as input to categorization because of their many-to-many relation to the hidden LDS. Thus, hypertext categorization is in need of reconstructing this LDS. This argumentation has been supported by means of an categorization experiment. In order to solve this problem, categorization has been conceptually reconstructed by means of an algorithm reflecting the distinction of visible and hidden structure. It utilizes unsupervised structure instead of supervised function learning. Future work aims at implementing this algorithm.
Towards Structure-sensitive Hypertext Categorization
413
References AMITAY, E. and CARMEL, D. and DARLOW, A. and LEMPEL, R. and SOFFER, A. (2003): The connectivity sonar. Proc. of the 14th ACM Conference on Hypertext, 28–47. BOCK, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, G¨ ottingen. ¨ BUNKE, H. and GUNTER, S. and JIANG, X. (2001): Towards bridging the gap between statistical and structural pattern recognition. Proc. of the 2nd Int. Conf. on Advances in Pattern Recognition, Berlin, Springer, 1−11. CHAKRABARTI, S. and DOM, B. and INDYK, P. (1998): Enhanced hypertext categorization using hyperlinks. Proc. of ACM SIGMOD, International Conf. on Management of Data, ACM Press, 307−318. DEHMER, M. and MEHLER, A. (2004): A new method of similarity measuring for a specific class of directed graphs. Submitted to Tatra Mountain Journal, Slovakia. ¨ FURNKRANZ, J. (2002): Hyperlink ensembles: a case study in hypertext classification. Information Fusion, 3(4), 299−312. GIBSON, D. and KLEINBERG, J. and RAGHAVAN, P. (1998): Inferring web communities from link topology. Proc. of the 9th ACM Conf. on Hypertext, 225−234. GLEIM, R. (2005): Ein Framework zur Extraktion, Repr¨ asentation und Analyse webbasierter Hypertexte, Proc. of GLDV ’05, 42−53. HSU, C.-W. and CHANG, C.-C. and LIN, C.-J. (2003): A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University. JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston, 2002. JOACHIMS, T. and CRISTIANINI, N. and SHAWE-TAYLOR, J. (2001): Composite kernels for hypertext categorisation. Proc. of the 11th ICML, 250−257. KLEINBERG, J. (1999): Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604−632 KOSALA, R. and BLOCKEEL, H. (2000): Web mining research: A survey. SIGKDD Explorations, 2(1), 1−15. MEHLER, A. and DEHMER, M. and GLEIM, R. (2004): Towards logical hypertext structure − a graph-theoretic perspective. Proc. of I2CS ’04, Berlin, Springer. MIZUUCHI, Y. and TAJIMA, K. (1999): Finding context paths for web pages. Proc. of the 10th ACM Conference on Hypertext and Hypermedia, 13−22. REHM, G. (2002): Towards automatic web genre identification. Proc. of the Hawai’i Int. Conf. on System Sciences. RIEGER, B. (1989): Unscharfe Semantik. Peter Lang, Frankfurt a.M. YANG, Y. (1999): An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1, 1/2, 67–88. YANG, Y. and SLATTERY, S. and GHANI, R. (2002): A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3), 219−241. YOSHIOKA, T. and HERMAN, G. (2000): Coordinating information using genres. Technical report, Massachusetts Institute of Technology.
Evaluating the Performance of Text Mining Systems on Real-world Press Archives Gerhard Paaß1 and Hugo de Vries2 1
2
Fraunhofer Institute for Autonomous Intelligent Systems, St. Augustin Germany Macquarie University, Sydney, Australia
Abstract. We investigate the performance of text mining systems for annotating press articles in two real-world press archives. Seven commercial systems are tested which recover the categories of a document as well named entities and catchphrases. Using cross-validation we evaluate the precision-recall characteristic. Depending on the depth of the category tree 39-79% breakeven is achieved. For one corpus 45% of the documents can be classified automatically, based on the system’s confidence estimates. In a usability experiment the formal evaluation results are confirmed. It turns out that with respect to some features human annotators exhibit a lower performance than the text mining systems. This establishes a convincing argument to use text mining systems to support indexing of large document collections.
1
Introduction
In Germany the Deutsche Presse-Agentur (dpa) news agency and the PresseArchivNetzwerk (PAN), a joint subsidiary of German broadcasters for indexing press articles, wanted to select a commercial text mining system to support the annotation of press articles. They asked Fraunhofer AIS to perform a comparison of available text mining systems in a realistic setup. Currently the articles are read by professional human annotators, who assign it to one or more content categories in a time consuming process. In addition they identify important persons, locations and institutions, and characterize its contents by free catchphrases. As fully automatic indexing currently seems to be out of reach, two different application scenarios were considered. In a scenario called ”supported annotation”, a press article is categorized and indexed by a text mining system whereupon a human annotator checks the proposals and makes corrections to produce the final annotation. A second scenario, ”partly automatic annotation”, is based on the confidence that the text mining system estimates for its proposals. The scenario assumes that in a large number of cases this confidence is very high - say larger than 95% - and that in these cases the annotation can be done automatically. In the rest of the cases a human annotator has to check and correct the proposals of the text ming system. To assess the performance of the text mining systems two different experiments were conducted. In a formal evaluation the systems’ annotation
Text Mining Systems on Real-world Press Archives
415
proposals (classes, named entities, catchphrases) for a test set were compared to the original annotations of press articles. In a usability test the text mining systems were deployed in the supported annotation scenario, to determine if good proposals really assist the annotators and lead to improved or faster indexing of the articles. Because of confidentiality reasons no results for specific text mining systems can be reported, but only the best and worst performance values for each evaluation will be given. Nevertheless they should describe the current quality spectrum of text mining results quite well. In the next section we describe the databases and text mining tasks and characterize the participating text mining systems. In section 3 we discuss evaluation criteria for text mining. The fourth section describes the formal evaluation and the fifth section gives details on the usability tests and the actual improvement of annotations. A final section summarizes the findings.
2
Text Mining Tasks and Evaluation Criteria
Both clients supplied us with large size databases to allow realistic tests of the text mining systems. The database of the Presse Archiv-Netzwerk (PAN) consisted of 451229 press articles from newspapers, journals and magazines. They were classified in an extensive hierarchical system of classes with seven layers and 2300 categories. To give an example of the layers we have ”economy”, ”economic sector”, ”consumer goods industry”, ”food industry”, ”drink”, ”non-alcoholic drink”. These categories were elaborated in two ways. In the first place, there are 9506 topic descriptors for each class which give further details. In the above example there are among others the topic descriptors ”non-alcoholic beer”, ”cola drink” and ”fruit juice”. Secondly, each class may be enhanced by one of twenty aspects like ”history”, ”reform”, or ”overview”. These aspects are not independent categories but are assigned in conjunction with some other category. Additionally, the articles were indexed by named entities: persons, institutions, and locations. The annotation rules demand that not all but only ”important” named entities are selected. To unify spelling, lists containing 174185 persons, 72005 institutions and 23060 geographic locations were also supplied. Finally part of the articles were indexed by free catchphrases. The database provided by the Deutsche Presse-Agentur (dpa) contained 382200 newswire stories covering the whole range of topics, the so-called ”Basisdienst”. Their length ranged from a few sentences to several hundred words. They were classified into a 3-level hierarchy of about 900 categories, which is a modification of the IPTC classification scheme. The stories were not originally annotated with named entities; only in the usuability test such an annotation was tested. Seven leading German providers of text mining were selected to take part in the test: Amenotec, Digital Collections, Inxight, picturesafe, Recommind,
G. Paaß and H. de Vries 100
416
>=10000 >=5000
80
dpa +
>=2000
Number of documents in a class
60
dpa + PAN
40
Precision
>=1000
+
20
+
+ +
+ +
PAN
>=500 >=200 >=100 >=50 >=20 >=10
0
>=5 <5
0
20
40
60
80
100 0
Recall of Classes
20
40
60
80
Recall
Fig. 1. Average document centered recall for classes of the PAN and dpa corpus (left). Average 4x-recall of PAN classes grouped by number of documents (right).
TEMIS, and XtraMind. All systems being proprietory, information about the employed technology was vague. Each text mining system generates a set of proposals (at most 40) and assigns a confidence score to it. Hence there is an ordered sequence of proposals for a single document. Then precision is the fraction of the correct proposals in the total number of proposals for the document. Recall is the fraction of actually correct proposals in the total number of correct items. These values were aggregated over documents belonging to specific groups of classes. By ordering the proposals according to the confidence and accepting more and more proposals, the precision should drop while the recall increases. This yields the well-known ROC-curve. For evaluation we used the precision-recall breakeven or 1x-recall. If we take twice as many proposals as there are correct items, we get the 2x-recall. Similarly we get the 3x-recall and 4x-recall. As final measure of effectiveness the clients settled on the 2x-recall. An overview over other evaluation measures is given by [4]. To assess the significance of differences we used the k-fold crossvalidated t-test discussed by [1].
3
Formal Comparison of Text Mining Systems
The training time for a training set of 360000 PAN records varied considerably: between 2 and 90 hours. The time required for annotating a single document was 0.1 to 1 second. To take into account the temporal dependency between stories we blocked the whole time range into two-day periods before forming the five-fold cross-validation sets. Figure 1 shows the document-centered recall for classes averaged over all documents and over the five cross-validation runs. To demonstrate the range
Text Mining Systems on Real-world Press Archives
417
of results, the systems with the highest and lowest performance are plotted. Only exact hits of the the topic classes in the original document are counted. For PAN we get a 40% 1x-recall whereas for dpa we have 78% 1x-recall. If four times more proposals are taken into account than there are true classes for a document, we get a 4x-recall of 61% for PAN and 87% for dpa. Hence these proposals miss 39% of the correct classes for PAN and only 13% of the correct proposals for dpa. The standard deviation of the cross-validated 2xrecall is 0.0039, which is several magnitudes lower than the 2x-recall difference between best and second-best system. Hence the difference was by any means significant. To assess the effect of the hierarchy structure we relaxed the requirements for a hit. We also counted a proposal as correct if the true class had the same parent in the hierarchy, or if the proposal was the parent or child of the true class. There is a 10-15% increase in the recall values for the best system, which indicates the effect of a reduced complexity of the classification hierarchy. An interesting question is the dependency of the recall on the number of ”positive” documents for a class. The right part of figure 1 shows the dependency of the 4x-recall on the training set size, for the PAN corpus. The best algorithms already recover 35-50% of the true classes if the training set size is between 10 and 20. For classes with 50 to 100 documents the 4x-recall grows to 70-80%. Increase over the bound of 1000 documents gives only small increases in recall. For the partly automatic annotation scenario we had to evaluate the confidence of a proposal. All text mining systems delivered some confidence measure, which however could not directly be interpreted as a probability of correct assignment. We therefore generated a normalized confidence value by the following procedure: 1. For a system, the range of actually produced confidence values was partitioned into up to 100 intervals with approximately the same number each. 2. For each interval the probability of correct assignment was determined on half of the test set. This yielded the normalized confidence value. 3. On the other half of each test set the number of correct assignments with the corresponding normalized confidence values was evaluated. Thus, for each cross-validation test set, two different normalized confidence values were developed which however were nearly identical. Figure 2 shows the average of the resulting curves. For PAN, the best text mining system only could assign 4% of the documents with 80% confidence. As for these documents there is still 20% error, partly automatic annotation is not yet feasible for PAN. For dpa however 45% of the documents could be assigned with a normalized confidence of 95%, i.e. with 5% errors. For these documents partly automatic annotation seems to be possible, at least for classes. The remaining 55% of the documents may be assigned by human annotators using system proposals.
Fraction of correctly assigned proposals
418
G. Paaß and H. de Vries
dpa dpa
PAN
PAN
Normalized Confidence Fig. 2. Fraction of correctly assigned proposals with a normalized confidence value equal or larger than the abcissa. The curves for the systems with largest and smallest performance are shown.
Named Entity extraction requires algorithms completely different from classification procedures. The participating systems use existing lists of names, rules sets, and/or statistical methods like hidden Markov models [3]. As the dpa corpus did not cover named entities, formal experiments could only be performed with the PAN corpus. To simplify the task, the order of the name components as well as their capitalization were ignored. Figures 3(a) and 3(b) show the ROC curves for named entities. The best results are achieved for persons and geographic locations, where an 1x-recall of 66% and 51% respectively is reached. It grows to 78% and 76% if four times as many proposals are considered as there are true names, so only about 20% of the names are missed. The values for topic descriptors, catch phrases and institutions are much lower and never exceed 50%. Hence a large number of true named entities is not recovered. For persons, institutions and geographic locations the normalized confidence curves for PAN were disappointing. Geographic location was best, as 10% of the records could be assigned with a normalized confidence of 70%. Confidence values larger than 80% were not encountered. Hence named entities currently cannot be assigned automatically.
100
419
80
80
100
Text Mining Systems on Real-world Press Archives
60
Precision
40
60
geo
40
Precision
person
cphrase inst +
0
20
+
geo
+
person escr + +
+ 0
0
20
descr +
20
40
60 Recall
80
100
+ + +
inst hrase ++++ 0
20
40
60
80
100
Recall
(a) Recall for persons, geographic loca- (b) Recall for institutions and free catchtions, and topic descriptors. phrases. Fig. 3. Average document centered recall for named entities of the PAN corpus. The values for the systems with highest and lowest performance are shown.
4
Usability Tests
A usability test was designed in order to find out whether proposals from good text mining systems would lead to improved quality and increased speed in human document annotation. For both corpora we selected a number of documents and presented these documents to human annotators together with the annotation proposal of a text mining system. We developed a neutral GUI for this purpose which also assisted the annotators in several aspects: it presented the text of the document together with the proposed annotation and it offered extensive access to the classification hierarchy as well as lists of named entities. For PAN we randomly selected 200 documents, each of which was classified seven times by different annotators using annotation proposals from different systems. The human annotator could then accept or correct these proposals. In this way we were able to estimate the effect of a proposal on the final annotation of a document. For dpa we used a similar setup. We compared the recall values of the formal tests to the recall values of the text mining proposals in the usability tests taking the annotators decision as true. It turns out that for PAN the results are quite similar with differences of about 5%, except for institutions and catch phrases. The latter are accepted far more often by the annotators than they fit to the original annotation. In contrast institutions are accepted less often by the annotators, perhaps because of inaccuracies in capitalization or sequence. For dpa far less system proposals are accepted by the annotators than the formal tests suggest. We will come back to this later. The significance of differences
G. Paaß and H. de Vries
0.6
420
SX
0.8
SY
0.6
SY U2
U10 U15 U8 U1
U10 U8 U15 U1
U8
SY
Recall
U13
U3 U1
SY U2
U2
U1 U3
U3 U1
U3 U2 U1
0.4
0.5 0.4
U13 U10 U15 U1
U13
0.2
0.3
U4 U5 U11 U14 U2 U12 U10 U7 U9 U15 U16 U6 U13 U1 U3
SX
U2 U11 U14 U16 U9 U5 U7 U12 U6 U3
U11 U14 U5 U2 U9 U12 U16 U3 U6 U7
U6 U7 U3 U16
SX
Recall
U4
U4 U4 U14 U11 U5 U2 U12 U9
SX
SX
SX
SX SX
SY SY
0.2
U8
0.0
SY
0.0
0.1
SY
1x-Recall
2x-Recall
3x-Recall
Recall for PAN classes
4x-Recall
1x-Recall
2x-Recall
3x-Recall
4x-Recall
Recall for P DA classes
Fig. 4. Recall of original classes by text mining systems and annotators Ui . SX and SY are the systems with highest and lowest values. The line indicates the mean annotator recall.
between the systems was tested by analysis of variance, taking into account the variations due to documents and annotators. While the differences between the leading systems and the others were significant for topic classes, persons and locations, the differences for the remaining attributes were partly insignificant. One aspect of the test was the times needed for annotation. The best systems lead to a reduction of 18% for PAN and 30% for dpa compared to the worst system. These figures can be taken as a proxy to the real time savings to be expected in the use of the text mining systems. It is illuminating to compare the final annotations delivered by the annotators to the original annotations in the corpora. Figure 4 shows the recall of the original annotation in the PAN and dpa corpus by the text mining system’s proposals as well as by the human annotator’s proposals. As can be seen, the best text mining systems are consistently closer to the original annotation than the human annotators. The same holds to an even larger extent for the dpa corpus. For persons the best text mining system is able to extract persons more precisely than the average human annotators, although for the notoriously difficult institutions the annotators on the average perform better than the best text mining system. These observations lead to the consequence that annotations are not unique and have some degrees of freedom. This is also supported by the observation that the same document is annotated differently by the different annotators. For PAN classes the annotators recovered the original annotations with an average 2x-recall value of 39.1% while they ”recovered” the annotations of the same document by other annotators with an average 2x-
Text Mining Systems on Real-world Press Archives
421
recall of 41.0%. Therefore there are differences between annotators, which are as large as the differences to the original annotation. The same holds for the other attributes like persons, etc. Good text mining systems on the other hand are able to recover important attributes of the original annotation in a more reliable way.
5
Summary
The experiments show that text mining systems are able to annotate large text collections of 500000 news articles with a hierarchy containing thousands of categories. The original annotation can be reproduced especially well for persons and geographic locations with a 2x-recall of 75% and 66%. For classes a 2x-recall of 51% (PAN) and 85% (dpa) is reached. Other features can be reconstructed with a 2x-recall of 30-40%. We used crossvalidation and corresponding tests to find the best system for the annotation of each feature. It turned out that no system was optimal for every feature. The recall levels as well as the ranking of systems was confirmed by a usability test with human annotators. Here it turned out that the annotators for some features exhibit larger differences to the original annotation than the text mining systems. There were large divergences between the annotators. Hence text mining systems have the potential to achieve more consistent annotations. In addition they may be used to review existing annotations and arrive at better class definitions. This does not mean that human annotators are superfluous. In some cases text mining systems produce entirely wrong assignments as they are not able to ”understand” the documents. For the dpa corpus however it is possible to assign part of the annotations automatically, if the confidence threshold is set high enough. Currently dpa is implementing this automatic assignment procedure as it offers considerable economic savings, while the PAN cooperation is investigating alternative retrieval procedures.
References 1. DIETTERICH, T.G. (1997): Approximate statistical tests for comparing supervised classification learning algorithms. Technical report, Dept. of Computer Science, Oregon State University. 2. NADEAU, C., and BENGIO, Y. (2001): Inference for the generalization error. Technical report, Health Canada and Cirano Montreal. 3. RAJMAN, M., VESELY, M., and ANDREWS, P. (2003): Document processing and visualization techniques. Technical report, Nemis Network of Excellence in Text Mining and its Applications in Statistics. 4. SEBASTIANI, F. (2002): Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47.
Part-of-Speech Induction by Singular Value Decomposition and Hierarchical Clustering Reinhard Rapp Fachbereich Angewandte Sprach- und Kulturwissenschaft Universit¨ at Mainz, 76711 Germersheim, Germany
Abstract. Part-of-speech induction involves the automatic discovery of word classes and the assignment of each word of a vocabulary to one or several of these classes. The approach proposed here is based on the analysis of word distributions in a large collection of German newspaper texts. Its main advantage over other attempts is that it combines the hierarchical clustering of context vectors with a previous step of dimensionality reduction that minimizes the effects of sampling errors.
1
Introduction
Most previous statistical work concerning parts of speech has been on tagging. Hereby the possible parts of speech for each word are assumed to be known, and the task is to choose the correct one when given a particular context. For example, the ambiguous word sound can be a noun, a verb or an adjective. But in the sentence ”it is sound work” only the adjective reading is correct. Rather than with tagging, this paper deals with part-of-speech induction. This involves the discovery of syntactically motivated word classes, and the assignment of each word to one or several of these classes. Although manually created lexicons of word classes exist for many languages, an automatic system is nevertheless of interest. One motivation can be to simulate certain aspects of human cognition, i.e. to resemble human intuitions on word classes. Another is that there is still some disagreement among linguists on the classification of words (especially function words), so the output of a potentially more objective automatic system might be useful. In addition, an automatic system can provide information on the likelihood distribution for each of a word’s parts of speech. Early studies related to statistical part of speech induction used class based n-gram models (Brown et al., 1992), iterative clustering (Kneser & Ney, 1993), or a combination of a statistical and a neural network approach (Sch¨ utze, 1993). More recent work includes Clark (2003) who combines distributional and morphological information, and Freitag (2004) who uses a hidden Marcov model in combination with co-clustering. To our knowledge, almost all of the previous work concerning part-ofspeech induction has been on English, whereas in the current paper our language of study is German. We also propose new methodologies for classification and evaluation. In particular, we do not use abstract statistical measures
Part-of-Speech Induction by SVD and Hierarchical Clustering
Ball Buch liest Ofen schwarz weiß
das 0 35 5 0 0 4
left neighbors der er ist 37 0 0 0 0 0 1 17 0 45 0 0 0 0 27 1 19 23
viel 1 0 8 0 1 10
das 0 0 12 0 0 11
right neighbors der er ist 0 0 24 0 2 28 3 1 0 0 1 23 0 0 18 2 1 15
423
viel 1 0 8 0 1 6
Table 1. Matrix of relative frequencies between adjacent words.
such as perplexity or the F-measure for evaluation, as these measures make it difficult to check if the results agree with human intuitions. Instead we verify whether the automatically generated word classes are consistent with the word classes agreed upon by linguists, and whether the class assignments for each word are correct.
2
Methodology
Our aim is to discover syntactically motivated word classes. This means that classes are to be compiled in such a way that words belonging to the same class can substitute for one another in a sentence without affecting its grammaticality. Our work is based on the assumption that such words are distributionally similar concerning their left and right neighbors, i.e. that words belonging to the same class typically have the same neighbors with similar frequency distributions. Information on word neighborhoods as extracted from a corpus can be stored in a matrix as exemplified in Table 1. Note that to provide for the wide range of possible corpus frequencies this matrix typically contains relative frequencies, i.e. the value of 45 between Ofen and its left neighbor der means that in 45% of its occurrences the word Ofen is preceded by der (Rapp, 1996). For the discovery of word classes we compute the vector similarities between the rows of the matrix and then cluster the words according to these similarities. It can be expected that words with similar syntactic behavior will be assigned to the same classes. Ambiguous words such as weiß that can belong to several classes tend to be assigned to the cluster of their most frequent reading. To find out their additional classes we compute the difference between the vector of each word and the centroid of its closest cluster, and assign the differential vector to the most similar other cluster. This process can be repeated until the length of the differential vector falls below a threshold or, alternatively, its agreement with any of the centroids is too low. This way an ambiguous word is assigned to several parts of speech, starting from the most common and proceeding to the least common. Figure 1 illustrates this process.
424
R. Rapp
Fig. 1. Constructing the parts of speech for sound.
The main difficulty arising with the procedure as described above is the problem of data sparseness, i.e. each corpus can contain only a small fraction of all possible contexts of a word. Therefore it is desirable to discover regularities in the matrix and to make appropriate generalizations. This can be achieved by reducing the dimensionality of the matrix to the most important dimensions. An algebraic method to achieve this is Singular Value Decomposition (SVD). It has the property that when reducing the number of columns the similarities between the rows are preserved in the best possible way. In this work our claim is that to discover the basic parts of speech of a language we need very strong generalizations, which is why we reduce the number of columns in our matrices from several hundred thousands to only around five. Note that this reduction is much stronger than in other studies, where usually a hundred or more columns are retained (Dumais, 1990).
3
Implementation
As our corpus we use all articles of the years 1995 and 1996 that appeared in the German newspaper Frankfurter Allgemeine Zeitung comprising about 60 million words. By looking at the co-occurrence frequencies of adjacent words in this text collection we computed a matrix analogous to the one shown in Table 1. Punctuation marks were treated the same way as words. For computational reasons and because we wanted to present our results in graphical form, we restricted the number of rows to two test vocabularies on which our evaluations are based. One test vocabulary is a list of 50 German words as shown in Table 2. This list comprises five different parts of speech (nouns, verbs, adjectives, prepositions, conjunctions), each of which represented by ten words. Because in this list (and in German in general) most words are unambiguous with regard to their parts of speech, for testing our capability of resolving ambiguities we put together another vocabulary of again 50 words where we included 17 ambiguous words that can assume more than one part of speech (see Table 3). This list includes only verbs and adjectives. Nouns were not included because in German due to their capitalization they are generally not ambiguous with regard to part of speech.1 1
Ambiguities resulting from sentence initial capitalization are not treated here as this seems to be an artificial constraint that is not reflected in the basic mechanisms of human language processing.
Part-of-Speech Induction by SVD and Hierarchical Clustering Nouns Berg Frau Hand Haus Mann Musik Obst Stuhl Tisch T¨ ur
Verbs arbeiten befehlen essen legen pfeifen schlafen singen tr¨ aumen trinken w¨ unschen
Adjectives dunkel glatt gr¨ un kalt kurz langsam m¨ ude tief weich weise
425
Prepositions Conjunctions an aber auf als bei damit bis indem hinter obwohl mit oder u ¨ber sondern unter sowie von und vor wie
Table 2. List of words and their parts of speech.
For the first vocabulary, which includes many function words, our matrix has a size of 50 rows times 590,847 columns, and for the second vocabulary (with only content words) a size of 50 rows times 29,006 columns. The question is now whether it is sensible to apply some association measure, such as TF-IDF or the log-likelihood ratio to the values in the matrix. However, in accordance with previous work (Rapp, 2005) preliminary experiments showed that some caution is required with such association functions as inappropriate value characteristics may prevent the SVD from finding optimal dimensions. In particular, for best performance it seems essential that the logarithm is applied to the co-occurrence counts. After some experimentation with various possibilities we finally empirically chose the following formula which yielded good results: 105 · hij aij = lg 1 + hi
Hereby, hij is the frequency of common occurrence of the two words i and j, and hi is the corpus frequency of the word relating to the respective row of the matrix. Adding one and multiplying with 105 seems to help optimizing the value characteristic in our matrix so that the SVD can find good dimensions. Note that in this framework we are not aware of any well founded theoretical studies justifying one or another formula, and further work in this direction seems necessary (Dumais, 1990). The next step in our procedure is applying the SVD with the aim of reducing the dimensionality of our matrix to the main dimensions (Rapp, 2005). After that a clustering algorithm is applied to the rows of the dimensionality reduced matrix. We used the hierarchical clustering algorithm readily available in the MATLAB (MAtrix LABoratory) programming language. As our similarity measure we chose the cosine coefficient which computes the cosine of the angle between two vectors. Of the linkage types provided by MATLAB, Ward‘s method achieved the best results. Average linkage and the centroid
426
R. Rapp
method were also reasonably good. In comparison, single and complete linkage performed considerably worse, as they provide no mechanism of limiting the negative effects of outliers.
4
Results
Figures 2 and 3 show the resulting dendrograms after applying the procedure described in the previous section, with the only difference that in generating Figure 2 the SVD-step was omitted, whereas for Figure 3 a reduction to five dimensions was conducted. Without SVD the expected clusters of nouns, verbs, adjectives, prepositions, and conjunctions were found but are not separated very clearly. With SVD the five discovered clusters are much more salient so that an automatic system would have no difficulties in identifying the number of clusters. Although with both methods all 50 words are assigned to their correct clusters, the comparison of the two dendrograms indicates that the SVD was capable of making appropriate generalizations. In many cases the SVD was successful in assigning infrequent words or outliers to their appropriate clusters. This finding was also confirmed in experiments with other vocabularies, in particular if the number of dimensions chosen for the SVD step agreed with the number of clusters to be found. Having shown that our statistical method produces clusters that are consistent with word classes as proposed by linguists, our next investigation concerns the question if the method is also capable of assigning ambiguous words to several clusters. However, other than in English ambiguity with regard to basic parts of speech is relatively rare in German, which is why we decided to use for this purpose a special vocabulary enriched with ambiguous words, as described in the previous section. Applying our algorithm to this vocabulary resulted in the dendrogram shown in Figure 4. As can be seen from this diagram the two expected clusters for adjectives and verbs have been discovered. However, other than in the previous experiment (Figure 3) this time the adjective cluster consists of two distinct subclusters, where the right subcluster contains only ambiguous words that can also occur as verbs. This separation gives some indication that our representation is suitable for dealing with ambiguities. In order to automatically assign each word to its appropriate clusters, we applied the method described in section 2, that is for each word we compute its differential vector to the centroid of its cluster and then assign the differential vector to the most similar other cluster. Since differential vectors could in principle be computed ad infinitum, we set two rather conservative thresholds for aborting this process. One is that we assume that a cosine similarity above 0.99 between a vector and the centroid of its class means that this vector fully agrees with the respective class and that we should stop looking for additional class assignments. The other is that we only assign a differential vector to a cluster if its cosine similarity to the cluster’s centroid
Part-of-Speech Induction by SVD and Hierarchical Clustering A V A A A A A A V A A V V
V aktualisierte arbeiten V beabsichtigten V beachtete V beanspruchten V berechnete blau dick drucken dunkel V erbaute A erwischte essen
V gehen A glatt V halten A heilig A V heiligen A hell V hoffen A jung A kalt A klein V A konzentrierten A kurz A langsam
V legen V machen V A perfektionierte A V reduzierte V A reichen V rufen V schauen V schlafen A schnell A schwarz V singen V sprechen A stark
A V V A V A A A A A V
V A V
V V
427
tief tr¨ aumen trinken verd¨ achtigten verdient verdienten warm weich weise weiß w¨ unschen
Table 3. List of words and their parts of speech as computed.
is higher than 0.8. If such a cluster can not be found, we assume that the respective vector is caused by sampling errors. As a consequence, the search for additional class assignments is also aborted. The results from this procedure are shown in Table 3 where for each of the 50 words all computed parts of speech (verbs or adjectives) are given in the order as they were obtained by the algorithm, i.e. the dominating assignments are listed first. Assignments that the algorithm missed are printed in bold. In no case did the algorithm make a wrong assignment. Overall, the algorithm had to make 100 decisions on class assignments, of which according to our judgement 97 are correct. In three cases, namely for the words aktualisierte, beabsichtigten, and heiligen, the algorithm only found the adjective assignment but missed the verb assignment. However, it is not clear in how far these errors should be attributed to deficiencies of the algorithm or to a lack of corpus representativeness.
5
Discussion and Prospects
This work was inspired by previous work in word sense induction (Rapp, 2005). The results achieved indicate that part of speech induction is possible with good success based on the analysis of distributional patterns in text. The study also gives some insight how SVD is capable of improving the results by means of generalization. As a technical plus, reducing the size of the matrices speeds up the process of clustering. Since, with the exception of word segmentation, our approach does not require any language specific information, it should in principle work for any language. Some experiments conducted with English confirmed this prediction. However, additional validation for other languages is required. A potential weakness of our algorithm when applied to large vocabularies is the possibility of overgeneralization. That is, the dimensionality reduc-
428
R. Rapp
1.5
1.0
0.5
0
Fig. 2. Syntactic similarities without SVD. 2.5 2.0 1.5 1.0 0.5 0
Fig. 3. Syntactic similarities with SVD (5 dimensions).
tion can have the effect that information on rare word classes is discarded. However, this can possibly be avoided by using an approach of stepwise refinement. Hereby we first discover the fundamental parts of speech by applying the algorithm to a list of words comprising the basic vocabulary of a language, and then reapply it to each of the word lists associated with the discovered clusters. This process can be repeated several times. For best results, a mechanism that allows for the reassignment of initially misclassified minority items should be provided
Part-of-Speech Induction by SVD and Hierarchical Clustering
429
3.0 2.5 2.0 1.5 1.0 0.5 0
Fig. 4. Clustering of ambiguous words with SVD (2 dimensions).
Acknowledgements I would like to thank Manfred Wettler, Christian Biemann, and Gisela ZunkerRapp for their help, Hinrich Sch¨ utze for the SVD-software, and the DFG (German Research Society) for financial support.
References BROWN, P.F.; DELLA PIETRA, V.J.; DESOUZA, P.V.; LAI, J.C.; MERCER, R.L. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479. CLARK, A. (2003). Combining distributional and morphological information for part of speech induction. Proceedings of 10th EACL Conference, Budapest, 59–66. DUMAIS, S. (1990). Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval. Technical Report TM-ARH-017527, Bellcore, Morristown. FREITAG, D. (2004). Toward unsupervised whole-corpus tagging. Proceedings of COLING, Geneva, 357–363. KNESER, R.; NEY, H. (1993). Forming word classes by statistical clustering for statistical language modelling. In: R. K¨ohler, B.B. Rieger (eds.): Contributions to Quantitative Linguistics. The Netherlands: Kluwer, 221–226. RAPP, R. (1996). Die Berechnung von Assoziationen. Hildesheim: Olms. RAPP, R. (2005). Discovering the senses of an ambiguous word by clustering its local contexts. In: C. Weihs, W. Gaul (eds.): Classification – the Ubiquitous Challenge. Proceedings of the 28th Annual Conference of the GfKl. Berlin: Springer, 521–528. ¨ SCHUTZE, H. (1993). Part-of-speech induction from scratch. Proceedings of ACL, Columbus, Ohio, 251–258.
Near Similarity Search and Plagiarism Analysis Benno Stein1 and Sven Meyer zu Eissen2 1
2
Faculty of Media, Media Systems Bauhaus University Weimar, 99421 Weimar, Germany Faculty of Computer Science, Paderborn University, 33098 Paderborn, Germany
Abstract. Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage. This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance.
1
Plagiarism Analysis
Plagiarism is the act of claiming to be the author of material that someone else actually wrote (Encyclopædia Britannica, 2005). This definition relates to text documents, which is also the focus of this paper. Clearly, a question of central importance is to what extent such and similar tasks can be automated. Several techniques for plagiarism analysis have been proposed in the past— most of them rely on one of the following ideas. Substring Matching. Substring matching approaches try to identify maximal matches in pairs of strings, which then are used as plagiarism indicators (Gusfield, 1997). Typically, the substrings are represented in suffix trees, and graph-based measures are employed to capture the fraction of the plagiarized sections (Baker (1993), Monostori et al. (2002, 2000)). However, Finkel et al. (2002) as well as Baker (1993) propose the use of text compression algorithms to identify matches. Keyword Similarity. The idea here is to extract and to weight topic-identifying keywords from a document and to compare them to the keywords of other documents. If the similarity exceeds a threshold, the candidate documents
Near Similarity Search and Plagiarism Analysis
431
are divided into smaller pieces, which then are compared recursively (Si et al. (1997); Fullam and Park (2002)). Note that this approach assumes that plagiarism usually happens in topically similar documents. Fingerprint Analysis. The most popular approach to plagiarism analysis is the detection of overlapping text sequences by means of fingerprinting: Documents are partitioned into term sequences, called chunks, from which digital digests are computed that form the document’s fingerprint. When the digests are inserted into a hashtable, collisions indicate matching sequences. Recent work that describes details and variants of this approach include Brin et al. (1995); Shivakumar and Garcia-Molina (1996); Finkel et al. (2002). 1.1
Contributions of this Paper
The overall contribution of this paper relates to the usage of fuzzy-fingerprints as an effective tool for plagiarism analysis. To understand different intentions for similarity search and plagiarism analysis we first introduce the distinction of local and global similarity. In fact, fuzzy-fingerprints can be understood as a combination of both paradigms, where the parameter “chunk size” controls the degree of locality. In particular, we use this distinction to develop a taxonomy of methods for plagiarism analysis. These considerations are presented in the following section. Section 3 reports on experiments that quantify interesting properties of our approach.
2
Fingerprinting, Similarity, and Plagiarism Analysis
In the context of information retrieval a fingerprint h(d) of a document d can be considered as a set of encoded substrings taken from d, which serve to identify d uniquely.1 Following Hoad and Zobel (2003), the process of creating a fingerprint comprises four areas that need consideration. 1. Substring Selection. From the original document substrings (chunks) are extracted according to some selection strategy. Such a strategy may consider positional, frequency-based, or structural information. 2. Substring Number. The substring number defines the fingerprint resolution. There is an obvious trade-off between fingerprint quality, processing effort, and storage requirements, which must be carefully balanced. The more information of a document is encoded in the fingerprint, the more reliably a possible collision of two fingerprints can be interpreted. 3. Substring Size. The substring size defines the fingerprint granularity. A fine granularity makes a fingerprint more susceptible to false matches, while with a coarse granularity fingerprinting becomes very sensitive to changes. 1
The term “signature” is sometimes also used in this connection.
432
B. Stein and S.M. zu Eissen
4. Substring Encoding. The selected substrings are mapped onto integer numbers. Substring conversion establishes a hash operation where—aside from uniqueness and uniformity—also efficiency is an important issue (Ramakrishna and Zobel (1997)). For this, the popular MD5 hashing algorithm is often employed (Rivest (1992)). If the main issue is similarity analysis and not unique identification, the entire document d is used during the substring formation step—i. e., the union of all chunks covers the entire document. The total set of integer numbers represents the fingerprint h(d). Note that the chunks may not be of uniform length but should be formed with the analysis task in mind. 2.1
Local and Global Similarity Analysis
For two documents A and B let h(A) and h(B) be their fingerprints with the respective resolutions |h(A)| and |h(B)|. Following Finkel et al. (2002), a similarity analysis between A and B that is based on h(A) and h(B) measures the portion of the fingerprint intersection: ϕlocal (A, B) =
|h(A) ∩ h(B)| |h(A) ∪ h(B)|
We call such a kind of similarity measure local similarity or overlap similarity, because it directly relates to the number of identical regions. By contrast, the vector space model along with the cosine measure does not depend on identical regions: Two documents may have a similarity of 1 though they may not share any 2-gram. The vector space model along with the cosine measure receives a global characteristic because it quantifies the term frequency of the entire document; in particular, the model neglects word order. Figure 1 contrasts the principles of local and global similarity analysis pictorially. Basically, a fingerprint h(d) of a document d is nothing more than a special document model of d. In this sense, every information retrieval task that is based on a standard document model can also be operationalized with fingerprints. However, fingerprint methods are more flexible since they can be targeted specifically towards one of the following objectives: 1. compactness—with respected to the document length 2. fidelity—with respected to a local similarity analysis It is difficult to argue whether a fingerprint should be preferred to a standard document model in order to tackle a given information retrieval task. To better understand this problem of choosing an adequate document model we have developed a taxonomy of approaches to plagiarism analysis, which is shown in Figure 2. The approaches as well as the methods can be divided into local and global strategies. Note that in the literature on the subject local plagiarism analysis methods are encountered more often than global analysis methods. This is in the nature of things, since expropriating
Near Similarity Search and Plagiarism Analysis
A
Drawing the conclusion ``knowledge over search'' is obvious on the one hand, but too simple on the other: Among others, the question remains what can be done if the resource ``design knowledge'' is not available or cannot be elicited, or is too expensive, or must tediously be experienced? Obviously we can learn from human problem solvers where to spend search effort deliberately in order to gain the maximum impact for automated problem solving. The paper in hand gives such an example: In Subsection 2.1 we introduce the paradigm of functional abstraction to address behavior-based design problems. It develops from the searchplus-simulation paradigm by untwining the roles of search and simulation; in this way it forms a synthesis of the aforementioned approaches.
Local similarity analysis, based on the overlap of contiguous sequences.
B
at the first sight ``knowledge over search'' is obvious on the one hand, but too simple on the other: Among others, the question remains whether or not he could believe the alleged claim. However, most of us think that it develops from the search-plussimulation paradigm. This way one could gain the maximum impact for automated diagnosis problem solving, simply by untwining the roles of search and simulation. Human problem solving expertise is highly effective but of heuristic nature; moreover, it is hard to elicit but rather easy to process. Successful implementations of knowledge-based design algorithms don´t search in a gigantic space of behavior models but operate in a well defined structure space instead, which is spanned by compositional and taxonomic relations.
433
A
Drawing the conclusion ``knowledge over search'' is obvious on the one hand, but too simple on the other: Among others, the question remains what can be done if the resource ``design knowledge'' is not available or cannot be elicited, or is too expensive, or must tediously be experienced? Obviously we can learn from human problem solvers where to spend search effort deliberately in order to gain the maximum impact for automated problem solving. The paper in hand gives such an example: In Subsection 2.1 we introduce the paradigm of functional abstraction to address behavior-based design problems. It develops from the searchplus-simulation paradigm by untwining the roles of search and simulation; in this way it forms a synthesis of the aforementioned approaches.
B
Global similarity analysis, based on the shared part of the global term vectors.
at the first sight ``knowledge over search'' is obvious on the one hand, but too simple on the other: Among others, the question remains whether or not he could believe the alleged claim. However, most of us think that it develops from the search-plussimulation paradigm. This way one could gain the maximum impact for automated diagnosis problem solving, simply by untwining the roles of search and simulation. Human problem solving expertise is highly effective but of heuristic nature; moreover, it is hard to elicit but rather easy to process. Successful implementations of knowledge-based design algorithms don´t search in a gigantic space of behavior models but operate in a well defined structure space instead, which is spanned by compositional and taxonomic relations.
Fig. 1. Two documents A and B which are analyzed regarding their similarity. The left-hand side illustrates a measure of local similarity: All matching contiguous sequences (chunks) with a length ≥ 5 words are highlighted. The right-hand side illustrates a measure of global similarity: Here the common word stems (without stopwords) of document A and B are highlighted. Observe that both similarity analyses may lead to the same similarity assessment.
the exact wording of another author often relates to text passages rather than to the entire text. At the second level our taxonomy differentiates the local approaches with respect to the comparison rigor, and the global approaches with respect to statistical analysis versus style analysis. Among the shown approaches, the chunk identity analysis—usually operationalized with the MD5 hashing algorithm—is the most popular approach to plagiarism analysis. Nevertheless, the method comes along with inherent disadvantages: (i) it is computationally expensive, and (ii) a small chunk size must be chosen (3-10 words), which has a negative impact to both retrieval and storage performance. Observe that all mentioned problems can be countered, if the chunk size is drastically increased. This, however, requires some kind of fingerprints that operationalize a “relaxed” comparison concept. The following subsection adresses this problem. It introduces fuzzy-fingerprints, which are specifically tailored to text documents and which provide the desired feature: an efficient means for near similarity analysis. 2.2
Fingerprints that Capture Near Similarity
While most fingerprint approaches rely on the original document d, from which chunks are selected and given to a hashing algorithm, our approach is based on the vector space model representation of the chunks. Key idea is a comparison of the distribution of the index terms in each chunk regarding their expected term frequency classes.
434
B. Stein and S.M. zu Eissen Plagiarism analysis
Local similarity
Chunk similarity analysis
Chunk identity analysis
Global similarity
Term occurrence analysis
Style analysis
Text structure analysis
MD5 Hashing
Linguistic analysis Methods
orderShingling preserving methods
orderSuffix-tree model preserving methods with tree-cover
orderFuzzy-Fingerprint neglecting methods
order- Vector space model neglecting methods with cos-measure
Fig. 2. A taxonomy of approaches and methods to plagiarism analysis.
In particular, we abstract the concept of term frequency classes towards prefix frequency classes by comprising index terms into a small number of equivalence classes, such that all terms from the same equivalence class start with a particular prefix. Then, grounded on the analysis of large corpora a reference distribution of index term frequencies can be computed, and, for a predefined set of prefixes, the a-priory probability of a term being member in a certain prefix class can be stated. The deviation of a chunk’s term distribution from these a-priory probabilities forms a chunk-specific characteristic that can be encoded as small integer. The basic procedure for constructing a fuzzy-fingerprint hϕ (d) for a document d is as follows:2 1. Formation of a set C of chunks for d such that the extracted substrings c ∈ C cover d. 2. For each chunk c ∈ C: (a) Computation of the vector space model c of c. (b) Computation of pf , the vector of relative frequencies of the prefix classes for the index terms in c. (c) Computation of ∆pf , the vector of relative deviations of pf wrt. the expected prefix class distribution in the British National Corpus. 2
Actually, the procedure is technically much more involved. It includes an alogrithm for chunking, the determination of suited prefix classes, the computation of a reference distribution, and the identification as well as application of fuzzification schemes. Details can be found in Stein (2005).
Near Similarity Search and Plagiarism Analysis
435
(d) Fuzzyfication of ∆pf by abstracting the exact deviations towards a fuzzy deviation scheme with r intervals, and computation of a hash value γ: k−1 γ= δi · r i , with δi ∈ {0, . . . , r − 1} i=0
k is the number of prefix classes, and δi is the fuzzified deviation of the frequency of prefix class i. 3. Formation of hϕ (d) as the union of the hash values γc , c ∈ C. Remarks. The granularity of the fingerprint is controlled within three dimensions: By the number of chunks, |C|, in Step 1, by the number of equivalence classes, k, in Step 2b, and by the resolution of the fuzzy deviation scheme, r, in Step 2d.
3
Runtime Performance and Classification Characteristic
This section presents results from a comparative analysis of the fuzzy-fingerprinting approach. In particular we investigate the following questions: 1. Runtime Performance. To which extent is plagiarism identification accelerated compared to MD5 fingerprinting? 2. Classification Characteristic. How does fuzzy-fingerprinting relate to other local (MD5 fingerprinting) and global (vector space model) similarity measures? To answer these questions we set up different plagiarism experiments. The following plots result from a setting where as basis the RFC collection of the Internet Society was chosen: It comprises about 3000 documents with a considerable part of versioned sections. From this collection 50 documents were drawn randomly and compared to eight collection subsets with sizes between 100 and 800 documents. Since this comparison relied on the documents’ fingerprint representations, the number of observed collisions corresponds directly to the runtime of the plagiarism analysis. Figure 3 reflects this fact: It shows the developing of the hash collisions (left) as well as the entire analysis time (right). The main reason for the large performance difference stems from the fact that fuzzy-fingerprinting allows for chunk sizes of 100 words on average, while MD5 fingerprinting works acceptable only for chunk sizes of 3 to 10 words. The plot on the left-hand side in Figure 4 gives an answer to the question of how a document’s local and global similarity analysis are related. It shows the deviation of fingerprint-based similarity values compared to the respective cosine similarity values under the vector space model, averaged over all documents of the RFC collection; observe that fuzzy-fingerprints resemble the cosine similarity better than MD5 fingerprints do. Especially against this
436
B. Stein and S.M. zu Eissen 8
20
y-Axis: Number of collisions (* 1000) 18 x-Axis: Size of collection 16 14
y-Axis: Runtime (sec) of plagiarism analysis 7 x-Axis: Size of collection 6
MD5 Fingerprint
12
5
10
4
8
3
6
MD5 Fingerprint
2
4
Fuzzy-Fingerprint
100
200
300
400
Fuzzy-Fingerprint
1
2 500
600
700
800
100
200
300
400
500
600
700
800
Fig. 3. Runtime performance of a plagiarism analysis task: 50 documents are compared to different subsets of the RFC collection. The figures show the runtime expressed in the number of fingerprint collisions (left) as well as in seconds (right). 1.0
1.0 y-Axis: Deviation from global similairty x-Axis: Global similarity (VSM + cos-similarity)
0.8
y-Axis: Deviation from local similarity x-Axis: Local similarity (MD5 Fingerprint) 0.8
MD5 Fingerprint 0.6
0.6
0.4
0.4 Fuzzy-Fingerprint
Fuzzy-Fingerprint
0.2
0.2
0.4
0.6
0.8
0.2
1.0
0.2
0.4
0.6
0.8
1.0
Fig. 4. Classification characteristics within the above plagiarism analysis task. Left: Similarity deviation of fingerprinting compared to the cosine similarity under the vector space model. Right: Similarity deviation of fuzzy-fingerprinting compared to optimum MD5 fingerprinting.
background the plot on the right-hand side in Figure 4 must be interpreted: The rather small deviation between fuzzy-fingerprints and the “optimum” fingerprint, which is a fine-grained MD5 fingerprint of chunk size 3, illustrates the robustness of fuzzy-fingerprinting.
4
Summary
To identify plagiarized versions of a document or of some parts of it, similarity analyses must be performed. In this connection the paper introduced the distinction of local and global similarity measures. Local similarity measures answer the question which percentage of two documents are equal; global similarity measures answer the question to which percentage the entire documents are equal. This is a subtle but important difference, which leads to a taxonomy of methods for plagiarism analysis. Local methods for plagiarism analysis base on fingerprinting, and in this paper we propose a new class of fuzzy-fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints allow
Near Similarity Search and Plagiarism Analysis
437
for chunk sizes that are an order of magnitude larger than the typical MD5 digesting chunk sizes. As a consequence, the identification of plagiarism candidates is advanced significantly (more than a factor of five)—while reducing the size of the fingerprint database at the same time. Our experiments also show the robustness of these fingerprints with respect to both large variations in the chunk size and the similarity range. Altogether, these properties make the concept of fuzzy-fingerprinting an ideal tool for plagiarism analysis and near similarity search in large document collections.
References BAKER, B.S. (1993): On finding duplication in strings and software. http://cm.bell-labs.com/cm/cs/papers.html BRIN, S., DAVIS, J., and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents. SIGMOD ’95, 398–409, New York, NY, USA. ACM Press. ENCYCLOPÆDIA BRITANNICA. New Frontiers in Cheating. http://www.britannica.com/eb/article?tocId=228894, 2005. FINKEL, R.A., ZASLAVSKY, A., MONOSTORI, K, and SCHMIDT, H. (2002): Signature Extraction for Overlap Detection in Documents. Proc. 25th Australian conference on Computer Science, 59–64. Australian Computer Society. FULLAM, K., and Park, J. (2002). Improvements for scalable and accurate plagiarism detection in digital documents. http://www.lips.utexas.edu/~ kfullam/pdf/DataMiningReport.pdf GUSFIELD, D. (1997): Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press. HOAD, T.C., and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. American Society for Information Science and Technology, 54(3):203–215. ´ MONOSTORI, K., FINKEL, R., ZASLAVSKY, A., HODASZ, G., and PATAKI, M. (2002): Comparison of overlap detection techniques. LNCS, volume 2329. MONOSTORI, K., ZASLAVSKY, A., and SCHMIDT, H. (2000): Document overlap detection system for distributed digital libraries. DL ’00, 226–227, New York, NY, USA. ACM Press. RAMAKRISHNA, M.V., and ZOBEL, J. (1997): Performance in Practice of String Hashing Functions. Proc. Intl. Conf. on Database Systems for Advanced Applications, Australia. RIVEST, R.L. (1992): The md5 message-digest algorithm. http://theory.lcs.mit.edu/~ rivest/rfc1321.txt SHIVAKUMAR, N., and GARCIA-MOLINA, H. (1996): Building a scalable and accurate copy detection mechanism. DL ’96, 160–168, New York, NY, USA. ACM Press. SI, A., LEONG, H.V., and LAU, R.W.H. (1997): Check: a document plagiarism detection system. SAC ’97, 70–77, New York, NY, USA. ACM Press. STEIN, S. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval. In: Tochtermann and Maurer (eds.): 5th Intl. Conf. on Knowledge Management (I-KNOW 05), Graz, Austria, JUCS. Know-Center.
Objective Function-based Discretization Frank H¨ oppner Fachhochschule Braunschweig/Wolfenb¨ uttel, 38440 Wolfsburg
Abstract. Decision tree learner inspect marginal class distributions of numerical attributes to infer a predicate that can be used as a decision node in the tree. Since such discretization techniques examine the marginal distribution only, they may fail completely to predict the class correctly even in cases for which a decision tree with a 100% classification rate exists. In this paper, an objective function-based clustering algorithm is modified to yield a discretization of numerical variables that overcomes these problems. The underlying clustering algorithm is the fuzzy c-means algorithm, which is modified to (a) take the class information into account and (b) to organize all cluster prototypes in a regular grid such that the grid rather than the individiual clusters are optimized.
1
Introduction
Let us consider an easy problem for a classifier in the sense that a 100% classification rate can be achieved by, e.g., the decision tree in Fig. 1. The twodimensional data set is depicted schematically in Fig. 2: it has two numerical attributes and a binary label. All data objects are distributed uniformly over the rectangular area [0, 6] × [0, 4], both classes are equally likely. However, if we apply OneR, a decision tree (DT) learner oder a Naive Bayes Classifier (NBC) to this dataset, we achieve classification rates of 50% only. Why? OneR considers only a single variable to predict the class information, but when projecting the data set to one axis, both classes are equally likely (cf. Fig. 2). Thus, a rule with a single variable has no chance to achieve more than 50% classification rate. The NBC assumes that the individual attributes are independent given the class, which is obviously not true here, the ’combination’ of one-attribute-predictions in NBC therefore must also fail. A decision tree that captures the underlying model of the data set is shown in Fig. 1, but standard decision tree learner will not come up with this model. This is because decision tree learner discretize numerical attributes first or introduce binary predicates for the decision nodes. A splitting algorithm will find, for either axis, that the average number of instances in both classes is the same at any point along the margin. Thus, marginal class distribution cannot give any hints on how to discretize the individual axes. A binary split algorithm, as described in (Mitchell, 1997) would choose a random (or at best no) split. Multisplit algorithms (Elomaa and Rousu, 1996) may come up with an equidistant splitting in this case.
Objective Function-based Discretization
x<=2
2<x<5
5<=x
y<=1 1
y<=1 1
y<=1 1
gray white
white gray
gray white
gray
439
white
gray
Fig. 1. The underlying model for the simple data set in Fig. 2. 2
3
1
1
2
1
Fig. 2. A situation in which binary splitting as described in (Mitchell, 1997) fails. When projecting the data on either axis, the number of samples in both classes occur equally likely.
This example is, of course, artificially generated, but similar situations indeed occur in practice. The Data Mining Cup 2001 data set had quite a number of attributes and most of them exhibited equally distributed class information on marginal distribution. Such kind of datasets have also been observed in business data, where some attributes where derived from other by some business rules. In such cases it remains unclear whether the poor performance is due to the preprocessing of numerical variables or due to a lack of structure in the data. In our example, the failure is due to the poor discretization of the numerical attributes, the binary splitting algorithm does not help here. A simple way to improve the results of the decision tree learner is to (1) perform a, say, k-means clustering and (2) add the associated cluster as an additional attribute to the dataset. The cluster id’s now encode information about both, x and y values, and as such overcome the deficiency of all algorithms that look only at a single attribute at a time. In the remainder of this paper, we develop a multisplit algorithm for converting numerical attributes to nominal attributes by means of a modified clustering algorithm. In the next section, we review shortly the clustering algorithm we want to modify (a variant of
440
F. H¨ oppner
k-means). In section 3 we modify the clustering algorithm to make use of the class information and in section 4 we put some constraints on the position of clusters to achieve a partition of individual attributes.
2
Objective Function-based Clustering
In this section we briefly review the fuzzy c-means (Bezdek, 1980) and related algorithms, for a thorough overview of objective function-based fuzzy clustering see (H¨oppner et al., 1999), for instance. Let us denote the membership degree of data object xj ∈ X, j ∈ {1, ..., n}, to cluster pi ∈ P , i ∈ {1, ..., c}, by ui,j ∈ [0, 1]. The higher ui,j the more does data object xj belong to cluster pi . Denoting the distance of a data object xj to a cluster prototype pi by d(xj , pi ), we minimize the objective function Jm (P, U ; X) =
n c
2 um i,j d (xj , pi )
(1)
j=1 i=1
where the so-called “fuzzifier” m is chosen in advance and influences the fuzziness of the final partition (crisp as m → 1 and totally fuzzy as m → ∞; common values for m are within 1.5 and 4, 2 is most frequently used). The objective function is minimized iteratively subject to the constraints ∀1≤j≤n :
c
ui,j = 1,
∀1≤i≤c :
i=1
n
ui,j > 0
(2)
j=1
The first constraint makes sure that all data objects contribute in equal parts to the clusters and the second ensures that none of the clusters are empty. In every iteration step, minimization with respect to ui,j and pi is done separately. The necessary conditions for a minimum yield update equations for both half-steps. Independent of the choice of the distance function and the prototypes, the membership update equation is ui,j =
c k=1
1 d2 (xj ,pi ) d2 (xj ,pk )
1 m−1
(3)
In the most simple case of fuzzy c-means, where the prototypes – to be interpreted as cluster centers – are vectors of the same dimension as the data vectors and the distance function is the Euclidean distance, we obtain n j=1 pi = n
j=1
um i,j xj um i,j .
(4)
Objective Function-based Discretization 3
441
3 ’-’ ’-’ ’-’
’-’ ’-’ ’-’
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0 0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
Fig. 3. Another simple example data set with clear decision borders. Without considering the class borders (straight lines) the obtained clusters may lie directly on these border (left) and their utility for a classifier is limited. The modification of this section yields pure clusters (right).
3
Using Class Information
Figure 3 shows on the left hand side a data set with class borders drawn as thin lines. If we run a (fuzzy) c-means algorithm for this data set, we may come up with cluster prototypes like those indicated by larger crosses. It becomes evident, that some clusters may lie directly on the border between two classes, making such a cluster less useful for the decision tree learner. We do not want to obtain only a partitioning of the data set, now we additionally desire pure classes, that is, the data objects within one cluster should also (if possible) all belong to the same class. One could, of course, split the data set such that each new set contains only data objects of a specific class. Clustering could then be performed on each class specific subset. If the classes are reasonably well separated from each other, this could indeed help to improve the purity of the clusters. However, our intention is not to use the clusters themselves, but to derive a discretization of each variable (cf. modifications in the next section). Having multiple sets of clustering would require a merging step to finally come up with one discretization of each axis. To avoid this re-combination, we stick to the analysis of the whole data set at once. In this section, we modify the fuzzy c-means algorithm to consider the class information. Usually, such information is not present in clustering tasks, but can be used if clustering is part of the preprocessing phase (as in our case) or only partially labelled datasets are available. One can find such modification in the literature already, e.g. (Pedryzc and Waletzky, 1997), (Stutz, 1998) and (Timm, 2001), but the approach formulated here emphasizes the similarity of the two objectives of obtaining compact and pure clusters. For the following we transform the class label information to a kind of distance matrix δ. First, we replace the class attribute by indicator variables
442
F. H¨ oppner
id x y class #1 2.4 3.1 C #2 2.0 1.2 B ...
(1) →
id #1 #2 ...
.. δA δB δC .. 0 0 1 .. 0 1 0 ..
(2) →
id #1 #2 ...
.. δA δB δC .. 1 1 0 .. 1 0 1 ..
(3) →
id #1 #2 ...
.. δA δB δC .. 1 1 0 .. 3 0 3 ..
Fig. 4. Conversion of labels to cost values δ.
for each possible class (for class A we have IA (x) = 1 iff x belongs to class A, 0 otherwise, step (1) in Fig. 4). Next, we invert these indicator values (replace 1 by 0 and vice versa) to allow for an interpretation as cost or distance values (the first data object does not belong to class A, therefore there is a positive distance δA , step (2)). Finally, we allow any other cost values than 0/1 to obtain a cost-sensitive (even data-sensitive) clustering: If it is more important to classify B objects correctly than A or C, we increase the cost value for these data objects (step (3)). If the correct classification of some data object xj is very important, we may also increase its misclassification cost locally. If no label is given for a specific record, all costs may be set to zero. We then seek for a (second) membership matrix u ˆi,k that assigns a cluster prototype pi to a class k. Constraints analogous to (2) hold for u ˆi,k : A cluster must be assigned to at least one class, and the sum of membership values must be the same for all clusters. We can now define a kind of cluster-class distance dˆi,k by means of the individual data-class distances δj,k : The distance dˆi,k of a cluster i to a class k can be understood as the sum of distance δj,k of those data objects xj that are assigned to the cluster i (with degree ui,j ): dˆi,k =
n
um i,j · δj,k
j=1
Now we are in the position to formalize the final objective function: The first objective of fuzzy c-means is that clusters shall represent data objects, which is expressed by minimizing the cluster-data distance. The second (new) objective is that clusters shall represent classes, which we may express by a minimization of the cluster-class distance as defined above. So we obtain in total J=
c n
um i,j di,j
+
j=1 i=1
s c
ˆ u ˆm i,k di,k
i=1 k=1
$ %& ' $ %& ' data-cluster relationship cluster-class relationship
n s c m m ui,j di,j + u ˆi,k δj,k = j=1 i=1
c
k=1
s
subject to i=1 ui,j = 1 and i=1 u ˆi,k = 1, where s is the number of classes. The modified objective function J clearly shows the similarity of the two objectives: the first sum corresponds to the condensation of n data objects to
Objective Function-based Discretization
443
c clusters, the second sum corresponds to the condensation of the c clusters to s classes. The two parts of J can be balanced by means of the δ-values. Minimizing this objective function is done in the same alternating fashion as in the original fuzzy c-means, but we have an additional membership matrix to consider. The necessary conditions for a (local) minimum of the objective function yields the following update steps: the first step optimizes u (assuming u ˆi,k and pi to be constant), the second step optimizes u ˆ (assuming ui,j and pi to be constant) and the third step finally optimizes pi (assuming all memberships to be constant). So the variables are updates as follows: 1 −1 c di,j m−1 1. ui,j = where di,j = d2 (xj , pi ) + k u ˆi,k δj,k k=1 di,k 1 −1 s di,k m−1 2. u ˆi,k = where di,k = dˆi,k l=1 di,l
3. Determine prototype centers by (4). These update equations are repeated iteratively until the change in the prototypes drop below some predefined threshold. The resulting algorithm yields partitions that respect the decision borders of the classes, as can be seen in the right example of Fig. 3.
4
Grid of Prototypes
By means of the modification in the previous section, we obtain clusters that are ’most’ useful for a classifier, because the clusters are as pure as possible with respect to their class distribution. But if we would use, say, 10 clusters in a 5-dimensional space in a decision tree, how can we interpret the tree? If a distinction on the cluster is used at some node, a user would have to imagine how the clusters distribute in the 5-dimensional space, because their relative positions to each other determines the cluster borders. Interpretability of high-dimensional clusters can be supported by distributing clusters regularly on a grid, parallel to the main axes. Then, from the position on each individual axis the space that is occupied by the cluster is clearly defined. From the grid we obtain a partition of each individual axis and may use these partitions to improve the performance of a decision tree learner, since they have been obtained from the joint rather than the marginal class distribution. To achieve this, we must force all cluster prototypes to lie on a grid. Rather than having all prototypes pi independently from each other, we now reorganize and parameterize them. Given k input variables, suppose we want to divide the domain of variable vi into Ni representative values pi,j , k j ∈ {1, ..., Ni }. Then, we have a tessellation of the data space into i=1 Ni regions. We denote any region by the tuple of indices (i1 , i2 , ..., ik ) ∈ I, I = {(i1 , i2 , ..., ik ) | ij ∈ {1, .., Ni }, j ∈ {1, .., k}}.
444
F. H¨ oppner
The conventional prototypes pi are now replaced by prototypes p1,i1 p2,i2 . . . pk,ik k The complete set of c prototypes, c = i=1 Ni , is given by P = {(p1,i1 , p2,i2 , · · · , pk,ik ) | (i1 , i2 , ..., ik ) ∈ I } The standard fuzzy c-means objective function turns into: ⎛ x1 − p1,i1 n ⎜ x2 − p2,i2 ⎜ J= um .. (i1 ,i2 ,...,ik ),j ⎜ ⎝ . j=1 (i1 ,i2 ,...,ik )∈I
⎞2 ⎟ ⎟ ⎟ ⎠
(5)
xk − pk,ik
Here, we use the index (i1 , i2 , ..., ik ) ∈ I of a grid cell to index the prototypes. Contrary to standard fuzzy c-means, where each prototype is determined separately, now we have to determine the pi,j values jointly for multiple prototypes. The necessary conditions for a minimum of the objective function (5) under constraints (2) are given by: n m j=1 (i ,i ,...,ik )∈I,il =r u(i1 ,i2 ,...,ik ),j · xl pl,r = n 1 2 (6) m j=1 (i1 ,i2 ,...,ik )∈I,il =r u(i1 ,i2 ,...,ik ),j
In the modification of the previous section, the optimal prototypes were obtained in the same way as they are obtained in standard fuzzy c-means. The class information influences the prototypes only indirectly via the membership degrees. To force the prototypes to lie on a regular grid, the third update step in section 2 has to use (6) for (4). In the other two steps, the parameterized prototypes have to be used, but the equations remain the same. It is important to notice, that k – although the number of prototypes on the grid increases exponentially ( i=1 Ni ) – the number of parameters increases k linearly with the number of dimensions ( i=1 Ni ).
5
An Example
In Fig. 5 the resulting partitions for N1 = N2 = 3 and N1 = N2 = 6 are shown for the data set of Fig. 2. One can easily recognize the regular grid on which the clusters are forced to lie on. For the case of 6 clusters per axis the rows two and three almost coincide and the last two rows are very close to each other. This would never be the case for fuzzy c-means if the class information were ignored. Here, this is a desired result, because otherwise the clusters would occupy a decision border and the derived partition would be less helpful for the decision tree learner. In this preliminary work, the mid-points (pi,k + pi,k+1 )/2 were used to define the final partition for each variable. A more sophisticated approach is left for future work.
Objective Function-based Discretization 4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
445
0 0
1
2
3
4
5
6
0
1
2
3
4
5
6
Fig. 5. The regularly distributed clusters in case of the simple data set in Fig. 2.
6
Conclusions
An objective function-based method for a joint discretization of multiple numerical attributes is proposed. It is based on the idea of clustering the data space and use information from the partitioning of the whole data space to define a partition of each individual axis. The iterative algorithm terminates in the experiments after a few iterations, each iteration can be done in O(n) where n is the number of records. The obtained partitions consider information from other attributes and thus help to overcome the limitations of binary splitting or multisplit techniques that consider the marginal class distribution only.
References BEZDEK, J.C. (1981): Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York. ELOMAA, T. and ROUSU, J. (1996): Finding Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning. Technical Report of the Dept. of Computer Science, Univ. of Helsinki, Finland, 41. ¨ HOPPNER, F., KLAWONN, F., KRUSE, R. and RUNKLER, T. (1997): Fuzzy Cluster Analysis. John Wiley & Sons. MITCHELL, T. (1997): Machine Learning. McGraw Hill. PEDRYZC, W. and WALETZKY, J. (1997): Fuzzy-Clustering with Partial Supervision. IEEE Trans. on Systems, Man and Cybernetics – Part B, 27(5), 787-795. STUTZ, C. (1998): Partially Supervised Fuzzy c-Means Clustering with Cluster merging. In: Proc. of the Europ. Congress on Intelligent Techniques and Soft Computing, 1725–1729. TIMM, H. (2001): Fuzzy Cluster Analysis of Classified Data. In: Proc. Joint IFSA World Congress and Int. Conf. of the North Am. Inf. Proc. Society, Vancouver, Canada.
Understanding and Controlling the Membership Degrees in Fuzzy Clustering Frank Klawonn Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbuettel D-38302 Wolfenbuettel, Germany
Abstract. Fuzzy cluster analysis uses membership degrees to assign data objects to clusters in order to better handle ambiguous data that share properties of different clusters. However, the introduction of membership degrees requires a new parameter called fuzzifier. In this paper the good and bad effects of the fuzzifier on the clustering results are analysed and based on these considerations a more general approach to fuzzy clustering is proposed, providing better control on the membership degrees and their influence in fuzzy cluster analysis.
1
Introduction
A simple and common popular approach to cluster analysis is the so-called k-means clustering algorithm (see for instance Duda and Hart (1973)). In this approach each cluster is represented by a prototypical object that corresponds to the cluster centre. A data object is assigned to the cluster for which the distance of the prototype to the data object is smallest. In order to model partly overlapping clusters, the concept of membership degrees was proposed by Bezdek (1973) and Dunn (1974). Viewing k-means clustering and its fuzzy variant as an objective function-based clustering technique, it is necessary to introduce a new parameter, called fuzzifier. The behaviour and the properties of the clustering scheme differ significantly from the classical k-means one. A detailed analysis of the fuzzifier, its properties, its positive and negative effects leads to a more general approach to fuzzy clustering with a better understanding as well as better control of the clustering parameters and properties. The paper is organized as follows. After a brief review of fuzzy clustering in section 2, section 3 provides a more detailed analysis and understanding of the fuzzifier concept, discussing also the question whether fuzzy clustering is more robust than crisp clustering. In section 4 various approaches are proposed that generalize the fuzzifier concept in order to have a better control on the properties of the clustering algorithm. Future perspectives are discussed in the conclusions.
The Role of the Membership Degrees in Fuzzy Clustering
2
447
Fuzzy Cluster Analysis
The c-means1 clustering algorithm is designed to partition a data set X = {x1 , . . . , xn } ⊂ Rp into c clusters. From the purely algorithmic point of view, the c-means clustering can be described as follows. Each of the c clusters is represented by a prototype vi ∈ Rp . These prototypes are chosen randomly in the beginning. Then each data vector is assigned to the nearest prototype (w.r.t. the Euclidean distance). Then each prototype is replaced by the centre of gravity of those data assigned to it. The alternating assignment of data to the nearest prototype and the update of the prototypes as cluster centres is repeated until the algorithm converges, i.e., no more changes happen. This algorithm can also be seen as a strategy for minimizing the objective function c n f = uij dij (1) i=1 j=1
under the constraints c
uij = 1
for all j = 1, . . . , n
(2)
i=1
where uij ∈ {0, 1} indicates whether data vector xj is assigned to cluster i (uij = 1) or not (uij = 0). dij = xj − vi 2 is the squared Euclidean distance between data vector xj and cluster prototype vi . The parameters to be optimized are the cluster prototypes vi , hidden in the distances dij , and the assignments uij to the clusters. Since there is no direct solution to this optimization problem, the above described strategy tries to minimize the objective function by alternatingly optimizing either the cluster prototypes or the assignments, while the other parameter set is considered to be fixed. It should be noted that by replacing the Euclidean distance by other distance measures and enriching the cluster prototypes by further parameters, other shapes than just the spherical clusters as in standard c-means clustering can be discovered. Clusters might be ellipsoidal, linear manifolds, quadrics or even differ in volume (Keller and Klawonn (2003)). Since this paper is concerned with the assignment of the data objects to clusters, we refer to the literature (for instance H¨ oppner et al. (1999)) for an overview. All our considerations are more or less independent of the chosen distance function dij . The generalization from crisp assignments uij ∈ {0, 1} to membership degrees uij ∈ [0, 1] seems to be straight forward, by simply considering the latter relaxed constraint for the objective function. However, even when arbitrary values between zero and one are allowed for the assignment of the data objects to the clusters, it is easy to prove that a minimum of the objective 1
In fuzzy cluster analysis, c is chosen to denote the number of clusters. In order to be coherent in notation, we therefore write c-means instead of k-means clustering.
448
F. Klawonn
function (1) can only be obtained, if the membership degrees are chosen in a crisp way, i.e. uij ∈ {0, 1}. The reason for this is quite obvious. Not assigning the full weight uij of a data object xj to the closest cluster i, but instead raising the weight ukj to a cluster with a larger distance, will definitely increase the value of the objective function. Therefore, for fuzzy clustering the objective function was modified in the following form, introducing a so-called fuzzifier m > 1: n c f = um (3) ij dij . i=1 j=1
Note that the fuzzifier m does not have any effects, when we use hard clustering. The fuzzifier m > 1 is not subject to the optimization process and has to be chosen in advance. A typical choice is m = 2. We will discuss the effects of the fuzzifier in the next section. The fuzzy clustering approach with the objective function (3) under the constraints (2) and the assumption uij ∈ [0, 1] is also called probabilistic clustering, since due to the constraints (2) the membership degree uij can be interpreted as the probability that xj belongs to cluster i. The objective function (3) is also optimized by an alternating optimization scheme. It can be shown that the membership degrees have to be chosen as uij =
c
1 , 1 m−1
k=1
(4)
dij dkj
unless there exists a cluster i with zero distance dij to xj . In this case uij = 1 and ukj = 0 for i = k is chosen. If dij is the squared Euclidean distance, then the cluster centres vi are computed as the weighted mean n m j=1 uij xj vi = n (5) m . j=1 uij
3
Properties of the Fuzzifier
The fuzzifier m controls, how much clusters may overlap. For m → 1 the membership degrees tend to the values 0 and 1, i.e. fuzzy clustering is turned into crisp clustering. For m → ∞ clusters become completely merged because of uij → 1c . In addition to be able to better handle ambiguous data, it seems that fuzzy c-means clustering is more robust than standard crisp c-means. Although there is no proof for this hypothesis, the use of membership degrees might be able to eliminate undesired local minima in the objective function and can therefore prevent fuzzy clustering from converging to counter-intuitive results. Figure 1 shows the objective functions of c-means and fuzzy c-means clustering for a simple one-dimensional data set. The data set consists of
The Role of the Membership Degrees in Fuzzy Clustering
449
Fig. 1. Objective functions for crisp (left) and fuzzy (right) clustering
two clusters centred around 0 and 5. There is also a cluster with very few data around 10. The clustering was carried out using noise clustering (Dav´e (1991)). This means that in addition to the two clusters for which the prototypes must be computed there is a third noise cluster that has no specific prototype, but a fixed large distance to all data. The noise cluster is supposed to collect those data that are far away from all other clusters, in our case the few data around 10. In figure 1 the x- and the y-axis refer to the location of the two cluster prototypes. The membership degrees are assumed to be chosen according to (4) for fuzzy clustering and for crisp clustering, the data objects are assigned to the closest cluster (including the noise cluster). The objective function for fuzzy clustering on the right hand side has two local minima, both representing correct clustering results. The difference between them is that the first and second cluster prototype are exchanged. The objective function for crisp clustering has – in addition to the two ”correct” local minima – four more undesired local minima. For a detailed discussion of this problem we refer to Klawonn (2004). Another explanation for the higher robustness of fuzzy clustering is that a bad initialisation is more difficult to overcome for crisp clustering. In order to illustrate this effect we consider the artificial data set in figure 2. When (crisp) c-means is initialized by random cluster centres as they are indicated by the three squares, the left prototype will grab immediately the two clusters on the left hand side, while the other two prototypes have to share the third cluster. They will never obtain any information about the existence of the other two clusters. The situation is different for fuzzy c-means. Although all data from the two clusters on the left hand side are closest to the left initial prototype, they will still have a non-zero membership degree to the other clusters according to equation (4). Therefore, there is a higher chance that one of the prototypes on the right hand side will be attracted by one of the two clusters on the left hand side. However, although fuzzy clustering benefits in this case from the non-zero membership degrees, they can have
450
F. Klawonn
qqq q q qqq qqq qqq qqq qq q qqq qqq q q qqq qqq qqq qqq qq q qqq
qqq q q qqq qqq qqq qqq qq q qqq
Fig. 2. A bad initialisation for c-means
bad effects in other cases. First of all, it is counter-intuitive to have non-zero membership degrees, no matter how far away a data object lies from a cluster prototype and how well it might be covered by another prototype. Secondly, when clusters have different data densities, clusters with higher densities tend to influence or even completely attract other cluster prototypes than the one that is closest as well.
4
Alternatives to the Fuzzifier
Viewing the fuzzifier in fuzzy clustering from a more general point of view, its main effect is a transformation of the membership degrees. Instead of using the terms uij dij as in the objective function (1) for c-means, fuzzy clustering (3) replaces this by g(uij )dij where g(u) = um . It is an obvious question, whether this type of transformation g is the only reasonable one or whether there are better alternatives. In order to better understand the role of the transformation g, we follow Klawonn and H¨ oppner (2003a) and consider the objective function n c f = g(uij )dij (6) i=1 j=1
under the constraints (2) that we want to minimize w.r.t. the values uij , considering the distances dij to be fixed. The constraints lead to the Lagrange function
n c n c L = g(uij )dij + λj 1 − uij i=1 j=1
j=1
i=1
and the partial derivatives ∂L = g (uij )dij − λj . ∂uij
(7)
At a minimum of the objective function the partial derivatives must be zero, i.e. λj = g (uij )dij . Since λj is independent of i, we must have g (uij ) · dij = g (ukj ) · dkj
(8)
The Role of the Membership Degrees in Fuzzy Clustering
451
for all i, k at a minimum. This actually means that these products must be balanced during the minimization process, unless at least one of the two membership degrees is zero or one. Equation (8) also explains why it is necessary to introduce the fuzzifier and why zero membership degrees (nearly) never occur. When we simply use the identity g(u) = u as the transformation, i.e. we consider the objective function (1), then it is obvious that (8) cannot be achieved, since g (u) = 1 is constant. On the other hand, when we use g(u) = um with m > 1, we have g (0) = 0 and g (1) = m > 0. Therefore, in order to balance the two products in (8), no matter how large dik and how small dij is, ukj must be chosen greater than zero and uij smaller than one. When we replace g(u) = um by another transform g with g (0) > 0, this will definitely yield a zero membership degree for a cluster k, if dij /dkj < g (0)/g (1) holds. Of course, it is not possible to choose arbitrary functions g : [0, 1] → [0, 1]. It is obvious that g should be increasing and that we want g(0) = 0 and g(1) = 1. The above argument also requires that g should be differentiable. Equation (8) will only let clusters with a larger distance to a data object than the closest cluster participate in the membership degree if g (u) < g (˜ u) for u < u ˜. This means g should also be increasing. Another important aspect is that we are still able to derive an analytical solution for the membership degrees, when we want to minimize the objective function (6) while fixing the cluster prototypes (and therefore the distances dij ). Without an analytical solution, a numerical solution, i.e. an iterative scheme would be needed within the alternating optimization leading to high computational costs. Klawonn and H¨ oppner (2003a/b) proposed a quadratic transform gα (u) = αu2 + (1 − α)u (0 ≤ α ≤ 1) (9) and an exponential transform gα (u) =
1 (eαu − 1) eα − 1
(0 < α).
(10)
The objective function using the transformation (9) represents a convex combination of standard crisp c-means and fuzzy c-means clustering with a fuzzifier m = 2. When using these transforms, the update equations for the alternating optimization scheme have to be altered. For the cluster prototypes (5) the terms um ij have to be replaced by gα (uij ). The update equations for the membership degrees are ⎛ ⎞ 1 ⎝ 1 + (ˆ c − 1)β 1−α ⎠ uij = −β where β = (11) dij 1−β 1+α k:u =0 kj
dkj
⎛
and uij =
1 ⎝ α+ αˆ c
k:ukj =0
ln
⎞
dkj ⎠ , dij
(12)
452
F. Klawonn g(u) g3 = 1 6
g2 g1
-u u3 = 1 Fig. 3. A piecewise linear transformation g0 = 0 = u0 u1
u2
respectively. cˆ is the number of clusters with zero membership degree for data object xj . These clusters with zero membership degree are determined in the following way. For a fixed data object xj the distances dij are sorted in decreasing order. Without loss of generality let us assume d1j ≥ . . . ≥ dcj . If there are zero membership degrees at all, we know that for minimizing the objective function the uij -values with larger distances have to be zero. The corresponding update equation does not apply to these uij -values. Therefore, we have to find the smallest index i0 to which (11), respectively (12), is applicable, i.e. for which it yields a positive value. For i < i0 we have uij = 0 and for i ≥ i0 the membership degree uij is computed according to (11), respectively (12), with cˆ = c + 1 − i0 . Note that updating the membership degrees requires additional sorting. However, the sorting has to be carried out for a relatively small number of elements, namely as many elements as there are clusters, so that the additional computational costs are acceptable. Klawonn (2004) proposed to drop the differentiability of the transformation g completely and to consider a piecewise linear transformation g as it is shown in figure 3. This gives more freedom to control the behaviour of the membership degrees than just by one parameter α. With this kind of transformation, the update equation for membership degrees can no longer be determined by taking derivatives. Klawonn (2004) describes an efficient update scheme for such piecewise linear transformations with guaranteed convergence. A piecewise linear transformation will lead to discrete membership degrees in the sense that only the membership degrees corresponding to the bends in the curve will occur. Although such a piecewise linear transformation is not differentiable, it still satisfies the condition that its derivative exists at least almost everywhere and is non-decreasing. We can even give up this monotonicity condition for the derivative. This means that membership degrees in parts where the transformation becomes flatter will not be assigned to clusters. In this way, making the curve flatter around 0.5, we can avoid ambiguous membership degrees forcing them to tend more to either zero or one.
The Role of the Membership Degrees in Fuzzy Clustering
5
453
Conclusions
Understanding the fuzzifier in fuzzy clustering as a specific type of transformation opens the door to new approaches to fuzzy clustering. The undesired effect of non-zero membership degrees, no matter how far away data objects might be from a cluster, can be avoided in this way. This also enables fuzzy clustering to cope with clusters of different densities. The proposed transforms allow a specific adjustment of the properties of the fuzzy clustering algorithm like: When should a data object have a non-zero membership degree? Are completely ambiguous data acceptable? In most cases this can be achieved by a piecewise linear transformation with only three or four segments. Future work will be devoted to algorithms whose membership transformation is changed over time or might even be adaptive. For instance, as we have already mentioned in section 3, standard fuzzy clustering can easier overcome a bad initialisation, because all data objects have at least a small influence on all clusters, no matter how large the distance is. However, this property is not desired for the final clustering result. Therefore, it seems advisable, to start the clustering with a transformation similar to the one used in standard fuzzy clustering and then modify it, so that non-zero membership degrees become less probable.
References BEZDEK, J.C. (1973): Fuzzy Mathematics in Pattern Classification. Ph.D. thesis, Appl. Math. Center, Cornell University, Ithaca. ´ R.N. (1991): Characterization and Detection of Noise in Clustering. Pattern DAVE, Recognition Letters, 12, 657–664. DUDA, R. and HART, P. (1973): Pattern Classification and Scene Analysis. Wiley, New York. DUNN, J.C. (1974): A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact, Well Separated Clusters. Journal of Cybernetics, 3, 95– 104. ¨ HOPPNER, F., KLAWONN, F., KRUSE, R. and RUNKLER, T. (1999): Fuzzy Cluster Analysis. Wiley, Chichester. KELLER, A. and KLAWONN , F. (2003): Adaptation of Cluster Sizes in Objective Function Based Fuzzy Clustering. In: C.T. Leondes (ed.): Intelligent Systems: Technology and Applications vol. IV: Database and Learning Systems. CRC Press, Boca Raton, 181–199. KLAWONN, F. (2004): Fuzzy Clustering: Insights and a New Approach. Mathware and Soft Computing, 11, 125–142. ¨ KLAWONN, F., HOPPNER, F. (2003a): What is Fuzzy About Fuzzy Clustering? Understanding and Improving the Concept of the Fuzzifier. In: M.R. Berthold, H.-J. Lenz, E. Bradley, R. Kruse, C. Borgelt (eds.): Advances in Intelligent Data Analysis V. Springer, 254–264. ¨ KLAWONN, F., HOPPNER, F. (2003b): An Alternative Approach to the Fuzzifier in Fuzzy Clustering to Obtain Better Clustering Results. In: Proc. 3rd Eusflat Conference. Zittau, 730–734.
Autonomous Sensor-based Landing Systems: Fusion of Vague and Incomplete Information by Application of Fuzzy Clustering Techniques Bernd Korn German Aerospace Center, DLR Institute of Flight Guidance
Abstract. Enhanced Vision Systems (EVS) are currently developed with the goal to alleviate restrictions in airspace and airport capacity in low-visibility conditions. EVS relies on weather penetrating forward-looking sensors that augment the naturally existing visual cues in the environment and provide a real-time image of prominent topographical objects that may be identified by the pilot. In this paper an automatic analysis of millimetre wave radar images for Enhanced Vision Systems is presented. The core part of the system is a fuzzy rule based inference machine which controls the data analysis based on the uncertainty in the actual knowledge in combination with a-priori knowledge. Compared with standard TV or IR images the quality of MMW images is rather poor and data is highly corrupted with noise and clutter. Therefore, one main task of the inference machine is to handle uncertainties as well as ambiguities and inconsistencies to draw the right conclusions. The output of different sensor data analysis processes are fused and evaluated within a fuzzy/possibilistic clustering algorithm whose results serve as input to the inference machine. The only a-priori knowledge used in the presented approach is the same pilots already know from airport charts which are available of almost every airport. The performance of the approach is demonstrated with real data acquired during extensive flight tests to several airports in Northern Germany.
1
Introduction
Adverse weather conditions affect flight safety as well as efficiency of airport operations. The problem is obvious in critical flight phases such as approach, landing, take-off, and taxiing, in which the reduced visual range affects pilot’s situational awareness and increases the separation distance between approaching aircraft due to safety reasons. Consequently, runway capacity decreases and delays increase. Thus, even at well equipped airports (with ILS CAT III systems) runway capacity is dramatically reduced under low visibility conditions. Therefore, a lot of research and development work have been done during the last years in the field of Enhanced Vision Systems (EVS). The basic idea behind such a system is to provide sensor images to the pilot allowing him to assess the situation outside the aircraft, so that he can perform approach, landing and taxiing manoeuvres in the same manner as under good visibility conditions. It easily can be seen that the performance
Autonomous Sensor-based Landing Systems
455
of the Enhanced Vision System is strongly depending on the selection of imaging sensors. The need of obtaining images under all weather conditions may serve as an example for this effect. Infra-red (IR) and millimetre-wave (MMW) sensors are currently envisaged as the most promising EVS support of pilot vision in low visibility. One important benefit of IR-sensors is that these sensors generate a perspective image, from which the human can derive the perceptual cues of depth to generate a three-dimensional interpretation of the outside world. However, the penetration of bad weather (dense fog and light rain) in the infrared spectrum is remarkably poorer than the weather penetration that can be achieved by MMW-radar (Rodloff et al. (1998)). On the other hand, due to the different type of sensing MMW radar images are difficult to interpret by the human being and pilots are in general not able to derive the necessary navigation information such as the aircraft’s position relatively to the runway directly from the radar image. Consequently, there is a need to assist pilots in the interpretation of such radar images. In addition, an automatic analysis of radar images could take benefit from a-priori knowledge about runway structure and aircraft navigation data. In the following sections the chosen approach for automatic analysis of millimetre wave (MMW) radar images with regard to the requirements for a sensor based landing is described in more detail. In Section 2 the basic concept of the radar image based navigation for approach and landing is presented. Section 3 is focusing on the extraction of runway hypotheses from single radar images whereas section 4 describes how single runway hypotheses are fused together using a fuzzy clustering algorithm. Results from a huge amount of flight trials are presented in section 5.
2
Radar Image Based Navigation
The sensor based navigation system (Korn and Hecker (2002)) is following a two step approach. In the first step, the extraction step, hypotheses about the location of runways are generated from single radar images only. Each hypothesis is attached by a value denoting the possibility that that hypothesis represents the correct runway. Normally, the data analysis generates more than only one runway hypothesis from a single radar image. These hypotheses, together with their possibility values, are sent to the second layer in the process, the inference and fusion process. Within the fusion process all hypotheses are clustered using adapted fuzzy clustering algorithms (Bezdek (1987)). This hierachical structure makes it easy to integrate additional sensor data analysis processes (e.g. runway hypotheses from the analysis of IR images) (Korn et al. (2000a)). The result of the cluster analysis is evaluated within the inference process. The distribution (e.g. compactness) and size of each cluster (relative size as well as absolute size) in combination with some a-priori knowledge about the topology of the airport are used to determine the state of the system. The result of this evaluation is fed back to the extrac-
456
B. Korn
tion process to determine whether runway structures should be searched for within the entire sensor data or in certain regions-of-interests (ROI) or even tracked, when there is a high reliability for only a few runway hypotheses. Furthermore, the position of the aircraft relative to the runway threshold is calculated (Korn et al. (2000b)) if the output of the clustering allows a reliable and unique determination of the runway. From this calculated position, information such as the deviation from an optimal 3◦ glide path can be derived and be displayed head up or head down by ILS-like localizer and glide slope indicators. In general, the sensor data analysis can be simplified the more (precise) information about the airport is available. On the other hand, the more information is necessary; the applicability of the system is reduced to those airports for which the information is available. Therefore, the system needs only few and not necessarily precise information about the airports, which can be extracted from Approach and Aerodrome Charts: heading (accuracy about 1◦ ), dimension (width and length), and elevation of the runways. Combining this information with INS-data the heading (not the position) of the runway in the radar images can be estimated (with an accuracy of about 1◦ -3◦ ). This estimation is used to generate runway hypotheses by extracting runway features from the radar images.
3
MMW Radar Image Analysis
In our system, the HiVision radar (Hellemann and Zachai (1999)) from EADS, Ulm, a 35 GHz MMW radar with a frame rate of about 16 Hz is used as the primary EV sensor. In Figure 1 some typical radar images from runways are depicted. The runway can be identified as a dark stripe in a noisy environment. Furthermore, the approach light system is imaged as a chain of ”blobs” in front of the runway. In the third image even runway lights can be detected as blobs along the runway. There is only few reflection of radar energy from the concrete area of runway back to the sensor. Thus, a difference in intensity exists at the borders of the runway in the radar image. The extraction process is looking for such intensity differences in every row of each radar image. A region within a row (see Figure 2 (a)) is selected as part of the runway if it fulfils the following requirements at least to a certain degree: • Intensity differences at borders • Homogenous intensity distribution within the region • Size of the region corresponds to the width of the runway These attributes are described with fuzzy sets. The combination of the fuzzy values is realized using the possibility theory (Kruse et al. (1993)). Within a radar image a lot of small regions exist which have a high possibility value and consequently which might be part of the runway. In a second step these regions are grouped together to runway stripes by applying a fuzzy Hough Transform (Strauss (1999)) to the centre points of each regions. In
Autonomous Sensor-based Landing Systems
457
Fig. 1. Appearance of typical runway structures in radar images (B- scope format).
general this leads to more than one hypothesis about the location of the runway in the image. Figure 2(b)-(d) shows the result of this operation for a radar image (image 2(b), approach to Hannover 27R). The image 2(c) shows all regions which might belong to the runway. The possibility values of the regions are intensity coded. A dark region represents a region with a high possibility value of being part of the runway. Besides the correct runway, four more hypotheses are generated (2(d)). Every hypothesis is evaluated by combining the possibility values of every small region that belongs to the hypothesis. Again a dark depiction means a high possibility value that this hypothesis represents the correct runway. A similar approach is used to generate runway hypotheses based on the approach light structure and runway lights. More details can be found in the paper of Korn and Hecker (2002). 3.1
Generation of Runway Hypotheses
Using the results from the individual extraction processes consolidated runway hypotheses are generated and evaluated based on the quality of the features used for their generation. Once established that runway hypotheses from e.g. the runway stripe extraction process and the approach light extraction process correspond to each other, a value of quality is calculated for this combined hypothesis. The main parameters of a runway hypothesis xh = (P, φ, w, l) are the threshold position P , the runway heading φ, the width w, and length l of the runway. The quality µh of the runway hypothesis is calculated by µh = µAL ∨ µRL ∨ (µRW Y ∧ µlength ) ∧ (1) 1 − (µRW Y ∧ µlength ) ∨ µ#Obstacle
458
B. Korn
Fig. 2. Region within a row of a radar image which is identified as part of the runway (a) and extraction of runway stripe. (b): radar image (approach to Hannover 27R), (c): feature image. (d): runway hypotheses generated by fuzzy Hough Transform.
where µAL and µRL denote the quality of the features approach lights and runway lights, respectively. µRW Y is a measure for the intensity distribution of image of the runway stripe and µlength is a measure describing how much of this particular runway hypothesis is really visible in the radar image, the more is visible the better the value for µlength is. Under the assumption that this hypothesis is the real runway; the runway stripe is investigated for obstacles. In normal airport operation, more than one obstacle (landing or departing aircraft) on a runway is very unlikely. Thus, µ#Obstacle is reflecting this. ∨ and ∧ denote the fuzzy OR (max-operator) and fuzzy AND (min-operator) operations, respectively.
4
Fusion and Inference
Usually, several different runway hypotheses are generated from a single radar image. So, the extraction of navigation information from a single radar image is not reliable enough. Therefore, runway hypotheses from the actual radar image are fused with hypotheses generated from previous radar images. This fusion is done by a clustering algorithm, which is based on the Fuzzy-c-Means algorithm. Each runway hypothesis is described by attributes like width,
Autonomous Sensor-based Landing Systems
459
length, heading, position of runway beginning (defined relative to the aircraft coordinate system). Every attribute is attached with a measure of uncertainty which describes how reliable this attribute has been extracted from the radar image. In addition every hypothesis is attached with an overall quality value µh (see eq. 1) which denotes the possibility that this hypothesis represents the correct runway. The clustering is based on the objective function Jm =
n c
2 µh (k) ∗ µage (k) ∗ um ik ∗ dik ,
dik = d(xk , vi )
(2)
k=1 i=1
A prototype v has the same structure as a single data element x. Consequently, the distance function describing the distance between two prototypes is the same as the distance function describing the distance between prototype and data element. The distance function does not take into account every element of the structure describing the runway. It is based only on the position P of the runway threshold: d2 (vi , xk ) = d2 (vi , xh (k)) = d2P (Pv (i), Px (k))
(3)
with d2P (Pv , Px ) ⎧ ⎫ ⎞ ⎛ (0.2)2 0 0 ⎬ ⎨ T = min (R(−φ)(Pv − Px )) ⎝ 0 1 0⎠ (R(−φ)(Pv − Px )) (4) ⎭ φ∈{φv ,φx } ⎩ 0 01 dP takes into account the respective runway headings of the protoype φv and of the data element φx , which has been determined by the image analysis process, and reduces the distance along the runway centreline up to 51 of the Euclidian distance (R(φ) denotes a 3 × 3 rotation matrix). This reflects the fact that the extraction of the runway threshold is much more uncertain along the runway centreline than perpendicular to the centreline. µage is an additional possibility value which considers the age of a hypothesis. A hypothesis which is derived from the actual image has a higher possibility value µage than a hypothesis generated some time before. Besides that, with µage −→ 0 the number of runway hypotheses is reduced and thus the complexity of the fusion process as well. The number of clusters c is determined with regard to the distances between the cluster prototypes. The number of clusters is increased if a new hypothesis to be inserted in the cluster set does not correspond to one of the existing cluster prototypes (if the distances to the respective prototypes are above a certain threshold). The prototypes are calculated element by element from the data set X and the cluster matrix U using the standard weighted sum formula taking into account µage as an additional weighting factor. E.g. the threshold position Pv (i) of a cluster prototypes vi is determined by n µ (k) ∗ µage (k) ∗ um ik ∗ Px (k) n h Pv (i) = k=1 (5) µ (k) ∗ µ (k) ∗ um age k=1 h ik
460
4.1
B. Korn
Analysis of Cluster Results
The clustering performs some kind of information consolidation by reducing the huge number of hypotheses to a small number of prototypes. An absolute perfect runway extraction algorithm always would generate one runway hypothesis from a single radar image that represents the correct runway. In such a case the clustering results in one cluster, only. Consequently, for a noisy, disturbed set of hypotheses, the inference process is looking for a dominant cluster. The dominance of a cluster i is calculated by evaluating the absolute S(i) and relative size Sr (i) of the cluster: S(i) =
n
µh (k) ∗ µage (k) ∗ uik
(6)
k=1
and
S(i) Sr (i) = c j=1 S(j)
(7)
In addition a quality value for the compactness of the cluster is taken into account, too. Once the clustering results into one dominant cluster and there is a high confidence within the system that the cluster prototype represents the correct runway, the position of the aircraft relative to the runway threshold is calculated. From this calculated position, information such as the distance to the threshold and the deviation from an optimal 3◦ glide path can be derived and be displayed head up or head down e. g. by ILS-like localizer and glide slope indicators. The distance to the threshold can be calculated with an accuracy of about 15 m within the last 1000 m of an approach. The localizer information is more accurate. For the last 1000 m of an approach, the accuracy was better than 2 m (compared to differential GPS), whereas the calculation of the elevation of the aircraft is less reliable. For a reliable calculation of the vertical component of the aircraft’s position relative to the runway, the result of the radar image based navigation has to be fused with the barometric altitude and the radar altimeter (Korn et al. (2000a)).
5
Experimental Results
During our flight tests we acquired about 90000 radar images within more than 50 radar image sequences from approaches to several runways from different airports in Germany (e.g. Hannover, Cologne-Bonn, Braunschweig, Hamburg, Peine, ...). In every sequence the runway was extracted correctly. The average distance between runway threshold and aircraft at time of extraction was about 1700 m, which is significantly larger than the minimum RVR (Runway Visual Range) for a non-precision approach. Only for two approaches (one to Braunschweig 08 and one to Peine-Edesse 25) this distance was below 1000 m but this smaller figure can be easily explained by the flown
Autonomous Sensor-based Landing Systems
461
curved approach procedure with a straight final of about 1200 meters. The results clearly demonstrate the robustness and usefulness of the presented approach. The automatic analysis of the radar images enables the combination of Synthetic Vision Systems with Enhanced Vision Systems. The aircraft’s navigation solution can be monitored or updated by the results of the sensor image analysis process. With an automatic detection of the runway in the sensor images there is no need anymore to display these images to the pilot. Consequently, a ”nice” Synthetic Vision image of the surround terrain and the airport can be displayed to the pilot even in the final approach phase close to terrain, knowing that this virtual view has been crosschecked by weather penetrating forward looking imaging sensors. The presented approach based on the extraction of runway hypotheses from single radar images and the consolidation of the hypotheses by application of fuzzy clustering can easily be adapted to results from the analysis of other sensors like e.g. IR sensors or even the radar altimeter. Uncertainty or even incompleteness in determination of some components of a single hypothesis can be dealt with by respective adaptation of the distance functions and calculations of the cluster prototypes.
References BEZDEK, J.C. (1987): Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York. HELLEMANN, K. and ZACHAI, R. (1999): Recent progress in mm-wave-sensor capabilities for enhanced vision. In: J. G. Verly, (Ed.) Enhanced and Synthetic Vision 1999, SPIE Vol. 3691, 21–28. ¨ KORN, B., DOHLER, H.-U,, and P. HECKER (2000a): MMW Radar Based Navigation: Solutions of the ”Vertical Position Problem”. In: J. G. Verly, (Ed.) Enhanced and Synthetic Vision 2000, SPIE Vol. 4023, 29–37. ¨ KORN, B., P. HECKER, and DOHLER, H.-U. (2000b): Robust Sensor Data Fusion for Board-autonomous Navigation During Approach and Landing. In: International Symposium on Precision Approach and Automatic Landing, ISPA 2000, DGON, Munich, 451–457. KORN, B. and HECKER, P. (2002): Enhanced and Synthetic Vision: Increasing Pilot’s Situation Awareness under Adverse Weather Conditions. In: Proceedings of the 21st Digital Avionics Systems Conference, DASC 2002 KRUSE, R., J. GEBHARDT, and F. KLAWONN (1993): Fuzzy-Systeme. B. G. Teubner. ¨ RODLOFF, R., DOHLER, H.-U., and HECKER, P. (1998): Image Data Fusion for Enhanced Situation Awareness. In: RTO-SCI-Symposium on: ”The Application of Information Technologies to Mission Systems” STRAUSS, O. (1999): Use the fuzzy hough transform, towards reduction of precision/uncertainty duality. Pattern Recognition - Special Issue on Fuzzy Image Processing Vol 32, No 11, 1911–1922
Outlier Preserving Clustering for Structured Data Through Kernels Marie-Jeanne Lesot LIP6, 8 rue du capitaine Scott, 75 015 Paris, France
Abstract. In this paper, we propose a kernel-based clustering algorithm that highlights both the major trends and the atypical behaviours present in a dataset, so as to provide a complete characterisation of the data; thanks to the kernel framework, the algorithm can be applied independently of the data nature without requiring any adaptation. We apply it to xml data describing student results to several exams: we propose a kernel to handle such data and present the results obtained with a real dataset.
1
Introduction
Clustering provides a simplified description of a dataset by means of a decomposition into subgroups: the latter are defined so that a point is more similar to points belonging to the same cluster than to points assigned to other groups, i.e. clusters respect a compactness and separability aim. It is then possible to summarise the initial dataset by the obtained subgroups. Now a complete and accurate description should contain information on the major trends present in the data, but also on the atypical behaviours, as represented by outliers: at a linguistic level, such points can be associated with labels such as “abnormal”. They correspond to minor but significant concepts which are necessary to describe the initial dataset and should be present as any other group in the data characterisation. In this paper, we consider this task in the case of structured data, i.e. data that are not to be considered as vectors, but as sequences, trees or more generally graphs. In particular, we consider xml data representing student results to several exams for which information about the attribute relationships are available. The kernel learning methods (Vapnik (1995); Sch¨ olkopf and Smola (2002)) provide a unified framework to handle such non-vectorial data and constitute algorithms that can be applied whatever the distance measure is, without requiring to adapt to the data nature. Therefore we propose a kernel version of the opca clustering algorithm (Lesot and Bouchon-Meunier (2004)), to simultaneously identify both major trends and marginal behaviours in structured datasets. Section 2 of the paper recalls opca principles and section 3 presents its kernel variant. In section 4 we apply it to xml data representing student results, propose a a kernel to handle such data and present the obtained results.
Outlier Preserving Clustering for Structured Data Through Kernels
2
463
Outlier Preserving Clustering Algorithm (OPCA)
The Outlier Preserving Clustering Algorithm (Lesot and Bouchon-Meunier (2004)) provides a complete characterisation of a dataset by identifying both the major and the marginal trends present in the data: it identifies subgroups as any clustering algorithm, but also one-point clusters, corresponding to outliers, and lastly intermediate clusters corresponding to small sets of similar outliers, which can be overlooked by both clustering techniques and outlier detection methods. To that aim, it explicitely takes into account the double objective of clustering, compactness and separability: it is based on the combination, in an iterative process, of the single linkage hierarchical clustering algorithm, denoted AHCmin in the following, and the fuzzy c-means algorithm, denoted fcm. This combination exploits the algorithms respective advantages: fcm build especially compact clusters, and AHCmin is sensitive to the minimal distance between subgroups and thus identifies well-separated clusters. The iterative process makes it possible to take into account several distance scales and to adapt locally to the data density. Opca can be seen as a divisive method which selects at each step the most appropriate clustering algorithm and modifies the parameters according to local criteria.
3
OPCA Kernel Extension
Kernel methods (e.g. see Sch¨olkopf and Smola (2002)) provide a formal framework to enrich data representation at low computational costs and to extend learning algorithms to non-vectorial data. They are based on an implicit nonlinear data transformation φ : X → F where X denotes the input space and F is called the feature space. Data points are not handled directly in F, but only through their scalar product in this space, which is computed using the initial representation, through the kernel function k : X × X → R, defined as ∀x, y ∈ X
φ(x), φ(y) = k(x, y)
Kernel algorithms are formulated so as to only depend on scalar products, that are then computed using the kernel function. An opca kernel extension requires a kernel variant of fcm and AHCmin . 3.1
Kernel Fuzzy c-Means
The kernel extension of fuzzy c-means (Wu et al. 2003) consists in transposing the cost function to the feature space, i.e. applying it to the transformed data φ(x): provided the cluster centres are looked for as linear combinations of the transformed data, which is consistent with the centre expression obtained in the non-kernel case, it can be shown that the update equations for the cluster centres and the membership degrees only depend on scalar products.
464
M.-J. Lesot
It is to be noticed that there exist other fcm variants that consider other distances than the euclidian one, such as the fuzzy shell methods (see e.g. Klawonn et al. (1997)). Yet their objective is different: they aim at extracting prototypes which have a different nature than the data points, e.g. ellipsoids or quadrics. Thus they modify the distance definition between points and cluster prototypes. Kernel functions directly express the similarity between point couples and determine more directly which points are to be grouped in the same cluster. 3.2
Hierarchical Clustering with Kernel
For hierarchical clustering algorithms, the kernel extension is straightforward: data points are not involved as such, but only through their distances that are functions of the scalar products:
d(x, y) = x − y, x − y = x, x − 2x, y + y, y (1)
Thus replacing the scalar product with the kernel function, one can implicitly apply hierarchical algorithms in the feature space, using the kernel distance d2K (x, y) = k(x, x) − 2k(x, y) + k(y, y). 3.3
Kernel OPCA Definition
Algorithm The opca kernel variant, denoted kopca, is detailed in Table 1. It consists in the combination, in an iterative process, of the kernel single linkage hierarchical clustering algorithm (kAHCmin ) and the kernel fuzzy cmeans (kfcm, Wu et al. (2003)). More precisely, it considers a data subgroup G and divides it by the most appropriate algorithm, depending on its separability and compactness measured in the feature space; it then iterates on the obtained subgroups. The use of kAHCmin depends on the presence of gaps in G distribution. This is measured by CkAHC = DG /δG where DG is the maximal merging distance observed in G when applying kAHCmin and δG = minx,y∈G dK (x, y) is the minimal distance between distinct points in G: DG corresponds to the cost of considering G as a non-divisible group, δG defines a scale of locally significant distances. The quotient measures a local separability. If kAHCmin is applied, it provides a sequence of nested partitions of the data obtained through progressive merges, one of the partitions must be selected as the clustering result. To that aim, we defined a cost threshold s∗ , so that only the merges whose cost is lower than s∗ are performed. Denoting α ¯ + ασ(D) where D is the vector of a user-defined parameter, we define s∗ = D ¯ the progressive merge costs, D and σ(D) its average and standard deviation respectively. This is equivalent to performing a proportion of the proposed merges at each step. Both CkAHC and s∗ only depend on distances and thus can be computed in the feature space.
Outlier Preserving Clustering for Structured Data Through Kernels
465
Initialization: Define G = X and fix the parameter α value Compute the kernel matrix K = (k(xi , xj ))1≤i,j≤n Algorithm: If G is separable according to CkAHC , Compute threshold s∗ Split G using kAHCmin and s∗ . Else, if G is not compact according to Cdiam , Compute the optimal cluster number c∗ according to Csel If kfcm split is justified according to Ckf cm Split G using kfcm with c = c∗ . Iterate on each obtained subgroup, extracting the corresponding kernel submatrix
Table 1. kopca algorithm.
The use of kfcm depends on G compactness, measured by the group diameter, Cdiam = maxx,y∈G dK (x, y), and on the gain in compactness provided by kfcm. The latter is measured by Ckf cm = r σ(Cr )/σ(G) where (Cr )r=1..c denote G candidate subgroups: it compares the average standard deviation of the candidate subgroups to the global G standard deviation. Note that this quantity can be computed in the feature space: for any dataset C, denoting φ¯C the dataset average in the feature space, one has 1 1 σφ2 (C) = φ(x) − φ¯C 2 = k(x, x) − k(x, y) 2 |C| x∈C
x∈C
x,y∈C
Lastly the optimal cluster number c is selected according to a stability criterion: it can be observed that kfcm converge to the same solution independently of the random initialisation of the cluster centres, provided the c value is lower than the actual number of subgroups present in the dataset; otherwise, the algorithm exploits the available degrees of freedom and the obtained partition varies, as well as the final value of the cost function. Therefore, we measure stability for several c values as Csel = σ(J φ )/J¯φ where J φ denotes the kfcm cost function (see Wu et al., 2003), J¯φ and σ(J φ ) its average and standard deviation with respect to random initialisation of the centres. The last c value before destabilisation is chosen. kOPCA Parameters The non-kernel algorithm opca involves parameters which enable the user to include semantic information, as the minimal significant distance or the maximal desired cluster diameter. In the kernel case, these values are defined in the feature space, thus may be more difficult to interpret, they are not considered as parameters. Thus kopca only depends on the parameter α which determines the proportion of kAHCmin proposed merges performed at each step; it indirectly rules the final number of clusters and should be defined according to the expected outlier proportion. Note that kopca results also depend on the kernel function and its parameters.
466
M.-J. Lesot 2.5
(a)
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
(c)
(b)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
(d)
-2 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Fig. 1. Clustering results using several clustering algorithms: (a) Considered dataset; (b) kopca (5 clusters); (c) opca (7 clusters); (d) kernel fcm (2 clusters).
3.4
Experimental Results on an Artificial Dataset
We tested kopca on a two-dimensional dataset, illustrated on Figure 1a, composed of a noisy ring surrounding data generated by a uniform distribution, one outlier located in (1.9, -1.5), and lastly a small outlying group, located around (-1.4, 2.1), generated following a uniform distribution. We used α = 3 and a gaussian kernel with standard deviation 0.5. Figure 1b shows that kopca identifies 5 clusters, among which the ring and the internal uniform distribution; it isolates the outlier and a locally isolated point located in (-0.5, -1.9). Lastly it identifies the small outlying group, providing a result compatible with the expectations. Figure 1c shows the opca results and highlights the kernel advantage: according to the euclidian distance, the ring cluster cannot be considered as compact, thus it is split into four subgroups, which misrepresents the data distribution. Still, the outlying data are successfully identified. Figure 1d illustrates the kernel fcm results with the stability-based selection criterion for the optimal c value: as it is also based on the gaussian kernel, it identifies the ring cluster, but it cannot isolate the outliers, be it the small outlying group or the outlier point. It assigns them to the noisy ring, loosing information about the specificity of these data points.
Outlier Preserving Clustering for Structured Data Through Kernels
Internship 9
Courses
Theoretical subjects
8/10 5
6
9
Practical subjects
6
8
467
Examples of λ values (tree leaves, corresponding to fields, are numbered from left to right): l l = P 3 l = = l > λ1,8 P −2
p(1, 8) = 0
→
λ1,8 =
p(1, 2) = 2
→
λ1,2
9
Fig. 2. Left: example of a considered data point: xml structure and field values; right: examples of λ values.
4 4.1
Application to Structured Data Considered Data
We consider the application of kopca to structured data in the form of xml data describing student results to several exams: the xml structure represents an ontology on the attributes which expresses their relationships, e.g. distinguishing between scientific or literary subjects, or opposing theoretical and practical subjects. An example is provided on fig. 2, where internship results are opposed to the course ones, which are divided into theoretical subjects (5 components) and practical ones (2 components). These structured data can be represented as trees, whose leaves contain the student results and whose internal nodes represent the xml structure. They are specific insofar as the structure is identical for all data points: it is not to be seen as a discriminative feature that must be taken into account to compare data points, as is usually the case (cf. e.g. Collins and Duffy (2002), Kashima and Koyanagi (2002)); yet the available information about the attributes should be taken into account to compare points. 4.2
Proposed Kernel
The xml structure indicates that attributes are not independent one from another, which suggests to enrich the attribute by attribute euclidian metric so as to take into account their relationships: two attributes belonging to the same branch convey similar meaning and their values should also be compared. Therefore, we propose a kernel based on attribute cross comparisons, that are weighted according to their proximity indicated in the xml structure: for two data points x and y, it is defined as k(x, y) =
d ij=1
λij xi yj
with
λij = δij +
l P − p(i, j)
468
M.-J. Lesot
(1)
(2)
(3)
(4)
(5)
(6)
Fig. 3. Clustering results obtained with a real dataset representing student results in an xml form. Data are represented as vectors of their field values.
d denotes the number of fields, xi the value of field i for point x, (λij )ij=1..d the weighting coefficients; l is a user-defined parameter, P the tree depth and p(i, j) the depth of the deepest common node of fields i and j (see examples on the right part of fig. 2): if l = 0, λij = δij , the kernel equals the euclidian scalar product; higher l give influence to the cross combination and express the fact that attributes belonging to the same branch play a more important role in the similarity than attributes that have little in common (cf. fig. 2). It is to be noticed that the kernel matrix can be computed as K = X T ΛX where X is the matrix representing the student results as vectors of their field values, and Λ the weighting coefficient matrix. Thus, provided Λ is semidefinite positive, which depends on the considered structure and applies in the considered case, k is indeed a kernel. This formulation highlights the fact that the proposed kernel is associated to a linear transformation of the data, the transformation being derived from the available xml structure. The kernel advantage as compared to a vectorial representation of the data is that the additional information is easier to encode in a similarity measure than in a transformation function: it is simpler to define the weights of the cross comparison than to express the corresponding data transformation. Moreover the kernel framework implies it is not necessary to recompute the update equations or to modify the algorithm, even if the considered scalar product has changed (e.g. as compared to the euclidian scalar product or the gaussian kernel used in section 3.4). 4.3
Experimental Results
We applied the previous methodology to a real dataset describing 42 students according to the xml structure represented on Figure 2, corresponding to 8 fields. The algorithm built 6 clusters represented on Figure 3, as vectors of their field values with parallel coordinates (although this representation may lead a visual bias as it suggests an attribute by attribute interpretation).
Outlier Preserving Clustering for Structured Data Through Kernels
469
It can be seen that cluster 3 groups students who obtained really good results on the second or the third theoretical subjects, which compensate one for another; this shows that the obtained clusters take into account a correlation between the attributes. Cluster 2 is to be seen as students having slightly lower results for the theoretical parts and cluster 1 to those having difficulty with the first subject.The algorithm also identifies three outlying groups, which may require a specific treatment: cluster 5 correspond to students having especially good results for all subjects, including the practical ones, and cluster 6 to students having failed on the theoretical parts. Lastly, cluster 4 corresponds to the only student who obtained a bad result for the internship, although his marks were acceptable or high for the course part.
5
Conclusion
We proposed a method to detect both major and marginal behaviours present in a dataset, so as to provide a complete and accurate simplified description of it, with the possibility of considering more flexible metrics than the euclidian one. We applied it to xml data taking into account the available structure information to enrich data comparisons according to attribute correlations. An important problem not handled here is that of parameter selection, which is here performed empirically. One perspective of this work concerns automatic parameter selection methods, for the clustering algorithm as well as for the kernel itself. This probably requires to define a validation criterion, which must be independent of the metrics so as to enable to compare them. Such a criterion is all the more important as it is necessary to justify the clusters obtained in the feature space and guarantee that they are relevant and not only a consequence of the metric change. Another perspective concerns the definition of cluster representatives so as to increase the clustering result interpretability, which may also have a role for the cluster validation.
References COLLINS, M. and DUFFY, N (2002): Convolution kernels for natural language. In Advances in Neural Information Processing Systems, NIPS 14, pages 625–632. KASHIMA, H. and KOYANAGI, T. (2002): Kernels for semi-structured data. In Proc. of ICML’02, pages 291–298. KLAWONN, F., KRUSE, R. and TIMM, H. (1997): Fuzzy shell cluster analysis. In G. della Riccia, H. Lenz, and R. Kruse, editors, Learning, networks and statistics, pages 105–120. Springer. LESOT, M.-J. and BOUCHON-MEUNIER, B. (2004): Descriptive concept extraction with exceptions by hybrid clustering. In Proc. of FUZZ-IEEE’04, pages 389–394. ¨ SCHOLKOPF, B. and SMOLA, A. (2002): Learning with kernels. MIT Press. VAPNIK, V. (1995): The nature of statistical learning theory. Springer, New York. WU, Z., XIE, W. and YU, J. (2003): Fuzzy c-means clustering algorithm based on kernel method. In Proc. of ICCIMA’03, pages 1–6.
Classification-relevant Importance Measures for the West German Business Cycle Daniel Enache, Claus Weihs, and Ursula Garczarek Department of Statistics, University of Dortmund , 44221 Dortmund, Germany
Abstract. When analyzing business cycle data, one observes that the relevant predictor variables are often highly correlated. This paper presents a method to obtain measures of importance for the classification of data in which such multicollinearity is present. In systems with highly correlated variables it is interesting to know what changes are inflicted when a certain predictor is changed by one unit and all other predictors according to their correlation to the first instead of a ceteris paribus analysis. The approach described in this paper uses directional derivatives to obtain such importance measures. It is shown how the interesting directions can be estimated and different evaluation strategies for characteristics of classification models are presented. The method is then applied to linear discriminant analysis and multinomial logit for the classification of west German business cycle phases.
1
Important Measures in Classification
1.1
Economic Multipliers
Usually, the influence of one variable xi out of a set of variables xj , j = 1, . . . , p, on a function f at a given point x0 can be measured by the partial derivative ∂f (x0 )/∂xi . This measure can be interpreted as the change of f , if xi is changed by one unit, given that the other variables xj , j = 1, . . . , p, i = j, are held constant (ceteris paribus). This information is only of value, if the variables are uncorrelated. An interesting approach to obtain more reliable measures that can be interpreted like economic multipliers and is based upon averaging over orderings, see Lindeman et al. (1980) and Kruskal (1987) for linear regression and Enache and Weihs (2005) for classification. If multicollinearity is present, the economic multipliers obtained by these methods only reflect the relative importance of the predictors. The effect of changing one predictor variable, e.g. by fiscal policy, on the result is not measured realistically, since the change in one variable inflicts changes in the other variables as well.
This work has been supported by the Collaborative Research Center ‘Reduction of Complexity in Multivariate Data Structures’ (SFB 475) of the German Research Foundation (DFG).
Classification-relevant Importance Measures
1.2
471
Directional Derivatives
It is interesting to analyze the effects of changing one predictor along with the change of all the correlated variables on the result, for example, if one is interested in the effects of a certain fiscal policy action. To incorporate the relationships to all other variables correctly, directional derivatives instead of partial derivatives have to be used. The directional derivative of function f with respect to x ∈ IRp at point x0 in direction d ∈ IRp , ||d|| = 1, is defined as: ∂d f (x0 ) f (x0 + td) − f (x0 ) = lim , t→0 ∂d x t
(1)
if the limit exists. For practical purposes, the relation between directional derivatives and the gradient can be used1 : ∂d f (x0 ) ∂f (x0 ) =d . ∂d x ∂x
1.3
(2)
Interpretation of Directional Derivatives
Obviously, partial derivatives are a special case of directional derivatives, that is, derivatives in direction of the unit vectors. With directional derivatives, not only the change of one variable xi is analyzed, but the simultaneous change of all variables x1 , . . . , xp , described by the direction in d. It can be interpreted geometrically as the slope of the tangential hyperplane at point x0 , but unlike partial derivatives, the direction in which the slope is measured is not the coordinate axis of xi but rather the direction described by d. 1.4
Estimation of the Direction Vectors
In order to develop importance measures based on directional derivatives, the correct direction vectors have to be chosen. The direction vector corresponding to a change of xj by one unit and the corresponding changes of the other variables according to their correlation with xj is denoted by dj . The single elements of the direction vectors can be estimated using simple linear regressions of all xi on xj : xi = aij + bij xj + εi , i = 1, . . . , p. The coefficients ˜ j of estimated bij can be estimated by sij /sjj and combined to a vector d slope coefficients which then has to be normalized to unit length, which gives ⎛ s1j ⎞ ⎛ ⎞ sjj s1j p p sjj · 2 2 s s i=1 ij ⎟ i=1 ij ⎟ ⎜ ⎜ ˜ S •j ⎟ ⎜ ⎟ .. .. ˆ j = dj = ⎜ d = , (3) ⎜ ⎟ ⎜ ⎟= . . ˜ ⎝ ⎠ ⎝ ||S ⎠ ||dj || •j || spj sjj spj · p p sjj 2 2 i=1
sij
i=1
sij
that is, the jth normalized column of the empirical covariance matrix. 1
The prime denotes the matrix transpose and
∂f (x0 ) ∂x
the gradient.
472
1.5
D. Enache et al.
Importance Criteria in Classification
In classification models with K classes the influence of p predictor variables x on p(k|x), the posterior class probability of class k, is of interest. Since there are K classes, the data has to be split into classes not only to estimate class means and covariance matrices, but also individual class specific direction vectors dki for each variable i. The partial derivatives of selected classification models are given in section 2. The directional derivatives can then easily be computed using equation (2). High positive values meaning a strong increase of the posterior class probability due to the change of the predictor and high negative values indicating a strong decrease. The posterior probability function is usually not linear and therefore the points at which the derivatives are to be evaluated are to be chosen carefully. In classification it is necessary to distinguish different importance criteria, which help choosing the corresponding evaluation strategy2 . Importance for class characterization This importance measure should give an impression about the influence of the respective variable for staying inside the class, if the class membership is already quite obvious. For this purpose, the direction derivative can be evaluated at the respective class mean. Importance for Class Separation • evaluate at the borders between classes and average • evaluate at uniformly distributed randomly chosen points which are closest to the class borders and average Such measures should give an impression about which variable has a strong importance for staying inside the class or entering another class. The first strategy would be more precise, but for some classification methods, the class borders cannot be obtained analytically and have to be searched using grid search algorithms, which are computationally demanding for higher dimensions. The second strategy can then be used as an approximation. Overall Importance • evaluate at all data points in the sample and average • evaluate at equally distanced grid points and average • evaluate at uniformly distributed random points and average 2
In the classification of times series data, especially cycle data, it may also be interesting, which variables have the most influence on the class change and one could evaluate the derivatives at the data points where these changes occur.
Classification-relevant Importance Measures
473
The first strategy takes the characteristics of the data into account, whereas the second strategy tries to cover the space as best as possible. For high dimensional problems, this is probably computationally not feasible and therefore the third strategy is more convenient as an approximation of the second.
2
Derivatives for the Selected Classification Models
Since classical LDA without dimension reduction is used here, both LDA and Logit have K linear discriminant functions, and can therefore been written as exp(αk + x β k ) p(k|x) = K , i = 1, . . . , K, (4) i=1 exp(αi + x β i ) with gradients K
∂p(k|x0 ) (5) p(i|x0 ) · (β k − β i ) , k = 1, . . . , K. = p(k|x0 ) · ∂x i=1
For LDA the coefficients are αk = − 21 µk Σ −1 µk + ln(p(k)) and β k = Σ −1 µk . For the Logit model, α1 and β 1 are set to 0 (reference class). The unknown parameters for each model can be estimated from the sample.
3 3.1
Classification of the West-German Business Cycle Data
The application of the importance measures is the classification of business cycle phases. Quarterly west german macro economic data is analyzed, consisting of 157 observations from 4/1955 to 4/1994. There are four classes corresponding to the four phases upswing (“up”), upper turning point (“utp”), downswing (“down”), and lower turning point (“ltp”). Heilemann and M¨ unch (1996) reduced the predictors to a set of 13 important variables: Y yearly growth rate of the gross national product (GNP) C yearly growth rate of private consumption GD government deficit as a proportion of the GNP L yearly growth rate of the number of wage and salary earners X netto exports as a proportion of the GNP M1 yearly growth rate of money supply IE yearly growth rate of equipment investments IC yearly growth rate of construction investments LC yearly growth rate of labour unit costs PY yearly growth rate of GNP price deflator PC yearly growth rate of the consumer price index RS nominal short term interest rate RL real long term interest rate There is multicollinearity present in the data. The highest positive correla-
474
D. Enache et al.
Var Y C GD L X M1 IE IC LC PY PC RS RL
up LDA Logit 0.0003 -0.0007 -0.0046 -0.0026 -0.0057 -0.0034 0.0012 -0.0021 0.0141 0.0050 -0.0031 -0.0014 0.0028 0.0003 -0.0012 -0.0008 -0.0120 -0.0066 -0.0112 -0.0056 -0.0197 -0.0049 0.0003 0.0005 0.0058 0.0049
utp LDA Logit 0.0350 0.0210 0.0332 0.0162 -0.0508 -0.0204 0.0543 0.0290 0.0399 0.0180 0.0425 0.0222 0.0267 0.0212 0.0069 0.0019 -0.0681 -0.0364 -0.0527 -0.0260 -0.0158 -0.0120 0.0374 0.0142 0.0582 0.0272
down LDA Logit -0.0086 -0.0072 -0.0063 -0.0046 -0.0061 -0.0055 -0.0077 -0.0072 0.0050 0.0093 -0.0110 -0.0024 -0.0064 -0.0064 -0.0069 -0.0028 0.0060 -0.0017 0.0056 -0.0030 0.0100 -0.0018 0.0176 0.0037 0.0021 0.0011
ltp LDA Logit -0.0129 -0.0010 -0.0076 -0.0004 -0.0313 -0.0022 -0.0218 -0.0016 -0.0036 -0.0008 -0.0004 0.0000 -0.0079 -0.0006 -0.0109 -0.0008 0.0079 0.0009 0.0125 0.0011 0.0157 0.0012 0.0068 0.0006 -0.0038 -0.0006
Table 1. Variable importances (directional derivatives) for class characterization. The results of classical linear discriminant analysis and multinomial logit.
tions are 0.868 for LC and PY and 0.776 for C and Y. The highest negative correlations are -0.656 for RL and LC and -0.656 for RL and PY. The variable PY has a variance inflation factor3 (VIF, e.g. Neter et al. (1990)) of 16.254, which is definitely above 10. There are also three variables (Y, LC, and RL) with variance inflation factors over 7. VIF’s are computed for the whole sample. Within-class mutlicollinearity is even stronger (e.g. VIF = 275.467 for PY).
3.2
Results
Class Characterization Table 1 shows the results for both, the linear discriminant analysis and the multinomial logit. The three most important variables for class “up” using LDA are PC, X, and LC. The directional derivative for PC is -0.0197. This means if PC is increased by one unit (and all other variables according to their relationship with this variable) and the system is really deep in an upswing phase then the probability of remaining in that phase is decreasing by 1.97 percent points. Using the logit model the variable PY appears to be more important than PC. For the “utp” class the three most important variables are LC, RL, and L in both models. In the “down” class the two models identify different sets of important variables. Whereas in LDA the three most important are 3
VIF is the factor, by which the variance of the estimated linear regression coefficient is increased due to multicollinearity. Typically, 10 is considered a critical value, in some references 7.
Classification-relevant Importance Measures
Var Y C GD L X M1 IE IC LC PY PC RS RL
up LDA Logit -0.0002 0.0048 -0.0093 -0.0117 -0.0097 -0.0182 0.0032 0.0059 0.0261 0.0404 -0.0073 -0.0172 0.0041 0.0129 -0.0022 -0.0043 -0.0195 -0.0462 -0.0221 -0.0430 -0.0371 -0.0331 0.0002 0.0132 0.0091 0.0304
utp LDA Logit 0.0269 0.0231 0.0244 0.0170 -0.0331 -0.0151 0.0440 0.0248 0.0251 0.0170 0.0254 0.0199 0.0241 0.0128 0.0077 0.0019 -0.0459 -0.0218 -0.0411 -0.0122 -0.0149 -0.0045 0.0190 0.0208 0.0369 0.0189
down LDA Logit -0.0126 -0.0010 -0.0097 -0.0015 -0.0087 -0.0020 -0.0116 0.0004 0.0076 0.0036 -0.0134 -0.0037 -0.0091 -0.0014 -0.0095 -0.0004 0.0057 0.0023 0.0051 0.0021 0.0117 0.0024 0.0196 0.0060 0.0030 -0.0004
475
ltp LDA Logit -0.0134 -0.0097 -0.0092 -0.0066 -0.0267 -0.0300 -0.0211 -0.0252 0.0001 0.0038 -0.0036 -0.0020 -0.0097 -0.0048 -0.0113 -0.0075 0.0077 -0.0056 0.0108 -0.0020 0.0141 0.0025 0.0075 -0.0042 -0.0022 0.0111
Table 2. Variable importances (directional derivatives) for class separation. The results of classical linear discriminant analysis and multinomial logit.
RS, M1, and PC, the logit identifies X, Y, and L. Important for the class characterization of the “ltp” class in both models are GD, L, and PC. Note, that the turning point classes in LDA have higher derivatives than the others, meaning that their class probabilities change easier than those of the upswing and downswing classes. For the logit on the other hand the class “ltp” has low derivatives at the class centers. Separation Since a grid search for the class borders is computationally very demanding, points at the borders are searched using a random simulation. (1,000 uniformly distributed data points have been generated for each class and the 10% points closest to the borders have been chosen.) Table 2 shows the resulting importance measures for class separation. For class “up” the three most important variables are PC, X, and PY. The value of -0.0371 for PC means that increasing PC by one unit will decrease the probability of staying in the upswing class by 3.7 percent points. In the logit model LC is more important than PC. LC, L, and PY are the three most important variables for class “utp” in the LDA model, whereas in the logit model Y is more important than PY. In the “down” class LDA identifies RS, M1, and Y as the three most important variables, in contrast to the logit where X is more important than Y. For the “ltp” class the most important variables for LDA are GD, L, and PC, with RL instead of PC for the logit. Overall Importance Here, the directional derivatives have been evaluated at each point in the
476
D. Enache et al.
Var Y C GD L X M1 IE IC LC PY PC RS RL
up LDA Logit 0.0025 0.0020 -0.0065 -0.0087 -0.0097 -0.0138 0.0055 -0.0002 0.0274 0.0260 -0.0054 -0.0098 0.0075 0.0077 -0.0018 -0.0033 -0.0235 -0.0332 -0.0204 -0.0279 -0.0372 -0.0220 0.0022 0.0076 0.0106 0.0226
utp LDA Logit 0.0110 0.0124 0.0105 0.0092 -0.0158 -0.0093 0.0168 0.0152 0.0117 0.0092 0.0127 0.0110 0.0077 0.0092 0.0027 0.0016 -0.0204 -0.0150 -0.0159 -0.0101 -0.0042 -0.0041 0.0120 0.0095 0.0176 0.0119
down LDA Logit -0.0049 -0.0043 -0.0034 -0.0029 -0.0048 -0.0050 -0.0028 -0.0004 0.0028 0.0110 -0.0128 -0.0120 -0.0044 -0.0047 -0.0050 -0.0022 0.0096 0.0133 0.0091 0.0125 0.0119 0.0083 0.0232 0.0247 0.0017 -0.0034
ltp LDA Logit -0.0077 -0.0076 -0.0049 -0.0040 -0.0200 -0.0239 -0.0141 -0.0183 -0.0006 -0.0026 -0.0004 0.0003 -0.0044 -0.0035 -0.0065 -0.0061 0.0032 -0.0004 0.0061 0.0029 0.0084 0.0053 0.0030 -0.0006 -0.0005 0.0035
Table 3. Overall variable importances (directional derivatives) for each class. The results of classical linear discriminant analysis and multinomial logit.
sample and the resulting values have been averaged. Table 3 shows the overall importance measures. In the “up” class the three most important variables for LDA are PC, X, and LC. A PC value of -0.0372 means that increasing PC by one unit will result in a decrease of the probability for staying in the upswing class. For the logit model RL appears to be more important than PC. The three most important variables for the “utp” class are LC, RL, and L for LDA and L, LC, and Y for the logit. In the “down” class RS, M1, and PC are important for LDA and RS, LC, and PY for the logit. And finally, for the “ltp” class the most important variables are GD, L, and Y in both models.
4
Summary and Outlook
A method has been introduced for obtaining measures of relative importance of correlated predictors on the value change of a dependent variable using directional derivatives. It can be used with every differentiable output function, like linear and non linear regression or probability functions used in supervised classification. For classification problems, several strategies for choosing evaluation points have been presented and have been applied to the classification of west German business cycle phases. When comparing the importance measures for all criteria (characterization, separation, and overall importance), one can see, that several variables are important for more than one criterion. These important variables are listed in table 4. Apparently, L and LC have the highest importance overall, a result confirming earlier studies (e.g. Weihs and Garczarek (2002)). When using LDA, PC also is considered important and X likewise when using the
Classification-relevant Importance Measures
up utp down ltp overall
LDA X, LC, PY, PC L, LC, RL M1, RS, PC Y, GD, L L, LC, PC
477
Logit X, LC, PY, PC L, LC, RL, Y M1, RS, X Y, GD, L L, LC, X
Table 4. Summary of important variables over all criteria. If one variable is important in more than one class, it is listed in the “overall” row.
logit model. It can also be seen, that each class appears to have a different set of important variables. Whereas most of the classes are dominated by production and labor market variables, for “utp” and “down” classes also monetary variables (M1, RS, and RL) have an important influence. In future work, directional derivatives of other classification methods, like Fisher’s LDA with dimension reduction, Quadratic Discriminant Analysis, and Support Vector Machines, still have to be derived. Other real world applications require quantitative and qualitative variables as predictors. Obtaining the direction vectors for dummy variables used to represent qualitative predictors is a problem which has to be addressed first.
References ENACHE, D. and WEIHS, C. (2005): Importance Assessment of Correlated Predictors in Business Cycles Classification. In: C. Weihs and W. Gaul (Eds.): Classification: The Ubiquitous Challenge, Springer, Heidelberg, 545–552. ¨ HEILEMANN, U. and MUNCH, H. J. (1996): West german business cycles 1963– 1994: A multivariate discriminant analysis. In: CIRET-Conference in Singapore, CIRET-Studien 50 . KRUSKAL, W. (1987): Relative importance by averaging over orderings. The American Statistician, 41, 6–10. LINDEMAN, R. H., MERENDA, P. F., and GOLD, R. Z. (1980). Introduction to Bivariate and Multivariate Analysis. Scott Foresman, Glenview, IL. NETER, J., WASSERMANN, W., and KUNTER, M.H. (1990): Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs, 3rd ed., Richard D Irwin. WEIHS, C. and GARCZAREK, U. (2002): Stability of multivariate representation of business cycles over time. Sonderforschungsbereich 475, Technical Report 20/2002, University of Dortmund.
The Classification of Local and Branch Labour Markets in the Upper Silesia Witold Hantke Higher School of Management and Social Sciences, Aleja Niepodleglosci 32, 42-100 Tychy, Poland
Abstract. The paper focuses on the differentiation of unemployment in Upper Silesia. All analyses have been carried out referring to both, particular professions and subregions of Silesia. It shows, how the internal diversity of the province influences the situation of the labour markets. It could be interesting to compare the present and future classifications, because the time scope of the research covers the year preceeding the European integration.
1
Introduction
1997 has reversed the declining trend of unemployment rate in Poland. It was caused by many factors: the worldwide recession, the demographic explosion, the necessity to restructure the heavy industry and many others. As a result, since the beginning of the XXIst century, about 20% of the Poles in the production age remain unemployed and this has again become a serious social problem. Before discussing the phenomenon of unemployment in Silesia, it is worth presenting a territorial scope of the research. In Poland, the province is the biggest unit of territorial division. Silesian province consists of 15 terrestrial districts (containing many localities) and 21 town districts (independent towns with district laws). The strong internal diversity of Silesia is very important for its labour market. It is influenced by the historical, geographical and economic factors. Because of these, Upper Silesia contains regions highly urbanized (for example the Upper Silesian Industrial Zone and the Rybnik Coal Zone), typically agricultural (Cracow-Cz¸estochowa Highland), as well as ˙ tourist (The Silesian and Zywiec Beskid,). The district unemployment rate is differentiated a lot and in June 2004, it varied from 7,9% in Katowice up to 30,9% in the neighbouring Siemianowice. We may assume, that both, the causes of the issue, as well as the prospects for changes in the Silesian labour market are also differentiated, which makes it essential to investigate them in detail. The time scope of the research covers the period from 30.06.2003 to 30.06.2004. All analyses have been carried out according to the dynamic aspect. The actual state of the labour markets was not taken into account but the changes in the research period. Thus, the classifications do not refer to actual situation but to the development tendencies of the Silesian labour market and the prospects for changes in the future.
The Classification of Local and Branch Labour Markets G12 Managers of large and medium organisations G13 Managers of small companies G21 Experts in physics, mathematics and technical sciences G22 Experts in natural sciences and health protection G23 Experts in educational system G24 Other experts G31 Average technical staff G32 Average staff in the field of biology and health protection G33 Teachers of practical study of profession and instructors G34 Workers of other specialities G41 Office employees
G42 Workers of money circulation and client attendance G51 Workers of personal services and security G52 Models and Salesmen G61 Farmers G62 Gardeners G63 Foresters and fishermen G64 Farmers and fishermen working for their own needs G71 Miners and construction workers G72 Workers of metal processing and mechanics of machinery G73 Workers performing precise tasks, pottery makers, manufacturers of haberdashery, workers of the printing industry
479
G74 Remaining industry workers and craftsmen G81 Operators of excavating and processing machinery and appliances G82 Operators of machinery G83 Drivers and vehicle operators G91 Workers performing simple tasks, trade and services G92 Helpers in agriculture, fishing and related branches G93 Helpers in mining, industry, civil engineering and transport
Table 1. The branch labour markets
2
The Classification of the Branch Labour Markets
The first part of the paper refers to the classification of markets of particular professions. They have been divided into clusters, and later into ranking. The branch labour markets have been assumed to cover 28 profession groups, which can be seen in Table 1. In order to describe the work supply and demand, the following indicators have been defined: NIU - the change of the number of unemployed during the research period per a single unemployed; LU - the contribution of the longterm unemployed (for over 1 year) in the overall number of the unemployed; JO - the number of job offers per a single unemployed; UJO - the contribution of job offers which have remained unused untill the end of the research period in the overall number of job offers.Before the classification, the atypical groups have been eliminated from further research: G34 (the number of unemployed has increased here by 83%, in other groups this change amounted from -24% to 25%); G61 (54% of unused job offers, in the other groups at highest 11%); G22 and G92 (more than 1 job offer per a single unemployed, in the other groups at highest 0,63); G64 (in this group no job offers have been reported, which makes it impossible
480
W. Hantke
3,0
G31 G63 G81 G91 G93
CL.1 CL.2 CL.3 CL.4 6.045 -1.423 -3.323 3.629 -14.940 -10.860 3.799 11.460 -0.569 4.099 11.480 11.740 -2.426 0.851 8.181 13.200 -7.925 -2.907 4.310 11.650
Fig. 1. The values of the classification functions
2,5 2,0 1,5 1,0 0,5 0,0
39 19 1 81 73 31 53 22 31 34 21 23 12 1 73 63 52 82 72 83 62 72 44 41 G G G G G G G G G G G G G G G G G G G G GG G
Fig. 2. Dendrite of the Ward Method
to determine the UJO index. All diagnostic features have been normalised according to the formula: zij =
xij − mini {xij } . maxi {xij } − mini {xij }
(1)
where i is a number of object, and j - a number of feature. In order to distinguish classes similar in the field of job demand and supply, firstly two taxonomical methods have been applied: Ward and k-means. The dendrite of the Ward Method is presented on Figure 1. It shows that there are 4 clusters. As an outcome of the k-means method the similar division has been obtained, as only 5 groups have been qualified differently: G31, G63, G81, G83, G93. In order to perform their final classification, a discriminant analysis has been carried out. The remaining labour markets have been used as a training set. The values of the classification functions for every cluster are presented on Table 2. The objects have been qualified to a class, for which the highest function value has been achieved (G31 - c1. 1., G63, G81, G83, G93 c1.4). Thus, the final classification, is formed in this way: CL. 1 = G12, G13, G21, G23, G24, G32, G51 covers mainly professions requiring higher education, for example managers or experts in the majority of fields. As Figure 2. shows, the number of the unemployed has risen here in the course of the last year. However these professions are connected with the lowest fraction of the longterm unemployed. CL. 2 = G33, G71, G72, G74, G83 consists of miners and construction workers, steelworkers, drivers. 8-11% of job offers has remained unused here (the highest percentage out of all clusters). On the other hand, we are dealing here with a great contribution of the longterm unemployed. This shows that a certain stagnation is taking place on this labour market. CL. 3 = G41, G42, G82 covers office workers, workers of money circulation and equipment operators. They can be characterized mainly by the highest number of job offers per a single unemployed. CL. 4 = G52, G62, G63, G73, G81, G91, G93 is the most differentiated class - it consists of salesmen, farmers, operators of excavating equipment, and of workers performing the simplest tasks. It is characterized by the lowest number of job offers, falling
50% 40% 30% 20%
1 2 3 4 CLUSTER
1
2
3
CLUSTER
4
0,7 0,6 0,5 0,4 0,3 0,2 0,1
UJO
60%
JO
30% 20% 10% 0% -10% -20% -30%
LU
NIU
The Classification of Local and Branch Labour Markets
1
2
3
4
CLUSTER
12% 10% 8% 6% 4% 2% 0%
1
2
481
3
4
CLUSTER
Fig. 3. Descriptive statistics of clusters (mean, std. deviation, min, max)
per a single unemployed. Also when it comes to other criteria, this cluster’s position is also rather unfavourable. Thus the prospects for the unemployed of this group seem the worst. On the basis of the above classification, we can indicate, which professions are in a better and which in a worse situation. In order to analyse this problem deeper, a ranking has been created by applying the synthetic Hellwig measure represented by the formula: (2) di = 1 − di0 : di0 + 2S(di0 ) , where di0 is an Euclidian metric from the pattern object, di0 - an average di0 , and S(di0 ) a standard deviation. The diagnostic features have been divided into stimulants (JO, UJO) and destimulants (NIU, LU). Because of the normalization method - formula (1) - the coordinates of the pattern for the stimulants are equal to 1, and for the destimulants 0. The ranked values of the Hellwig measure have created a ranking shown on Table 3. It confirms, that the employees previously included in cluster 4 are in the worst situation. Teachers, on the other hand, have occupied the top of the ranking. They are characterized by a relatively low percentage of longterm unemployed (<30%). Approximately 0,4 job offers falls per a single unemployed (above average). This confirms the opinion that teachers, despite low salaries, constitute a stable group when it comes to employment. However it is worth noticing, that all Hellwig measure values are significantly lower than 1. Furthermore, this means that the situation of all employees, that underwent classification, is far from perfect. Pos. 1 2 3 4 5 6
G G G G G G
Group 23 (1) 83 (2) 71 (2) 51 (1) 72 (2) 82 (3)
di Pos. Group di Pos. Group di Pos. Group di 0.518 7 G 21 (1) 0.338 13 G 74 (4) 0.271 19 G 13 (1) 0.115 0.41 8 G 32 (1) 0.33 14 G 12 (1) 0.219 20 G 91 (4) 0.107 0.398 9 G 42 (3) 0.325 15 G 81 (4) 0.194 21 G 62 (4) 0.091 0.38 10 G 33 (2) 0.307 16 G 52 (4) 0.193 22 G 93 (4) 0.081 0.341 11 G 41 (3) 0.302 17 G 31(4) 0.185 23 G 63 (4) -0.03 0.341 12 G 24 (1) 0.297 18 G 73 (4) 0.148 Table 2. The ranking of professions.
482
W. Hantke
Wodzislaw D. GLIWICE ´ RUDA SL. RYBNIK ˙ ZORY
CL.1 22.05 13.04 16.48 3.31 4.75
CL.2 12.04 9.74 6.34 5.98 7.73
CL.3 5.03 3.39 -5.21 -4.15 -6.42
CL.4 21.51 13.31 8.87 2.42 -4.64
CL.5 -2.96 -2.49 7.34 -6.04 -2.63
Table 3. The values of the classification functions.
3
The Classification of Local Labour Markets
WI TOCH. JASTRZ BIE D BROWA G. RUDA L. GLIWICE Wodzisław D. MYSŁOWICE ORY RYBNIK Rybnik D. CHORZÓW Myszków D. JAWORZNO CZ STOCH. BYTOM BIELSKO-B. ZABRZE ywiec D. Bielsko D. SOSNOWIEC Gliwice D. Cieszyn D. TYCHY KATOWICE Racibórz D. Tarn. Góry D. Bieru D. PIEKARY Pszczyna D. Zawiercie D. Kłobuck D. Cz stochowa D. B dzin D.
In the previous chapters, the differentiated situation on labour markets for particular professions have been discussed. In the next section, a parallel classification, referring to local labour markets, has been carried out. It shows, how the Silesian differentiation mentioned above influences the job supply and demand in particular districts. To describe the supply and demand, the same variables as in the previous research have been applied: NIU, LU, JO and UJO; they have been also normalised according to formula (1). Three atypical observations have been eliminated from further analysis: Mikolo´w District (over 28% of unused job offers, in other districts 0-15%), SIEMIANOWICE and Lubliniec District (0,72 and 0,6 job offers per a single unemployed, in the other districts 0,01-0,45). As in the case of branch labour markets, the classification of districts using the taxonomic methods was the first stage of research. The results of the Ward Method are presented on Figure 3.(town districts have 5 been written with capital let4 ters), where we can observe that only one cluster on the 3 left clearly stands out. The divi2 sion of the remaining districts is 1 ambiguous because the Euclid0 ian metric between clustered Ś ń Ę objects increased similarly for Ż ę Ę Ż Ę three steps of the algorithm. Ś Ą ę It seems most proper to either assume the existence of two classes, or divide the secFig. 3. Dendrite of the Ward Method. ond class, which is internally diverse, into four subclusters. Eventually, the second option has been chosen, which resulted in obtaining 5 classes. At this assumption, the k-means method has, afterwards, been applied and its outcome has been again similar - the discrepancy refers to 5 ´ ˙ districts: Wodzislaw D., GLIWICE, RUDA SL., RYBNIK, ZORY.
The Classification of Local and Branch Labour Markets
483
In order to classify them, a discriminant analysis has been carried out. Table 4 contains the values of the classification functions. The final division of districts is shown on Figure 4. and the distributions of features in every class on Figure 5. It shows that: CL. 4 Racib´ orz D., GLIWICE, KATOWICE, TYCHY - these are districts, where the number of the unemployed has decreased in the course of the research period. There is also a relatively high job demand reported by the employers - the number of job offers per a single unemployed is higher than in most of the remaining clusters. The unemployment rate at the end of the research period was relatively low here (7,9%-15,1%). The next class, which has quite positive tendencies on the labour market is CL. 1 = {Rybnik D., DA ¸ BROWA G., JASTRZE ¸ BIE, MYSLOWICE, RUDA ´ ´ SL., SWIE ¸ TOCHLOWICE}. Similarly as in cluster 4, there is a strong job demand. The difference between the two classes is that in cluster 1, the demand not sufficiently used by the unemployed (5-11% of unused job offers in cl. 1. and 3-6% in cl. 4.). On the one side, the job offers reserve constitutes a kind of advantage. On the other hand, it shows that cluster 1. is a more static market. It does not influence longterm unemployment, although it can be seen, that in some districts of this class, the number of unemployed has decreased, while in the other districts it has risen. At the end of the research period, the unemployment rate here was also relatively low (except for ´ etochlowice and Rybnik District). The characteristics of CL. 5 = {Gliwice Swi¸ D., Cieszyn D., SOSNOWIEC} also lead to interesting conclusions. On the one hand, this class has the lowest index of job offers per a single unemployed (0,05 - 0,15), and on the other hand, most of these offers have remained unused (8-15%).Thus, the phenomenon of unemployment may result here from two factors: low job demand and its misfit to the existing supply. Despite this, the number of the unemployed in districts belonging to this class has decreased in the course of the research period. CL. 2 = {B¸edzin D., Bieru´ n D., Cz¸estochowa D., Klobuck D., Pszczyna D., Tarnowskie G´ ory D., Zawier˙ cie D., RYBNIK, ZORY} contains mainly terrestrial districts, in other words smaller localities. Here, we can observe the lowest index of unused job offers (0-4%). This shows that the job demand reported by the economic entities is utterly exploited, which means that it is too small to balance the supply. It is worth mentioning that, except for Rybnik and Pszczyna District, those are all districts, which have a high unemployment rate (for example the northeast of the province). Therefore, the main source of problems on these labour markets is the disproportion between supply and demand. The labour markets ˙ included in CL. 3 = {Bielsko D., Myszk´ow D., Zywiec D., BIELSKO-BIALA, ´ BYTOM, CHORZOW, CZE ¸ STOCHOWA, JAWORZNO, ZABRZE} seem to have the worst prospects for changes. This class is also characterized by a very low job demand. Unlike cluster 2, this fact influences the longterm unemployment in a greater degree - as much as 41-45% of the unemployed have remained without a job for over a year (in the other clusters - 32-39%). The localities of Beskid Mountains and Cz¸estochowa are nowadays character-
484
W. Hantke
45%
LU
NIU
5% 0%
35%
-5% -10%
40%
30%
1 2 3 4 5 CLUSTER
1 2 3 4 5 CLUSTER
0,5
15%
UJO
JO
0,4 0,3 0,2 0,1
10% 5% 0%
1 2 3 4 5
1 2 3 4 5
CLUSTER
CLUSTER
Fig. 5. Descriptive statistics Fig. 4. The final division
ized by a relatively low unemployment rate. Therefore, we can only say, that these labour markets face an unsteady and threatened future. However, for the typically working class towns of the Upper Silesian Industrial Zone (like Chorz´ow, Bytom, Zabrze) and Myszk´ow District, where the unemployment rate is already very high, the threats mentioned above may mean longterm maintenance of the present situation. Thus they may face the most serious problems during their fight with unemployment. We already know, which subregions of the province are characterized by the most favourable prospects on the labour market. A further explanation of this research has been the ranking of the districts, presented on Table 5. It confirms the conclusions made on the basis of the cluster analysis. The top of the ranking is occupied by districts previously included in cluster with ´ ¸ ska turned out a situation considered most favourable (4 and 1). Ruda SlA to be the leader, as the number of the unemployed has decreased by 4,5%. Longterm unemployed constitute 31% (the lowest percentage compared to the remaining districts). The opposite is represented by districts included in cluster 2 and 3 - for example Myszk´ ow District with a 3,2% increase of the number of unemployed, a 45,1% fraction of the longterm unemployed (the highest percentage) and almost no job offer reserve.
4
Conclusions
As a result of the performed analyses, we can say that the historical, geographical and economic variety of Silesia can also be seen in the differentiated causes of unemployment and the prospects for the particular labour
The Classification of Local and Branch Labour Markets P. 1 2 3 4 5 6 7 8 9 10 11
District ´ RUDA SL.(1) DA ¸ BR. G.(1) Wodzislaw D.(1) GLIWICE(4) KATOWICE(4) TYCHY(4) Racib´ orz D.(4) RYBNIK(2) Rybnik D.(1) SOSNOWIEC(5) Cieszyn D.(5)
di 0.658 0.598 0.574 0.518 0.481 0.457 0.454 0.433 0.42 0.402 0.388
P. 12 13 14 15 16 17 18 19 20 21 22
District JASTRZE ¸ BIE(1) ´ SWIE ¸ TOCH.(1) Gliwice D.(5) MYSLOWICE(1) ˙ ZORY(2) BIELSKO-B.(3) Bieru´ n D.(2) Pszczyna D.(2) ´ PIEKARY SL.(2) Tarn. G´ ory D.(2) Zawiercie D.(2)
di 0.38 0.349 0.332 0.316 0.306 0.306 0.301 0.266 0.241 0.234 0.219
P. 23 24 25 26 27 28 29 30 31 32 33
District B¸edzin D.(2) JAWORZNO(3) Bielsko-B. D.(3) Klobuck D.(2) ´ CHORZOW(3) Cz¸est. D.(2) BYTOM(3) CZE ¸ STOCH.(3) ZABRZE(3) ˙ Zywiec D.(3) Myszk´ ow D.(3)
485
di 0.2 0.191 0.19 0.177 0.176 0.159 0.158 0.157 0.142 0.067 -0.01
Table 5. The ranking of districts
markets. When it comes to branch labour markets, there are 4 classes to be distinguished. The employees with higher education have the most favourable prospects for the future, which is greatly influenced by their mobility. It results in the lowest longterm unemployment. The class containing unemployed farmers and workers performing the simplest tasks shows the worst tendencies, which is mostly connected with the lowest demand for their work. Taking into account the territorial aspect, 5 subregions of Silesia have been distinguished. The important academic centres (Katowice, Gliwice), some towns belonging to the Upper Silesian Industrial Zone and the western part of the Rybnik Coal Zone are the labour markets with the most prospective future. They have a strong job demand and mostly a decrease of the number of unemployed. The working class towns of the Upper Silesian Industrial Zone, such as Chorz´ow, Bytom, Zabrze and the whole of the northern part of Silesia is a region threatened by a high longterm unemployment. Moreover, the Beskid Mountains may face similar problems in the future. It is worth mentioning, that any discussed tendencies and prospects may undergo changes in the nearest future resulting from the European integration. The comparison of the present and future analyses, may partially answer the question concerning what influence the EU accession will have on the situation of the representatives of particular professions and the inhabitants of particular regions.
References GATNAR, E. (1998): The symbolic methods of data classification. PWN, Warszawa. HELLWIG, Z.(1968): The application of a taxonomic method for the classification of countries in account of their development, resources and the structure of the qualified teams. Statistical Review, 4. WALESIAK, M. (1996): The analysis methods of marketing data. PWN, Warszawa.
An Overview of Artificial Life Approaches for Clustering David K¨ ampf and Alfred Ultsch Databionics Research Group, Institute for Mathematics and Computer Science, Philipps Universit¨ at Marburg, 35032 Marburg/Lahn, Germany
Abstract. Recently, artificial life approaches for clustering have been proposed. However, the research on artificial life is mainly the simulation of systems based on models for real life. In addition to that artificial life methods have been utilized to solve optimization problems. This paper gives a short overview of artificial life and its applications in general. From this starting point we will focus on artificial life approaches used for clustering. These approaches are characterized by the fact that solutions are emergent rather than predefined and preprogrammed. The data is seen as active rather than passive objects. New data can be added incrementally to the system. We will present existing concepts for clustering with artificial life and highlight their differences and strengths.
1
Introduction
Artificial life research has many different fields. These include the simulation of real life to study its principles and dynamics as well as the utilization of artificial life principles for engineering tasks. The way ants collectively find the shortest way from their nest to a food source for example has led to ant inspired solutions for the traveling salesman problem (TSP). Artificial life based approaches are characterized by the fact that a solution is rather emergent than preprogrammed. Thus the main question is how to design an artificial life system for a specific task. This paper starts with a short general overview of artificial life models and their applications in Chapters 2 and 3. Based on this we will focus on artificial life systems for clustering in Chapter 4 and develop a taxonomy for this field in Chapter 5.
2
Characteristics of Artificial Life Systems and Models
Artificial life systems are composed of a great number of similar and rather simple entities. These entities can communicate with each other either directly or indirectly through manipulations of the environment. The indirect communication is called stigmergic. The concept of stigmergy (from French: stigmergie) was discovered by Grass´e (1959). Important for the process of
An Overview of Artificial Life Approaches for Clustering
487
self organization is the positive feedback which drives the autocatalytic process and some kind of competition among the entities to balance it. In the following the basic properties of artificial life systems are outlined: • Population based approach – Composed of simple entities (e.g. each entity does not necessarily have a local memory) – Distributed computation / solution finding – Avoids premature convergence • Robust • Positive feedback / autocatalytic process – Rapid discovery of good solutions • Constructive greedy heuristic – Helps to find good solutions early in the search process • Stigmergic approach – Entities communicate indirectly through environmental changes The constructive greedy heuristic applies mainly to optimization problems. In the case of the TSP the greedy heuristic would be to select the nearest node. In the case of clustering problems one has to think about meaningful distances or similarity measures. We will describe the basic mechanisms and characteristics of these algorithms in Chapters 3.2 and 4. The main benefits of artificial life systems are their robustness and simplicity. Robustness means the system can adopt easily to changing boundary conditions or missing data. The stochastic nature of these systems prevents the convergence to local minima. In addition to that artificial life systems are easy to implement because they consist of simple basic entities. However to design these entities and the system dynamics for a specific task can be complex.
3 3.1
Fields of Artificial Life Research Simulations of Real Life
The well known cellular automat (Conway (1970)) is an example for the simulation of life. The cells live in a two dimensional universe and go through the whole cycle of birth, survival and death according to rules. The simulation of human behavior can also be categorized as a simulation of real life. The research of the dynamics of crowds (see Helbing et al. (2002)) is interesting in two ways. On the one hand the simulations reveal general principles of emergence and self organization. On the other hand these simulations give valuable insights into the dynamics of panic situations (Helbing et. al (2000)), which are especially important for architecture.
488
D. K¨ ampf and A. Ultsch
Fig. 1. Worker Ants organize a cemetry (Bonabeau, Dorigo, Th´eraulaz (1999))
3.2
Artificial Life Solutions for Optimization Problems
Ant Colony Optimization (ACO) (Dorigo and Di Caro (1999)) is a metaheuristic which can be adopted to various optimization problems. ACO is the foundation of ant based algorithms in this field. Equations 1 and 2 are taken from the ant algorithm for the TSP, but they are a good example for the basic dynamics of this class of algorithms. aij (t) =
β τij (t)α ηij
l∈Ni
β τil (t)α ηij
aij (t) l∈N k ail (t)
Pijk (t) =
(1)
(2)
i
In Equation 1 τij is the pheromone intensity for arc (i, j) and ηij is the heuristic value, which is problem specific and in the case of TSP can be chosen as ηij = d1ij . Ni is the neighborhood of node i, Nik is the same neighborhood but without the nodes already visited by ant k. Pijk (t) is the probability for ant k to move from node i to j at time step t. The balance between the self-catalytic process driven by the pheromone attraction and the greedy heuristic which is derived from the problem characteristics is crucial for these algorithms. This balance means adjusting α and β in an appropriate way. Another important element of ACO or ant algorithms which utilize pheromone trails in general is the evaporation of pheromones. This avoids stagnation and enables new solutions to emerge. There are ACO based approaches for many other NP-hard problems like Job Scheduling Problem (JSP), Vehicle Routing Problem (VRP), Quadratic Assignment Problem (QAP) (see Dorigo and Di Caro (1999) for an overview). ACO by its nature is easier to adopt to problems with a graph representation.
4
Artificial Life for Clustering
Figure 1 (Bonabeau, Dorigo, Th´eraulaz (1999)) shows worker ants self organizing a cemetery. The worker ants sort the dead corpses. This behavior was the inspiration for the development of clustering algorithms based on artificial life dynamics. The basis for clustering with artificial life was established
An Overview of Artificial Life Approaches for Clustering
489
by Deneubourg et al. (1991) with their work on the self organization of a sorting behavior in a group of autonomous robots. This was accomplished by introducing probabilities for each robot to drop or to pick up an object. Robots would pick up an object with high probability if the object does not have any neighbors at all or if the object is surrounded by dissimilar ones. The robots would drop an item in a neighborhood with items that fitted. Equations 3 and 4 describe the system dynamics. In these equations f (i) is the density of similar items in the neighborhood Ni of robot i. There is a constant k + for picking and k − for dropping operations. Ppick (i) =
Pdrop (i) =
k+ k + + f (i)
f (i) k − + f (i)
2 (3) 2 (4)
This has been extended for clustering tasks by Lumer and Faieta (1994) by using fˆ(i) and Pˆdrop as in Equations 5 and 6. Lumer and Faietas work has been the basis for ant based clustering algorithms. Thus the entities are called ants instead of robots. 2fˆ(i) if fˆ(i) < kˆ− ˆ Pdrop (i) = (5) 1 otherwise d(i,j) 1 1 − if fˆ(i) > 0 2 j∈N σ α i fˆ(i) = (6) 0 otherwise
In Equation 6 the term σ12 normalizes fˆ, σ being the size of the neighborhood Ni . The term d(i,j) implements a fixed threshold on dissimilarities α d ∈ [0, 1]. The problem with these approaches was that algorithms based on Equations 3, 5 and 6 tend to produce too many too small clusters. For this reason Monmarch´e et al. (1999) introduced a three staged process with two ant based clustering phases and a final one with the k-means algorithm. Ultsch (2000) presented a non-ant based approach which solved the problem with a decreasing neighborhood radius (see also Chapter 4.1). The idea of the decreasing radius is to first self organize a global structure and then fine tune the clustering result. Recently Handl et al. (2003a) introduced an improved ant based clustering algorithm with an increasing neighborhood radius. The increasing radius has advantages with respect to runtime. In addition to that Handl et al. solved the problem of finding the right set of system parameters by introducing rules for their dynamic adoption.
4.1
The ALF Approach
The ALF approach (Ultsch 2000) identifies artificial life entities with data points. This distinguishes the approach from most other artificial life systems
490
D. K¨ ampf and A. Ultsch
for clustering, where the artificial life entities move the data points in the universe. The three components of the model are: • a two dimensional, discrete universe • strategies that control the entities’ movements based on sensory input • entities that are representing data points The movement of each entity is computed on the basis of a movement table. The movement table holds the probabilities for the movement into each direction (north, east, south, west) and a probability for standing still. Every entity has its own movement table which is updated with every time step. Pk = (PSouth , PEast , PN orth , PW est , POrigin )
T
with k entity number
(7)
Movement tables are generated by strategies. A very simple strategy is Persistency: vk (t) = vk (t − 1), i.e. the entity will keep on moving into the same direction. More sophisticated strategies also take the sensory input s into account. The sensory input can be olfactory like e.g. pheromone trails or the odor of other entities. We therefore define a sensory radius rsensory . Entities can sense their neighbors on the map within rsensory . The radius can be constant or decreasing during the simulation. Within that sight radius we divide the entities into similar and dissimilar ones. Dissimilarity of entities is defined as the euclidean distance between the entities’ data points. We call the n% most similar entities friends (within sight radius rsensory ) and the remaining ones foes. The movement table is then computed in such a way that the entity will move away from its foes and towards its friends. The strength of this mechanism is the adaptiveness of the threshold. We will now give a short outline of the main characteristics of the ALF approach. It is a self organizing system that shows emerging solutions. It is an unsupervised approach. Especially there are no assumptions made about the number of clusters or their size. This enables the system to discover new (so far unknown) clusters. This is important for knowledge discovery. In addition to clustering it delivers a projection of the data set onto the two dimensional grid. The ALF system has been successfully applied to numerous artificial and real world data sets (Ultsch 2001, 2002, 2004).
5
Taxonomy of Existing Approaches
See Figure 2 for the taxonomy tree of artificial life approaches for clustering. Clustering algorithms can be divided into two subgroups: symbolic and subsymbolic approaches. While the sub-symbolic approaches operate directly in the problem’s data space, the symbolic approaches operate on symbolic representations of the clusters. If the generation of knowledge is the goal and knowledge should be understandable for a human expert, symbolic approaches are preferable. Symbolic approaches would be operating on rules
An Overview of Artificial Life Approaches for Clustering
491
Fig. 2. Taxonomy of Artificial Life Approaches for Clustering
that define the clusters rather than on the data itself. But there are no artificial life based symbolic approaches for clustering yet. There is an interesting symbolic approach for classification with artificial life systems called AntMiner by Parpinelli et al. (2002). The next stage in the taxonomy diagram reflects the system architecture. There are three philosophies for the representation of data points in the artificial life system. One is to represent data points as objects in the universe. These data points are then moved by entities (e.g. ants) in order to form clusters. Examples for such approaches are AntClass (Monmarch´e et al. (1999)), ACluster (Ramos et al. (2002)) and ant-based clustering (Handl and Meyer (2002), Handl et. al (2003b)) among others. The opposite strategy to this is to identify each artificial life entity with one data set. The entities move in the universe to form groups of similar entities. Examples for this design are ALF (Ultsch (2000)) and Visual AntClust (Monmarch´e et. al (2003)). The third design approach in this area is to let the entities move in the solution space, which is in general n dimensional. For clustering purposes one might just think of the space of all k-means solutions. An interesting approach of this type is the Particle Swarm Optimization (PSO). PSO is a swarm based optimization algorithm and it has been applied to clustering problems by Van der Merwe and Engelbrecht (2003).
492
6
D. K¨ ampf and A. Ultsch
Conclusion and Summary
Artificial life systems have many applications from the study of real life’s dynamics through optimization problems to clustering. The simulation of real life has a unique position here. In the field of optimization problems artificial life metaheuristics have advantages over existing algorithms mainly in the field of flexibility. So in this field the artificial life based algorithms have benefits and these benefits are already used in real world applications. Artificial life approaches for clustering have some interesting properties compared to conventional approaches that make them attractive for further studies. Features like unsupervisedness, self organization and emergence are important. Together with the fact that new data sets can be added quickly and without recomputing the whole data set this is a unique combination of features. The Databionics Research Group has an approach (see Chapter 4.1) that has proved its performance on artificial and real world data sets. This will be the basis for our further work especially in the field of symbolic approaches.
References DENEUBOURG, J.-L., GOSS, S., FRANKS, N., SENDOVA-FRANKS, A., DE´ TRAIN, C., and CHETIEN, L. (1991): The dynamics of collective sorting: Robot-like ants and ant-like robots. In: J.-A. Meyer and S. W. Wilson (Eds.): Proceedings of the First International Conference on Simulation of Adaptive Behaviour: From Animals to Animats 1, MIT Press, Cambridge, MA, 356–363. DORIGO, M. and DI CARO, G. (1999): Ant Algorithms for Discrete Optimization. Artificial Life, 5, 137–172 HANDL, J. and MEYER, B. (2002): Improved ant-based clustering and sorting in a document retrieval interface. In: Proceedings of the Seventh International Conference on Parallel Problem Solving from Nature (PPSN VII), volume 2439 of LNCS. Springer, Berlin, 913–923. GAMBARDELLA, L.M., RIZZOLI, A.E., OLIVERIO, F., CASAGRANDE, N., DONATI, A.V., MONTEMANNI, R. and LUCIBELLO, E. (2003): Ant Colony Optimization for vehicle routing in advanced logistic systems. In: Proceedings of MAS 2003 — International Workshop on Modelling and Applied Simulation, Bergeggi, Italy, 3–9. ´ P. P. (1959): La Reconstruction du nid et les Coordinations InterGRASSE, Individuelles chez Bellicositermes Natalensis et Cubitermes sp. La th´eorie de la Stigmergie: Essai d’interpr´etation du Comportement des Termites Constructeurs. In: Insect Soc., 6, 41–80. HANDL, J., KNOWLES, J., and DORIGO, M. (2003b): On the performance of ant-based clustering. In: Design and Application of Hybrid Intelligent Systems, Vol. 104 of Frontiers in Artificial Intelligence and Applications, IOS Press, 204–213. HANDL, J., KNOWLES, J. and DORIGO,M (2003a): Strategies for the increased robustness of ant-based clustering. In: Self-Organising Applications: Issues, challenges and trends, LNCS 2977, Springer-Verlag, 90–104.
An Overview of Artificial Life Approaches for Clustering
493
HELBING, D., FARKAS, I. and VICSEK, T. (2000): Simulating dynamical features of escape panic. Nature, 407, 487–490. ´ HELBING, D., FARKAS, I., MOLNAR, P. and VICSEK, T. (2002): Simulation of pedestrian crowds in normal and evacuation situations. In: M. Schreckenberg and S. D. Sharma (eds.) Pedestrian and Evacuation Dynamics. Springer, Berlin, 21–58. ´ N. and VENTURINI, G. (2003): Visual clusLABROCHE, N., MONMARCHE, tering with artificial ants colonies. Seventh International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES 2003) LUMER, E. and FAIETA, B. (1994): Diversity and adaption in populations of clustering ants. In: Proceedings of the Third International Conference on Simulation of Adaptive Behaviour: From Animals to Animats 3, MIT Press, Cambridge, MA, 501–508. ´ N., SLIMANE, M, and VENTURINI, G. (1999): Antclass: disMONMARCHE, covery of clusters in numeric data by an hybridization of an ant colony with the kmeans algorithm. Rapport interne 213, Laboratoire d’Informatique de l’Universit´e de Tours PARPINELLI, R.S., LOPES, H.S. and FREITA, A.A. (2002): An Ant Colony Algorithm for Classification Rule Discovery. In: H. Abbass, R. Sarker, C. Newton. (Eds.) Data Mining: a Heuristic Approach, pp. 191-208. London: Idea Group Publishing. RAMOS, V., MUGE, F. and PINA, P. (2002): Self-Organized Data and Image Retrieval as a Consequence of Inter-Dynamic Synergistic Relationships in Artificial Ant Colonies. In: Javier Ruiz-del-Solar, Ajith Abraham and Mario K¨ oppen (Eds.), Frontiers in Artificial Intelligence and Applications, Soft Computing Systems - Design, Management and Applications, 2nd Int. Conf. on Hybrid Intelligent Systems, IOS Press, Vol. 87, 500–509. ULTSCH, A. (2004): Strategies for an Artificial Life System to cluster high dimensional Data. In: Abstracting and Synthesizing the Principles of Living Systems, GWAL-6, Bamberg, pp. 128–137. ULTSCH, A. (2002): Data Mining as an Application for Artificial Life. In: Abstracting and Synthesizing the Principles of Living Systems, GWAL-5, L¨ ubeck, pp. 191–199. ULTSCH, A. (2001): DataBots: Data Mining as an Application for Autonomous Minirobots., In Proc. 1st International Conference on Autonomous Minirobots for Research and Edutainment - AMiRE, Paderborn, pp. 59 - 73. ULTSCH, A. (2000): Visualisation and Classification with Artificial Life. In: Proc. Conf. Int. Fed. of Classification Societies ifcs 2000 Namur, Belgium. VAN DER MERWE, D. W. and ENGELBRECHT, A. P. (2003): Data clustering using particle swarm optimization. In: Proceedings of IEEE Congress on Evolutionary Computation 2003 (CEC 2003), Canbella, Australia, 215–220.
Design Problems of Complex Economic Experiments Jonas Kunze Institut f¨ ur Informationswirtschaft und -management, Universit¨ at Karlsruhe (TH), 76128 Karlsruhe, Germany
Abstract. Economic experiments are a source of valuable data about economic decision making. Although a lot of experiments exist, there are few experiments done where the subjects face complex decision tasks. Inventory management problems in supply chain management represent such complex decision tasks because of their time delay and nonlinearities. The published experiments on inventory management in supply chains are reviewed and some of the design problems of these experiments are discussed. The paper especially focuses on incentives, presentation effects and concreteness of the experiment.
1
Motivation
Experimental economics presents an interesting field for data collection as experiments allow high control. Hence, the data collected in an experiment can be judged as valuable and reproduciable. It can be used to verify or to disprove theories. This may explain the great success that experimental approaches have had during the last decades within the field of economics. On the other hand, each observation in a controlled experiment is not free of costs. The experimenter pays the subject a decision-dependent amount of money that usually is proportional to the time the subject spent in the experiment. Economic experiments, such as market, game and individual-choice experiments generally are not very complex. Due to limited time and money the compromise between design quality and complexity of the experiment is generally made in favor of the first. Therefore, basic economic assumptions about decision behavior in simple situations are well investigated, while there is still a lack of knowledge about decision behavior in more complex scenarios. In economic experiments there exists a variety of instruments to assure control. Davis and Holt (1993) classify these instruments into procedural regularity, motivation, unbiasedness, calibration and design parallelism. Procedural regularity mainly focuses on allowing reproducibility of the experiment and the practices should be reported when publishing an experiment. The practices include amongst others: instructions, illustrative examples, test of understanding, existence of training rounds, and computerization.
Design Problems of Complex Economic Experiments
495
Motivation depends on the incentives that are presented in the experiment. Based on the nonsatiation assumption that the subjects’ utility is a monotone increasing function of the monetary reward (Smith (1976)), a payoff function based on total costs serves to minimize these costs. However, the payoff must be salient (Davis and Holt (1993)). A payoff is defined as salient if the subject perceives a relationship between his decision and the reward paid and if the financial reward is higher than the subject’s decision making costs. Furthermore, the field of motivation includes the use of laboratory currency, a show up fee and privacy about the own payoff function. Unbiasedness includes the used terminology in instructions and questionnaires and double-blind settings. If everyday words are used in the experiment, subject’s decision may be affected by her own interpretation of these words. Double-blind setting refers to a situation where the subjects as well as the persons conducting the experiment do not know about the theoretical prediction of the experiment. Calibration refers to the existence of a base line treatment. This assures that the experiment’s data is comparable with a reference point. Finally, design parallelism treats the level of realism in the experiment, i.e. the closeness to realistic situations. As noted in the beginning, economic experiments used to be quite simple in contrast to real economic decisions. In more complex scenarios the main question is whether the presented instruments can be adopted similarly. The paper is organized as follows. In section 2, a complex scenario that has been adopted to economic experiments is presented and the conducted experiments are shortly described. In section 3, three problems regarding the setup of these complex experiments are discussed and open issues are specified. The paper closes with a short summary.
2
A Complex Experiment
A well known scenario for a non-trivial economic decision task is the Beer Distribution Game (BDG). The BDG is described e.g. in Sterman (1989) and for its origins see Forrester (1961). It is a multiplayer and multiround game with a lot of variables that are influenced by the players’ past decisions. The BDG simulates a linear supply chain in a simplified production and distribution system. In the classical version four players represent the retailer, the wholesaler, the distributor and the factory of a unique commodity (here: beer kegs). All players have a stock where units of the good are stored. The experiment consists of a number of rounds. The decision variable of each player is the number of goods to order from the next upstream position in each round. In the case of the factory, the player chooses the number of goods to produce. Placed orders are not shipped immediately but with a time delay. Usually, after 2 rounds the order reaches the next upstream position and the chosen number of units is taken out of the stock. After
496
J. Kunze
2 additional rounds, the shipment reaches the original player. An external demand function exists that represents the demand of the customer. Each round the customer’s demand is taken out of the retailer’s stock. The payoff function is influenced by the transaction cost generated by the players. The transaction costs consist of storage cost for stored goods and backlog costs, in the case that orders exceed stock (the backlog is preserved until it can be accommodated by incoming shipments). Generally, backlog cost per unit is twice the storage cost per unit. The complexity of the BDG is characterized by a high order nonlinear difference equation. The order is 23 in the case of the classical BDG. The order of the difference equations is highly influenced by the time delays of orders and shipments. At the same time, the delays in the system make feedback learning difficult. Approaches in machine learning to solve this type of problems exist. One example is the bucket brigade mechanism for the evaluation of decision rules used in adaptive, general purpose machine learning systems like the Holland Classifier Systems (Holland et al. (1989)). The bucket brigade algorithm values interconnected decision rules. Theoretically, this should enable it to recognize good decision rules that are acting in a feedback environment with significant time delay between action and payoff, because the final payoff is given to all of the temporarily interconnected decision rules preceding it. In practice, studies indicate (cf. Geyer-Schulz (1995)) that there exist problems in reaching optimal decision behavior. Therefore, the decision task the players of the BDG faces can be judged as complex. The first publication in this field was done by Sterman (1989). Sterman uses class room experiments of the classical BDG and applies an order heuristic which is fitted to the subjects’ orders. Gupta et al. (2002) use a variant of the BDG with 3 subjects forming a group. They test for the effect of changing lead times and public availability of point-of-sale (POS) information (cf. also Steckel et al. (2004)). Kaminsky and Lovallo (1998) use quite a different approach. A group consists only of one human player, while the other decision makers are computer controlled. They test for the influence of shortened lead times. Croson and Donohue (2003) use the classical BDG with 4 subjects in one group. The paper investigates the influence of public POS data on the bullwhip effect. Two experiments based on the BDG have been conducted by the author. The first experiment (experiment A) took place in February 2004 and included 60 subjects (volunteers from lectures, mailing list members). Each supply chain consisted of two subjects and the external demand function was a constant function with one jump in round 5 to a higher order level. This demand function was the same used by Sterman (1989) and it was not known to the subjects. The payoff schema used was a tournament schema, i.e. the best groups received payoffs. The experiment was done within the scope of a diploma thesis and financial rewards were not salient. The second experiment (experiment B) was conducted in March 2005 and 87 subjects participated.
Design Problems of Complex Economic Experiments
497
Here, subjects were selected randomly from an database for participants of economic experiments at the University of Karlsruhe. Three subjects formed one group. The demand function was a uniform distribution from 0 to 10 and stochastic information about the demand function was provided to the subjects. The payoff of a subject was 30 euro less their doubled groups’ transaction costs in cent. The experiments used two treatments that differed in the system status information of the subjects. Subjects were randomly assigned to the two treatments. In the first treatment, in addition to local information about the own stock and backlog and the own order history, all shipment information and stock information of the other subjects in the group were provided. Also, actual customer demand was known to the subjects. In the second treatment, only local information was available. The treatment was the same for all groups in one session. All groups were synchronized, i.e. a round was finished when all subjects made and confirmed their decisions.
3
Experimental Design Problems
Instruments for experimental design can be classified into procedural regularity, motivation, unbiasedness, calibration and design parallelism as discussed in the first section. In this section three aspects of the design of BDG based experiments are discussed. First, the incentive structure as an item of motivation and unbiasedness is discussed. Then, problem presentation issues are treated and it is shown how they influence decision quality. Finally, the design parallelism point is addressed showing the conflicting demands if using the BDG in an experimental economic setup. 3.1
Incentives
As described in the first section, financial rewards are generally used in economic experiments to induce a goal orientation in the subject’s behavior. In the case of the BDG, the overall goal is to minimize total transaction costs (total costs). Hence, the most general form for an intuitive payoff calculation is a function that is non-increasing in the total costs. Sterman (1989) uses a simple incentive scheme. All subjects pay $1 to a common pool and the winner team, i.e. the team with the smallest less total costs, takes all. The other experiments mentioned in section 2 use two different kinds of incentive systems. The relative incentive system used by Croson and Donohue (2003) and Croson et al. (2004) calculates a subject’s group payoff relatively to the other groups’ performance. Each subject of the best team with total cost cb receives a maximal payoff p. Subjects of the worst team with total cost cw receive a minimal payoff p. All other subjects receive a linear interpolated payoff p depending on their total group costs c: p = p − (p − p) ∗ (c − cb )/(cw − cb )
J. Kunze
Payoff [$]
498
45 40 35 30 25 20 15 10 5 0
System 1 System 2
0
500
1000
1500 2000 Transaction Cost [Units]
2500
3000
3500
Fig. 1. Two absolute payoff systems with different parameters.
Values are for example p = $5 and p = $25 (cf. Croson et al. (2004)). A nonlinear variant of this cost calculation is the tournament setting where the groups are ranked and paid according to their relative performance. The “winner takes all” schema by Sterman (1989) is somewhat similar to this tournament setting and represents an extreme variant of it. The second kind of payoff calculation applied to BDG experiments are absolute incentive systems. For every increase in total costs, the payoff reduces linearly. This type of incentive systems are used in Gupta et al. (2002) and in the experiment B conducted by the author. The absolute payoff systems are described by three parameters. The initial amount p, the loss rate per cost unit r and a minimal payoff p. Depending on the group costs c the payoff p is given by p = max{p−r∗c; p}. In Figure 1 two absolute payoff functions are depicted. The x-axis describes the transaction cost of the group and the y-axis the payoff to each of the group’s subjects. Parameter values are p = 30, r = 0.01 and p = 0 for the first payoff system and p = 40, r = 0.02 and p = 0 for the second payoff system. The strength of the incentives is given by the loss rate r and incentives are present while c < (p − p)/r. For higher costs, incentives are lost as the payoff p is always the minimal payoff p. In comparison, both of the approaches have their drawbacks. The second approach with absolute payoff calculation runs the risk that incentives are lost. Especially, choosing the parameters value for the loss rate r is a trade-off between incentive strength and incentive range. Examples for lost incentives are present in the data of experiment B. The standard deviation of all order decisions taken with existing incentives are 4.9 units (n=2635), in case of lost incentives 28.35 units (n=203). The distributions are significantly different. The problem of loosing incentives with an absolute payoff system is aggravated by the fact that total costs of a BDG session are usually hard to predict as the system’s feedback loops induce a nonlinear behavior. The first approach, i.e. relative payoff systems including the “winner takes all” schema, does not have the problem of lost incentives but another problem occurs. Imagine the “winner takes all” schema: This payoff system leads to highly risky decision making as only the best group is going to earn money.
Design Problems of Complex Economic Experiments
499
Furthermore, the problem is complicated by the fact that during the game, total costs are not public information. Hence, each subject has to guess whether he is in the best performing team or not. This (unobservable) guess influences also the risk of the decisions taken. As the system is hard to control, subjects tend to estimate their group performance worse than it really is, thereby making more risky decisions. In similar ways, all relative payoff systems induce risk loving behavior. This is extremely critical for the analysis of experiments as the guess about the actual and possible payoff is not observable. Especially, a person ranking himself in the worst team with no chance to improve has actually no incentives to make reasonable decisions. 3.2
Problem Presentation
The classical BDG is played by 4 players sitting around a table. All experimental studies, excluding the study of Sterman (1989) use computerized versions of the BDG as they allow a higher degree of control and avoid calculation errors. For the computerized version, different options exist to present the decision problem to the subjects. As the decision task consists of choosing one number each round, a round could be presented in 2 screens: In the first screen the actual decision task is presented. When all subjects have made a decision, the computer calculates the costs and the system state in the next round. An optional evaluation screen (second screen) might be presented to explain the calculations that have been done. In the experiment A, just one screen was used for presenting the decision task. This was done because the instruction contained a complete example of the system state transition. In the data, the round time was recorded. The round time is the maximum of all decision times of the subjects and describes the time the subjects were able to see the screen. The average round time was 38.3 seconds with a standard deviation of 24.6 seconds. In the experiment B, a preliminary test was made using a two screen version. In the first screen the decision task was presented, while the second screen showed the system status of two following rounds. Each of the screens had to be acknowledged by all players. From the experience of the experiment A it was supposed that decision time per round would not change significantly. But the average round time for the decision screen was 56.4 seconds (σ = 23.7), while the average round time for the evaluation screen was 47.2 seconds (σ = 30.5). This extreme effect led to a change in the experimental setup and the experiment B was finally conducted with just one screen. The round times of experiment B are shown in Table 1. It is notable that round time differs significantly between treatment 2 (local information only) and treatment 1 (added shipment and stock information). Without assumption of the underlying distribution, a Mann-Whitney-U Test is used to compare the distribution function of the two treatments. The null hypothesis, stating that the distribution functions of the round times are the same for both treatments, could be rejected with a level of significance α 1%.
500
J. Kunze Treatment 1 1 Session 1 3 Avg. round time [sec] 54.1 52.7 Std. deviation 21.8 15.7
1 2 2 5 2 4 53.4 33.1 38.1 15.7 10.2 9.4
2 6 39.8 11.1
Table 1. Round time for experiment B
Several claims could be made from the presented observations. First, the problem presentation effects the decision time. Second, increasing financial rewards support longer decision times as the comparison between experiment A and B suggests. Third, the best results in terms of total costs were obtained in the preliminary test of experiment B. Also this observation might be explained by a change in the treatment that did not refer to the presentation only, an effect of the problem presentation on the decision quality might exist. 3.3
Abstract vs. Concrete Situation
In economic experiments it is critical to use scenario descriptions. Citing Smith (1976), the experimenter may wants to add “realism” by giving the abstract experimental commodity a name such as “wheat” [...]. This runs the danger of so enriching induced values that control over valuation is lost. How important is this aspect for complex decision problems? From an experimental economist’s point of view, the importance of this statement does not change for complex settings. But research of business studies is very interested in gaining insights about decision making in complex, realistic situations. The external demand function of the BDG and the information provided about it serves very well to illustrate this twofold requirement. An economist would ask for a distribution function that could easily be explained in an experiment to the subjects. Obviously, if the parameter value of the external demand function is not communicated to the subjects, at least stochastic information should be provided. On the other hand, a business economist would not give this information to subjects. He would be interested how subjects behave in a realistic, uncertain environment. In some BDG versions, scenario information is given beforehand and descriptions like “football world championship” are used to induce some expectations about beer demand of the customer. As Geyer-Schulz (1998) shows, the optimal decision policy of a nonlinear system as the BDG depends heavily on the characteristic of the environment in which it initializes. The external demand function is an important aspect of this characteristic. Concluding this section, the gain of complexity opens up a field of new interesting questions for experimental research. But these questions like the classification of a given supply chain state into a scenario and the following selection of the adequate order policy, today still lead to an experimental design that conflicts with the requirements of experimental economics.
Design Problems of Complex Economic Experiments
4
501
Conclusion
Using complex decision problems as the BDG in experimental economics is a promising approach to obtain high quality data for analysis. On the other hand, it is difficult to apply standard experimental design procedures to this kind of experiment as described in section 3.3. In case of the BDG, incentives, problem presentation and the abstractness are just some of the critical points. Results from existing experiments call for a profound analysis of these points. Acknowledgment I gratefully thank the DFG funded Graduiertenkolleg GK 895 “Informationswirtschaft und Market Engineering” for financial support of my research.
References CROSON, R. and DONOHUE, K. (2003): Impact of point of sale (POS) data sharing on supply-chain management: An experimental approach. Productions and Operations Management, 12 (1), 1–11. CROSON, R. and DONOHUE, K. and KATOK, E. and STERMAN, J. (2004): Order Stability in Supply Chains: Coordination Risk and the Role of Coordination Stock. Tech. rep., MIT, Engineering Systems Division, MA. DAVIS, D. and HOLT, C. (1993): Experimental Economics. Princeton University Press, Princeton, New Jersey. FORRESTER, J. W. (1961): Industrial Economics. MIT Press, Cambridge, MA. GEYER-SCHULZ, A. (1995): Holland Classifier Systems. APL Quote Quad, 25 (4), 43 – 55. GEYER-SCHULZ, A. (1998): Fuzzy Genetic Algorithms. In: H. T. Nguyen and M. Sugeno (eds.), Fuzzy Systems: Modeling and Control, The Handbooks of Fuzzy Sets Series, 403 – 460, Boston. Kluwer Academic Publ. GUPTA, S. and STECKEL, J. H. and BANERJI, A. (2002): Dynamic Decision Making in Marketing Channels: An Experimental Study of Cycle Time, Shared Information and Customer Demand Patterns. In: R. Zwick and A. Rapoport (eds.), Experimental Business Research, 21–47, Bosten. Kluwer Academic Publ. HOLLAND, J. H. and HOLYOAK, K. J. and NISBETT, R. E. (1989): Induction: Processes of Inference, Learning and Discovery (Computational Models of Cognition and Perception). MIT Press, Cambridge, MA. KAMINSKY, P. and LOVALLO, D. (1998): A new computerized beer game: A tool for teaching the value of integrated supply chain management. In: H. Lee and S. M. Ng (eds.), Global Supply Chain and Technology Management, POMS, 216–225. SMITH, V. L. (1976): Experimental Economics: Induced Value Theory. The American Economic Review, 66 (2), 274–279. STECKEL, J. H. and GUPTA, S. and BANERJI, A. (2004): Supply Chain Decision Making: Will Shorter Cycle Times and Shared Point-of-Sale Information Necessarily Help? Man. Sci., 50 (4), 458–464. STERMAN, J. D. (1989): Modeling Managerial Behavior: Misperceptions of Feedback in a Dynamic Decision Making Experiment. Man. Sci., 35 (3), 321–339.
Traffic Sensitivity of Long-term Regional Growth Forecasts Wolfgang Polasek and Helmut Berrer IHS, Institute for Advanced Studies, 1060 Vienna, Austria Abstract. We estimate the sensitivity of the regional growth forecast in the year 2002 due to expected changes in the travel time (TT) matrix. We use a dynamic panel model with spatial effects where the spatial dimension enters the explanatory variables in different ways. The spatial dimension is based on geographical distance between 227 cells in central Europe and the travel time matrix based on average train travel times. The regressor variables are constructed by a) the average past growth rates, where the travel times are used as weights, b) the average travel times across all cells (made comparable by index construction), c) the gravity potential variables based on GDP per capita, employment, productivity and population and d) dummy variables and other socio-demographic variables. We find that for the majority of the cells the relative differences in growth for the year 2020 is rather small. But there are differences as how many regions will benefit from improved train networks: GDP, employment, and population forecasts respond differently.
1
Introduction
Long-term forecast is a big challenge for the regional modelling, since only a few years of panel data are available on a regional basis. Furthermore, traffic dependent models must be developed to explore the sensitivity of travelling times on the socio-demographic variables of a region. Based on a very sophisticated model choice procedure (BMA: Bayesian model averaging) for the entire regional model we have additionally shortened the pool of variables to concentrate on demo-economic variables with traffic related backgrounds. We consider 2 types of forecasts and 2 railway travel time (TT) scenarios: scenario 1 assumes that all presently planed projects (i.e. for the decade 2000-2010) will be realized according to the national traffic plans. Scenario 2 assumes railway investments that will remove all at the moment known bottlenecks in the decade 2010 to 2020. Figure 1 shows the travel time reductions based on railway investment programs in 6 countries (Austria, Germany, Poland, Czech Republic, Slovakia and Hungary). They are based on the research work of an Interreg 3b project (SIC!) and are made available by the company BVU (www.bvu.de). From Figure 1 we see that the largest travel time reduction can be expected for the Czech regions (Liberec and Jihorosky), the Hungarian regions and for the Polish region Lodzkie. (Note that the minimum percentage of TT reduction in Figure 1 is 0.92 and indicates up to 8 % faster travel times). Two
Traffic Sensitivity of Long-term Regional Growth Forecasts
503
Fig. 1. The percent of travel time reduction between the two train TT Scenarios 1 (current state) and 2 (improved railway connections: “free train”).
types of forecasting methods were used: a) adjusted forecasts: growth in all regions of a country was restricted so that an average predicted growth was maintained in each country and b) unadjusted forecasts: growth prediction without restrictions. 1.1
The Growth Model
The econometric model uses the dynamic data panel for period 1995-2001 in 227 regions of 6 countries, where the focussed region between Berlin and Budapest consists of NUTS-3-regions, while the regions outside are measured at NUTS’2-level. We use a Barro and Sala-i-Martin (1995) type growth regression model, where the convergence term is measured by the level of the dependent variable in the year 1995 (i.e. the first year of the present study). The dependent variable are the growth rates for the 3 focus variables: (real) regional GDP growth, the employment rate and the population growth rate. We started with a traditional spatial model with up to 6 nearest neighbours, but we soon found out that for traffic purposes the transformation to special regression variables has more explanatory power. These linear and non-linear transformations are possible in our case since we obtained travel time (TT) matrices for train and road networks between all 227 regions. In the BMA analysis all the newly created TT and traffic variables were selected more often than traditional spatial variables, based on neighbourhood (continuity) or distance (nearest neighbours). The following groups of explanatory variables were used in the forecasting model and in the preceding model choice procedure (BMA, see Raftery et al. 1997):
504
W. Polasek and H. Berrer
Travelling times (TT) between 227 regions for the year 2000 (in the matrix T T1 ) and the year 2020 (in the matrix T T2 ). Average travel times: a) average TT, b) weighted TT: with distance (Far index) and with inverse distance (Near index), c) harmonic means, d) speed averages. Accessibility indices: Based on the TT on road and on train we calculated an index with minimum 0 and maximum 1. This was made either for the whole area (all) or the normalization in each country. Potential indices: based on the gravity formula of Newton A*B/ D, where A and B denote the variables for the region and destination cells, and D is a distance measure. The following variables were used: GDP, GDP per capita (pc), employment, population, productivity: GDP per worker. Infrastructure variables: a) the number of highway entrances per highway (Autobahn) km, b) the number of railway stations per rail km, c) the length of highway net per square-km and the length of railway net per square-km. TT adjusted growth rates: Only past average weighted growth rates were calculated where we used the train TT or the road TT as weights. 1.2
The Sensitivity Analysis
For the sensitivity analysis we use the BMA method to select the best regressor variables using the Scenario 1 rail TT. With this model we calculate iteratively the future growth rates and the level of the variable until the year 2020. (Note that the model is specified in a causal way, i.e. no contemporaneous regressor variables are allowed.) The alternative forecasts for Scenario 2 are calculated in the same way. Finally we compare both forecasts for the year 2020 and calculate the difference as percent of the Scenario 1 forecasts. 1.3
Caveats
To make the results of the sensitivity analysis visible we have employed a graphical visualisation technique for the 227 regions. The advantage is that a large amount of data information can be understood faster than studying tables, but the disadvantage is that graphics stir up many more questions of the type: Why do we see these differences? Thus, we have to warn the reader that not all of these questions can be answered satisfactory. Some will be due to bad observations or data quality, some due to misfits of the model and some will be just unexplainable. We have followed the rule, that the graph in total gives a sensible picture to justify our modelling approach. Furthermore we want to emphasize that we focus on a regional model where the regressor selection was done in such a way as to maximize the influence of train TT. This approach was chosen, since it was clear that traffic impacts, especially for train TT, on growth will be generally small. Thus, an “optimal regional growth model” will probably give slightly different results; also a model that
Traffic Sensitivity of Long-term Regional Growth Forecasts
505
Fig. 2. Sensitivity analysis for the adjusted model: The differences between GDP levels for 2020 is computed in percent. The majority of regions will only see a slight positive train TT effect.
will be based solely in road TT travel times or both. (Note that the interaction between the road TT and train TT times needs also some special studies). Therefore we recommend regarding our study as a magnifying glass of train TT on regional growth patterns, while the other (observed and non-observed) factors are more or less kept constant.
2 2.1
The Traffic Dependent GDP Growth Model The Travel Time Induced GDP Forecasts
The results of the sensitivity analysis are shown in Figure 2 for the scenario “free trains” (i.e. no major railway bottle necks) given by the matrix T T2 in comparison with the present (planned and realized 2000-2010) rail travel times, given by the matrix T T1 . We have plotted the relative change of the GDP levels for 2020 based on 2 train travel time matrices, i.e. according to the formula: Dif fGDP = (GDP 2020(T T1 ) − GDP 2020(T T2 ))/GDP 2020(T T1 ). Most positive changes in the regional GDP can be seen for Jena (Eastern Germany) and the regions of the Czech Republic (e.g. Karlovarsky), which borders Germany, but also the Moravia regions (Moravskoslezsky) bordering Poland. The largest negative growth impulse can be seen for the SW Hungarian region Zala, which is peripheral within Hungary and can move the growth towards regions closer to Budapest. Also some peripheral regions in
506
W. Polasek and H. Berrer
a) From the adjusted model Zala Praha Szczecinski Nowosadecki Podkarpackie Kujawsko-Pomorskie
-0.036 -0.016 -0.014 -0.014 -0.013 -0.013
Jena Lodzkie Zlinsky Karlovarsky Moravskoslezsky Liberecky
0.022 0.022 0.026 0.027 0.028 0.046
b) From the unadjusted model Zala -0.059 Praha -0.037 Stredocesky -0.032 Pest -0.027 Dresden -0.027 Vas -0.027 Cottbus -0.027 Gyor-Moson-Sopron -0.027 Del-Dunantul -0.027 Frankfurt (Oder) -0.027
Oberwart Vysocina Jena Zlinsky Wielkopolskie Karlovarsky Moravskoslezsky Mazowieckie Lodzkie Liberecky
-0.003 0 0 0.004 0.005 0.005 0.006 0.013 0.024 0.024
Table 1. The top and low GDP growth differences:
Poland (Szczecinski, Nowosadecki) might slightly suffer due to lack of train TT improvements. Most German regions are not affected, and in Austria only those regions (that border Germany) are above zero growth. From all 227 regions there were 86 regions with negative growth, 23 with zero growth and 118 with positive growth effects. From Table 1 we see the top and low ten regions with traffic related growth differences from the unadjusted model. Surprisingly we see well known larger cities, like Prague, Dresden, Frankfurt (Oder), Pest and Gy¨ or. Note that we see from the top 10 list that only 7 regions have a positive traffic impact: 3 from Poland and 4 from Slovakia. From all 227 regions there were 218 regions with negative growth, 2 with zero growth and 7 positive growth effects. Note that the results of Figure 3 are rather pessimistic with respect to train TT changes. This might be a consequence of the declining GDP growth rates during the observation period, which leads to depressed long-term forecasts. From Table 2 we see that the BMA estimate for the constant is not significant, and the Slovakia dummy variable is the only fixed effect, which is negative (-2.1%). That means Slovakia started from a lower base of GDP growth than the other countries. The convergence effect for the log GDP level is negative, but the level effect of (log) population is positive. The coefficients on the POP and EMPL growth rates are both positive and between 0.29 and 0.39: this implies that a 3 % growth rate in either employment or population will result in a 1 % larger EMPL growth rate. Three out of the 5 inverse
Traffic Sensitivity of Long-term Regional Growth Forecasts
507
Fig. 3. Sensitivity analysis for the unadjusted model: GDP growth sensitivities: Only a few regions will benefit from improved TT.
TT weighted past GDP growth rates are negative, all of them are rail TT effects. The sum of these effects is -2.2 that show a strong negative time dynamic component that was observed for GDP growth in the late 90ies. The variable TT.far.train, the long distance weighted TT for railways and the accessibility index based on train TT has a positive influence and might be interpreted as a good transportation proxy variable. All potential variables have a positive effect, and all are based on rail TT. A significant potential effect is found for the change of the GDP per capita, for productivity changes (GDP/ employment), and for the employment potential.
3
Conclusions
We have shown in this paper that the regional growth rates of GDP, EMPL and population can be explained to a large degree traffic dependent spatial or time series variables. The dynamic panel model estimated by BMA allows sensible long-term predictions of these variables and also a traffic related sensitivity analysis. We see that the traffic scenario “free train”, i.e. a removal of all bottle-necks of the current (year 2000) rail system in central Europe, will benefit more regions than being harmful. This growth scenario will also hold if we impose the restriction that the growth rates will be regional reallocations in each of the 6 countries. These growth rates are mostly in the range of +/2% of the GDP level in the year 2020. This sensitivity analysis is also valid for the adjusted (i.e. country restricted regional growth) and unadjusted (i.e. unrestricted regional growth) model. Rail TT improvements will only benefit a few regions in the new accession states. From Table 3 we see that in the
508
W. Polasek and H. Berrer
Dependent Variable R-squared sigma2 Nobs Nvars ndraws nu lam phi # of models time (seconds)
= Lgdp.01.95.6 = 0.886 = 0 = 227 = 20 = 25000 = 4.000 = 0.250 = 3.000 = 2249 = 438.891
Posterior Estimates Variable Coefficient const -0.017 Lgdp.1995 -0.011 Lgdp.giTT.rail.96 -2.289 Lgdp.giTT.rail.97 -0.024 Lgdp.giTT.rail.98 0.059 Lgdp.giTT.rail.99 -0.003 Lgdp.giTT.rail.00 0.086 Lpop.95 0.009 Lempl.00.95 0.388 Lpop.00.95 0.289 nodes.per.highway.km 0.015 TT.train.far 0.176/1000 err.all.bahn.dist.avg 0.048 potential.gdp.empl.00.95.rail 0.123 potential.all.empl.95.rail 0.015 potential.all.gdp.cap.00.95.rail 0.153 d.aut 0 d.sk -0.021 d.hu 0 d.ger 0 d.p -0.001
t-statistic t-probability -0.9 0.35 -8.4 0 -5.5 0 0 0.98 0.3 0.74 0 1 0.3 0.76 7.6 0 7.7 0 4.2 0 2.9 0 11.7 0 12.2 0 9 0 5.4 0 11.3 0 0 0.96 -7.2 0 0 0.97 -0.2 0.81 -0.4 0.71
Table 2. BMA estimates for the traffic related GDP% model
adjusted model we can expect positive GDP effects for more than 50 % of the regions to profit from train TT. Positive employment effects can be expected a little bit less (i.e. 47 %), and the lowest train TT effects can be expected for population growth: just every 4th region or 28 % of the regions will benefit.
Traffic Sensitivity of Long-term Regional Growth Forecasts
509
adjusted model GDP EMPL POP
negative 0.38 0.42 0.62
zero 0.1 0.11 0.1
positive 0.52 0.47 0.28
negative 0.96 0.13 0.43
zero 0.01 0.04 0.12
positive 0.03 0.84 0.45
unadjusted model GDP EMPL POP
Table 3. Summary of TT scenario 2
References BARRO, R AND SALA-I-MARTIN X. (1995), Economic Growth, McGraw Hill: New York. BARRO, R AND SALA-I-MARTIN X. (1992), ”Convergence”, Journal of Political Economy, No. 100, pp. 223-251. BARRO, R AND SALA-I-MARTIN X. (1995), Economic Growth, McGraw Hill: New York. BRUNOW, ST. AND HIRTE G. (2005) Age structure and Regional Income Growth, TU Dresden, Discussion paper Verkehr 1/2005. GEWEKE J. (1993), “Bayesian Treatment of the Independent Student-t Linear Model”, J. of Applied Econometrics, 8 Suppl., S19-S40. LESAGE, J. P. (1997), “Bayesian Estimation of Spatial Autoregressive Models”, International Regional Science Review, 1997 Volume 20, 113-129. LESAGE J. (1998), Spatial Econometrics, Manuscript and Function Library, http://www.spatial-econometrics.com/html/wbook.pdf LESAGE, J. P. (1997), “Bayesian Estimation of Spatial Autoregressive Models”, International Regional Science Review, 1997 Volume 20, 113-129. LESAGE, J. AND R. KELLEY PACE (2002), ”Using Matrix Exponentials to Explore Spatial Structure in Regression Relationships”, mimeo, Univ. of Toledo. LESAGE, J. P. AND A. KRIVELYOVA (1999), “A Spatial Prior for Bayesian Vector Autoregressive Models,” Journal of Regional Science, Vol. 39, (2), 297317. POLASEK W. AND H. BERRER (2005) Infrastructure and GDP growth in Central European Regions, IHS Vienna, mimeo. RAFTERY A. E., D. MADIGAN, AND J. A. HOETING (1997), Bayesian model averaging for linear regression models, Journal of the American Statistical Association, 92, 179- 191.
Spiralling in BTA Deep-hole Drilling: Models of Varying Frequencies Nils Raabe, Oliver Webber, Winfried Theis, and Claus Weihs University of Dortmund Department of Statistics 44221 Dortmund, Germany
Abstract. One serious problem in deep-hole drilling is the formation of a dynamic disturbance called spiralling which causes holes with several lobes. Since such lobes are a severe impairment of the bore hole the formation of spiralling has to be prevented. Gessesse et al. (1994) explain spiralling by the coincidence of bending modes and multiples of the rotary frequency. This they derive from an elaborate finite elements model of the process. In online measurements we detected slowly changing frequency patterns similar to those calculated by Gessesse et al. We therefore propose a method to estimate the parameters determining the change of frequencies over time from spectrogram data. This allows to significantly simplify the usage of the explanation of spiralling in practice because the finite elements model has to be correctly modified for each machine and tool assembly while the statistical method uses observable measurements. Estimating the variation of the frequencies as good as possible opens up the opportunity to prevent spiralling by e.g. changing the rotary frequency.
1
Introduction
The work presented in this paper has been carried out as part of a project aimed at modelling the BTA deep hole drilling process, with special emphasis on dynamic aspects. The longterm goal is online-prediction of dynamic disturbances which in future may be used as a basis for intelligent control of the process. Deep hole drilling methods are used for producing holes with a high length-to-diameter ratio, good surface finish and straightness. For drilling holes with a diameter of 20 mm and above, the BTA (Boring and Trepanning Association) deep hole machining principle is usually employed (see VDI, 1974). The working principle is illustrated in Fig. 1. For obtaining a low deviation of the bore hole center axis from the ideal straight line, which is an important objective for machining holes with a high length to diameter ratio, deep hole drilling tools use the bore hole wall section machined in the immediate past as a guiding surface. This is achieved by an asymmetric cutting edge arrangement in combination with guiding pads on
This work has been supported by the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 475.
Models of Varying Frequencies
511
Fig. 1. BTA deep hole drilling, working principle (VDI,1974).
Fig. 2. Radial chatter marks on the bottom of the bore hole (left) and effects of spiralling on the bore hole wall (right).
the circumference of the tool. The high surface finish of the bore hole wall is a side effect of the guiding action. When drilling with standard twist drills, chip removal becomes more and more unreliable with increasing drilling depth. This sooner or later leads to process failure. To solve this problem, deep hole drilling tools feature forced chip removal through high cooling lubricant flow rates via low restriction passages. In the case of BTA deep hole drilling, oil is supplied around the outside of the boring bar and the chips are transported away through the internal volume of the tube. Machining of bore holes with a high length to diameter ratio necessitates slender tool-boring bar assemblies. These components therefore have low dynamic stiffness properties which in turn can be the cause of dynamic disturbances such as chatter vibration and spiralling. Whereas chatter mainly leads to increased tool wear along with marks on the generally discarded bottom of the bore hole, spiralling causes a multi-lobe shaped deviation of the cross section of the hole from absolute roundness often constituting a significant impairment of the workpiece. The effects of these disturbances on the workpiece can be seen in Fig. 2. As the deep drilling process is often used during the last production phases of expensive workpieces, process reliability is of prime importance. To achieve an optimal process design with the aim of reducing the risk of workpiece damage, a detailed analysis of the process dynamics is necessary. In this paper we focus on spiralling which can be observed to occur either reproducibly at a certain drilling depth and fixed machining parameters or
512
N. Raabe et al.
Fig. 3. Components of the discretized analogous model.
at random drilling depths. Gessesse et al. (1994) have modelled the process with finite elements and derived from this model that a reason for the reproducible occurence of spiralling is the intersection of changing bending modes and uneven multiples of the rotational frequency. They have shown in some experiments that this actually was a good prediction of spiralling. We observed the movement of the bending modes in online measurements of the bending moment of the boring bar and in measurements of the lateral acceleration of the boring bar. In Raabe et al. (2004) we estimated the time-variation of bending eigenfrequencies by quadratic regression of spectrogram data on time. Here we propose a method to estimate the course of the frequencies based on the measurements in the framework of a mechanical model. This paper continues with a description of the mechanical model in section 2. Then section 3 introduces our criterion for parameter selection before some results are presented in section 4. The paper concludes in a summary given in section 5.
2
Mechanical Model
To express the connection between the machine parameters and the timevariation of the bending eigenfrequencies (abbreviated BEF) from a mechanical point of view we propose a discretized analogous model (see Gross et al. (2002a)). For this purpose we reduce the BTA system to its most important components (see Fig. 3). The black dots in fig 3 indicate that for our model we subdivide the bar into n segments – called elements – of equal length l = L/n, mass m = M/n and stiffness k = K · l2 (1, ..., n) (1, ..., n), where L, M and K denote the corresponding values of the whole bar. The number n is called the number of Degrees of Freedom. The stiffness influences of the damper, the oil supply and the workpiece to the boring bar are called ksupp , kseal and kend . In contrast to l, m and k the latter three parameters are not known. With these terms the equation of movement of the system can be expressed by: [M ]{¨ x} + [K]{x} = {0}
Models of Varying Frequencies
513
Fig. 4. (a) Example of a bending eigenmode. (b) Example of a changing BEM during the drilling process. The position s of the seal changes from value 9 at the beginning (cp. (a)) to 6 while the positions d and n of the influences of damper and workpiece stay the same.
with the mass-matrix [M ]n×n = m · In×n and the stiffness-matrix
⎡
[K]n×n
6 −4
⎤ 0 .. ⎥ −4 .⎥ .. ⎥ 6 .⎥ ⎥ ⎥ .. . 0⎥ ⎥ ⎥ .. . −4 6 −4 1 ⎥ ⎥ .. ⎥ . 1 −4 5 −2 ⎦ 1
⎢ −4 6 ⎢ ⎢ ⎢ 1 −4 ⎢ k ⎢ 0 1 = 2⎢ l ⎢ ⎢ .. . . ⎢ . . ⎢ . ⎢ . ⎣ . .. .0 · · · · · ·
0 ··· ··· . 1 .. .. .. .. . . . .. . −4 1
0
1 −2
1
+ksupp {ed }{ed } + kseal {es }{es } + kend {en }{en } ,
(1)
where d and s represent the numbers of the elements the damper resp. the oil supply device rest on counted from the left. Now determining the BEFs ω and bending eigenmodes (BEMs) x ˆ of the system means solving the following eigenvalue-problem (compare Gross et al. (2002b)):
[K] − ω 2 [M ] {ˆ x}eiωt = {0}.
(2)
A BEM is the shape with which the bar oscillates with the corresponding BEF. Each BEM is represented by the vector x ˆ containing the deviations from the baseline in x-direction for each segment end (compare Fig. 4 (a)). Now the time-variation of the BEMs and BEFs becomes clear when looking at what happens during the drilling process. The boring bar is fixed on the left side and when the process starts the workpiece is rotated and moved towards the bar. While the damper always stays on the same position d the seal of the oil supply moves in front of the workpiece with the same speed (see Fig. 4 (b)). So s decreases and the stiffness matrix [K] changes. Note that even though the workpiece also moves, kend is always added to the nth element of the “base” matrix in the first row of the definition of Eq. 1. This is because the workpiece always affects the end of the bar. As mentioned above the stiffness parameters ksupp , kseal and kend are not known. But in early experiments it turned out that the higher the value is
514
N. Raabe et al.
Fig. 5. (a) Spectrogram of the acceleration signal. (b) Computed course of bending eigenfrequencies.
chosen for kend – i.e. the end of the bar nearly can not oscillate – the better is the model fit. So kend from now is fixed to a high value (1017 N/m) and the remaining free parameters are ksupp and kseal .
3
Criterion
In this section we describe how the two free parameters are estimated. For the fitting we use the data obtained from the acceleration sensor. This sensor is placed between the damper and the final position of the seal. For this reason we except only those BEFs to be prominent in this signal whose corresponding BEMs have a bulge at this position. Computing BEFs and BEMs for a variety of plausible stiffness parameters it shows that these are the second and the third BEF where by definition the lowest BEF is the first one. However, higher BEFs are neglected since the corresponding amplitudes decrease w.r.t. higher frequencies. Figure 5 (a) shows the spectrogram of the acceleration signal of one of our experiments. Comparing it to a course of the first four BEFs (Fig. 5 (b)) computed by solving Eq. 2 with L = 334cm and M = 26kg (known from our settings), some example parameters ksupp = 3.51 · 109 N/m and kseal = 1.053 · 107 N/m and n = 334 some similarities can be seen. So we recognize the (mirrored) U-Shape of the third (second) BEF. We also see that the first and fourth computed BEFs nearly don’t change. Looking at the corresponding BEMs one would see that they have a bulge between damper and clamped end. Because the damper is expected to have a quite high stiffness, moving the seal does not affect this mode. On the other hand the bar nearly doesn’t oscillate in the area where the signal is measured with these two frequencies. So they are not reflected in the spectrogram. However, in the spectrogram there seem to exist some time-constant frequencies with high amplitudes, for example one at about 60Hz. But for the
Models of Varying Frequencies
515
Fig. 6. (a) Spectrogram with computed bending eigenfrequencies (b) Weighting scheme, notice the bandwidth v
given reasons we don’t assume them to reflect BEFs. It is more plausible that they are due to the machine drive and so are negligible for are investigations. So our choice of a criterion for any given kˆsupp and kˆseal is based on the concordance of the course of the resulting second and third BEF and the spectrogram. Its construction is described in the following. Let ω ˆ j (t) := ωj (t; kˆsupp , kˆseal ), j = 2, 3, be the two interesting BEF courses. Remember these courses are stepfunctions (cp. Fig. 6 (a), for better illustration with a smaller value of n). Let now fcj (t), cj := argmini (|fi − ω ˆ j (t)|) , j = 2, 3, be the Fourier frequencies next to the computed BEFs at each time. Then the criterion to be maximized is
v
3 2 1 1 √ m kˆsupp , kˆseal := wi 4 at;cj +i , 2 j=2 #T v t∈T i=− 2
where at;k denotes the amplitude of Fourier frequency k at time t, T is the set of all #T time-points periodograms are computed for, v is a pre-defined even bandwidth parameter and wi := 2v−4|i| are linear weights as illustrated v2 in Fig. 6 (b). The aim underlying the construction of m is to prefer a choice of ksupp and kseal which leads to BEFs meeting the area of high amplitudes as well as possible. The amplitudes are transformed as a consequence of the fact that the periodogram ordinates of White Noise are χ2 -distributed (see Theis (2003)). In this case by taking the 4th root a symmetric distribution is obtained.
516
N. Raabe et al.
Fig. 7. Fitted bending eigenfrequency courses.
4
Results
Figure 7 shows an example fit after maximizing m by the optimization method of Nelder and Mead (1965). The settings are the same as in the previous section, v = 50 was chosen for the bandwidth. Some experiments with different choices of v between 15 and 100 showed that this parameter seems not to affect the results. The optimal parameters are kˆsupp = 2.252 · 108 N/m and kˆseal = 1.037 · 107 N/m. The levels of these values are plausible from a technical point of view. Also their relation to each other is as expected since the damper is known to have a much higher stiffness influence than the seal. Especially the higher third BEF fits the spectrogram quite well. A little bit different is the situation for the second one. Here the computed course seems to border the lower area of high amplitudes from above. It turned out that varying the parameters w.r.t. just a slightly better fit of the second leads to a very high loss of the fit of the third BEF. The reason for the apparently unfavorable fit of the lower course of BEFs could be a the non-consideration of an important parameter. For example in some of our experiments we observed that the damper left its initial position d during the process. Since d is a very sensitive parameter in our model its time-variety should be taken into account in further investigations. Anyway we consider our assumption of seeing the bending eigenfrequencies in the acceleration signal confirmed.
5
Summary and Outlook
By the discretized analogous model we established a connection between measured data and the mechanics of the BTA system. The stiffness influences of the damper and the oil-supply-device have been identified to be the most important factors. Although further improvements are possible, we think the
Models of Varying Frequencies
517
signal of the acceleration sensor is appropriate to estimate the time-variation of the BEFs. One item of future work will be the consideration of further parameters like a variable damper or the implementation of a frequency response function. Improvements are also possible for the estimation procedure by taking the distribution of the periodograms into account. The main goal of our investigations is the prevention of spiralling. So once the estimation of the BEFs is established, the next step will be onlineestimation and connected to it the implementation of control charts. Then, when one BEF runs into danger to meet the rotational frequency or one of its uneven multiples, it maybe varied to avoid the intersection.
References GESSESSE, Y.B., LATINOVIC, V.N., and OSMAN, M.O.M. (1994): On the problem of spiralling in BTA deep-hole machining. Transaction of the ASME, Journal of Engineering for Industry, 116, 161–165. GROSS, D., HAUGER, W., and SCHNELL, W. (2002a): Technische Mechanik Band 3: Kinetik, 7. ed., Springer, Berlin. GROSS, D., HAUGER, W., SCHNELL, W. , and WRIGGERS, P. (2002b): Technische Mechanik Band 4: Hydromechanik, Elemente der h¨ oheren Mechanik, Numerische Methoden, 4. ed., Springer, Berlin. NELDER, J.A. and MEAD, R. (1965): A Simplex Method for Functional Minimization. Computer Journal, 7:308–313. RAABE, N., THEIS, W., and WEBBER, O. (2004): Spiralling in BTA Deep-Hole Drilling - How to model varying frequencies. Conference CD of the Fourth Annual Meeting of ENBIS 2004, Copenhagen. THEIS, W. (2004): Modelling Varying Amplitudes, Dissertation, Fachbereich Statistik, Universit¨ at Dortmund, http://eldorado.unidortmund.de:8080/FB5/ls7/forschung/2004/Theis VDI (1974): Tiefbohrverfahren. VDI D¨ usseldorf.
Analysis of the Economic Development of Districts in Poland as a Basis for the Framing of Regional Policies Monika Rozkrut and Dominik Rozkrut Department of Statistics and Econometrics, University of Szczecin, ul. Mickiewicza 64, 71-101 Szczecin, Poland e-mail: [email protected], [email protected] Abstract. In 2004 six major socio-economic regions according to The Nomenclature of Territorial Units for Statistics (NUTS) were established. The so called NUTS1 regions were formed as groups of voivodships (NUTS2 regions). Some of the authorities expressed their concerns that joining better regions with less developed in one might raise GDP per capita statistic in region, thus reducing the expected flows of structural funds. In the paper we try to verify these concerns. We study the level of economic development of all districts in Poland, and its spatial diversification, using the methods of synthetic taxonomic indexes of development and clustering methods.
1
Introduction
In 2004 governmental commission decided how to divide Poland into six major socio-economic regions according to The Nomenclature of Territorial Units for Statistics (NUTS). The main administrative regions in Poland are called voivodships. The country is sub-divided into 16 such regions. Because it is required for NUTS1 regions to range in population from 3 to 7 million, voivodships are too small to became first level regions, so the NUTS1 regions were formed as groups of voivodships (which in turn became NUTS2 regions). This is illustrated in Figure 1, where NUTS1 are indicated by different textures, and smaller NUTS2 by borders. In figure 3 the NUTS3, i.e. districts are presented. NUTS is a multi-level hierarchical classification, created for the framing of Community regional policies. NUTS1 are used for analyzing regional Community problems and for the purposes of appraisal of eligibility for aid from the Structural Funds. Poland expects that all of its regions will be eligible for structural funds, which are directed to regions on which GDP per capita does not exceed the level of 75% of average for whole European Union (so called ”Objective 1.” funds). During the consultations, voivodships whose development was lagging behind, expressed their concerns about being joined with wealthier voivodships in one region, what might statistically improve their socio-economic situation, raising GDP per capita statistic in region. This in turn reduces the expected flows of structural funds. In the paper we try to verify these concerns. We study the level of economic development of
Analysis of the Economic Development of Districts in Poland
519
Fig. 1. NUTS1 (and NUTS2) regions in Poland.
all districts in Poland in the year of 2003, and its spatial diversification, using the methods of synthetic taxonomic indexes of development and clustering methods.
2
Taxonomic Measure of Development
The method of taxonomic measure allows linear ordering of districts, measurement of their relative development and clustering into groups of similar level of development. The method used here is a modification of the one proposed by Hellwig (1968). The idea of the method may be described shortly as follows. Firstly the set of variables is collected, which describe different aspects of the economic development of analyzed objects. Some expertise is needed in that part, to make proper choices. The set may sometimes be partly determined by some specific aims of the study, for example the researcher may intentionally put more stress on some aspects, by choosing particular aggregates. After that, the variables are divided into three major groups. First group consist of variables of which higher values are connected with higher level of development (the higher values the better situation). Second group is build up of variables with opposite interpretation. Third group is composed of variables with a certain preferable values (or ranges), neither higher nor lower. Variables from the second and the third group have to be transformed so to have the same property as variables from the first group (called stimulants), i.e. to be positively related to the level of development. This is quite obvious in the case of the second group and also there are many propositions how to transform the variables from the third group. After transforming all variables to stimulants, their values are normalized. Next, some paragon virtual object is constructed as an object taking the best values in each variable, and distances between this paragon object and all other objects are calculated. The bigger the distance from the paragon object, the worse is the development of the analyzed one. Usually the differences
520
M. Rozkrut and D. Rozkrut
0,71 0,62 0,54 0,46 0,37
Fig. 2. Taxonomic measure of development in districts.
themselves are also normalized so to take values from the range [0, 1] or by some other formulas, in such a way that a higher values means higher level of development. In the paper we also use some other methods which are well known, so we leave them without description. Those are methods of tree clustering (with Ward’s amalgamation method) and k-means clustering.
3
Empirical Study
The method described above was used to assess the relative level of development in all districts of Poland (NUTS3 regions). Those basic results were used to analyze the economic development of NUTS1 and NUTS2 regions, mainly to evaluate the diversification. Taxonomic measure of development was based on following variables: • • • • • • • •
number of dwellings per 1000 inhabitants, number of firms per 1000 inhabitants, number of beds in hotels (and similar) per 1000 inhabitants, level of employment (number of employed divided by population), ratio of employed in industry, ratio of employed in market services, ratio of labor force working in private sector, budget revenues of districts per capita.
All variables, except the unemployment, are positively related to development, hence the unemployment were transformed into stimulant variable by subtracting the values for each observation from maximum. After that all
Analysis of the Economic Development of Districts in Poland
521
0,75
70
0,70
60
0,65
taxonomic measure
number of obs.
50 40 30 20
0,60 0,55 0,50 0,45 0,40
10
0,35
taxonomic measure
0,7087
0,6909
0,6555 0,6732
0,6378
0,6201
0,5847 0,6024
0,5670
0,5316 0,5493
0,5139
0,4785 0,4962
0,4608
0,4431
0,4077 0,4254
0,3900
0,3723
0
M edian = 0,4639 25% -75% Range O ut liers Ext reme out liers
Fig. 3. Histogram and box-and-whisker plot of taxonomic measure of development.
variables were normalized by dividing their values by their maximum values, so the district with the value of one was the best one in regard to analyzed variable (in given year). As an aggregation formula arithmetic mean was used to calculate (in such a way, already normalized) the measure which takes the value of 1 for the paragon object, and the value of 0 for some nonexistent object taking the value of 0 in all variables. This is a simplified method in which our implicit distance measure from paragon object (which is not calculated directly) is defined as the average difference (distance) in all variables between analyzed object and paragon object. Making it this way resulted in no need for additional normalization of obtained measure. Figure 2 illustrates the level of development of all districts in Poland in 2003. The higher values of the taxonomic measure are represented by the darker shading which, so the darker the district is on the map, the better developed. The best districts are those of big cities (Warsaw, Wroclaw, Poznan, Lodz, Krakow, Gdansk, Szczecin) with some other joining this group, like e.g. seaside districts (Swinoujscie, Kamien Pom., Kolobrzeg). The distribution of the measure is presented in Figure 3 by means of histogram and box-andwhisker plot. As for many other economic variables, also this distribution turns out to be right-skewed, with some outlying observations. First insight into the similarity of development in each of the established NUTS1 regions gives Figure 4, box-and-whisker plots of the measure of development. On the left side of the graph there is a region (1) which covers with its districts almost a whole range of values of the measure. The only differences between regions results from the membership of the best districts, causing maximal values to fluctuate more than minimal values. Bigger diversification may be observed between voivodships (NUTS2, see Figure 5). Very interesting would be to use tree clustering to reveal similar groups according to diversity and average level of development in each region. This was done for voivodships, using the Ward’s method. Two variables were used in that part of the analysis: means and standard deviations of taxonomic
522
M. Rozkrut and D. Rozkrut 0,75 Median
25%-75%
Min.-Maks.
0,70
taxonomic measure
0,65 0,60 0,55 0,50 0,45 0,40 0,35 1
2
3
4
5
6
Fig. 4. Box-and-whisker plot of taxonomic measure of development in NUTS1 regions. 0,75 Median
taxonomic measure
0,70
25%-75%
Min.-Maks.
0,65 0,60 0,55 0,50 0,45 0,40 Wielkopolskie
ń
Zachodniopomorskie
Ś
Warmi sko-Mazurskie
l skie
ą Ś
wietokrzyskie
Podlaskie
Pomorskie
Opolskie
Podkarpackie
Mazowieckie
Łódzkie
Małopolskie
Lubuskie
Lubelskie
Dolno l skie
ą ś
Kujawsko-Pomorskie
0,35
Fig. 5. Box-and-whisker plot of taxonomic measure of development in NUTS2 regions.
measure for each region. This is shown in Figure 6. Figure 7 comes from tree clustering based on means and standard deviations calculated for all analyzed variables. So each of initial 8 variables was represented by two variables: one with means and second with standard deviations calculated for each voivodship (exactly from covered districts). The latter of the two figures seems to be easier to interpret. All voivodships could be divided into two or four major groups according to their internal diversity and the level of development.
Analysis of the Economic Development of Districts in Poland
523
dolno ś l ąskie ś l ąskie lubuskie wielkopolskie łódzkie pomorskie zachodniopomorskie kujawsko-pomorskie opolskie małopolskie podkarpackie mazowieckie lubelskie podlaskie ś wietokrzyskie warmi Ćsko-mazurskie 0,00
0,05
0,10
0,15
0,20
0,25
linkage distance
Fig. 6. Clustering of NUTS2 regions by means and st. dev. of taxonomic measure.
dolno ś l ąskie lubuskie warmi Ćsko-mazurskie pomorskie zachodniopomorskie kujawsko-pomorskie ś wi ętokrzyskie podlaskie wielkopolskie lubelskie opolskie małopolskie łódzkie ś l ąskie mazowieckie podkarpackie 0
100
200
300
400
500
600
700
800
linkage distance
Fig. 7. Clustering of NUTS2 regions by means and st. dev. of variables.
Next, k-means algorithm was used to divide districts into five groups. Diversification of each one is illustrated in Figure 8. The first group on the left side of the graph consists only of six districts, among them four districts with the best values. Members of all groups are presented in Figure 9.
4
Conclusions
In analysis of the results, one may point out the problem of correlations between variables used to construct the taxonomic measure. No weighting scheme was used during calculations of the measure of development, but in
524
M. Rozkrut and D. Rozkrut 0,75 Median
25%-75%
Range
Outliers
0,70
taxonomic measure
0,65 0,60 0,55 0,50 0,45 0,40 0,35 1
2
3
4
5
group
Fig. 8. Box-and-whisker plot of taxonomic measure of development in groups (Kmeans).
5 4 3 2 1
Fig. 9. Members of groups in K-means method.
case of significant correlations some scheme is certainly applied implicitly. If we pick up two variables that are correlated, we then put more attention to their common factor, which then becomes more influential from the point of view of the aim of the analysis. However authors stand on a position that the set of variables is chosen arbitrarily by the expert (or so called ”expert method”), and thus reflects his views, and what he stresses is important from the point of view of the level of development. But there is also another aspect of this situation. The correlations may be considered to be stochastic, so even
Analysis of the Economic Development of Districts in Poland
525
though in large number of observations dependence is revealed, in each and every case the variables may take the values ”far from” the overall regularity, which should also find it’s way in the analysis, which objective is to assess the relative level of development of each district. Nevertheless it’s of course always worth to analyze the correlation matrix. We could use then the factor analysis, but it shouldn’t change the relative positions of districts in so constructed ranking. The correlations affect mainly only the range of obtained values of the taxonomic measure (flattening of results). The taxonomic measure of development defined like here, may itself be considered as common factor for all analyzed variables, but with easy economic interpretation. Summing up the results, there is a need for substantial structural adjustments in Poland. Results of the study show that the level of development is strongly diversified and established regions are far from being homogeneous, with big differences in level of economic development. There is a lot of hope connected with expected flow of structural funds from European Union, which may help regions lagging behind, to develop more quickly. We recommend the use of synthetic measures and classification methods not only in analyses but also directly in framing regional policies as an addition to the simple comparison of GDP per capita. These methods give a lot of additional insight and may be a perfect addition to simple comparisons of basic statistics. It’s very important to realize how significant differences may exist not only between regions, but also insight them, because there is a risk, that structural funds might be used to further development of wealthier districts, leaving the rest even more lagging behind.
References BOCK, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, G¨ ottingen. ´ ´ A. (1989): Methods of Numerical TaxGRABINSKI, T., WYDYMUS, S., ZELIAS, onomy in Socio-Economic Modeling. PWN, Warszawa. (in Polish) HARTIGAN, J.A. (1975): Clastering Algorithms. Wiley, New York. HELLWIG, Z. (1968): Application of the taxonomic method for classification of countries with regard to their level of development and the structure of qualified personnel. Przegl¸ad Statystyczny, 4. (in Polish) JAJUGA, K., WALESIAK, M. (2000): Standardisation of Data Set Under Different Measurement Scales. In: R. Decker, W. Gaul (Eds.): Classification and Information Processing at the Turn of the Millennium. Springer-Verlag, Berlin, Heidelberg, 105–112. JAJUGA, K., WALESIAK, M., BAK, A. (2002): On the General Distance Measure. In: M. Schwaiger,O. Optiz (Eds.): Exploratory Data Analysis in Empirical Research. Springer-Verlag, Berlin, Heidelberg. POCIECHA, J., PODOLEC, B., SOKOLOWSKI, A., ZAJA ¸ C, K. (1988): Taxonomic Methods in Socio-Economic Research. PWN, Warszawa. (in Polish)
The Classification of Candlestick Charts: Laying the Foundation for Further Empirical Research Stefan Etschberger1 , Henning Fock2 , Christian Klein1 , and Bernhard Zwergel1 1
Institut f¨ ur Statistik und mathematische Wirtschaftstheorie, Universit¨ at Augsburg D-86135 Augsburg, Germany Lehrstuhl f¨ ur Finanz- und Bankwirtschaft, Universit¨ at Augsburg D-86135 Augsburg, Germany
2
Abstract. The academic discussion about technical analysis has a long tradition, in American literature as well as in the German scientific community. Lo et al. (2000) laid the foundation for empirical research on the “classical” technical indicators (like “head-and-shoulders” formations) with their paper “Foundations of Technical analysis”. The candlestick technique is based on the visual recognition of patterns called “Candlesticks”, a special method of visualizing the behavior of asset prices. Candlesticks are very popular in Asia and their popularity is growing in western countries. Until now there has not been done much empirical research concerning the performance of technical analysis with candlesticks. This is probably due to the fact that no automatic and deterministic way to classify candlestick patterns has been developed thus far. The purpose of this work is to lay the basis for future empirical investigations and to develop a systematic approach by which candlestick charts can be classified.
1
Introduction
Technical analysis uses past prices in order to forecast future prices. Although technical analysis is widespread in industry practice, academic finance has still not found much reasoning why technical analysis should work. Consequently technical analysis has up to know been a point of great controversy between academic and practical finance. However, there has been done a lot of empirical work on the performance of technical Analysis.1 There are literally hundreds of different methods, patterns and indicators which investors claim to be successful. A possible differentiation for these techniques is to distinguish between Charting: The visual recognition of patterns in charts, for example the head-and-shoulders pattern; 1
For an overview see Fock et al. (2005).
The Classification of Candlestick Charts
527
Market Technique: The calculation of buy and sell signals with mathematical definitions, for example moving averages. The methodological approach to empirical research on technical analysis patterns belonging to market technique is obvious: The Patterns are based on mathematical and therefore computable definitions. In order to analyze the predictive power of this kind of patterns you simply need a database, a PC and a little bit of programming know-how. The way of doing empirical research on patterns based on charting is not as simple. The identification of the patterns is a very subjective procedure, because you need the expertise of a specialist, who is normally a human being and per definition fallible. This problem has partially been solved by Lo et al. (2000), who developed a systematic and automatic approach to pattern recognition, using a nonparametric kernel regression. However, they only give attention to popular charting patterns, based on stock charts.
2
Candlestick Charting
Candlesticks are a special method of visualizing the behavior of share prices. Candlestick technique is very old: It was developed during the 17th century in Japan. At that time a tax existed, which was paid in rice. It seems that there was a prospering rice market, because it was possible to buy and sell future tax income, that means there existed a kind of “rice forward”. There was a lot of trading with these contracts, which led to the foundation of one of the world’s first futures exchanges. Some of the traders used technical analysis for their speculation on rice. This was the hour of birth of technical analysis with candlestick charts (Nison(1991)). Since that time technical analysis with candlestick patterns has been very popular in Asia. The popularity in the western countries has been increasing enormously during the last few years. One evidence for this growth is the great amount of publications in the popular scientific literature during recent years. In contrast to the age and the popularity of candlesticks, there seem to be no scientific publications dealing with the predictive power of candlesticks. The reason for this phenomenon may be that the candlestick technique is based on the visual recognition of patterns – the candlestick technique belongs to the type of technical analysis that we call “charting”. The recognition of patterns in price charts is consequently a very subjective proceeding. Because the literature dealing with candlesticks are only in the popular scientific field candlestick charts are presented by picturing patterns and/or by a verbal description only. Our aim is to lay the foundations for future research: We develop a systematic and automatic approach by which candlestick charts can be classified. With the help of our definitions it is possible to investigate the performance of candlestick charting in an standardized way.
528
S. Etschberger et al.
Fig. 1. Definition of a Candle
3 3.1
The Classification of Candlestick Patterns Candlesticks
A candle describes the behavior of share prices during a time interval. The trader is free to decide how long this interval may be. Many technical analysts work with candles based on daily open/high/low/close prices while day traders mostly use candles based on a five-minute time interval. The candle is defined by the open/high/low/close prices of the specified time interval. Figure 1 shows how a candlestick is formed. The body of the candle, i.e. the difference between the open and close price, is called the “real body” of the candle. The high and the low price define the upper and the lower “shadow” of the candle. A white real body indicates that the close price is higher than the open price, consequently we have a positive return during that time interval. A black real body indicates a lower close than the open price (i.e. a negative return). A candlestick pattern is built from one and up to five candles. Many of the candlestick patterns are based on a few basic patterns from which a new pattern is constructed by mirroring, inverting, or adding another candle. This principle is demonstrated in Figure 2: A “Hanging Man” is a small white candle with a long lower shadow and no (or an extremely short) upper shadow. According to popular literature this pattern is a sell-signal, when preceded by an uptrend. By mirroring the “Hanging Man” we construct a pattern called “Inverted Hammer”, which is a buy signal, when preceded by a downtrend. A “Doji” is a candle with a very small real body. Whether it is a buy or sell signal depends on other indicators. If the doji is followed by a tall black candle then this pattern is called “Doji engulfing pattern (negative)” and it is a clear sell signal. This property may lead to the problem of overlapping subsamples, when doing empirical research.
The Classification of Candlestick Charts
529
Fig. 2. Example for candlestick Patterns Real Body
Quantile
Doji Small Candle Medium Candle Tall Candle
[0 − 0.1) [0.1 − 0.3) [0.3 − 0.7) [0.7 − 1]
Table 1. Quantiles of the size of the real bodies
For the identification of the patterns it is necessary to define the size of each candle-class and the proportions of the candles to each other. 3.2
Size of the Candles
We decided to differentiate between four different sizes of candles according to popular literature and practitioners. • • • •
Tall Candles Medium Candles Small Candles Very small candles, the so called “doji”
The size of the candles thereby defines the size of the real body. The first step of the classification process is to calculate an indicator for the size rb (Real Body): open rb = − 1 close Certainly this size rb is dependent on the time interval of the candles. We strongly recommend the usage of an out of sample dataset in order to calculate the deciles of the real bodies. Using the master sample may lead to dependencies in the results – this is a real problem when statistical tests are used. The categorization of the candles is done by the sample’s quantiles of the real bodies: This classification corresponds with the (verbal) description in popular literature. In the course of an empirical investigation (Fock et al. (2005))
530
S. Etschberger et al.
we interviewed practitioners and showed them examples of these classified candles. This study resulted in an approval of these definitions. Two remarks have to be added to this procedure: 1. Using the same definitions for black and white candles is only correct, if the color of the candle is statistically independent from the size of the candle. This has to be checked, for example with a Kolmogorov-SmirnovTest. The hypotheses are H0 : F = G
H1 : F = G
where F is the distribution of the negative returns and G is the distribution of the positive returns. If color and size are not independent, the classification of the size has to be done in two steps, for the black and the white candles separately. 2. One problem, which may appear, is the so-called “discreteness of stock prices”. Especially when handling extremely short time intervals it may happen that two or more deciles have the same values. The reason being that is that the process is not continuous but discrete, which lead to minimum tick size and therefore to a “jump” in the charts. As an example it may be possible that “doji” candles have the same definition as “small” candles when defining classes for five-minute candles. If that happens the definitions should only be be adjusted, if more than 10% of the candels have an open price which equals the close price. In this case the definition for dojis in table 1 should be changed and all candles with no real body are classified as dojis. Small candles are candles which belong to their specified quantile and rb is greater than zero. 3.3
Relations of the Candles
In the next step the relations of the candles are defined. The candles are numbered from left to right. The parameters we use to define the relations are the proportions and relations of the Open-, High-, Low- and Close prices. We use mathematical equations (<, >, ≥, ≤) for these definitions. The variables are: Table 3 shows our definitions for 33 important candlestick chart patterns. In addition to the mathematical definitions for each pattern we state whether this pattern generally is a buy- or a sell signal. Each definition specifies the size of the candles as well as the relations between the candles. For example c1 < a(c2, o2) means that the close price of the first candle has to be smaller than the close price and the open price of the second candle. ls2 > 3 · rb2 must be interpreted as “The Lower Shadow of the second candle is at least three times as tall as its real body”. Again we showed our definitions and examples of recognized patterns to practitioners, who confirmed that our classifications seem to be correct.
The Classification of Candlestick Charts
531
Variable
Description
Type of signal (L/S): Color (Co): Real body (rb): Price:
buy signal / long (L); sell signal / short (S) white (w); black (b); not defined (n.d.) doji (d); small (s); medium (m); tall (t) open (o), high (h), low (l), close (c), upper shadow (us), lower shadow (ls), middle of real body (mrb) AND-constraint
a(price1,price2):
Table 2. Definition of the variables
4
Empirical Example
In the paper of Fock et al. (2005) some of the introduced definitions were used to set up an empirical research on the performance of candlestick charting on intraday DAX and FGBL futures data starting on January 1st, 2002 and ending December 31st, 2003. The forecasting performance of candlestick patterns was tested for significance. Subsequently buy- and sell signals derived from combinations of candles with other indicators were also tested for significance. These indicators were the moving average (MA), the momentum (M), the relative strength index (RSI) and the moving average convergence/divergence (MACD) indicitor. As a benchmark we used randomized buy signals in the underlying futures. Although we did not take transaction costs into account, the results are quite poor. In most cases we were not able to find significantly better results than a benchmark with randomized transactions. Based on the insights and findings gained from Fock et al. (2005) the classification approach described in this present paper was developed. Additionally the definitions of the candlestick patterns were completed.
5
Summary
We introduced mathematical definitions to lay the foundations for future research on candlesticks, much as Lo et al.’s (2000) article did for the analysis of “charting” through its systematic definition of “charting” patterns. With the help of these definitions it is quite easy to program algorithms to detect candlestick patterns in price databases in an objective and traceable way. We hope that subsequent studies will employ and profit from our mathematical definition schemes of the candlestick patterns.
Table 3. Definition of Candlestick Charts
532 S. Etschberger et al.
The Classification of Candlestick Charts
533
References FOCK, H.J., KLEIN, C. and ZWERGEL, B. (2005): The Performance of Candlestick Analysis on Intraday Future Data. Journal of Derivatives, forecoming LO, A.W., MAMAYSKY, H., and WANG, J. (2000): Foundations of Technical Analysis: Computational Algorithms, Statistical Inference, and Empirical Implementation. Journal of Finance, 55, 1705–1770. NISON, S. (1991): Japanese Candlestick Charting Techniques. New York Institute of Finance, New York.
Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process Steffi H¨ose and Konstantin Vogl Lehrstuhl f¨ ur Quantitative Verfahren, insbesondere Statistik, Fakult¨ at Wirtschaftswissenschaften, TU Dresden, 01062 Dresden
Abstract. The loss distribution of a credit portfolio is considered within the framework of a Bernoulli-mixture model where in each rating grade the stochastic Bernoulli-parameter follows an autoregressive stationary process. Changes in the loss distribution are discussed when the unconditional view is replaced by a conditional view where information from the last period is taken into account. This relates to the lively debate among practitioners whether regulatory capital should incorporate point-in-time or through-the-cycle aspects. Calculations are carried out in a model estimated with real data from a large retail portfolio.
1
Motivation
For a better calibration of model parameters as well as for a wide range of prediction purposes, it is of key interest to incorporate the idea of a business cycle into portfolio models. In the field of credit risk, practitioners often pay tribute to the fact that their data history is so weak that they confine their estimation of model parameters to independence over time, whereas it is even apparent from short time series data, that default rates carry serial correlation. Therefore, a model with parsimonious parameterization is needed due to the scarce data situation one is typically confronted with, when applying credit risk models. This paper examines what can be accomplished using a stationary first-order autoregressive process in order to model the credit cycle.
2
Model Assumptions
In the following survey a homogeneous portfolio with nt obligors belonging to the same rating grade is considered over time periods t = 1, . . . , T . Let At,i denote the default-indicator variable of the i-th obligor in time period t, which can take a value of one in the case of default or zero otherwise. The defaults are modeled within a Bernoulli mixture model1 , where at the beginning of time period t, the stochastic default probability π ˜t takes a number πt ∈ ]0, 1[. The variables At,i for i = 1, . . . , nt are assumed to be conditionally independent for given realizations of the stochastic default probabilities π ˜t , π ˜t−1 , . . . 1
As example see Joe (1997, p. 211).
Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process
535
and for given realizations of macroeconomic impact variables Vt−1 , Vt−2 , . . .. Thus, their conditional distribution is a Bernoulli distribution, At,i π ˜t = πt , Vt−1 = vt−1 , π ˜t−1 = πt−1 , . . . ∼ Ber(πt ), (1) with parameter πt . Given these realizations, the default of each of the obligors during time period t occurs independently with probability πt . Therefore, P [At,i = 1|˜ πt = πt ] = πt , is the conditional default probability during period t, and P [At,i = 1] = E[˜ πt ]
(2)
is the unconditional default probability. The transformed2 variables Φ−1 (˜ πt ), in the sequel addressed as probits, are assumed to follow a strictly stationary first-order autoregressive process, Φ−1 (˜ πt ) = α + β Φ−1 (˜ πt−1 ) + Vt−1 + Ut , where α ∈ R and −1 < β < 1. The random variable Vt−1 plays the role of an observable macroeconomic impact, whereas Ut includes the remaining irregular component. All the random variables Vt and Ut are assumed to be mutually independent and Gaussian distributed, i.i.d.
2 ), Ut ∼ N(0, σU
i.i.d.
Vt ∼ N(µV , σV2 ).
The previously made assumptions lead to Gaussian distributed probits 2 α + µV σU + σV2 −1 Φ (˜ πt ) ∼ N , , 1−β 1 − β2
which are stationary and dependent over time. Aside from the exogenous macroeconomic impact variable, the model presented here is equivalent to a Basel II single-factor model3 with a stationary first-order autoregressive process for the systematic factor.
3
Credit Portfolio Loss Distribution
2 In this framework the model is parameterized by α, β, σU , µV and σV2 . The expectation and the variance of the stationary distribution of the stochastic 2
3
The notation Φ−1 is used for the inverse of the cumulative distribution function of the standardized Gaussian distribution. See Basel Committee on Banking Supervision (2004) and H¨ ose and Vogl (2005).
536
S. H¨ ose and K. Vogl
default probability π ˜t can be written as functions of the model parameters. The expectation is determined by (α + µV ) 1 − β 2 π := E[˜ πt ] = Φ , 2 + σ2 (1 − β) 1 − β 2 + σU V
which plays the role of the unconditional default probability in the rating class, see (2). The variance is given by
V [˜ πt ] = Φ2 Φ−1 (π), Φ−1 (π); − π 2 , where Φ2 (·, ·; ) denotes the cumulative distribution function of the standardized bivariate Gaussian distribution with the correlation parameter :=
2 + σV2 σU 2 + σ2 . 1 − β 2 + σU V
The Bernoulli mixture model given in (1) implies that the default indicators are Bernoulli distributed random variables with default probability π, where the dependence structure is determined by Cov[At,i , At,j ] = V [˜ πt ],
∀i = j.
So, the default indicators are equicorrelated with a non-negative correlation. With lagged variables as the condition, the default indicators At,i |˜ πt−1 = πt−1 , Vt−1 = vt−1
∼
Ber(pt )
are Bernoulli distributed, where the Bernoulli parameter α + β Φ−1 (πt−1 ) + vt−1 pt := Φ 2 1 + σU
(3)
additionally depends on the realizations of the default probability and the macroeconomic impact in period t − 1. The conditional covariance of the default indicators is equal to the conditional variance of the stochastic default probability which is determined by Cov[At,i , At,j |˜ πt−1 , Vt−1 ] = V [˜ πt |˜ πt−1 , Vt−1 ], ∀i = j, 2 σU V [˜ πt |˜ πt−1 = πt−1 , Vt−1 = vt−1 ] = Φ2 Φ−1 (pt ), Φ−1 (pt ); 1+σ − p2t , 2 U
with pt from (3). Using these preparations, the first two moments of the portfolio loss distribution can be assessed. The portfolio loss in period t is defined as4 Lt :=
nt
wt,i At,i ,
i=1 4
See Basel Committee on Banking Supervision (2004).
Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process
537
where wt,i ≥ 0 is the product of the exposure at default and the loss given default caused by the i-th obligor during period t. Therefore, the unconditional expected loss EL and the conditional expected loss ELc in t are defined by EL := E[Lt ] = π
nt
wt,i
and
i=1
πt−1 = πt−1 , Vt−1 = vt−1 ] = pt ELc := E[Lt |˜
nt
wt,i
i=1
using pt from (3). In order to measure the uncertainty of the portfolio loss, the unexpected loss5 UL can be defined by the square root of UL2 := V [Lt ] = π(1 − π)
nt
2 wt,i + V [˜ πt ]
nt
wt,i wt,j
i,j=1 i=j
i=1
in the unconditional case and by the square root of UL2c := V [Lt |˜ πt−1 = πt−1 , Vt−1 = vt−1 ] nt nt 2 = pt (1 − pt ) wt,i + V [˜ πt |˜ πt−1 = πt−1 , Vt−1 = vt−1 ] wt,i wt,j i,j=1 i=j
i=1
with pt from (3) when the conditional distribution is considered. Whereas the random variable ELc scatters around the unconditional expected loss EL, the case is somewhat different if unexpected losses are considered. Using the law of total variance and Jensen’s inequality one can show that the expectation of the conditional unexpected loss is less than the unconditional unexpected loss, E[ULc ] ≤ E[UL2c ] = E[V [Lt |˜ πt−1 , Vt−1 ]] ≤ V [Lt ] = UL. (4)
4
An Example
The model is applied to a hypothetical portfolio of nt = 10 000 retail clients in a single rating class with equal weights wt,i = 1. In order to estimate the Probit-AR(1)-process, an authentic data set from the SCHUFA6 containing the default history with quarterly observations from I/2000 to IV/2003 of about 800 000 German retail clients is used. The macroeconomic impact variable is chosen to be (1)
Vt = γ1 Xt 5 6
(2)
+ γ2 X t ,
(5)
The meaning of unexpected loss differs among authors. The SCHUFA AG is one of the major suppliers of consumer credit scores in Germany, comparable to EXPERIAN, EQUIFAX or TRANS UNION in the U.S.
538
S. H¨ ose and K. Vogl (1)
(2)
where the exogenous variables Xt and Xt denote the change of the logarithm of the disposable income of German households and of the German unemployment rate.7 As an example, the estimated model parameters of rating grade A (highest creditworthiness) and rating grade M (worst non-default grade) are given as follows: estimate for α β 2 σU 2 σV µV π
rating grade A -1.2989 0.55648 0.0023923 0.00025082 -0.010806 0.0038141 0.0016025
rating grade M -0.20176 0.84434 0.0026705 0.0018301 -0.016491 0.015434 0.082076
2 The coefficients α, β, γ1 , γ2 and σU were estimated by the ordinary least squares method and the estimates for µV and σV2 follow from equation (5). For this data set the coefficient of determination reaches about 44 % for rating grade A and 81 % for rating grade M. The conditional and unconditional portfolio loss distribution is a mixture of Binomial distributions. Using the parameter estimates, the loss distribution can be calculated by means of Monte Carlo methods. In Figure 1 and 2 the loss distributions of rating grade A and M are plotted. The solid curves represent the unconditional distributions. The conditional distributions for good, medium and bad macroeconomic scenarios are plotted with dashes. Additionally, the vertical lines mark the value of the expected loss for each distribution, dashed in the conditional and solid in the unconditional cases. Table 1 contains the estimates of the unexpected loss. The unconditional values and conditional values for good, medium and bad macroeconomic impacts are shown. From Figures 1 and 2 it is apparent that the effect of condi-
Table 1. Estimated unexpected loss for a portfolio of nt = 10 000 obligors of rating grade A and M with wt,i = 1 for i = 1, 2, . . . , nt estimate for the unexpected loss rating grade A rating grade M unconditional 5.14 191.67 cond. on good macroeconomic impact 4.15 76.11 cond. on medium macroeconomic impact 4.69 91.94 cond. on bad macroeconomic impact 5.29 110.55
tioning reduces the variance of the loss distribution. For some single scenarios however, the unexpected loss can also be greater in the conditional than in the unconditional case, because equation (4) refers only to the expectation. 7
(1)
(1)
(1)
(2)
(2)
(2)
Xt = ln Yt − ln Yt−1 and Xt = ln Yt − ln Yt−1 , where for Y (1) and Y (2) the time series BDJA9405B and BDOUN013R from DATASTREAM are used.
Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process
539
Fig. 1. Unconditional (solid) and conditional (dashed) portfolio loss distributions and expected losses (vertical lines) of rating grade A
Fig. 2. Unconditional (solid) and conditional (dashed) portfolio loss distributions and expected losses (vertical lines) of rating grade M
The autoregressive model of dependence over time allows to predict the stochastic default probability on the basis of the past observations. The 1−α∗ probability interval for π ˜t is given by [zl , zu ], where zl := Φ α + β Φ−1 (πt−1 ) + vt−1 − σU Φ−1 (1 − zu := Φ α + β Φ−1 (πt−1 ) + vt−1 + σU Φ−1 (1 −
α∗ 2 )
, α∗ 2 ) .
540
S. H¨ ose and K. Vogl
Fig. 3. Predicted default probabilities and default rates of rating grade A
Fig. 4. Predicted default probabilities and default rates of rating grade M
The quality of predictions is demonstrated in Figures 3 and 4. The observed default rates are displayed as solid lines whereas the predicted default probabilities are shown with long dashed lines. The additional short dashed lines represent the bounds of the 90% probability intervals around the point estimator. The overall quality of prediction is fairly good despite the fact that the model is only capable of reacting to recent changes with a delay of one period. Incorporating more than one time lag into the model would increase the number of model parameters to be estimated. Here, five model parameters have already been estimated from 15 observations.
Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process
5
541
Conclusion
In the previous sections nothing is said about how to deal with granularity effects and multiple rating grades in the portfolio. Whereas the first could still be handled at the cost of considerably increasing the simulation effort, the latter would require estimating additional model parameters again, such as the between-class-correlations and so forth. Before struggling with these problems, one should realize however, what can be seen by the feasible simple case shown is this paper. Firstly, it is evident as shown in Figures 1 and 2 that obligors with a low probability of default are far less affected by the credit cycle than those with a high default probability. Secondly, replacing the unconditional view by a conditional one, leads to a significant reduction of variance of the portfolio loss distribution only if the autoregressive model yields a high coefficient of determination. If this is not the case, the conditional variance can even rise when the state of the economy is bad. In this context, the regulatory capital could be reduced by conditioning the loss distribution within an autoregressive model with high explanatory power. One should keep in mind however, that the regulatory capital is defined according to Basel II by the 99.9%-quantile where the expected losses are subtracted. This might cause difficulties in the conditional case, since the standard risk costs imposed on the obligors can only typically reflect the long run unconditional expected losses, and can often not be adjusted in every period. Finally, the fact that the conditional distributions scatter around the unconditional loss distribution clearly indicates that the SCHUFA rating system shares characteristics of so-called through-the-cycle rating systems, which use static and dynamic obligor characteristics but tend not to adjust ratings in response to general changes in overall macroeconomic conditions.8 On the contrary, from a perfect point-in-time rating system one expects that conditioning on all information available will not change the portfolio loss distribution.
References BASEL COMMITTEE ON BANKING SUPERVISION (2004): International Convergence of Capital Measurement and Capital Standards: A Revised Framework. Basel, June 2004. BASEL COMMITTEE ON BANKING SUPERVISION (2005): Studies on the Validation of Internal Rating Systems. Basel, February 2005. ¨ HOSE, S. and VOGL, K. (2005): Predicting the Credit Cycle with an Autoregressive Model. Dresdner Beitr¨ age zu Quantitativen Verfahren, 45/05. JOE, H. (1997): Multivariate Models and Dependence Concepts. Chapman & Hall, London.
8
See Basel Committee on Banking Supervision (2005).
Comparing and Selecting SVM-Kernels for Credit Scoring Ralf Stecking and Klaus B. Schebesch Faculty of Economics, University of Bremen, D-28359 Bremen, Germany Abstract. Kernel methods for classification problems map data points into feature spaces where linear separation is performed. Detecting linear relations has been the focus of much research in statistics and machine learning, resulting in efficient algorithms that are well understood, with many applications including credit scoring problems. However, the choice of more appropriate kernel functions using nonlinear feature mapping may still improve this classification performance. We show, how different kernel functions contribute to the solution of a credit scoring problem and we also show how to select and compare such kernels.
1
Introduction
Credit scoring based on classification models providing automated “rules” as opposed to crediting mediated by humans and other procedures not disclosed by banks (“discretion banking”) is of ongoing interest, especially in view of conquering non-local markets with many potential small credit applicants (Berger et al. (2002)). Past work on building and evaluating classification models for credit scoring (Baesens et al. (2002) Friedman (2002), Huang et al. (2004)) includes simple basic methods like Linear Discriminant Analysis (LDA), Logistic Regression and Decision Trees, but also computationally more expensive and flexible methods like linear and non-linear Support Vector Machines (SVM). In previous work using real life credit data (Stecking and Schebesch 2003, Schebesch and Stecking 2005) we observe some out-of-sample classification improvement over simpler models when using SVM with RBF-kernels. Owing to data sparseness (relatively few cases for many input features per case), possible non-linearities in the true class separation are difficult to detect, but even in linear SVM we observe many support vectors (Schebesch and Stecking 2003) suggesting possible non-linearity in our credit scoring data. Even a modest gain in out-of-sample classification performance may be profitable for banks with many clients. Hence a closer look at some non-linear kernels is worthwhile. In this paper we propose including the Coulomb kernel of Hochreiter et al. (2003), a non linear kernel which is quite similar to the popular RBF kernel, but which has some desirable properties when applied to our credit scoring problem. We also compare the performance of other non linear kernels for our credit scoring data and we report some comparative advantages of kernels.
Comparing and Selecting SVM-Kernels for Credit Scoring Maximize margin between classes
⇐= positive cases
φ(x_i), y_i = −1
543
Primal SVM (max. margin) N X 1 min C ζi + w, w w,b,ζ 2 i=1 s.t. yi [ϕ(xi ), wi + b] ≥ 1 − ζi , all i with slacks
ζi ≥ 0 and C > 0.
φ(x_j), y_j = +1
⇓ negative cases
Dual SVM N X N N X X max αi − 12 αi αj yi yj k(xi , xj ) α
Feature space
s.t.
i=1 N X
i=1 j=1
C ≥ αi ≥ 0.
yi αi = 0,
i=1
⇓
Unbounded SV (UBSV)
1.0
Bounded SV (BSV)
UBSV:
0.8
Coulomb
0.6
BSV:
0.4
=⇒
=C
SVM classification rule o nP N ∗ ∗ y ∗ (x) = Sign . α i=1 i yi k(x, xi ) + b
0.0
0.2
RBF
0 < α∗i < C αi∗
−4
−2
0
2
4
Fig. 1. Stylized action of the SVM in feature space (upper left) and the SVM optimization problems (primal, top right and dual, middle right). Note the definition of support vectors UBSV and BSV. The action of two kernels over a stylized onedimensional input space, the RBF and the Coulomb kernel (see main text) are also shown (lower left). In general, their choice influences the number and the position of support vectors and hence the classification rule (lower right).
2
Credit Client Classification and Kernel Choice
With a record of N > 0 past cases of credit applicants labeled according to whether credit applicant i was “bad” (defaulting, yi = +1) or “good” (nondefaulting, yi = −1) during a contract period, a classification model y(x) for credit scoring can be built using labeled training examples {xi , yi }i=1,...,N , where input vector xi ∈ Rm describes the m characteristics of the ith credit applicant. SVM classification models can use kernels with quite different properties in an associated dual optimization problem (details in Sch¨ olkopf and Smola (2002), Stecking and Schebesch (2003), Schebesch and Stecking (2005)) as is also depicted in figure 1. Two key issues of SVM from figure 1 are used in the sequel. First, a kernel k(xi , xj ) which e.g. acts similarly to a reverse distance function between xi and xj suffices to replace both the generally unknown ϕ(xi ) and wi from the primal (derivation is not shown here). The choice of k(xi , xj ) is problem dependent. The second issue concerns the types of support vectors which
544
R. Stecking and K.B. Schebesch
are data points at the boundary or within the margin between the classes in feature space (the shaded box in figure 1). BSV, the bounded support vectors describe a region within the margin where surrounding (new) data points cannot be separated with high confidence. Hence a large number of non support vectors (i.e. αi = 0) plus UBSV would cet. par. increase the confidence in the validity of the SVM model for new data from the same distribution. The location and number of support vectors in turn depends on the kernels. First we propose to compare the effects of two quite similar rotation symmetric kernels with k(u, u) = c, c > 0, k(u, v) ≥ 0 and k(u, v ) > k(u, v ) for ||u − v || < ||u − v || for all u, v , v ∈ Rm , which act much like a reverse distance. The RBF kernel exp(−s||u−v||2 ), s > 0 was used extensively on our credit scoring data set in previous work. Despite similarities with the RBF kernel the Coulomb kernel (||u − v||2 + )−δ , with , δ > 0, introduced by Hochreiter et al. (2003) may still lead to different SVM models, especially for sparse / high dimensional inputs.
3
Similarity of SVM Models with Different Kernels
In order to compare problem dependent similarities of models using the RBF and the Coulomb kernel, we use a real life credit scoring data set with N = 658 cases and m = 40 features (Stecking and Schebesch (2003)). Combining grid search and stochastic search over the parameters C and σ (for RBF) and over C, , δ (for Coulomb) we compare the errors and the support vectors for SVM models with both kernels. Figure 2 (lhs) plots the training error against the leave-one-out error of these models. Note that the Coulomb kernel easily admits models with fairly good leave-one-out error in the range of 25% − 28% (by comparison, best linear models obtain around 27%-28%) and with training error near to or exactly equal to zero, differing clearly from the more regular U-shaped error pattern of the RBF-kernel, which is more commonly produced by flexible data models. Furthermore, on our data, the Coulomb kernel uses less cpu-time (about half of that used by the RBF kernel) for training with leave-one-out error computation. The Coulomb models from the block with very low training error are obviously different from all the RBF models. But what about Coulomb models which are located near to very good RBF-models (in the dashed window, figure 2 lhs) from this error plot? In figure 2 (rhs) we compare the two best models for both kernels with respect to leave-one-out error by sorting all i = 1, ..., N data points according to the values yi αic∗ and yi αir∗ , with indices c∗ and r∗ referring to the solutions of the Coulomb and RBF models. The sorted points show that, the Coulomb SVM has more support vectors {xi |yi αic∗ = 0} than the RBF kernel but less bounded support vectors {xi |yi αic∗ = ±C c∗ } than the RBF kernel models, which also points to different functional models. Computation of the alignment (Shawe-Taylor et al. 2004) of the data matrix Kij = k(xi , xj ) of the two kernels (see also next section), however, suggest extreme similarity
Comparing and Selecting SVM-Kernels for Credit Scoring 33
545
7
32 31
RBF
30
Coulomb models are here located at the lower side of the RBF model error region
RBF
5
Linear 3
×××××××××××××××××××××××××××××××××××××××× ×××××××××× ×××××× ×××××××× ××××××× ×××××××× ×××××××××××××× ××××××××××× ××××××× ×××××××××× ××××××××× ××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× ×××××××××× ×××××××××××× ×××××× ×××××××××× ×××××××× ××××××××× ×××××× ×××××××××××× ×××××××× ×××××××× ××××× ×××××× ××××××××××××××××××××××××××××××××××××××
Coulomb 1
29 28
−1
27 −3
26 −5
25 24
Coulomb −7
0
5
10
15
20
25
0
100
200
300
400
500
600
Fig. 2. Lhs plot: Training error vs. leave-one-out error (in percent misclassified) for differently parameterized SVM using Coulomb and RBF kernels. Rhs plot: yi αic∗ and yi αir∗ sorted over i = 1, ..., N by magnitude (see main text).
with A(K c , K r ) = 0.985. The comparison of the full models, on the other hand, suggest functional dissimilarity of both models, by simply counting the sign contradiction of the model outputs, namely the number of instances y c∗ (xi )y r∗ (xi ) < 0 which equals 50, i.e. the models misclassify different cases. Combining dissimilar models (even when using the same kernel) can lead to gains in out-of-sample prediction despite of a very high kernel data alignment.
4
Empirical Approach
The data set used is a biased sample of 658 clients for a building and loan credit from an application period between January 1997 and April 1998. After a time span of three years, in April 2001, the state of credit was evaluated: 49.1% of the credit clients were defaulting, 50.9% were non defaulting. The true defaulting rate of the population, however, was only 6.7%. For a detailed description of using SVM in non standard situations of classification see Schebesch and Stecking (2005). The binary variable state of credit serves as the target variable, that has to be predicted by the classification function. For each credit client there is a 40-dimensional input pattern, consisting of individual informations like age, income, amount of credit etc. SVM with different kernels will be used as binary classification functions. Empirical results of credit scoring and credit rating approaches with SVM can be found in Baesens et al. (2003), Huang et al. (2003) and Stecking and Schebesch (2003), where at least a slightly superior performance of SVM over more traditional methods was observed. A detailed comparison of SVM kernel performance for credit scoring has not been reported yet. Five different SVM-kernels will be evaluated in the sequel: (1) the linear kernel K(u, v) = u, v, (2) the polynomial kernel K(u, v) = (u, v + c)d with hyperparameters c and d to be set, (3) the sigmoid kernel K(u, v) = tanh(κu, v+ϑ) with hyperparameters κand ϑ to be set, (4) the Radial Basis u − v2 with hyperparameter Function (RBF) kernel K(u, v) = exp − σ2
546
R. Stecking and K.B. Schebesch
−δ u − v2 1+ E with hyperparameters E and δ to be set. The SVM with the linear kernel will serve as a benchmark, that has already been proved to be superior to the more traditional methods like Linear Discriminant Analysis and Logistic Regression. A crucial point for setting up SVM is the choice of appropriate hyperparameters. Two different approaches of hyperparameter selection will be used: (1) Kernel Target Alignment Tuning, using s = 1/σ 2 to be set and (5) the Coulomb kernel K(u, v) =
K, yy T y T Ky A(K, yy T ) = = N K K, Kyy T , yy T
for y ∈ {−1, +1}N , with K1 , K2 = tr(K1T K2 ) as the Frobenius inner product. A(K, yy T ) is the alignment between K(u, v) and the ideal kernel yy T where y is the vector of the target variable. Kernel alignment A(K1 , K2 ) in general can be interpreted as a Pearson correlation coefficient between two random variables K1 (u, v) and K2 (u, v) (Shawe-Taylor and Cristianini (2004)). By replacing K2 with yy T one gets a measure for the similarity of the input variables (already mapped into feature space induced by kernel K) and the ideal kernel (the “target space” yy T ), without solving an associated SVM! A close linear relationship between kernel data and target space would imply that the optimization algorithm of the SVM can exploit this easily, leading to very good classification results. The selection rule for kernel target alignment tuning is then to choose the hyperparameter set that maximizes the alignment. (2) Cross Validation Tuning is done by directly computing the cross validation error e for the SVM with kernel K(u, v) with varying hyperparameters. The selection rule is to choose the hyperparameter set that minimizes e. Compared to kernel target alignment the computational costs for this method are high: the full SVM must be computed, a validation set is needed and an upper bound C for α must be found.
5
Results
Table 1 shows the classification results of different SVM-Kernels after kernel target alignment tuning. Leave-one-out (l-o-o) error is known as an excellent estimator of the true generalization error. It is computed for each SVMKernel. The kernel ranking is in ascending order of l-o-o error. The smallest error can be found for the SVM with Coulomb kernel. Kernel target alignment tuning in this case leads to parameters of E = 1.5 and δ = 0.9. For these parameters the alignment takes its maximum of 0.0456. The upper bound C for α has to be tuned via cross validation. For C = 1 an l-o-o error of
Comparing and Selecting SVM-Kernels for Credit Scoring
Rank 1 2 (3) 4 5 6
SVMKernel Coulomb Sigmoid Linear Polynomial (d = 3) RBF Polynomial (d = 2)
Parameter (E, δ, κ, ϑ, c, s) 1.5 / 0.9 0.1 / -0.4 -3 0.4 -3
C 1 0.9 100 0.002 1 0.004
Alignment 0.0456 0.0654 0.0415 0.0458 0.0487 0.0500
547
Leave one out Error 25.53 % 27.05 % 27.20 % 27.51 % 31.31 % 33.13 %
Table 1. First ranking of SVM classification performance. Kernel target alignment tuning was used for kernel hyperparameter selection.
Rank 1 2 3 4 5 (6)
SVMKernel Coulomb RBF Polynomial (d = 2) Polynomial (d = 3) Sigmoid Linear
Parameter (E, δ, s, c, κ, ϑ) 1.9 / 0.5 0.05 5 4 0.1 / -0.4 -
C 1 5 0.003 0.001 0.9 100
Alignment 0.0102 0.0118 0.0039 0.0020 0.0654 0.0415
Leave one out Error 24.92 % 25.08 % 26.29 % 26.44 % 27.05 % 27.20 %
Table 2. Second ranking of SVM classification performance. Cross validation tuning was used for kernel hyperparameter selection.
25.53% was reached. Both Coulomb and Sigmoid kernel are superior to the linear kernel, which has an l-o-o error of 27.20% and no hyperparameters to be tuned. Polynomial kernels as well as the RBF kernel do not outperform the linear benchmark. Table 2 shows the classification results after cross validation tuning. Coulomb kernel again has the lowest l-o-o error, followed by the RBF kernel. With cross validation tuning every non linear kernel is outperforming the linear benchmark. Low kernel target alignment (except for the sigmoid kernel) is indicating only small dependency between alignment and generalization error. Table 3 gives an overview of the performance and the structure of the SVM with different kernels. In terms of l-o-o error the Coulomb kernel leads to the best results for both hyperparameter selection methods. The RBF kernel has a very low classification error after cross validation tuning and an unacceptable high error after kernel target alignment tuning. The Polynomial kernel is slightly better than the linear benchmark with cross validation tuning but worse with kernel target alignment tuning. The Sigmoid kernel is the only one where both tuning methods lead to the same parameters. The kernel itself is slightly outperforming the linear one. Overall, cross valida-
548
R. Stecking and K.B. Schebesch
SVMKernel Linear
Hyperp.Selection
No. of SVs
No. of UBSVs
In Sample Error
Leave one out Error
-
357
41
22.64 %
27.20 %
Polynomial (d = 2)
1)
K C2)
561 455
62 63
24.62 % 19.60 %
33.13 % 26.29 %
Polynomial (d = 3)
K C
519 427
216 216
10.03 % 8.81 %
27.51 % 26.44 %
K/C
561
17
25.84 %
27.05 %
RBF
K C
625 431
371 179
1.82 % 10.94 %
31.31 % 25.08 %
Coulomb
K C
608 554
327 186
2.28 % 7.90 %
25.53 % 24.92 %
Sigmoid
1)
K: Kernel target alignment tuning,
2)
C: Cross validation tuning
Table 3. SVM classification performance using different kernels and two different hyperparameter selection methods. No. of SVs is the total number of support vectors (bounded and unbounded), No. of UBSVs is the number of unbounded support vectors.
tion tuning is more successful than kernel target alignment tuning. But does higher classification performance pay for the higher effort? For Polynomial and RBF kernel the answer has to be yes. Coulomb and Sigmoid kernel on the other side show acceptable classification results also after kernel target alignment tuning. A closer look at the structure of the different models also is worthwhile. The number of all support vectors (SVs) and the number of all unbounded support vectors (UBSVs) is given in table 3. It can be seen, that a high number of UBSVs leads to a low In Sample Error. But training data always can be fitted perfectly by the SVM. So what about the generalization error? Table 3 shows, that a smaller total number of SVs leads to a better generalization performance for all kernels. This is especially true for the RBF kernel, where 625 SVs lead to an l-o-o error of 31.31% and 431 SVs to 25.08%. Similar results can be found for the Polynomial kernel of the second degree. A comparatively high number of support vectors therefore seems to indicate data overfitting by the model.
6
Conclusion and Outlook
In this paper we present a comparison of the credit scoring performance of different SVM-kernels. The less well known Coulomb kernel was introduced as a localized kernel function similar to RBF kernels but with more robust classification performance being also equal or slightly superior in terms of expected out-of-sample error. Two hyperparameter selection methods are used
Comparing and Selecting SVM-Kernels for Credit Scoring
549
to find a set of competitive credit classification kernel models. The models are compared via their kernel data matrices by computing kernel data and kernel target alignment. It turns out, that in general the more expensive cross validation tuning outperforms kernel target alignment tuning. However, future work will consider combining differently parameterized or different functional kernels with dissimilar kernel data matrices which may lead to a still better classification performance in credit scoring.
References BAESENS, B., VAN GESTEL, T., VIAENE, S., STEPANOVA, M., SUYKENS, J. and VANTHIENEN, J. (2003): Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54, 627–635 BERGER, A.N., FRAME, W.S., and MILLER, N.H. (2002): Credit Scoring and the Availability, Price and Risk of Small Business Credits, Federal Reserve Working Paper 2202-26, www.federalreserve.gov/pubs/feds/2002/200226/200226pap.pdf FRIEDMAN, C. (2002): CreditModel Technical White Paper, Standard & Poor?s Risk Solutions, New York. HOCHREITER, S. MOZER, M.C. OBERMAYER, K. (2003): Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems, Advances in Neural Information Processing Systems 15, MIT Press 2003, 561-568 HUANG, Z., CHEN, H., HSU, C.-J., CHEN, W.-H. and WU, S. (2004): Credit Rating Analysis with Support Vector Machines and Neural Networks: A Market Comparative Study, in: Decision Support Systems (DSS), 37(4), 543-558 SCHEBESCH, K.B. and STECKING, R. (2005): Support Vector Machines for Credit Scoring: Extension to Non Standard Cases. In: D. Baier and K.-D. Wernecke (Eds.): Innovations in Classification, Data Science and Information Systems. Springer, Berlin, 498–505 ¨ SCHOLKOPF, B. and SMOLA, A. (2002): Learning with Kernels. The MIT Press, Cambridge. SHAWE-TAYLOR, J. and CRISTIANINI, N. (2004): Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge. STECKING, R. and SCHEBESCH, K.B. (2003): Support Vector Machines for Credit Scoring: Comparing to and Combining with some Traditional Classification Methods. In: M. Schader, W. Gaul and M. Vichy (Eds.): Between Data Science and Applied Data Analysis. Springer, Berlin, 604–612.
Value at Risk Using the Principal Components Analysis on the Polish Power Exchange Gra˙zyna Trzpiot and Alicja Ganczarek Department of Statistics, Karol Adamiecki University of Economics, Katowice, Poland
Abstract. In this article we present downside risk measures such as: Value-at-Risk - VaR and Conditional Value-at-Risk - CVaR. We established these measures based on the principal components analysis. The principal components analysis is usually applied to complex systems that depend on a large number of factors where one wishes to identify the smallest number of new variables that explain as much of the variability in the system as possible. The first few principal components usually explain the most of historical variability. In our research we used the prices of electric energy from the Day Ahead Market (DAM) of the Polish Power Exchange from 30.03.03 to 27.03.04. We conclude by discussing practical applications of the results of our research in risk management on the Polish Power Exchange.
1
Introduction
For the last few years the Polish energy market has developed. The Polish Power Exchange was established in July 2000. The Day Ahead Market (DAM) was the first market which was established on the Polish Power Exchange. This whole-day market consists of the twenty-four separate, independent markets where participants may freely buy and sell electricity. The advantage of the Exchange is that all participants of the market can buy and sell electric energy, irrewpective of whether they are producers or receivers of electric energy. In this article we use downside risk measures such as: Value-at-Risk VaR and Conditional Value-at-Risk - CVaR to describe the risk of change in the price on DAM. The downside risk measures are more effective than the measures of volatility to estimate risk on the electric energy market, because the biggest forward losses are more important than average forward losses for all the participants of the market. The electric energy volumes and prices are characterized by daily, weekly and yearly seasonal peaks and lows. In this article we describe a seasonal multi-factor model for the forward price curve on DAM by the Principal Components Analysis (PCA). The aim of this article is the practical application of the results of our research in risk management on the Polish Power Exchange.
Value at Risk Using the Principal Components Analysis
2
551
Measures of Risk
We have many sources of risk: the changes of prices, the uncertainty of keeping the conditions of a contract, difficulties in closing a position on the financial market, the changes in law and the risk of strategy. If we would like to estimate the future risk we must measure it. There are a lot of different measures of risk. We can divide them into three groups: measures of volatility, measures of sensitivity and measures of downside risk. In this paper we present quantile downside risk measures such as: Value-atRisk (VaR) and Conditional Value-at-Risk (CVaR). Value-at-Risk The downside risk measure measures unwilling deviations from the expected rate of return. VaR is such a loss of value, which cannot exceed the given probability over time period: P (F (t + ∆t) ≤ F (t) − V aR) = α
(1)
where α ∈ (0, 1) is a given probability, ∆t is a given time period, F (t) is a present value, F (t + ∆t) is a random variable, value at the end of duration of investment. Conditional Value-at-Risk The next downside measure is CVaR. CVaR can be called Expected Shortfall - ES ESα (F (t + ∆t)) = E{F (t + ∆t)|F (t + ∆t) ≤ V aRα (F (t + ∆t)))}.
(2)
The VaR quantity represents the maximum possible loss, which does not exceed the probability α. The CVaR quantity is the conditional expected loss given the loss strictly exceeds its VaR. CVaR is defined as the mean of the random variable below the quantile of the worst realizations. The definitions ensure that the VaR is never higher than the CVaR, so portfolios with low CVaR must have low VaR as well. CVaR is an alternative measure of risk, but it has better properties than VaR. Pflug (2000) proved that CVaR is a coherent risk measure having the following properties: transition - equivariant, positively homogeneous, convex, monotonic, stochastic dominance of first order and stochastic dominance of second order. The most commonly used main methodologies to calculate VaR are: variance-covariance, historical simulation and Monte-Carlo simulation. Looking at the hypothetical profits and losses under each scenario, it is possible to construct a histogram of expected profits and losses from which VaR are calculated (Trzpiot, Ganczarek (2003)). In this paper to calculate VaR we use the seasonal principal component analysis in combination with Monte Carlo simulation.
552
3
G. Trzpiot and A. Ganczarek
Principal Component Analysis
The Principal Component Analysis (PCA) is one of the multivariate statistical methods 1 . We use it to study the dependence between variables, which describe the multivariate objects. It consists of orthogonal transformation k– dimensional space on new space with the uncorrelated variables. If we note by F = [F1 , F2 , . . . , Fk ]T – vector of observational variables, X = [X1 , X2 , . . . , Xk ]T – vector of principal components, A = [a1 , a2 , . . . , ak ] – orthogonal and normality matrix. Then we can express the X vector of transformation X = AT F.
(3)
The PCA consists of calculating the orthogonal and normality matrix A. In the first steep we calculate vector a1 of the matrix A this way that X1 has the biggest variance of all principal components. Next we calculate vector a2 that X2 has the biggest variance of the remaining principal components, and X1 , X2 are uncorrelated and so on. We can obtain matrix A from the eigenvalues and eigenvectors of the covariance matrix C of variables F : 1 aj = Uj , λj
j = 1, . . . , k,
(4)
where Uj – eigenvector of the covariance matrix C, λj – eigenvalue of the eigenvector Uj of the covariance matrix C. The Xj principal components have the following properties: D2 (X1 ) > D2 (X2 ) > . . . > D2 (Xk ), k
D2 (Fj ) =
j=1
k
(5)
D2 (Xj ).
(6)
j = 1, . . . , k.
(7)
j=1
D2 (Xj ) = λj ,
Equation (6) means that all the observational variables and their volatility are described by all principal components and by eigenvalue (7). So we can use PCA to describe the volatility of the forward price of electricity prices. If wj denotes the contribution of Xj to the explanation of observational variables, we can write: wj =
λj , k λi
j = 1, . . . , k.
(8)
i=1 1
This method was proposed in 1901 by K. Pearson, and was used in 1933 by H. Hotelling.
Value at Risk Using the Principal Components Analysis
553
In the model we use only these principal components, which have the biggest part in explaining the variance of observational variables. Each of the principal components can be interpreted as a source of risk, and the importance of the components is an expression of the volatility of that risk source. The set of the factor loadings, i.e. the elements of matrix A, can be interpreted as the original data set corresponding to the source of risk. For energy forward price curves and in financial markets these uncorrelated sources of risk are highly abstract and usually take the form of: – first factor is called parallel shift, it governs changes in the overall level of prices. – second factor is called slope, it governs the steepness of the curve, it can be interpreted as a change in the overall level of the term structure of convenience yields, – third factor is called curvature, it relates to the possibility of introducing a bend in the curve, that is the front and back go up and the middle goes down, or vice-versa (Blanco, Soronow and Stefiszyn (2002)).
4
Value at Risk Using the Principal Components Analysis
In this paper we used the seasonal principal component analysis to calculate VaR. We calculated the factor scores and factor loadings, and we used the results to simulate new hypothetical evolutions of the forward curve by: (Blanco, Soronow and Stefiszyn (2002)) ⎧ ⎫ m m ⎨ 1 ⎬ √ Fi (t + ∆t) = Fi (t) exp − (aij λj )2 ∆t + aij λj ∆tεj , (9) ⎩ 2 ⎭ j=1
j=1
where Fi (t) – is forward price at time, εj – is a drawing from a standard normal distribution N(0, 1), λj - eigenvalue of the eigenvector Uj of the covariance matrix C. aij – is a factor loading which defines how the price will change in response to a shock to the component.
5
Failure Test
We used a failure test to estimate the effectiveness of VaR. It was proposed by Kupiec (1995). We are testing the hypothesis: H0 : ω = α H1 : ω = α
554
G. Trzpiot and A. Ganczarek
where ω is a proportion of the number of the results of research exceeding V aRα to the number of all results. The number of the excesses of V aRα has binomial distribution with a given size of the sample. The test statistic is: T −N N N N T −N N LRuc = −2 ln[(1 − α) α ] + 2 ln 1 − (10) T T
where N – is the number of the crossing of V aR, T – is the lenght of a time series, α – is a given probability, with them VaR couldn’t exceed loss of value The statistics LRuc has a χ2 - asymptotic distribution with 1 degree of freedom.
6
Empirical Analysis
For the estimation of risk on the Day Ahead Market (DAM) of the Polish Power Exchange we took into consideration the electric energy prices from 30.03.03 to 25.10.03 and from 26.10.03 to 27.03.04. In this part of paper we present the results of the evaluation of VaR and CVaR using the PCA. We described every term independently by 24 time series of prices of electricity. Next we used PCA to reduce the number of the variables. In table 1 we present the results of PCA analysis from the two researched periods. To describe risk on DAM we used two principal components. The first data set (summer) is described in 65,10% using two factors. The set of the factor loadings, the elements of matrix A, can be interpreted as the original dataset corresponding to the source of risk. So we can say that the first factor is correlated with hours from 7 to 19 more than over hours, while the second one is correlated with hours from 1 to 5 more than over hours. The second data set (winter) is described in 72,04% using two factors. We can say that the first factor is correlated with hours from 7 to 16 and from 22 to 24 more than over hours. The second one is correlated with hours from 1 to 6 and from 17 to 21 more than over hours. As we monitored each one for energy forward price curves and in financial markets these uncorrelated sources of risk were highly abstract. If the first factor is called parallel shift, we can interpret it as at that time the prices of electric energy went down. If the second factor is called slope, we can interpret it that the prices of electric energy have two peaks during a day (figure 1). Based on the PCA results we simulated new hypothetical evolutions of the forward curve in one week by formula (9) by ten thousands historical scenarios and building a hypothetical distribution of the futures prices.
Value at Risk Using the Principal Components Analysis Hour
Factor loadings for time series in summer Factor loadings for time series in winter from 30.03.03 to 25.10.03
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 λj wj % m P wj %
555
from 26.10.03 to 27.03.04
α1 -0.08 -0,12 -0,13 -0,14 -0,15 -0,19 -0,23 -0,25 -0,25 -0,25 -0,25 -0,26 -0,26 -0,26 -0,27 -0,26 -0,26 -0,25 -0,23 -0,12 -0,06 -0,07 -0,12 -0,13 11,49 47,87
α2 -0.40 -0,43 -0,42 -0,40 -0,30 -0,16 -0,03 0,11 0,14 0,14 0,14 0,14 0,14 0,06 0,04 0,05 0,07 0,06 0,05 0,12 0,16 -0,02 -0,12 -0,17 4,14 17,23
α1 -0,14 -0,16 -0,19 -0,20 -0,21 -0,22 -0,23 -0,23 -0,23 -0,24 -0,23 -0,22 -0,23 -0,24 -0,23 -0,21 -0,14 -0,18 -0,20 -0,19 -0,19 -0,18 -0,15 -0,20 13,44 56,02
α2 0,34 0,37 0,33 0,31 0,29 0,24 0,10 0,03 -0,04 -0,07 -0,09 -0,10 -0,11 -0,03 -0,03 -0,12 -0,29 -0,26 -0,25 -0,25 -0,21 -0,05 -0,08 -0,01 3,84 16,02
47,874
65,10
56,02
72,04
j=1
Table 1. Loadings, eigenvalues and variances of factors
We got the 24 forward curve in one week calculated based on the data set from 30.03.03 to 25.10.03 and the 24 forward curve in one week calculated based on the data set from 26.10.03 to 27.03.04. Next we calculated VaR and CVaR for this 48 hypothetical distribution. We took into consideration short and long positions on the energy market. Based on the first factor loadings we stated that at that time the prices of electric energy went down. We did not note the overlap in the long position and only a few crossings in the short position. In table 2 we present downsides risk measures for hypothetical evolutions of the forward curve on one week for 22 hour (one overlap of V aR0,05 on 28.10.2003).
556
G. Trzpiot and A. Ganczarek
Fig. 1. Two factor loadings for time series noted in summer from 30.03.03 to 25.10.03 VaR
VaR
0,01
0,05
0,01
0,05
104,44 100,81 97,65 95,10 92,80 91,29 88,99
107,40 104,49 102,32 100,39 98,79 97,38 95,91
103,10 99,05 95,25 92,59 90,04 88,18 85,84
105,65 102,26 99,43 97,16 95,06 93,72 91,80
CVaR CVaR
Date
2003-10-25 2003-10-26 2003-10-27 2003-10-28 2003-10-29 2003-10-30 2003-10-31 2003-11-01
Saturday Sunday Monday Tuesday Wednesday Thursday Friday Saturday
F22 (t + ∆t) real price
114,87 114,87 114,80 114,89 114,98 114,97 114,92
114,86 115,03 113,50 99,50 120,00 127,00 100,00 99,50
Table 2. The downside risk measures for hypothetical evolutions for short position and real price
The question ”Are there downsides risk measures effective?” may by answered by testing the hypothesis: H0 : ω = 0, 05 against the alternatives H1 : ω = 0, 05. The test statistic has value LRuc = 0, 865. On the significance level of 0, 05 the rejection value equals χ21 = 3, 84, so we do not reject the null hypothesis. In table 3 we present downside risk measures for hypothetical evolutions of the forward curve in one week for 19 hour. We noted two overlaps of V aR0,05 and one of CV aR0,05 . The second time we used Kupiec test by testing the hypothesis: H0 : ω = 0, 05 against the alternatives H1 : ω = 0, 05. The test statistic has value LRuc = 4, 12, so we should reject the null hypothesis. In this case we may say that V aR0,05 does not effectively measure risk on the energy market. We do not reject the null hypothesis in the case of CV aR0,05 . Taking into consideration quantile downside risk measures for market participants, we can say that VaR is a safer measure than the expected value of the random variable Fi (t + ∆t), additionally CVaR is a safer measure than VaR. Based on the PCA methodology we can describe daily seasonal prices of
Value at Risk Using the Principal Components Analysis
557
VaR VaR CVaR CVaR
Date
2004-03-27 2004-03-28 2004-03-29 2004-03-30 2004-03-31 2004-04-01 2004-04-02 2004-04-03
Saturday Sunday Monday Tuesday Wednesday Thursday Friday Saturday
F19 (t + ∆t) real price 0,01
0,05
0,01
0,05
82,87 71,17 63,23 57,22 52,15 48,39 44,20
91,84 81,90 75,13 69,94 64,65 61,60 58,00
78,96 65,52 57,97 52,22 46,90 43,12 39,11
86,42 75,02 67,91 62,04 56,90 53,45 49,82
117,97 117,78 118,12 117,56 117,76 117,57 118,30
117,99 85,00 98,52 74,00 74,00 100,75 95,00 112,25
Table 3. The downside risk measures for hypothetical evolutions for short position and real price
electric energy on DAM. Using statistical tests we can verify the effectiveness of the downside risk measures on the Polish Power Exchange.
References BLANCO C., SORONOW D. and STEFISZYN P. (2002): Multi - factor models of the forward price curve. Commodities-Now, September 80-83. HOTELLING H. (1933): Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology, 24. KUPIEC P. (1995): Techniques for verifying the accuracy of risk management models. Journal of Derivatives, 2, 173-184. PFLUG. G. Ch. (2000): Some remarks on the value-at-risk and the conditional value-at risk. In: Probabilistic Constrained Optimization: Methodology and Applications (ed. Uryasev S.). Kulwer Academic Publishers. ROCKAFELLAR R.T., URYASEV S. (2000): Optimization of Conditional Valueat-Risk. Journal of Risk, 2, 21-41. SINGH M. K. (1997): Value at risk using principal components analysis. Journal of Portfolio Management, 24, 1, 101-112. TRZPIOT G and GANCZAREK A. (2003): Risk on Polish Energy Market, in: Dynamics Econometrics Models, University Nicolas Copernicus, Toru´ n, 175182.
A Market Basket Analysis Conducted with a Multivariate Logit Model Yasemin Boztu˘g and Lutz Hildebrandt Institute of Marketing, Humboldt–University Berlin, Spandauer Str. 1, D-10178 Berlin, Germany
Abstract. The following research is guided by the hypothesis that products chosen on a shopping trip in a supermarket can indicate the preference interdependencies between different products or brands. The bundle chosen on the trip can be regarded as the result of a global utility function. More specifically: the existence of such a function implies a cross-category dependence of brand choice behavior. It is hypothesized that the global utility function related to a product bundle results from the marketing-mix of the underlying brands. Several approaches exist to describe the choice of specific categories from a set of many alternatives. The models are discussed in brief; the multivariate logit approach is used to estimate a model with a German data set.
1
Introduction
One of the major tasks of retailers is managing their product categories to maximize the overall profit of the store or chain. Using marketing mix strategies to stimulate purchases for a specific product usually has an effect both on the advertised category and related categories. Additionally, a retailer decides not only to advertise for one category, but in many simultaneously. Thus the retailer must consider cross effects between linked or related categories in their marketing measures. Ignoring dependency structures could lead to wrong decisions or at least to suboptimal marketing–mix activities. Analyzing multi–item purchases is not only of interest for the researcher, but also from a managerial point of view. In the following, we will focus on the analysis of bundle purchases. It belongs to a ”pick–any”–choice problem (Levine (1979)), because the consumer can choose no item, one item or any possible number of items for his shopping bag. Common brand choice models, like the well known multinomial logit approach (MNL) (Guadagni and Little (1983)), consider only single category purchases and ignore cross–category relationships and influences. This could lead to wrong parameter estimates and therefore to wrong decisions for using marketing–mix activities.
Financial support by the German Research Foundation (DFG) through the research project #BO 1952/1, and through the Sonderforschungsbereich 649 is gratefully acknowledged.
A Market Basket Analysis Conducted with a Multivariate Logit Model
559
The model used in this article is based on an approach by Russell and Petersen (2000). It predicts category incidence and examines how a purchase in one category is affected by other category purchases. We assume a global utility function, which argues that the cross–category choice dependence is present within each choice process of each consumer. The modeling should include purchases conditional on purchases in other categories during the same shopping trip. Assuming such a dependence structure means that common estimation techniques cannot be used anymore because they are not able to cope for dependent observations. Instead, techniques from spatial statistics are needed to estimate the market basket model in a proper way. The article is structured as follows. In the next section, we will describe market basket models in general, and explain our model more in detail. Afterwards, a data set is presented along with subsequent estimation results. The article concludes with a summary and an outlook.
2
Market Basket Models
Market baskets arise due to shopping behavior of customers. During a shopping trip, the customers are in a ”pick–any”–situation because they have the possibility to choose no item, one or any other number of items in each category. Standard brand choice models, as the MNL, focus on purchases taken in one specific category, ignore cross-effects to other categories, and produce possibly biased parameter estimates. A number of research articles started to incorporate cross–category relationships in their purchase models (see e.g., Russell et al. (1997, 1999), Seetharaman et al. (2004)). Two main research approaches can be distinguished. One is more data–driven using data–mining. It is dominated by techniques like pairwise association (e.g., Hruschka (1985)), association rules (e.g., Agrawal and Srikant (1994)), vector quantisation (e.g., Schnedlitz et al. (2001)), neural networks (e.g., Decker and Monien (2003)) and collaborative filtering (e.g., Mild and Reutterer (2001, 2003)). Pairwise associations use simple association measures to indicate coincidence or affinity of items in market baskets to identify product category relationships. Often techniques of multidimensional scaling or cluster analysis are applied to first reduce the large number of categories. The association rules are then used to group subsets of product categories together. Vector quantisation is a more sophisticated method, which enriches the data with an additional basket vector. This vector contains information about the membership of a specific category to a subbasket class. Using neural networks for market basket analysis is related to vector quantisation. First, an affiliation to a subgroup is identified. Collaborative filtering then uses databases to identify those customers, who behave similar to the target customer and to make predictions using these similarities.
560
Y. Boztu˘ g and L. Hildebrandt
The second research approach is more explanatory driven. It tries to identify and quantify cross–category choice effects of marketing–mix variables. Here, two general methods can be identified. The multivariate probit approach (e.g., Ainslie and Rossi (1998), Manchanda et al. (1999), Seetharaman et al. (1999), Chib et al. (2002), Deepak et al. (2004)) is an extension of the standard probit approach (e.g., Hausmann and Wise (1978), Daganzo (1979), Train (2003)) for one category. It is based on Random Utility Theory and is built on a disaggregate level. The error distribution is assumed to be normal. Alternatively, the multivariate logit approach (e.g., Hruschka et al. (1999), Russell and Petersen (2000), Hansen et al. (2003), Singh et al. (2004)) can be used, which is an extension of the multinomial logit model (e.g., Guadagni and Little (1983)). It is also based on Random Utility Theory. The error term of the multivariate logit approach is assumed to be Gumbel distributed. In our approach, adapted from Russell and Petersen (2000), we use a multivariate logit model to analyse multi–item purchases. The approach models purchase incidence and is related to the well established MNL–models. It is much easier to estimate then the multivariate probit approach. The estimation routine can be programmed with standard software modules, and the approach allows the inclusion of several marketing–mix variables. Complementary, independence and substitution of product categories can be modeled. In our model, we assume that consumers make their category choices in some fixed order, which is not observed by the researcher. Due to this lack of information, the choice in each category is modeled conditional upon known choices in all other categories. It is assumed that the choices are made in a certain order, but it is not necessary to know this order for model construction. To estimate such a model in an unbiased way, we need to apply techniques from spatial statistics to account for relationship of dependence between the categories. With these methods, we are able to describe the conditioned observations without having any information about the concrete purchase sequence. The complete set of full conditional distributions uniquely determines the joint distribution (Besag (1974), Cressie (1993)). Our market basket model accounts for purchases at the category level. The whole bundle description consists of zeros and ones for the existence or absence of category items in the basket. The joint distribution describing the whole basket is inferred from the full conditional distribution of the single category models which have the following form Pr (C(i, k, t) = 1|C(j, k, t) for i = j) =
1 . 1 + exp (−V (i, k, t))
(1)
The utility in Equation (1) is specified as follows θijk C(j, k, t) + ikt U (i, k, t) = βi + HHikt + MIXikt + i=j
= V (i, k, t) + ikt
(2)
A Market Basket Analysis Conducted with a Multivariate Logit Model
561
with C(i, k, t) = 1 if consumer k purchases category i at time t. The household specific variable HH is specified as HHikt = δ1i ln [TIMEikt + 1] + δ2i LOYALik ,
(3)
where TIME is the time in weeks since the last purchase of consumer i in category k occurs and LOYAL the consumer’s long–run property to buy in one category. The marketing–mix variable MIX is defined as MIXikt = γi ln [PRICEikt ] + ϕi DISPLAYikt
(4)
with PRICE the weighted price index across all purchased items in category i and DISPLAY a display index across all items in a category. The cross– category parameter θijk implies a positive association between the product categories i and j for values greater zero, and a negative relationship for a values smaller than zero. The cross–category parameter consists of two parts with θijk = κij + φSIZEk . (5) as SIZE the mean number of categories chosen by consumer k during the initial period. Based on the full conditional model from Equation (1) with its utility specification in Equation (2), it follows, using the Theorem of Besag (Besag(1974)), the joint distribution as the final market basket model (Russell and Petersen (2000)) with Pr (B(k, t) = b) =
exp (µ(b, k, t)) ∗ b∗ exp (µ(b , k, t))
(6)
and the utility specification as βi X(i, b) + HHikt X(i, b) µ(b, k, t) = i
+
i
MIXikt X(i, b) +
i
(7) θijk X(i, b)X(j, b)
i<j
with B(k, t) = {C(1, k, t), . . . , C(N, k, t)} a vector of zeros and ones, X(i, b) = 1 if category i is in basket b and zero elsewhere, and b∗ all possible baskets excluded the Null basket. Overall, 2N −1 baskets are possible for N categories. The interpretation of the model in Equation (6) is that the approach is a logit choice model defined over a set of alternatives with a particular utility specification µ(b, k, t) as in Equation (7).
3
Data Analysis and Results
The data set used in our analysis is a one–year period consumer choices and is made available from the ”Zentrum f¨ ur Umfragen und Methoden, Mannheim
562
Y. Boztu˘ g and L. Hildebrandt
(ZUMA)” 1 . It contains data on breakfast beverages (e.g., coffee, instant coffee, tea, canned milk and filter paper) and covers 4177 consumers purchasing 40682 baskets during 26 weeks. As explanatory variables in our model, loyalty, time, price and display are used. Examining the five categories, 31 different baskets can bought by a consumer. The total value of the cross–effect parameter is negative for substitutional and positive for complementary relationships between the categories. Regarding the parameter estimates, we pase the following hypotheses: • The parameter for loyalty should be positive, as a higher loyalty to a category increases the purchase probability in that category. • The time parameter is assumed to be positive because the longer it takes that a consumer did not purchase in a category the higher will be the probability that he will buy a product of that category. • The price coefficient should be negative because higher prices are assumed to lower the possibility of purchasing in a category. • The display parameter should be positive because the existence of display should increase the possibility to buy in a specific category. • The size effect should be positive, larger basket size should lead to higher purchase incidence probability. We estimate the fit of several stepwise extended models. The first specified model was the simplest without any cross-effects, denoted as M1. Second, we included only the ”SIZE”–effect to capture a simple cross–category relationship, the model is called M2. M3 is a the model which contains the full cross–category effects. M4 is the most comprehensive one, where in addition to the model in Russell and Petersen (2000), ”DISPLAY” is included in the model equation. The fit values for all model types are given in Table 1. It is obvious that ignoring cross–category effects as in model M1, leads to a worse model fit. As more the cross–category relationships are included in the model (M2 to M3), the fit is improved substantially. Also adding an additional explanatory marketing–mix variable (display), results in an even better model fit, not only in the loglikelihood value, but also in the AIC value. The parameter estimates of model M4 (best fit) are given in Table 2. First, we examine the parameter values for the direct effects. The parameter estimates for loyalty and price are as expected. Each loyalty parameter is positive and significant at the 5%–level. For price, three out of five parameters 1
The data used for this analysis are part of a subsample of the 1995 GfK ConsumerScan Household panel data and were made accessible by ZUMA. The ZUMA data set includes all households having continuously reported product purchases during the entire year 1995. For a description of this data set cf. PAPASTEFANOU, G. (2001): The ZUMA data file version of the GfK ConsumerScan Household Panel. In: G. Papastefanou, P. Schmidt, A. B¨ orsch-Supan, H. L¨ udkte and U. Oltersdorf (Eds.): Social and Economic Analysis of Consumer Panel Data. Zentrum f¨ ur Umfragen, Meinungen und Analysen (ZUMA), Mannheim.
A Market Basket Analysis Conducted with a Multivariate Logit Model Model Number of parameters
M1 M2 M3 M4
20 30 31 36
LL
AIC
−81087.1 −70507.2 −69147.2 −69138.4
162214.2 141074.4 138356.4 138348.7
563
Table 1. Model fits for breakfast beverages
Variable
Coffee
Instant coffee
Tea
Canned milk Filter paper
Direct effects
0.93∗∗ 0.69∗∗ −0.10∗ −0.95∗∗ 3.52∗∗
2.17∗∗ 0.95∗∗ −0.24∗∗ −1.35∗∗ 0.09
1.13∗∗
1.13∗∗
1.13∗∗
1.13∗∗
1.13∗∗
Coffee Instant coffee
– −2.24∗∗
−2.24∗∗ –
−2.29∗∗ −2.45∗∗
−2.20∗∗ −1.94∗∗
−1.41∗∗ −1.60
Tea Canned milk Filter paper
−2.29∗∗ −2.20∗∗ −1.41∗∗
−2.45∗∗ −1.94∗∗ −1.60
– −2.21∗∗ −1.48∗∗
−2.21∗∗ – −1.28∗∗
−1.48∗∗ −1.28∗∗ –
Intercept Loyalty Time Price Display
1.97∗∗ 1.08∗∗ −0.30∗∗ −1.02∗∗ −0.54
0.55∗∗ 0.72∗∗ −0.22∗∗ −0.07 0.46
−2.81∗∗ 0.54∗∗ 0.66∗∗ 0.44 0.15
Cross–category effects Size
Table 2. Estimation results for breakfast beverages
are negative and significant, the remaining two (for canned milk and filter paper) are not significant. The estimated parameter values for time are all significant and negative with the exception of filter paper, which is opposite of what we expected. This finding might indicate an irregular shopping behavior in the inspected categories. The display parameter is positive and significant only for coffee, a highly promoted category; the result seems to be quite plausible. In all other categories, display is not used very often; so its influence on the purchase incidence is not important. Now we will inspect the results for the cross–category parameters. The size effect is positive and significant, as we expected. Its magnitude multiplied with the average basket size (1.39) is 1.75, a value which is larger than several cross–effects (e.g., for filter paper and coffee or for filter paper and canned milk). All cross–category effects are negative and significant. If combined with the average size times size effect, some relationship stay still negative, as e.g., the one for coffee and instant coffee or coffee and tea. This negative
564
Y. Boztu˘ g and L. Hildebrandt
value leads to considering these two categories as substitutes, while others (e.g., filter paper and coffee) have a complementary relationship. Regarding the categories, the results seem to be quite reasonable.
4
Summary and Outlook
Managers make decisions for many categories simultaneously. Since ignoring relationships and interdependencies could lead to biased parameter estimates, models should include cross–category effects. We presented a model based on a multivariate logit approach, which is conducted with approaches adopted from spatial statistics. We find significant cross–category parameters, where some imply a relationship of substitutes between the inspected categories of several breakfast beverages. Others imply a complementary relationship. Ignoring these effects results not only in worse model fit, but also in biased parameters of the direct effects. In an extended analysis, other methods to model and estimate the utility function, e.g. a generalized additive model approach (Hastie and Tibshirani (1990)) might result in a more detailed outcome. Also, considering consumer heterogeneity remains an open issue.
References AGRAWAL, R. and SRIKANT, R. (1994): Fast algorithms for mining association rules. Working Paper. IBM Almaden Research Center. AINSLIE, A. and ROSSI, P.E. (1998): Similarities in Choice Behavior Across Product Categories. Marketing Science, 17 (2), 91–106. BESAG, J. (1974): Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society, Series B, 36 (2), 192–236. CHIB, S. and SEETHARAMAN, P.B. (2002): Analysis of multi–category purchase incidence decisions using IRI market basket data. In: P.H. Franses and A.L. Montgomery (Eds.): Econometric Models in Marketing. Elsevier Science, 57– 92. CRESSIE, N.A.C. (1991): Statistics for spatial data. John Wiley & Sons. DAGANZO, C. (1979): Multinomial Probit. Academic Press, New York. DECKER, R. and MONIEN, K. (2003): Market basket analysis with neural gas networks and self–organising maps. Journal of Targeting, Measurement and Analysis for Marketing, 11 (4), 373–386. DEEPAK, S.D., ANSARI, A. and GUPTA, S. (2004): Investigating consumer price sensitivities across categories. Working Paper. University of Iowa. GUADAGNI, P.M. and LITTLE, D.C. (1983): A Logit Model of Brand Choice Calibrated on Scanner Data. Marketing Science, 2 (3), 203–238. HANSEN, K., SINGH, V.P. and CHINTAGUNTA, P. (2003): Understanding store brand purchase behavior across categories. Working Paper. Kellog School of Management, Northwestern University. HASTIE, T.J. and TIBSHIRANI, R.J. (1990): Generalized Additive Models. Chapman & Hall, London.
A Market Basket Analysis Conducted with a Multivariate Logit Model
565
HAUSMAN, J.A. and WISE, D.A. (1978): A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences. Econometrica, 46 (2), 403–426. HRUSCHKA, H. (1985): Der Zusammenhang zwischen Verbundbeziehungen und Kaufakt– bzw. K¨ auferstrukturmerkmalen. Zeitschrift f¨ ur betriebswirtschaftliche Forschung, 37, 218–131. HRUSCHKA, H., LUKANOWICZ, M. and BUCHTA, C. (1999): Cross-category sales promotion effects. Journal of Retailing and Consumer Services, 6, 99– 105. LEVINE, J.H. (1979): Joint–Space analysis of pick–any data: Analysis of choices from an unconstrained set of alternatives. Psychometrika, 44 (1), 85–92. MANCHANDA, P., ANSARI, A. and GUPTA, S. (1999): The ”Shopping Basket”: A Model for Multicategory Purchase Incidence Decisions. Marketing Science, 18 (2), 95–114. MILD, A. and REUTTERER, T. (2001): Collaborative Filtering Methods for Binary Market Basket Analysis. In: J. Liu, P.C. Yuen, C.H. Li, J. Ng and T. Ishada (Eds.): Active Media Technology. Springer, Berlin, 302–313. MILD, A. and REUTTERER, T. (2003): An improved collaborative filtering approach for predicting cross–category purchases based on binary market data. Journal of Retailing and Consumer Services, 6 (4), 123–133. PAPASTEFANOU, G. (2001): The ZUMA data file version of the GfK ConsumerScan Household Panel. In: G. Papastefanou, P. Schmidt, A. B¨ orsch-Supan, H. L¨ udkte and U. Oltersdorf (Eds.): Social and Economic Analysis of Consumer Panel Data. Zentrum f¨ ur Umfragen, Meinungen und Analysen (ZUMA), Mannheim. RUSSELL, G.J., BELL, D., BODAPATI, A., BROWN, C.L., CHIANG, J., GAETH, G., GUPTA, S. and MANCHANDA, P. (1997): Perspectives on Multiple Category Choice. Marketing Letters, 8 (3), 297–305. RUSSELL, G.J., RATNESHWAR, S., SHOCKER, A.D., BELL, D., BODAPATI, A., DEGERATU, A., HILDEBRANDT, L., KIM, N., RAMASWAMI, S. and SHANKAR, V.H. (1999): Multiple–Category Decision–Making: Review and Synthesis. Marketing Letters, 10 (3), 319–332. RUSSELL, G.J. and PETERSEN, A. (2000): Analysis of Cross Category Dependence in Market Basket Selection. Journal of Retailing, 76 (3), 367–392. SCHNEDLITZ, P., REUTTERER, T. and JOOS, W. (2001): Data–Mining und Sortimentsverbundanalyse im Einzelhandel. In: H. Hippner, U. K¨ usters, M. Meyer and K. Wilde (Eds.): Handbuch Data Mining im Marketing. Vieweg, Wiesbaden, 951–970. SEETHARAMAN, P.B., AINSLIE, A. and CHINTAGUNTA, P.K. (1999): Investigating Household State Dependence Effect Across Categories. Journal of Marketing Research, 36, 488–500. SEETHARAMAN, P.B., CHIB, S., AINLSIE, A., BOATWRIGHT, P., CHAN, T., GUPTA, S., MEHTA, N., RAO, V. and STRIJNEV, A. (2004): Models of multi–category choice behavior. Working Paper. Rice University. SINGH, V.P., HANSEN, K. and GUPTA, S. (2004): Modeling preferences for common attributes in multi–category brand choice. Working Paper. Carnegie Mellon University. TRAIN, K.E. (2003): Discrete Choice Methods with Simulation. Cambridge University Press.
Solving and Interpreting Binary Classification Problems in Marketing with SVMs Georgi Nalbantov1 , Jan C. Bioch2 , and Patrick J. F. Groenen2 1
2
Erasmus Research Institute of Management, Erasmus University Rotterdam, Postbus 1738, 3000 DR Rotterdam, The Netherlands Econometric Institute, Faculty of Economics, Erasmus University Rotterdam, Postbus 1738, 3000 DR Rotterdam, The Netherlands
Abstract. Marketing problems often involve binary classification of customers into “buyers” versus “non-buyers” or “prefers brand A” versus “prefers brand B”. These cases require binary classification models such as logistic regression, linear, and quadratic discriminant analysis. A promising recent technique for the binary classification problem is the Support Vector Machine (Vapnik (1995)), which has achieved outstanding results in areas ranging from Bioinformatics to Finance. In this paper, we compare the performance of the Support Vector Machine against standard binary classification techniques on a marketing data set and elaborate on the interpretation of the obtained results.
1
Introduction
In marketing, quite often the variable of interest is dichotomous in nature. For example, a customer either buys or does not buy a product, visits or does not visit a certain shop. Some researchers and practitioners often approach such binary classification problems with traditional parametric statistical techniques, such as discriminant analysis and logistic regression (Lattin et al. (2003), Franses and Paap (2001)) and others employ semiparametric and nonparametric statistical tools, like kernel regression (Van Heerde et al. (2001), Abe (1991, 1995)) and neural networks (West (1997)). Nonparametric models differ from parametric in that they make no or less assumptions about the distribution of the data. A disadvantage of nonparametric tools in general is that they are considered to be “black boxes”. In many such cases, the model parameters are hard to interpret and often no direct probability estimates are available for the output binary variable. A discussion on the relative merits of both kind of techniques can be found, for instance, in Van Heerde et al. (2001) and West (1997). In this paper, we employ the nonparametric technique of Support Vector Machine (SVM) (Vapnik (1995), Burges (1998), M¨ uller et al. (2001)). Some desirable features of SVM that are relevant for marketing include good generalization ability, robustness of the results, and avoidance of overfitting.
Binary Classification Problems in Marketing with SVMs Holiday length in days Variable ≤ 14 > 14 Transport Car 39.8 34.2 Airplane 48.0 58.2 Other 12.2 7.6 Full board Yes 25.7 18.3 No 74.3 81.7 Sunshine Important 83.9 88.5 Not important 16.1 11.5 Big expenses Made 26.0 26.5 Not made 74.0 73.5 Mean no. of children 0.35 0.49 Mean age group 3.95 4.52
567
Holiday length in days Variable ≤ 14 > 14 Destination Inside Europe 87.7 66.7 Outside Europe 12.3 33.3 Accommodation Camping 17.5 27.9 Apartment 29.5 24.0 Hotel 33.6 27.6 Other 19.4 20.5 Season High 38.6 43.2 Low 61.4 56.8 Having children Yes 31.6 40.2 No 68.4 59.8 Mean income group 2.23 2.67
Table 1. Descriptive statistics of the predictor variables for the holiday data set split by holiday length. For the categorical variables, the relative frequency is given (in %) and for numerical variables, the mean.
One drawback of SVM is the inability to interpret the obtained results easily. In marketing, SVMs have been used by, for example, Bennett (1999), Cui (2003), and Evgeniou (2004). Our aim is to assess the applicability of SVM for solving binary marketing problems and, even more importantly, to provide for the interpretation of the results. We compare SVM with standard marketing modelling tools of linear and quadratic discriminant analysis and the logit choice model on one empirical data set. In addition, we interpret the results of the SVM models in two ways. First, we report probability estimates for the realizations of the (binary) dependent variable, as proposed by Platt (1999) and implemented by Chang and Lin (2004). Second, we use these estimates to evaluate the (possibly nonlinear) effects of some independent variables on the dependent variable of interest. In this way, we can assess the effect of manipulating some marketing instruments on the probability of a certain choice between two alternatives. The remainder of the paper is organized as follows. First, we describe the data used in this research. Next, we provide a brief overview of the construction of SVM for classification tasks. Sections 4 and 5 give an account of the obtained results and their interpretation and Section 6 gives a conclusion.
568
2
G. Nalbantov et al.
Data
We focus on a straightforward marketing problem: how to forecast holiday length on the basis of some general travelling and customer characteristics. These data have been collected by Erasmus University Rotterdam in 2003. Table 2 provides descriptive statistics for the data set. The dependent variable, holiday length, has been dichotomized into “not more than 14 days” and “more than 14 days”. In total, there are 708 respondents. The outcome alternatives are quite balanced: 51.7% of the respondents have spent more than two weeks and 48.3% not more than two weeks of holidays. Eleven explanatory variables were available, some of which are categorical: destination, mode of transport, accommodation, full/nonfull board and lodging, sunshine availability, (other) big expenses, in/out of season, having/not having children, number of children, income group and age group.
3
Support Vector Machines for Classification
Support Vector Machines (SVM) are rooted in statistical learning theory (Vapnik (1995)) and can be applied to both classification and regression problems. We consider here the supervised learning task of separating examples that belong to two classes. Consider a data set of n explanatory vectors {xi }ni=1 from Rm and corresponding classification labels {yi }ni=1 , where yi ∈ {−1, 1}. Thus, in the marketing data set, −1 identifies short holiday length (≤ 14 days) and 1 identifies long holiday length (> 14 days). The SVM method finds the oriented hyperplane that maximizes the closest distance between observations from the two classes (the so-called “margin”), while at the same time minimizes the amount of training errors (Vapnik (1995), Cristianini and Shawe-Taylor (2000), Burges (1998)). In this way, good generalization ability of the resulting function is achieved, and therefore the problem of overfitting is mitigated. The explanatory vectors x from the original space Rm are usually mapped into a higher dimensional, space, where their coordinates are given by Φ(x). In this case, the optimal SVM hyperplane is found as the solution of the following optimization problem: n n 1 maxα (1) i=1 αi − 2 i,j=1 αi αj yi yj k(xi , xj ) n subject to 0 ≤ αi ≤ C, i = 1, 2, · · · , n, and i=1 yi αi = 0,
where k(xi , xj ) = Φ(xi ) Φ(xj ) is a kernel function that calculates dot products of explanatory vectors xi and xj in feature space. Intuitively, the kernel determines the level of proximity between any two points in the feature space. Common kernels in SVM are the linear k(xi , xj ) = (xi xj ) , polynomial k(xi , xj ) = (xi xj + 1)d and Radial Basis Function k(xi , xj ) = exp(−γ||xi − xj ||2 ) ones, where d and γ and manually adjustable parameters. The feature space implied by the RBF kernel is infinite-dimensional,
Binary Classification Problems in Marketing with SVMs
569
while the linear n kernel preserves the data in the original space. Maximizing the term − i,j=1 αi αj yi yj k(xi , xj ) corresponds to maximizing the margin between the twoclasses, which is equal to the distance n between hyperplanes n with equations i=1 yi αi k(xi , x) + b = −1 and i=1 yi αi k(xi , x) + b = 1. The manually adjustable constant C determines the trade-off between the margin and the amount of training errors. The α’s are the weights associated with the observations. All observations with nonzero weights are called “support vectors”, as they are the only ones that determine the position of the optimal n SVM hyperplane. This hyperplane consists of all points x which satisfy i=1 yi αi k(xi , x)+b = 0. The b parameter is found from the so-called Kuhn-Tucker conditions associated with (1). The importance of binary classification methods lies in how well they are able to predict the class of a new n observation x. To do so with SVM, the optimal separation hyperplane i=1 yi αi k(xi , x) + b = 0 that is derived from the solution ({αi }ni=1 , b) of (1) is used: n f (x) = sign(g(x)) = sign yi αi k(xi , x) + b , i=1
where sign(a) = −1 if a < 0, sign(a) = 1 if a ≥ 0. For interpretation, it is often important to know not only the predicted binary outcome, but also its probability. One way to derive posterior probabilities for the estimated class membership f (xi ) of observation xi has been proposed by Platt (1999). His approach is to fit a sigmoid function to all estimated g(xi ) to derive probabilities of the form: P (y = 1|g(xi )) = pi = (1 + exp(a1 g(xi ) + a2 ))−1 , where a1 and a2 are estimated by minimizing the negative log-likelihood of the training data: min −
a1 ,a2
4
n yi + 1 i=1
2
yi + 1 log(pi ) + (1 − ) log(1 − pi ) . 2
Experiments and Results
We define a training and a test sample, corresponding to 85% and 15% of the original data set, respectively. Our experiments have been carried out with the LIBSVM 2.6 software Chang and Lin (2004). We have constructed three SVM models, which differ in the transformation of the original data space, that is, using the linear, the polynomial of degree 2 (d = 2) and the RBF kernel. Table 2 shows detailed results of the SVM models as well as competing classification techniques in marketing such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and the logit choice
570
G. Nalbantov et al.
Sample Training ≤ 14 days > 14 days Overall Test ≤ 14 days > 14 days Overall
LDA 68.2 63.3 65.7 64.2 56.4 60.2
QDA 69.2 67.5 68.3 54.7 54.6 54.6
logit 63.3 66.2 64.8 60.4 65.5 63.0
lin SVM 73.0 60.5 66.5 58.5 49.1 53.7
poly SVM 78.9 59.2 68.7 75.5 45.5 60.2
RBF SVM 77.5 61.4 69.8 71.7 52.7 62.0
Table 2. Hit rates (in %) of different learning methods for the vacation data set. Approximately 85% and 15% of each data set are used for training and testing, respectively. LDA, QDA and logit stand for Linear Discriminant Analysis, Quadratic Discriminant Analysis and logit choice model.
model. The manually adjustable parameters C and γ have been estimated via a five-fold cross-validation procedure. As a result, the parameters for the linear, polynomial and RBF SVM models have been set as follows: C = 2.5, C = 0.004 and d = 2, C = 3500 and γ = 0.0013. The overall performance of SVM on the test set is comparable to that of the standard marketing techniques. Among SVM models, the most flexible one (RBF-SVM) is also the most successful at generalizing the data. The average hit rate on the test set of all techniques considered centers at around 59%. There is no substantial distinction among the performance of all models, except for the QDA and linear SVM models, which relatively underperform. In such a setting we generally favor those models that can be better interpreted.
5
Interpreting the Influence of the Explanatory Variables
The classical SVM appears to lack two main interpretation aspects shared by the standard models of LDA, QDA, and logit choice model. First, for the standard models, coefficient estimates for each explanatory variable are available and can be interpreted as the direct effect of a change in one of the independent variables on the dependent variable, while keeping all other independent variables fixed. The same interpretation is possible for the linear SVM model, since the original data space is preserved, and thus individual coefficient estimates are available. For all the other types of SVM this direct variable effect can be highly nonlinear and is not directly observable. The SVM with RBF kernel, for example, implies infinitely many number of explanatory variables, and thus infinitely many coefficients for each of these variables, which makes interpretation impossible at first sight. Second, the coefficient estimates obtained from the standard models can be used to derive the effect of each explanatory variable on the probability of
Binary Classification Problems in Marketing with SVMs
571
0.8
4 6 8 ge group
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.3 0 5 Income group
0.3 0.3 0 2 4 0 5 Sunshine Number of children
0.6
0.4
0.3 0 1 2 3 Accommodation
0 0.5 1 Destination
Fig. 1. Influences of individual explanatory variables on the probability to spend more than two weeks on a vacation for the logit model. 0.8
4 6 8 ge group
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.3 0 5 Income group
0.3 0.3 0 2 4 0 5 Sunshine Number of children
0.3 0 1 2 3 Accommodation
0.6
0.4
0 0.5 1 Destination
Fig. 2. Influences of individual explanatory variables on the probability to spend more than two weeks on a vacation for the RBF-SVM model.
a certain binary outcome. Although classical SVM does not output outcome probabilities, one can use here the proposed probability estimates by Platt (1999), discussed in Section 3. Interestingly, these probability estimates can help to derive individual variable effects also for the nonlinear SVM. For interpretation purposes, all that is needed is to visualize the relationship between a given explanatory variable and the probability to observe one of the two possible binary outcomes, while keeping the rest of the explanatory variables fixed. Thus, even for the SVM with RBF kernel it is not necessarily to know the coefficients for each data dimension in order to infer the influence of individual variables. Next, we interpret the results of the SVM model with RBF kernel on the vacation data set and compare them with those from the logit model. Consider Figures 1 and 2 that show the relationships between some of the independent variables and the probability to go on a vacation for more than two weeks, for the logit and RBF-SVM models respectively. In each of the panels, the remaining explanatory variables are kept fixed at their average levels. The dashed lines denote the probability of the “average” person to go on a vacation for more than two weeks. The first striking feature to observe is the great degree of similarity between both models. Although the RBF-SVM model is very flexible, the estimated effects for variables such as “Having children”, “Big expenses”, and “In season” are close to linear, just as the logit model predicts. The main difference between both techniques is best illustrated by the predicted effect of the “Age group” variable. The SVM model suggests that both relatively younger and relatively older holiday makers tend to have (on average) a higher probability to choose for the longer vacation option than the middle-aged ones, which makes sense intuitively. The logit model cannot capture such an effect by its definition as it imposes a monotonically increasing (or decreasing)
572
G. Nalbantov et al.
relationship between the explanatory variables and the probability of a certain outcome. The RBF-SVM model, on the other hand, is free to impose a highly nonlinear such relationship via the mapping of the original data into a higher-dimensional space. Moreover, since the SVM model does not suffer from monotonicity restrictions, it reports nonmonotonically ordered outcome probabilities for each of the ”Accommodation” variable categories (see Figure 2). Although one cannot conclude here that SVM is immune to the need to optimally scale the variables prior to model estimation, it is clear that it offers a better protection from arbitrary coding of unordered categorical variables than the logit model does. The marketing implications of the results obtained by SVM can be derived directly from Figure 2. By considering the effects of changes in individual variables, marketeers can infer which ones are most effective and, as a result of this, streamline the advertising efforts accordingly. Thus, it seems most effective to offer longer-than-two-week vacations to customers with the following profile: relatively older, with high income, small number of children or no children at all, preferring to have sunshine available most of the time, and to a destination outside Europe.
6
Conclusion
We have analyzed a marketing classification problem with SVM for binary classification. We have also compared our results with those of standard marketing tools. Although the classical SVM exhibits superior performance, a general deficiency is that the results are hard to interpret, especially in the nonlinear case. To facilitate such an interpretation, we have constructed relationships between the explanatory and (binary) outcome variable by making use of probabilities for the SVM output estimates obtained from an approach proposed by Platt (1999). Ultimately, this allows for the possibility to evaluate the effectiveness of different marketing strategies under different scenarios. In terms of interpretation of the results, it appears that SVM models can give two advantages over standard techniques. First, highly nonmonotonic effects of the explanatory variables can be detected and visualized. And second, which comes as a by-product of the first, the SVM appears to model adequately the effects of arbitrarily coded unordered categorical variables.
References ABE, M. (1991): A Moving Ellipsoid Method for Nonparametric Regression and Its Application to Logit Diagnostics With Scanner Data. Journal of Marketing Research, 28, 339–346. ABE, M. (1995): A Nonparametric Density Estimation Method for Brand Choice Using Scanner Data. Marketing Science, 14, 300–325.
Binary Classification Problems in Marketing with SVMs
573
BENNETT, K.P., WU, S. and AUSLENDER, L. (1999): On Support Vector Decision Trees For Database Marketing. IEEE International Joint Conference on Neural Networks (IJCNN ’99), 2, 904–909. BURGES, C.J.C. (1998): A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167. CHANG, C.C. and LIN, C.J. (2004): LIBSVM: a Library for Support Vector Machines. Software available at: http://www.csie.ntu.edu.tw/∼cjlin/libsvm CRISTIANINI, N. and SHAWE-TAYLOR, J. (2000): An Introduction to Support Vector Machines. Cambridge University Press, Cambridge. CUI, D. (2003): Product Selection Agents: A Development Framework and Preliminary Application. Unpublished doctoral dissertation. University of Cincinnati, Business Administration: Marketing, Ohio. Retrieved April 5, 2005, from http://www.ohiolink.edu/etd/send-pdf.cgi?ucin1054824718 EVGENIOU, T. and PONTIL, M. (2004): Optimization Conjoint Models for Consumer Heterogeneity. INSEAD Working Paper, Serie No. 2004/10/TM, Fontaineblea: INSEAD. FRANSES, P.H. and PAAP, R. (2001): Quantitative Models in Marketing Research. Cambridge University Press, Cambridge. LATTIN, J., CARROLL, J. and GREEN, P. (2003): Analyzing Multivariate Data. Duxbury Press, Belmont, CA. ¨ ¨ ¨ MUELLER, K.-R., MIKA, S., RATSCH, G., TSUDA, K. and SCHOLKOPF, B. (2001): An Introduction to Kernel-Based Learning Algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. PLATT, J. (1999): Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In A. Smola, P. Bartlett, B. Sch¨ olkopf, D. Schuurmans (Eds.): Advances in Large Margin Classifiers. MIT Press, Cambridge, MA, 61–74. VAN HEERDE, H., LEEFLANG, P., and WITTINK, D. (2001): Semiparametric Analysis to Estimate the Deal Effect Curve. Journal of Marketing Research, 38, 197–215. VAPNIK, V.N. (1995): The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., (2nd edition, 2000). WEST, P.M., BROCKETT, P.L. and GOLDEN, L.L. (1997): A Comparative Analysis of Neural, Networks and Statistical Methods for Predicting Consumer Choice. Marketing Science, 16, 370–391.
Modeling the Nonlinear Relationship Between Satisfaction and Loyalty with Structural Equation Models Marcel Paulssen and Angela Sommerfeld Institut f¨ ur Marketing, Humboldt-Universit¨ at zu Berlin, 10178 Berlin
1
Introduction
Despite high rates of customer satisfaction firms experience high rates of customer defection (Reichheld, 1996). In general the relationship between satisfaction and loyalty has been assumed to be linear and symmetric (Yi, 1990). However, this linearity assumption has recently been questioned in studies from Mittal, Ross and Baldasare (1998) and Matzler et al. (2004). From a managerial standpoint a thorough understanding of the nature of the relationship between customer satisfaction and loyalty is extremely important. A too rigid assumption of linearity is likely to produce incorrect results and can thereby lead to suboptimal decisions. Given that both satisfaction and loyalty are reflective constructs we would have to model their potentially nonlinear relationship with structural equations, which is far from being straight forward especially with multiple nonlinear terms (e.g. Rigdon, Schumacker & Wothke 1998). Thus the goal of this paper is to model the nonlinear relationship of satisfaction and loyalty within a SEM-framework and introduce the Quasi Maximum Likelihood Approach by Klein to both marketing researchers and practitioners (Klein & Muth´en, 2004).
2
Theoretical Background
Satisfaction studies are common in many companies. Typically customers evaluate a product or service on a number of attributes. Normally these satisfaction ratings are related to loyalty in order to understand which aspects of a product or service are crucial for retaining customers. Results from these types of analyses are often used for resource allocation decisions i.e. invest in improvements of attributes with the highest yield in terms of customer loyalty. In this context the linearity assumption would then imply that an increase of satisfaction on a high satisfaction level should lead to the same increase in loyalty as a similar increase of satisfaction from a low satisfaction level on a given product attribute. However, since Kahneman & Tversky’s (1979) prospect theory it seems obvious, that we have to question this linearity assumption. An important result of Kahneman and Tversky’s work is that people do not look at the levels of final wealth they can attain but at
Modeling the Nonlinear Relationship Between Satisfaction and Loyalty
575
gains and losses relative to some reference point and display loss aversion a loss function steeper than a gain function. This implies equal-magnitude gains and losses do not have symmetric impacts on the decision. Losses hurt more than gains satisfy. Translating the loss aversion phenomenon to the marketing context would imply that negative attribute performance should carry more weight in a customer’s repurchase decision than equal amounts of positive attribute performance. Kano (1984) took a somewhat more differentiated perspective on the relation between attribute performance and consumer decision making. His model assumes three factors influencing overall satisfaction, that were labeled as performance, excitement and basic factors. Performance factors possess a linear relationship between perceived attribute performance and overall satisfaction, whereas both basic and excitement factors are hypothesized to possess nonlinear relationships. Basic factors are attributes of a product or service expected by the customer. Basic factors are not supposed to impact on overall satisfaction in case they are fulfilled, but they have a strong impact, if they don’t meet the customer’s expectations. On the other hand unexpected attributes can be quite delightful and therefore increase satisfaction (excitement factors). Here a negative performance is hypothesized to have no impact on overall satisfaction, whereas a positive performance is having a positive impact on overall satisfaction.
3
Literature Review
Mittal et al. (1998) focused on prospect theory as a theoretical foundation for their study. Accordingly Mittal et al. assumed, that negative attribute evaluations have a higher impact on overall satisfaction than positive attribute evaluations. In order to test this proposition they used dummy-coded attributes in their regression analysis, one category “above than expected” the other “worse than expected”. Their results show that negative performance on an attribute has a stronger impact on overall satisfaction than a corresponding positive performance. However their findings did not provide unanimously support for their proposition of a stronger effect of negative attribute performance. For one attribute a positive performance had a stronger impact than a negative performance. In a follow-up study Matzler et al. (2004) argued based on the Kano’s model that it is not only overly restricted to assume just linear relationships, but that it is also problematic to assume just nonlinear relationships with negative performance always weighting more than positive performance. Based on the Kano-model they argued for the existence of three types of relationships (see above). In their study they could prove the existence of linear relation between attribute performance and satisfaction, as well as nonlinear relationships according to basic and excitement factors. Similar to Mittal et al. (1998) they also used a multivariate regression analysis with dummy-coding of the attribute performance. Unfortunately, the use of multivariate regression analysis exhibits some problems. Loyalty and satisfaction judgments are clearly reflective latent con-
576
M. Paulssen and A. Sommerfeld
structs. Not correcting latent constructs for measurement error leads to inconsistent and attenuated parameter estimates. Furthermore the dummy-coding of attribute performance leads to a loss of information. Therefore it is advisable to use modeling approaches, which are free of the mentioned problems. Since our constructs of interest are latent an SEM-framework would be an appropriate choice to model their potentially nonlinear relationship. In the following section we will therefore give a brief overview on approaches to model nonlinear relations within a SEM-framework.
4
Nonlinear Relationships within a SEM-framework
A popular approach to model nonlinear relationships with SEM is the multigroup approach. In a first step sub-samples are defined by the level of the variable for which nonlinear (quadratic, interactive) effects are hypothesized (e.g. median split). First a hierarchy of tests is conducted to ensure measurement equality (tau-equivalence) across the sub-samples. Then a model with gamma-parameters constrained to be equal across groups is tested against a model were the gamma-parameters are allowed to differ. A question in the multigroup approach is of course where to split the sample. A naive median split may obscure a nonlinear relationship, but quartile-splits require substantial sample sizes. If the grouping variable is measured with error assignment to groups are problematic and can lead to biased parameter estimates. Nevertheless the multiple group approach is a practical and popular approach to model nonlinear relationships. Kenny and Judd (1984) describe a procedure to estimate nonlinear and interactive effects, under the assumption that the latent variables are normally distributed. As shown in Figure 1, the variances and covariances of the nonlinear factors are functions of variances and covariances of the linear latent variables. Even if measurement indicators are multivariate normal their product terms will be nonnormal and any variable that is a function of nonlinear factors (XZ; XX) will also be nonnormally distributed. Therefore the maximum likelihood estimation procedure of LISREL is inappropriate (Kenny & Judd, 1984). Another complication of this procedure is the fact, that nonlinear constraints have to be specified. Nonlinear constraints are awkward to specify and can change dramatically when relatively minor modifications to the linear model part are made. “Utmost care must be taken to specify the constraints correctly – a single mistake has severe consequences“ (Ping, 1994). Ping (1994, 1996) therefore proposes a somewhat easier to implement twostep procedure in which the measurement model of the linear latent variables are estimated first. Loading and error variances of product indicators are calculated using first-step measurement models estimates. Then the structural model is estimated with the calculated loadings and error variances of product terms set as constants. Another problem of the described Kenny and
Modeling the Nonlinear Relationship Between Satisfaction and Loyalty
x1 P iP ζ ξ1 ) x2
H j H
x1 x1 Q k
1 η P
577
y2
q y1 P *
x1 x2 Q ξ 1 ξ 1
+ x2 x2
Fig. 1. A Kenny & Judd (1984) Elementary Nonlinear Model
Judd model is that the multiplicative terms can lead to substantive multicollineartiy that impedes parameter estimation since quadratic or interaction measures are function of main effect constructs. This can also be problematic for measurement models. Therefore asymptotic distribution free estimators that do not rely on the assumption of multivariate normality were developed for nonlinear models (J¨ oreskog-Yang, 1996). However the WLS-estimator uses the inverse of a fourth order moments-matrix as a weight matrix, which in the presence of product terms of indicators is not of full rank. This problem is aggravated the more product terms are used, since they are a function of the other observed variables. Furthermore sample size has to be substantial (Yang-Wallentin & J¨ oreskog, 2001). Finally and most importantly it has to be said, that all Kenny and Judd type models work only for elementary interaction and nonlinear models. We summarize this section with a quote by Rigdon, Schumacker and Wothke (1998) who stated: “Obviously, the lack of testing interaction and nonlinear effects in latent variable models in the research literature is not due to the failure of substantive arguments that suggest the presence of interaction or nonlinearity, rather the techniques are technically demanding and not well understood.”
5
The Quasi Maximum Likelihood Approach by Klein
The model we like to introduce here is the Quasi Maximum Likelihood Approach (Quasi-ML) by Klein (Klein & Muth´en, 2004). Klein introduces a structural equation model with a general quadratic form of latent independent predictor variables. The elementary interaction models, proposed by Kenny and Judd (1984), with interaction as well as quadratic effects are special cases of Klein’s model. The proposed model covers structural equations with polynomials of degree two and is itself a special case of the general polynomial structural equation model described by Wall and Amemiya (2000). The structural model with a quadratic form can be described by the following equation: ηt = α + Γξt + ξ t Ωξ t + ζt ,
t = 1, . . . , N,
where
578
M. Paulssen and A. Sommerfeld
ηt α ξt Γ
is is is is
Ω
a latent dependent variable (criterion variable), an intercept term, a (n × 1) vector of latent predictor variables, a (1 × n) coefficient matrix, ⎛ ω11 · · · ⎜ .. is a symmetric (n × n) coefficient matrix, Ω = ⎝ .
ζt
is a disturbance variable
⎞ ω1n .. ⎟, . ⎠ ωnn
The quadratic form ξ t Ωξ t distinguishes the model from ordinary linear SEM. Assumptions and notations are equivalent to linear SEMs. The problem of nonnormally distributed quadratic or polynominal indicator variables is solved by a transformation, which reduces the number of nonnormally distributed components of the original indicator vector to one nonnormally distributed component of the transformed indicator vector. After this transformation, the model is treated as a variance function model, and mean and variance functions for the nonlinear model are calculated. A quasi-likelihood estimation principle is applied and the nonnormal density function of the indicator vector is approximated by a product of an unconditionally normal and a conditionally normal density function. A Quasi-ML estimator is established by maximizing the loglikelihood function based on the approximating density function (Klein & Muth´en, 2004). Simulation studies indicate the efficiency of Quasi-ML estimation is similar to ML-estimators. Quasi-ML shows high statistical power to detect latent interaction and shows no substantial bias in the estimation of standard errors. Furthermore complex models with multiple nonlinear effects can be analyzed without excessive sample requirements.
6
Empirical Application of the Quasi-ML Method
We apply the Quasi-ML method to the substantive research question of nonlinear relationships between satisfaction and loyalty as developed in the theoretical part of the paper. Results of the Quasi-ML method are compared with the results of a multivariate regression with dummy coding of satisfaction judgments (Mittal et al., 1998; Matzler et al., 2004). We conducted two studies, one in the automotive and another in the telecommunication industry. Due to space limitations we will report the results of the automotive study in more detail than the telecommunications study. The study in the automotive industry consists of 1477 customers, who were questioned about their intentions to repurchase and to recommend and about their satisfaction with brand, sales and after-sales. Satisfaction as well as loyalty intentions were measured on a five point likert scale. The constructs and items are shown in Figure 2. In order to conduct a regression analysis analog Mittal et al. (1998), we dummy coded the satisfaction variables. Customers scoring from 1 to 2.5 on
Modeling the Nonlinear Relationship Between Satisfaction and Loyalty Study 1, customers of automotive companies (N = 1477) items satisfaction brand A prestigious brand A brand about which I hear good things from friends and relatives A desirable brand A reputation that has been built over time sales The sales dept. respects me as a customer The sales dept. understands my requirements The sales dept. staff are helpful and courteous after- The service dept. respects me as a customer sales The service dept. understands my requirements The service dept. fixes faults at the first attempt loyalty Based on your present experience would you repurchase a car of the same brand? Would you recommend the brand of your car to friends and acquaintances?
579
scale 1 = poor 2 = less good 3 = good 4 = very good 5 = excellent
1 2 3 4 5
= = = = =
certainly not probably not maybe, maybe not probably certainly
Fig. 2. Constructs and Respective Items
the satisfaction scales (mean of items for each scale) were coded as “negative”. Customers scoring from 3.5 to 5 on the satisfaction scales were coded as “positive”. As shown in Figure 3, we obtained results, as predicted by Mittal et al., confirming that negative satisfaction has a stronger impact on loyalty, than positive satisfaction. Thus the nonlinear nature of the relation between satisfaction and loyalty is also confirmed in this initial step of our study. As mentioned before not correcting latent constructs for measurement error leads to inconsistent and attenuated parameter estimates in the dummy regression. Furthermore the dummy-coding of attribute performance leads to a loss of information. In a second step, we use the Quasi-ML approach to model the nonlinear relation between satisfaction and loyalty, with the following structural model: ⎞⎛ ⎞ ⎛ ⎞ ⎛ ξ1t ξ1t ω11 0 0 ηt = α + (γ1 γ2 γ3 ) ⎝ ξ2t ⎠ + (ξ1t ξ2t ξ3t ) ⎝ 0 ω22 0 ⎠ ⎝ ξ2t ⎠ + ζt . 0 0 ω33 ξ3t ξ3t The standardized estimates show the linear and quadratic impact of satisfaction judgments on loyalty. Satisfaction with sales has no significant impact on loyalty, whereas satisfaction with brand and satisfaction with after-sales has a negative nonlinear relationship with loyalty. Both quadratic terms ω11 and ω33 are negative and significant. Thus the higher the actual satisfaction with both brand and after-sales the lower is the impact of a satisfaction increase
580
M. Paulssen and A. Sommerfeld Dummy-Variable Regression Coefficients (standardized) latent variable negative positive satisfaction with brand −0.233∗∗∗ 0.158∗∗∗ satisfaction with sales −0.012n.s. 0.086∗ satisfaction with after-sales −0.151∗∗∗ 0.099∗ R2 = 0.229, F6,1471 = 74.074, p < 0.000 ∗∗∗ p < 0.0001; ∗∗ p < 0.01; ∗ p < 0.05; n.s. p > 0.05
Fig. 3. Results of the Dummy Regression
Study 1 t-value stand. estimates γ1 statisfaction with brand 9.012 0.356∗∗∗ γ2 statisfaction with sales 1.632 0.070 statisfaction with after-sales 4.244 0.214∗∗∗ γ3 ω11 statisfaction with brand −5.662 −0.220∗∗∗ ω22 statisfaction with sales 1.824 0.067 −2.244 −0.098∗ ω33 statisfaction with after-sales ∗∗∗ p < 0.0001; ∗∗ p < 0.01; ∗ p < 0.05; n.s. p > 0.05
Fig. 4. Results of the Nonlinear SEM-Model
on loyalty. Again these results support the proposition from prospect theory and are in line with Mittal et al. (1998) (see Figure 4). The second study was carried out in the telecommunication industry. The sample consists of 926 customers. Similar to the first study loyalty and satisfaction judgments with various aspects of the telecommunication service (network, tariff and customer service) were measured with five point likert scales. The results of the second study are essentially equivalent in that we have significant nonlinear effects and corroborate the findings of study one in a different context.
7
Discussion and Conclusion
As has been shown in this brief paper the Quasi-ML method provides a relatively manageable approach to model nonlinear relationships in a SEMframework. Especially for the substantive research question at hand where multiple nonlinear effects had to be estimated simultaneously the discussed multiple-groups and Kenny and Judd type models offer no alternative. The two steps methods of moments approach by Wall and Amemiya (2000) could in principle offer an even more flexible approach to model nonlinear relationships with latent variables, but applications and experience with this method are still scarce. Thus the Quasi-ML method by Klein (Klein & Muth´en, 2004)
Modeling the Nonlinear Relationship Between Satisfaction and Loyalty
581
represents an interesting approach to model nonlinearities. In future research it should be applied to more substantive research questions such as moderator effects. Similar to quadratic effects only Klein’s approach allows to test multiple interaction effects simultaneously.
References ¨ JORESKOG, K.G. and YANG, F. (1996): Nonlinear structural equation models: The Kenny and Judd model with interaction effects. In: G.A. Marcoulides and R.E. Schumacker (Eds.): Advanced Structural Equation Modeling (pp. 57–88), Mahwah, NJ: Lawrence Erlbaum. KAHNEMAN, D., and TVERSKY, A. (1979): Prospect theory: An analysis of decision under risk, Econometrica, 47 (2), 263-292. KANO N. (1984): Attractive quality and must-be quality. Journal of the Japanese Society for Quality Control, April, 39–48. ´ B.O. (2004): Quasi maximum likelihood estimation KLEIN, A.G., and MUTHEN, of structural equation models with multiple interaction and quadratic effects. Journal of the American Statistical Association (under review). KENNY, D.A. and JUDD, C.M. (1984): Estimating the nonlinear and interactive effects of latent variables. Psychological Bulletin, 96, 201-210. MITTAL, V., ROSS, W.T., and BALDASARE, P.M. (1998): The asymmetric impact of negative and positive attribute-level performance on overall satisfaction and repurchase intentions. Journal of Marketing, 62 (January), 33–47. MATZLER, K., BAILOM, F., HINTERHUBER, H.H., RENZL, B., and PICHLER, J. (2004): The asymmetric relationship between attribute-level performance and overall customer satisfaction: A reconsideration of the importanceperformance analysis. Industrial Marketing Management, 33(4), 271–277. PING, R. A. (1994): Does satisfaction moderate the association between alternative attractiveness and exit intention in a marketing channel? Journal of the Academy of Marketing Science, 22, 364–371. PING, R. A. (1996): Latent variable interaction and quadratic effect estimation: A two-step technique using structural equation analysis. Psychological Bulletin, 119, 166–175. REICHHELD, F. (1996): The loyalty effect: The hidden force behind growth, profits and lasting value. Boston, Harvard Business School Press. RIGDON, E., SCHUMACKER, R., and WOTHKE, W. (1998): A comparative review of interaction and nonlinear modeling. In: R.E. Schumacker and G.A. Marcoulides (Eds.): Interaction and Nonlinear Effects in Structural Equation Modeling (pp. 251–94) Mahwah, NJ: Lawrence Erlbaum. WALL, M.M., and AMEMIYA, Y. (2000): Estimation for polynomial structural equation models. Journal of the American Statistical Association, 95 (451), 929-940. YANG-WALLENTIN, F. (2001): Comparisons of the ML and TSLS estimators for the Kenny-Judd model. In: R. Cudeck, S. Du Triot, & D. S¨orbom (eds.): Structural Equation Modeling: Present and Future. A Festschrift in Honor of Karl J¨ oreskog. Lincolnwood: Scientific Software International. YI, Y. (1990): A critical review of consumer satisfaction. In: V.A. Zeithaml (ed.): Review of Marketing (pp. 68–123). Chicago: American Marketing Association.
Job Choice Model to Measure Behavior in a Multi-stage Decision Process Thomas Spengler and Jan Malmendier Fakult¨ at f¨ ur Wirtschaftswissenschaft, Universit¨ at Magdeburg, 39106 Magdeburg, Germany
Abstract. The article presents a job choice model allowing to measure the importance of items of employer images in a multi-stage decision process. Based on scientific research a model for the multi-stage decision process is presented that contains details on how decisions are made on each stage. A method using logistic regression to empirically validate the model is described and compared to an alternative method. The results of applying the method are presented and discussed.
1
Introduction
The job choice has gained interest in research since the middle of the 20th century. Scientists from different fields of research - motivation theory as well as organizational behavior - have provided models to explain the behavior of job seekers. Even though recent research is available, the various approaches have not been synthesized to a coherent model (Wanous et al., 1983; Schwab et al., 1987; Beach, 1993; Highhouse and Hoffman, 2001).
2
Model of Job Choice
The most important theory used to understand job choice behavior has been the expectancy theory (Wanous et al, 1983). The most relevant concept within expectancy theory for job choice is the valence-instrumentalityexpectancy-theory of Vroom consisting of several models. The first and most important model, based on the Rosenberg-approach, explains the attractiveness of a job or an organization.1 This can be presented algebraically as follows:2 Vj = f [
m
(Vk ∗ Ijk )]
k=1 1
2
The job choice always contains the organizational choice. Even though the student creates images on the level of employers and organizations, job choice describes the actual process more accurately. being Vj valence (attractiveness) of job j, Vk valence (importance) of job attribute k, Ijk instrumentality of job j to provide attribute k
Job Choice Model to Measure Behavior in a Multi-stage Decision Process
583
Consistent with all other expectancy theories Vroom assumes a simultaneous decision having all required information available and a compensatory decision rule. Since the late 1960s and early 1970s, the expectancy theory has been extensively applied especially when focusing on job search and decision behavior of graduates with very convincing results. As the popularity of the expectancy theory grew, so did the criticism of these assumptions and its usefulness as a predictive tool representing reality (Wanous et al., 1983; Breaugh, 1992). The expectancy theory neglects the extensive process of the organizational and job choice assuming just one final decision. Soelberg, therefore, developed his own ”generalizable decision process model” that gained most attention in connection with job choice. Lacking a theoretical framework and being only loosely structured, it is still widely recognized in organizational choice research for the described decision process (Power and Aldag, 1985; Schwab et al., 1987). The model consists of four stages. While the first and second stage cover preparation only; stage 3 includes the acquisition of information and job alternatives and is completed with the formal decision to stop searching (also called screening) and the formation of an ”active roster”. Soelberg assumes that the individual makes an implicit choice that he confirms during stage 4 with ”a goal weighting function” (Soelberg, 1967). A similar process can be found in the buying process of Howard and Sheth where the buyer forms a consideration set before selecting the product out of this set (Shocker et al.; 1991). The formation of the consideration set in general evolves by a selection against decision criteria, while the individuals finally choose the most preferred or - stated differently - the most attractive alternative. Osborn empirically proved the two stage decision process with different choice criteria for the screening and final decision (Osborn, 1990). In recent literature the application for a job is seen as a separate decision and should be integrated as an additional decision step (Gatewood et al., 1993).
2.1
Multi-stage Decision Process (Job Choice Funnel)
Widely adopted in the marketing field, the decision funnel with different sets of alternatives is new to job choice research. Trommsdorff defines the purchasing funnel with four specific sets of alternatives (Trommsdorff, 2002). The ”available set” forms the basis with all potential alternatives; the ”awareness set” contains all known alternatives; the ”processed set” includes all alternatives the individual has processed information about. Out of those familiar alternatives the consideration or evoked set is formed (Shocker et al., 1991). The finally purchased alternative is selected out of this set. Adapting the funnel to the job choice environment, some adoptions are necessary. Based on the consideration set, the job seekers will decide where to apply. The potential employer will review the applications. In this step the set of applications is transformed into the set of offers being the relevant set for the final job choice. Therefore, three decisions are involved in
584
T. Spengler and J. Malmendier
the funnel: Selection for the consideration set, selection for application and selection of the future employer. 2.2
Decision Making on each Funnel Stage in Detail
For each of these selections two challenges should be considered: Firstly, a model has to be identified, how alternatives are evaluated on each funnel stage. Secondly, it has to be clarified how the alternatives are selected based on the evaluation. Addressing the first question, the successful application of the expectancy theory for job choice of graduate students and the work of Beach lead to the hypothesis that the Vroom model can be used for all three decision stages (Beach, 1993). To evaluate the alternative jobs, the individuals determine the attractiveness. Research indicates that they probably use different weights for each stage (Osborn, 1990). Along the funnel we find two different types of selection: According to the research described above, the first two screening decisions lead to choice sets, in the final decision the preferred employer will be selected. For the screening decisions all alternatives with an attractiveness above a decision standard will be selected (Beach, 1993). For both types of decisions there are two different selection models. Either the decision is made deterministically or probabilistically (Louviere and Woodworth, 1983). The probabilistic model is based on a random utility maximization model and extends the deterministic one by assuming that the attractiveness or utility has a stochastic component.3 V˜ij = Vij + ij This probabilistic model is a far better representation of the reality, because neither the actual decision nor the measurement of the image items are free of error.
3
Method
To empirically analyze job choice behavior two methods got most attention in the past. The compositional measurement directly surveys the value for the instrumentality or - stated differently - items of the employer image for each alternative and the importance of these items for the job decision. The sum of the products of instrumentality and importance mirrors the attractiveness of the alternative. The data collection of this method, also called explicated model or direct-estimate, is very simple and allows to survey long lists of items without overextending the survey participant. Still, there are various concerns regarding this method. First, this model contains no information whether this alternative will be chosen or not. Second, the separate evaluation 3
being V˜ij subjective utility/attractiveness of alternative j for individual i, Vij deterministic utility/attractiveness for alternative j for individual i, ij error
Job Choice Model to Measure Behavior in a Multi-stage Decision Process
585
of instrumentality and importance is unrealistic and artificial, mainly due to social desirability (Breaugh, 1992; Wanous et al., 1983). The most prominent alternatives are decompositional measurements, especially the conjoint analysis. As the traditional type of conjoint analysis still misses the link to the selection, the choice-based conjoint analysis closes this gap. Even though the choice-based conjoint analysis provides a more realistic model, the efficient measurement is limited to only a few items. Moreover, the choice-based conjoint analysis focuses on analyzing the last decision - picking the most favorite employer. 3.1
Proposed Method
As both methods have significant shortcomings, the brand driver analysis, suggested by marketing research, is evaluated (Echterling et al., 2002). This approach also uses the choice decision as dependent variable, but asking to assign each alternative to one out of four possible sets of alternatives: ”familiarity/processed set”, ”consideration set/short list”, ”application” and ”favorite employer”. The method uses items of employer images as independent variables. Conducting separate analysis for each decision step, the binary logistic regression offers a robust methodology to analyze the importance of image items for each decision. This method is especially valid for screening decisions against a level of required attractiveness. Each of the decisions to transfer to the next set of alternatives provides a binary decision variable.4 Another advantage of the logistic regression is the integration of the probabilistic choice model. It assumes a residual error term that is approximately normally distributed (Lilien et al., 1992).5 The logistic regression consists of a linear core, the strength of influence z:6
zi = β0 +
K
βk ∗ xik
k=1 4
5
6
The three binary variables are consideration set (y/n), intend to apply (y/n) and preferred employer (y/n). This neglects the fact that the individual does not use a decision standard for the final decision, but just chooses the most preferred alternative. Comparing the three decisions, the prediction of the final one has very limited relevance as the consideration set consists of offers received is unknown and depends on the selection by the employers. Therefore, the model is focused on the screening decisions against a decision standard. Assuming a normally distributed error term, the probit-model would be appropriate. But this model lacks a usable algebraic form and its results are extremely difficult to interpret (Lilien et al., 1992). Thus, the logit-model is used assuming nearly normally distributed residual values. being xi1 , ..., xik values of the independant variable xk for individual i, β1 , ..., βk weights of independant variable xk , β0 constant value
586
T. Spengler and J. Malmendier
The likelihood for the binary decision y=0 or y=1 is calculated as follows: pi (y = 1) =
1 1+e−zi
and pi (y = 0) = 1 −
1 1+e−zi
To optimize the model, the likelihood function, the product of all observed cases, will be maximized. The maximization will be achieved by adapting the βk values. Therefore, you can interpret the optimized βk values as derived importance of the image items for the specific decision, using the odds ratio. 3.2
Procedure
Besides analyzing the decision process of students, the survey aimed to understand the importance of student-employer relationships for job choice decisions. Therefore, items representing the quality of relationships have been integrated into the survey in addition to classical employer image items. The relationship marketing research has identified trust and familiarity as key indicators for the quality of a relationship (Bruhn, 2001). To collect the data, an online survey was created and used in association with e-fellows.net, the largest high-potential network and scholarship in Germany with more than 10’000 scholars. All scholars received an email, resulting in 2’495 completed online questionnaires. First, the participants had to sort 18 major employers of high-potentials into a set of alternatives. In a second step, they were asked to assess four employer images in detail that were randomly selected as long as they were at least familiar to the individual. Besides this, participants were asked for their stated importance of the image items and biographical information. In the survey a 6-point rating scale has been used for the image items, being verbally anchored at both ends.7
4
Results
The results of the three logistic regressions are displayed in two tables. The first shows the model fit and the improvements regarding correct classifications.8 The second table comprises the odds ratios representing the derived importance of the image items.9 Most important, all models are significant. The Pseudo-R2 measure the ability of the model to clearly distinguish between the two options of the binary decision based on the provided image item values. Nagelkerke-R2 is chosen as the quality criterion in this research context. For the first two 7
8 9
After closing the online survey, all data have been transferred to SPSS and analyzed using the logistic regression function, type ”Enter”. (+): requirements met; (-): requirements not met Level of significance: **0.01; *0.1 - n.s. not significant on level 0.1; Bold: the five most important items
Job Choice Model to Measure Behavior in a Multi-stage Decision Process
587
Decision to select an alternative for Consideration set Significance of model (LR-Test) Nagelkerke-R2 Share of correct classifications Share of correct classifications in base model Relative decrease of share of wrong classifications
0.01 (+) 0.326 (+) 74.4%
Intention to apply 0.01 (+) 0.243 (+) 70.7%
Preferred employer 0.01 (+) 0.120 (-) 63.1%
63.8%
60.0%
54.8%
29.3%
26.8%
18.4%
Table 1. Analysis of model fit
decisions the model delivers very acceptable values for Nagelkerke-R2 , only for the last decision the model does not separate well. This result is not a surprise, as the chosen model does not represent the final selection of the most preferred employer well. Looking at the importance of items for the three decisions, the clearest development shows the ”prospect of success for application”-item that is very important for the first decision and irrelevant for the last. Students only consider employers that will probably respond positively to an application. Also trust and familiarity are important items for the decision to transfer an employer to the shortlist or to apply. For the final decision career opportunities and locations seem to be highly relevant. Besides this analysis, the differences between the stated and derived importance are of interest. For some items the derived and stated importance show similar results. ”Challenging tasks” and ”trust in employer” are relevant for all decisions and stated to be highly important. Especially the relevance of the likelihood to successfully apply for a job and importance of familiarity of tasks for the first decision is highly underestimated. A similar situation can be found for career opportunities and location looking at the final decision. In contrast, candidates overrate the relevance of job security, responsibility, work-life-balance and creativity. These items do not have a significant positive influence on any of the three decisions. Besides the odds ratios the level of significance should be analyzed further. First of all, there are quite a number of items that do not influence the decision significantly. This might be an indicator for multicollinearity. The effect of multicollinearity appears when several items are fully or mostly a linear combination of other. In case of multicollinearity, the coefficients cannot be interpreted correctly. Testing the current model leads to a variance-inflationfactor of 3.1 that is clearly below the limit value of 10 (Green, 2000). Using the condition index, the model delivers acceptable values below the critical point of 30. Looking at a similar logistic regression analyzing job choice behavior, an even lower number of items has been significant (Schmidtke, 2002).
588
T. Spengler and J. Malmendier Derived importance as odds ratios for
Stated
decision to transfer to
Importance
Short-list Job security Attractive location High Salary Career opportunities Taking responsibility Trainings Work-Life-Balance Internationality Creative working environment Challenging tasks Innovative company Good working relations Trust in employer Familiarity with tasks Prospect of success for application Familiarity with employees
Intended application 0.9** 1.17** 1.07* 1.13** 0.89** 1.2** 0.92 1.1**
Preferred employer 1.03 n. s. 1.15** 0.93 n. s. 1.21** 0.95 n. s. 1.08 n. s. 0.89** 1.07 n. s.
2.35 2.3 2.81 1.97 2.18 1.88 2.14 2.23
1.04 n. s. 1.46** 1.2** 1.13** 1.22** 1.14**
0.97 n. s. 1.36** 1.12** 1.13** 1.09* 1.22**
1.05 n. s. 1.3** 1.1* 1.19** 1.09* 1.07 n. s.
1.91 1.65 2.5 1.45 1.51 2.73
1.75**
1.29**
0.96 n. s.
3.06
0.96*
1.12**
1.07*
4.62
0.97 n. 1.11** 0.99 n. 1.01 n. 0.85** 1.06* 0.97 n. 1.1**
s. s. s.
s.
Average
Table 2. Comparison of derived and stated importance
5
Discussion
First of all, the analysis clearly indicates that the derived importance of image attributes changes along the funnel. This is especially true for the relevance of the expected success of an application. Secondly, the results clearly indicate that the stated importance diverge from the derived importance. This underlines that the direct measurement of importance should be applied very carefully. The logistic regression overall provides significant results, still the goodness of fit could be improved. As stated before, it is no surprise that the logistic regression delivers no sufficient Pseudo-R2 for the last decision. From a content point of view, the extension of the classic image items using relationship attributes has significantly contributed to the explanation of decisions along the funnel. Especially the first decisions to short-list an employer and to apply are influenced by the relationship attributes familiarity and trust. Therefore, companies may seek to create relations through personal interactions (e.g. campus presentations, workshops, internships).
Job Choice Model to Measure Behavior in a Multi-stage Decision Process
589
References BEACH, L.R. (1993): Decision Making in the Workplace. Mahwah, New Jersey. BREAUGH, J.A. (1992): Recruitment: Science and Practice. Boston. BRUHN, M. (2001): Relationship Marketing. M¨ unchen. ECHTERLING, J., FISCHER, M., and KRANZ, M. (2002): Die Erfassung der Markenst¨ arke und des Markenpotenzials als Grundlage der Markenf¨ uhrung Arbeitspapier Nr. 2, Marketing Centrum M¨ unster. D¨ usseldorf, M¨ unster. GATEWOOD, R.D., GOWAN, M.A., and LAUTNSCHLAGER, G.J. (1993): Cororate Image, Recruitment Image, and Initial Job Choice Decisions. Academy of Management Journal, Vol. 36, No. 2, S. 414-427. GREEN, W.H. (2000): Econometric Analysis, 4.Ed. Englewood Cliffs. HIGHHOUSE, S. and HOFFMAN, J.R. (2001): Organizational Attraction and Job Choice. International Review of Industrial and Organizational Psychology, Vol. 16, S. 37-64. LILIEN, G.L., KOTLER, P., and MOORTHY, K.S. (1992): Marketing Models. Englewood Cliffs. LOUVIERE, J.J. and WOODWORTH, G. (1983): Design and Analysis of Simulated Consumer Choice or Allocation Experiments: An Approach Based on Aggregate Data. Journal of Marketing Research, Vol. 20, November 1983, S. 350-367. OSBORN, D.P. (1990): A Reexamination of the Organizational Choice Process. Journal of Vocational Behavior, Vol. 36, S. 45-60. POWER, D.J. and ALDAG, R.J. (1985): Soelberg’s job search and choice model: A clarification, review, and critique. Academy of Management Review, Vol. 10, S. 48-58. SCHMIDTKE, C. (2002): Signaling im Personalmarketing: Eine theoretische und empirische Analyse des betrieblichen Rekruitingerfolges. M¨ unchen. SCHWAB, D.P., RYNES, S.L., and ALDAG, R.J. (1987): Theories and research on job search and choice. Research in Personnel and Human Resources Management, Vol. 5, S. 129-166. SHOCKER, A.D., BEN-AKIVA, M., BOCCARA, B., and NEDUNGADI, P. (1991): Consideration Set Influences on Consumer Decision-Making and Choice: Isues, Models and Suggestions. Marketing Letters, 2:3 (1991), S. 181197. SOELBERG, P.O. (1967): Unprogrammed decision making. Industrial Management Review, Vol. 8, No. 8, S. 19-29. TROMMSDORFF, V. (2002): Konsumentenverhalten, 4. Ed. Stuttgart. WANOUS, J.P., KEON, T.L., and LATACK, J.C. (1983): Expectancy Theory and Occupational/Organizational Choices: A Review and Test. Organizational Behavior and Human Performance, Vol. 32, S. p. 66-86.
Semiparametric Stepwise Regression to Estimate Sales Promotion Effects Winfried J. Steiner1 , Christiane Belitz2 , and Stefan Lang3 1
2
3
Department of Marketing, University of Regensburg, 93040 Regensburg, Germany Department of Statistics, University of Munich, 80539 Munich, Germany Institute of Empirical Economic Research, University of Leipzig, 04109 Leipzig, Germany
Abstract. Kalyanam and Shively (1998) and van Heerde et al. (2001) have proposed semiparametric models to estimate the influence of price promotions on brand sales, and both obtained superior performance for their models compared to strictly parametric modeling. Following these researchers, we suggest another semiparametric framework which is based on penalized B-splines to analyze sales promotion effects flexibly. Unlike these researchers, we introduce a stepwise procedure with simultaneous smoothing parameter choice for variable selection. Applying this stepwise routine enables us to deal with product categories with many competitive items without imposing restrictions on the competitive market structure in advance. We illustrate the new methodology in an empirical application using weekly store-level scanner data.
1
Introduction
Kalyanam and Shively (1998) and van Heerde et al. (2001) have proposed nonparametric techniques (a kernel-based and a stochastic spline regression approach, respectively) to estimate promotional price effects. In both studies, the authors obtained superior performance for their semiparametric models compared to strictly parametric modeling. The empirical results of these two studies indicate that own- and cross-promotional price effects may show complex nonlinearities which are difficult or not at all to capture by parametric models. Moreover, no unique patterns for own- and cross-promotional price response curves generalizable across or even within product categories could be identified. These findings strongly support the use of nonparametric techniques to let the data determine the shape of promotional price response functions. A recent empirical comparison of parametric and seminonparametric sales response models (the latter specified as multilayer perceptrons) conducted by Hruschka (2004) also provides superior results for the more flexible neural net approach. We follow Kalyanam and Shively (1998) and van Heerde et al. (2001) and propose a semiparametric model based on penalized B-splines to estimate
Semiparametric Stepwise Regression to Estimate Sales Promotion Effects
591
sales promotion effects flexibly. We add to the body of knowledge by suggesting a stepwise regression procedure with simultaneous smoothing parameter choice for variable selection. Applying this stepwise routine enables us to deal with product categories with many competing brands and to resolve the problem of identifying relevant cross-promotional effects between brands without imposing restrictions on the competitive market structure in advance. Since cross-item price effects are usually much lower in magnitude than own-item price effects (e.g., Hanssens et al. (2001)), and frequently not all competing brands in a product category are close substitutes to each other (e.g., Foekens (1995)), a stepwise selection to reduce the number of predictors in a sales response model seems very promising. Many previous approaches to analyze sales response to promotional activities have tackled this problem by imposing restrictions on the competitive market structure, e.g., by capturing competitive promotional effects in a highly parsimonious way through the use of a single competitive variable (e.g., Blattberg and George (1991)) or by focusing only on a limited number of major brands in a product category (e.g., Kalyanam and Shively (1998), van Heerde et al. (2001)). The paper is organized as follows: in section 2, we propose the semiparametric model to estimate promotional effects and provide details about the P-splines approach we use to model the unknown smooth functions for ownand cross-promotional price effects; in section 3, we introduce the stepwise routine which includes a simultaneous smoothing parameter selection for the continuous price variables; in section 4, we illustrate the new methodology in an empirical application using weekly store-level scanner data for coffee brands; section 5 summarizes the contents of the paper.
2
A Semiparametric Approach to Analyze Promotional Data
To estimate sales promotion effects, we model a brand’s unit sales as (1) a nonparametric function of own- and cross-item price variables using penalized B-splines (e.g., Eilers and Marx (1996), Lang and Brezger (2004)) and (2) a parametric function of other promotional instruments: ln(Qis,t ) = αis Os + fij (Pjs,t ) + fii (Pis,t−1 ) + (1) s
j
k
j
γijk Djks,t +
δiq Wq,t + εis,t ;
ε ∼ N (0, σ 2 ),
q
where Qis,t : unit sales of item i (brand i) in store s and week t; Os : store dummy to capture heterogeneity in baseline sales of brand i across different stores;
592
W.J. Steiner et al.
fij (Pjs,t ): unknown smooth functions for the effect of own-item price (j = i) and prices of competing items (j = i) on unit sales of brand i; Pjs,t : actual price of item j in store s and week t; Pis,t−1 : lagged price of item i in store s and week t; Djks,t : indicator variables capturing usage (= 1) or nonusage (= 0) of non-price promotional instrument k (e.g., display, feature) for brand j in store s and week t; Wq,t : seasonal dummy indicating if public holiday q falls in week t (= 1) or not (= 0); αis , δiq : store intercept for item i and store s, effect of holiday q on unit sales of brand i; γijk : effect of non-price promotional instrument k of item j on unit sales of brand i, representing own (j = i) and cross (j = i) promotional effects; As common in commercially applied sales response models, we pool the data across stores and focus on one brand at a time (e.g., van Heerde et al. (2002)). We use log unit sales (ln(Qis,t )) instead of unit sales to normalize the distribution of the criterion variable which is typically markedly skewed with promotional data. We further include indicator variables (Wq,t ) to account for “seasonal” fluctuations in a brand’s unit sales due to holidays (e.g., Christmas, Easter). We also include a lagged variable for own price (Pis,t−1 ) to accomodate the fact that promotions often accelerate sales of a brand during the promotional period leading to a trough after the promotional period (e.g., Blattberg and George (1991)). To model the unknown smooth functions for own- and cross-price effects, we adopt the P-splines approach proposed by Eilers and Marx (1996). This approach can be characterized by three properties: (a) It is assumed that the unknown functions fij (or fii ) can be approximated by a spline of degree l with equally spaced knots within the range of the respective price Pj . We use cubic splines and, hence, assume degree 3. Suppressing brand index i, store index s and time index t, we can write such a spline in terms of a linear combination of Mj cubic B-spline basis functions Bjm , m = 1, . . . , Mj : fj (Pj ) =
Mj
βjm Bjm (Pj ),
(2)
m=1
where Bjm : m-th B-spline basis function; βjm : regression coefficient for the m-th B-spline basis function. It would be beyond the scope of the paper to go into the details of B-splines. We refer to De Boor (1978) as a key reference. (b) Eilers and Marx (1996) suggest to use a moderately large number of knots to ensure enough flexibility for the unknown functions. For simplicity, we use 20 knots for every price
Semiparametric Stepwise Regression to Estimate Sales Promotion Effects
593
response curve, i.e., Mj = M = 20. (c) To guarantee sufficient smoothness of the fitted curves, a roughness penalty based on squared differences (of order k) of adjacent B-spline coefficients is specified. Let vn denote the vector of all parametric effects of the model for the nth observation and let index j, j = 1, . . . , J + 1 cover all smooth functions for own- and competitive price effects (including the lagged own-price effect as the (J + 1)th price effect), this leads to the penalized least-squares criterion ⎛ ⎞2 N J+1 J+1 M ⎝yn − fj (Pjn ) − vn ζ ⎠ + λj (3) (∆k βj,l )2 , n=1
j=1
j=1
l=k+1
where N : sample size as product of number of stores and number of weeks; ∆k : differences of order k between adjacent regression coefficients; λj : smoothing parameter for function fj . In the following, we restrict ourselves to penalties based on second order differences, i.e., ∆k βj,l = βj,l − 2βj,l−1 + βj,l−2 . The penalized sum of squared residuals (3) is minimized with respect to the unknown regression coefficients βjm (compare equation (2)) and ζ. The trade off between flexibility and smoothness is controlled by the smoothing parameters λj , j = 1, . . . , J + 1, which are determined within the stepwise routine (see section 3). Estimation of the semiparametric model (1) given the smoothing parameters is carried out with backfitting (Hastie and Tibshirani (1990)). To give a benchmark for the performance of the semiparametric model (1), we compare it in our empirical application presented in section 4 to the exponential model (4), which is one of the most widely used parametric models to analyze sales response (e.g., Montgomery (1997), Kalyanam and Shively (1999)): ln(Qis,t ) = αis Os + βij Pjs,t + νi Pis,t−1 + (4) s
j
k
j γijk Djks,t +
δiq Wq,t + εis,t ;
ε ∼ N (0, σ 2 ).
q
Model (4) differs from model (1) only with respect to own- and cross-price effects which are specified linearly (parametrically).
3
Stepwise Routine with Simultaneous Smoothing Parameter Selection
Based on the P-splines approach outlined above, we suggest a stepwise regression procedure for markets with many competing brands and promotional instruments. This procedure does not only allow for variable selection
594
W.J. Steiner et al.
but also enables to determine the degree of smoothness of effects which can be modeled nonparametrically. The objective of using the stepwise routine is to select relevant predictors (and especially relevant cross-promotional effects) for the unit sales of a brand under consideration, while at the same time not losing much explanatory power by excluding other variables from the model. Importantly, by obtaining a parsimonious sales response model that way, overspecification effects arising from the inclusion of all possible but not necessarily important cross effects (typically reflected by unreliable coefficients resulting from overparametrization, wrong signs and unexpected magnitudes of coefficients due to multicollinearity) can be avoided. The stepwise procedure works as follows: For each independent variable, we consider a hierarchy of specification alternatives defined in terms of equivalent degrees of freedom df . It is well-known that the equivalent degrees of freedom df of a smooth function can be calculated from the trace of the corresponding smoother matrix (which in turn depends on the smoothing parameter value), and it is common practice to choose the value of a smoothing parameter simply by specifying the df for the smooth (Hastie and Tibshirani 1990). Clearly, there are only two possible specifications for indicator variables (like display, feature or seasonal dummy variables): excluded from the model (df = 0) or included in the model (df = 1). For the continuous price variables, however, we allow for a much broader interval of possible values for degrees of freedom ranging at integer increments from [0; 10]. Setting df = 0 implies that the respective price variable is excluded from the model. For df = 1, the effect is included linearly. With increasing df (i.e., decreasing smoothing parameter), the penalty term in expression (3) becomes less important and the estimated function gets more and more rough. Variable selection starts from the linear model, which includes all independent variables at df = 1 (i.e., parametrically). In each iteration, a set of new models is estimated by passing through the independent variables successively: (a) For each independent variable, the number of df is increased and decreased by one (where feasible) and the respective models are estimated leaving the number of df with respect to all other independent variables unchanged; (b) From the pool of new models estimated, the best model is then determined according to the BIC criterion: BIC = N · ln(ˆ σ 2 ) + ln(N ) · dftotal
(5)
where estimated variance for the error term εis,t ; σ ˆ2 : dftotal : overall degrees of freedom It is convenient to approximate dftotal by adding up the degrees of freedom used for the individual functions/terms included in the model (Hastie and Tibshirani (1990)). (c) If the BIC of the best model selected is less (i.e., better) than the BIC of the start model, GO TO (b) and use the selected model as the new start model; otherwise STOP.
Semiparametric Stepwise Regression to Estimate Sales Promotion Effects brand 1 lowest price 5.99 highest price 8.49
2 4.99 7.49
3 4.99 7.49
4 6.99 8.99
5 5.99 8.99
6 5.99 7.99
7 5.99 7.99
8 5.99 7.99
595
9 5.99 9.49
Table 1. Price ranges
4
Empirical Study
In this section, we present results from an empirical application of our semiparametric framework to weekly store-level scanner data for nine brands of coffee offered in five German supermarkets. The data were provided by MADAKOM GmbH (50825 Cologne, Germany) and include unit sales, retail prices and deal codes indicating the use of non-price promotional instruments (display, feature, other advertising activities) for the nine brands over a time span of 104 weeks. Table 1 shows the price ranges for the individual brands across stores according to the weekly price data. The weekly market shares of all brands vary considerably reflecting the frequent use of price promotions. Table 2 shows the BIC values for the best semiparametric and exponential models selected by the stepwise routine, respectively. Importantly, the stepwise procedure based on the exponential model (4) was only allowed to select own- and cross-price effects parametrically (i.e., at df = 1), as opposed to the stepwise selection with simultaneous smoothing parameter choice based on the semiparametric model (1). In addition, the overall degrees of freedom for the start model dftotal/start , which includes all effects parametrically, versus the overall degrees of freedom for the best semiparametric model dftotal/semipar are reported as a kind of benchmark for model improvement through variable selection. With the exception of brands 3, 7 and 8, the semiparametric approach clearly outperforms the strictly parametric approach, with the most dramatic improvement occurring for brand 9. The improvement from nonparametric modeling is only slightly for brand 7, while no differences between the final models occur with respect to brands 3 and 8. The latter implies that nonparametric modeling of price effects does not matter for these two brands, and that the semiparametric model here actually degenerates into the exponential model. A comparison between the number of degrees of freedom used in the start model to those used in the best semiparametric model demonstrates the usefulness of the stepwise routine in providing very parsimonious sales response models. The following results refer to the brands for which the improvement in BIC values for the semiparametric approach is substantial (i.e., not brands 3, 7 and 8): (a) For five out of six brands, exactly the same price variables were selected in the semiparametric model (1) and the exponential model (4). This implies that the greater flexibility in nonlinear effects for the price variables provided by the semiparametric approach is the reason for the BIC improvement relative to the exponential model. (b) For each brand, the current
596
W.J. Steiner et al.
brand semiparametric model exponential model dftotal/start dftotal/semipar 1 -1001.13 - 969.15 66 23 2 - 319.18 - 311.13 64 19 3 - 873.38 - 873.38 60 13 4 -1201.98 -1182.00 62 18 5 - 556.69 - 536.91 64 16 6 - 542.31 - 512.17 61 12 7 - 521.62 - 519.85 60 15 8 - 502.43 - 502.43 61 14 9 - 616.63 - 543.38 62 15
Table 2. BIC values and overall degrees of freedom
own-price effect is included nonparametrically. For three brands, however, the BIC improvement can also be attributed to nonparametrically selected cross-price effects which show strong nonlinearities. (c) Out of 72 possible cross-price effects (8 per brand), only 21 were selected across brands. This confirms previous empirical findings that only some of the brands in a product category may be close substitutes to each other. (d) Nearly all selected non-price promotional instruments (referring to the use of display, feature and other advertising activities) have signs in the expected direction, i.e., positive for own-promotional effects, negative for cross-promotional effects. Figure 1 illustrates the differences between the semiparametric model and the strict parametric one, considering the estimated own-item price effect for brand 9 and the cross-item price effect of brand 4 on the unit sales of brand 6 as two examples. Although the nonparametric and parametric own price response curves for brand 9 are shaped rather similar, the differences in predicted sales are substantial. In particular, the parametric model dramatically understates the effect for low prices (up to a difference of 800 units at 5.99), and it overstates the effect for medium prices. With respect to the crosspromotional price effect of brand 6 on brand 4, the parametric model understates the sales effect for low and high prices and overstates the sales effect for medium prices. Importantly, the nonparametric curve reveals a threshold effect at 6.99, up to which the unit sales of brand 4 are insensitive to price changes of brand 6.
5
Conclusions
We presented a semiparametric regression model including a stepwise procedure for variable selection to analyze promotional data. While the semiparametric model provides high flexibility in modeling nonlinear effects for the continuous price variables, the stepwise routine is used to identify the relevant predictors in markets with many competing items and many promotional instruments. The new approach is illustrated in an empirical application using weekly store-level scanner data.
597
0
250
500
sales brand 9 1000 1500
sales brand 4 300 350
2000
2500
400
Semiparametric Stepwise Regression to Estimate Sales Promotion Effects
5.99
6.49
6.99
7.49 7.99 price brand 9
P−Spline
8.49
8.99
exponential
9.49
5.99
6.49
6.99 price brand 6 P−Spline
7.49
7.99
exponential
Fig. 1. Nonparametrically estimated own-/cross-promotional price effects
References BLATTBERG, R.C. and GEORGE, E.I. (1991): Shrinkage Estimation of Price and Promotional Elasticities. Journal of the American Statistical Association, 86(414), 304–315. DE BOOR, C. (1978): A Practical Guide to Splines. Springer, New York. EILERS, P.H.C. and MARX, B.D. (1996): Flexible smoothing using B-splines and penalized likelihood (with comments and rejoinder). Statistical Science, 11(2), 89–121. FOEKENS, E.W. (1995): Scanner Data Based Marketing Modelling: Empirical Applications. Labyrinth Publications, The Netherlands. HANSSENS, D.M., PARSONS L.J. and SCHULTZ, R.L. (2001): Market Response Models: Econometric and Time Series Analysis. Chapman & Hall, London. HASTIE, T. and TIBSHIRANI, R. (1990): Generalized Additive Models. Chapman & Hall, London. HRUSCHKA, H. (2004): Relevance of Functional Flexibility for Heterogeneous Sales Response Models. A Comparison of Parametric and Seminonparametric Models. Discussion Paper 394, Faculty of Economics, University of Regensburg. KALYANAM, K., SHIVELY, T.S. (1998): Estimating Irregular Pricing Effects: A Stochastic Spline Regression Approach. Journal of Marketing Research, 35(1), 16–29. LANG, S. and BREZGER, A. (2004): Bayesian P-splines. Journal of Computational and Graphical Statistics, 13, 183–212. MONTGOMERY, A.L. (1997): Creating Micro-Marketing Pricing Strategies Using Supermarket Scanner Data. Marketing Science, 16(4), 315–337. VAN HEERDE, H.J., LEEFLANG, P.S.H. and WITTINK, D.R. (2001): Semiparametric Analysis to Estimate the Deal Effect Curve. Journal of Marketing Research, 38(2), 197–215. VAN HEERDE, H.J., LEEFLANG, P.S.H. and WITTINK, D.R. (2002): How Promotions Work: SCAN*PRO-Based Evolutionary Model Building. Schmalenbach Business Review, 54(3), 198–220.
Implications of Probabilistic Data Modeling for Mining Association Rules Michael Hahsler1 , Kurt Hornik2 , and Thomas Reutterer3 1
2
3
Department of Information Systems and Operations, Wirtschaftsuniversit¨ at Wien, A-1090 Wien, Austria Department of Statistics and Mathematics, Wirtschaftsuniversit¨ at Wien, A-1090 Wien, Austria Department of Retailing and Marketing, Wirtschaftsuniversit¨ at Wien, A-1090 Wien, Austria
Abstract. Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine association rules are discussed in great detail. We present a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world grocery database to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left-hand-side of rules and that lift performs poorly to filter random noise in transaction data. The probabilistic data modeling approach presented in this paper not only is a valuable framework to analyze interest measures but also provides a starting point for further research to develop new interest measures which are based on statistical tests and geared towards the specific properties of transaction data.
1
Introduction
Mining association rules (Agrawal et al., 1993) is an important technique for discovering meaningful patterns in transaction databases. An association rule is a rule of the form X ⇒ Y , where X and Y are two disjoint sets of items (itemsets). The rule means that if we find all items in X in a transaction it is likely that the transaction also contains the items in Y . A typical application of mining association rules is market basket analysis where point-of-sale data is mined with the goal to discover associations between articles. These associations can offer useful and actionable insights to retail managers for product assortment decisions (Brijs et al., 2004), personalized product recommendations (Lawrence et al., 2001), and for adapting promotional activities (Van den Poel et al., 2004). For web-based systems (e.g., e-shops, digital libraries, search engines) associations found between articles/documents/web pages in transaction log files can even be used to automatically and continuously adapt the user interface by presenting associated items together (Lin et al., 2002).
Implications of Probabilistic Data Modeling for Mining Association Rules
599
Association rules are selected from the set of all possible rules using measures of statistical significance and interestingness. Support, the primary measure of significance, is defined as the fraction of transactions in the database which contain all items in a specific rule (Agrawal et al., 1993). That is, supp(X ⇒ Y ) = supp(X ∪ Y ) =
count(X ∪ Y ) , m
(1)
where count(X ∪ Y ) represents the number of transactions which contain all items in X or Y , and m is the number of transactions in the database. For association rules, a minimum support threshold is used to select the most frequent (and hopefully important) item combinations called frequent itemsets. The process of finding these frequent itemsets in a large database is computationally very expensive since it involves searching a lattice which in the worst case grows exponentially in the number of items. In the last decade, research has centered on solving this problem and a variety of algorithms were introduced which render search feasible by exploiting various properties of the lattice (see Goethals and Zaki (2004) as a reference to the currently fastest algorithms). From the frequent itemsets found, rules are generated using certain measures of interestingness, for which numerous proposals were made in the literature. For association rules, Agrawal et al. (1993) suggest confidence. A practical problem is that with support and confidence often too many association rules are produced. In this case, additional interest measures, such as e.g. lift, can be used to further filter or rank found rules. Several authors (e.g., Aggarwal and Yu, 1998) constructed examples to show that in some cases the use of support, confidence and lift can be problematic. Instead of constructing such examples, we will present a simple probabilistic framework for transaction data which is based on independent Bernoulli trials. This framework can be used to simulate data sets which only contain random noise and no associations are present. Using such data and a transaction database from a grocery outlet we will analyze the behavior and problems of the interest measures confidence and lift. The paper is structured as follows: First, we introduce a probabilistic framework for transaction data. In section 3 we describe the used real-world and simulated data sets. In sections 4 and 5 we analyze the implications of the framework for confidence and lift. We conclude the paper with the main findings and a discussion of directions for further research.
2
A Simple Probabilistic Framework for Transaction Data
A transaction database consists of a series of transactions, each transaction containing a subset of the available items. We consider transactions which are recorded in a fixed time interval of length t. We assume that transactions
600
M. Hahsler et al. items l1
...
0 0 0 0 . . . 1 0
1 1 1 0 . . . 0 0
0 0 0 0 . . . 0 1
99
201
7
... ... ... ...
ln
... ...
1 1 0 0 . . . 1 1
...
411
.
c
l3
.
Tr1 Tr2 Tr3 Tr4 . . . Trm-1 Trm
l2
0.005 0.01 0.0003 ... 0.025
.
transactions
p
Fig. 1. Example transaction database with success probabilities p and transaction counts per item c.
occur randomly following a (homogeneous) Poisson process with parameter θ. The number of transactions m in time interval t is then Poisson distributed with parameter θt where θ is the intensity with which transactions occur during the observed time interval: P (M = m) =
e−θt (θt)m m!
(2)
We denote the items which occur in the database by L = {l1 , l2 , . . . , ln } with n being the number of different items. For the simple framework we assume that all items occur independently of each other and that for each item li ∈ L there exists a fixed probability pi of being contained in a transaction. Each transaction is then the result of n independent Bernoulli trials, one for each item with success probabilities given by the vector p = (p1 , p2 , . . . , pn ). Figure 1 contains the typical representation of an example database as a binary incidence matrix with one column for each item. Each row labeled Tr 1 to Tr m contains a transaction, where a 1 indicates presence and a 0 indicates absence of the corresponding item in the transaction. Additionally, in Figure 1 the success probability for each item is given in the row labeled p and the row labeled c contains the number of transactions each item is contained in (sum of the ones per column). Following the model, ci can be interpreted as a realization of a random variable Ci . Under the condition of a fixed number of transactions m this random variable has the following binomial distribution. P (Ci = ci |M = m) =
m ci p (1 − pi )m−ci ci i
(3)
However, since for a fixed time interval the number of transactions is not fixed, the unconditional distribution gives:
Implications of Probabilistic Data Modeling for Mining Association Rules
P (Ci = ci ) =
∞
P (Ci = ci |M = m) · P (M = m) =
m=ci
e−pi θt (pi θt)ci ci !
601
(4)
The unconditional probability distribution of each Ci has a Poisson distribution with parameter pi θt. For short we will use λi = pi θt and introduce the parameter vector λ = (λ1 , λ2 , . . . , λn ) of the Poisson distributions for all items. This parameter vector can be calculated from the success probability vector p and vice versa by the linear relationship λ = pθt. For a given database, the values of the parameter θ and the success vectors p or alternatively λ are unknown but can be estimated from the database. The best estimate for θ from a single database is m/t. The simplest estimate for λ is to use the observed counts ci for each item. However, this is only a very rough estimate which especially gets unreliable for small counts. There exist more sophisticated estimation approaches. For example, DuMouchel and Pregibon (2001) use the assumption that the parameters of the count processes for items in a database are distributed according to a continuous parametric density function. This additional information can improve estimates over using just the observed counts.
3
Simulated and Real-world Database
We use 1 month (t = 30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. For convenience reasons we use categories (e.g., popcorn) instead of the individual brands. In the available m = 9835 transactions we found n = 169 different categories for which articles were purchased. The estimated transaction intensity θ for the data set is m/t = 327.5 (transactions per day). We use the same parameters to simulate comparable data using the framework. For simplicity we use the relative observed item frequencies as estimates for λ and calculate the success probability vector p by λ/θt. With this information we simulate the m transactions in the transaction database. Note, that the simulated database does not contain any associations (all items are independent), and thus differs from the grocery database which is expected to contain associations. In the following we will use the simulated data set not to compare it to the real-world data set, but to show that interest measures used for association rules exhibit similar effects on real-world data as on simulated data without any associations. For the rest of the paper we concentrate on 2-itemsets, i.e., the co-occurrences between two items denoted by li and lj with i, j = 1, 2, . . . , n and i = j. Although itemsets and rules of arbitrary length can be analyzed using the framework, we restrict the analysis to 2-itemsets since interest measures for these associations are easily visualized using 3D-plots. In these plots the
602
M. Hahsler et al.
Fig. 2. Support (simulated)
Fig. 3. Support (grocery)
Fig. 4. Confidence (simulated)
Fig. 5. Confidence (grocery)
x and y-axis each represent the items ordered from the most frequent to the least frequent from left to right and front to back and on the z-axis we plot the analyzed measure. First we compare the 2-itemset support. Figures 2 and 3 show the support distribution of all 2-itemsets. Naturally, the most frequent items also form together the most frequent itemsets (to the left in the front of the plots). The general forms of the two support distributions are very similar. The grocery data set reaches higher support values with a median of 0.000203 compared to 0.000113 for the simulated data. This indicates that the grocery data set contains associated items which co-occur more often than expected under independence.
4
Implications for the Interest Measure Confidence
Confidence is defined by Agrawal et al. (1993) as conf(X ⇒ Y ) =
supp(X ∪ Y ) , supp(X)
(5)
where X and Y are two disjoint itemsets. Often confidence is understood as the conditional probability P (Y |X) (e.g., Hipp et al., 2000), where the definition above is seen as an estimate for this probability. From the 2-itemsets we generate all rules of the from li ⇒ lj and present the confidence distributions in figures 4 and 5. Confidence is generally much lower for the simulated data (with a median of 0.0086 to 0.0140 for the realworld data) which indicates that the confidence measure is able to suppress
Implications of Probabilistic Data Modeling for Mining Association Rules
603
Fig. 6. Lift (simulated)
Fig. 7. Lift (grocery)
Fig. 8. Lift supp > 0.1% (simulated)
Fig. 9. Lift supp > 0.1% (grocery)
noise. However, the plots in figures 4 and 5 show that confidence always increases with the item in the right-hand-side of the rule (lj ) getting more frequent. This behavior directly follows from the way confidence is calculated (see equation 5). Especially for the grocery data set in Figure 5 we see that this effect is dominating the confidence measure. The fact that confidence clearly favors some rules makes the measure problematic when it comes to selecting or ranking rules.
5
Implications for the Interest Measure Lift
Typically, rules mined using minimum support (and confidence) are filtered or ordered using their lift value. The measure lift (also called interest, Brin et al., 1997) is defined on rules of the form X ⇒ Y as lift(X ⇒ Y ) =
conf(X ⇒ Y ) . supp(Y )
(6)
A lift value of 1 indicates that the items are co-occurring in the database as expected under independence. Values greater than one indicate that the items are associated. For marketing applications it is generally argued that lift > 1 indicates complementary products and lift < 1 indicates substitutes (cf., Hruschka et al., 1999). Figures 6 to 9 show the lift values for the two data sets. The general distribution is again very similar. In the plots in Figures 6 and 7 we can only see that very infrequent items produce extremely high lift values. These values
604
M. Hahsler et al.
are artifacts occurring when two very rare items co-occur once together by chance. Such artifacts are usually avoided in association rule mining by using a minimum support on itemsets. In Figures 8 and 9 we applied a minimum support of 0.1%. The plots show that there exist rules with higher lift values in the grocery data set than in the simulated data. However, in the simulated data we still find 64 rules with a lift greater than 2. This indicates that the lift measure performs poorly to filter random noise in transaction data especially if we are also interested in relatively rare items with low support. The plots in Figures 8 and 9 also clearly show lift’s tendency to produce higher values for rules containing less frequent items resulting in that the highest lift values always occur close to the boundary of the selected minimum support. We refer the reader to Bayardo and Agrawal (1999) for a theoretical treatment of this effect. If lift is used to rank discovered rules this means that there is not only a systematic tendency towards favoring rules with less frequent items but the rules with the highest lift will always change with changing the user-specified minimum support.
6
Conclusion
In this contribution we developed a simple probabilistic framework for transaction data based only on independent items. The framework can be used to simulate transaction data which only contains noise and does not include associations. We showed that mining association rules on such simulated transaction data produces similar distributions for interest measures (support, confidence and lift) as on real-world data. This indicates that the framework is appropriate to describe the basic stochastic structure of transaction data. By comparing the results from the simulated data with the results from the real-world data, we showed how the interest measures are systematically influenced by the frequencies of the items in the corresponding itemsets or rules. In particular, we discovered that the measure lift performs poorly to filter random noise and always produces the highest values for the rules containing the least frequent items. These findings suggest that the existing interest measures need to be supplemented by suitable statistical tests which still need to be developed. Using such tests will improve the quality of the mined rules and the reliability of the mining process. The presented framework provides many opportunities for further research. For example, explicit modeling of dependencies between items would enable us to simulate transaction data sets with properties close to real data and with known associations. Such a framework would provide an ideal test bed to evaluate and to benchmark the effectiveness of different mining approaches and interest measures. The applicability of the proposed procedure also comprises the development of possible tests against the independence model. Another research direction is to develop new interest measures based on the probabilistic features of the presented framework. A first step in this direction was already done by Hahsler et al. (2005).
Implications of Probabilistic Data Modeling for Mining Association Rules
605
References AGGARWAL, C.C., and YU, P.S. (1998): A new framework for itemset generation. PODS 98, Symposium on Principles of Database Systems. Seattle, WA, USA, 18–24. AGRAWAL, R., IMIELINSKI, T., and SWAMI, A. (1993): Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data. Washington D.C., 207–216. BAYARDO, R.J., JR. and AGRAWAL, R. (1999): Mining the most interesting rules. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery in Databases & Data Mining (KDD99), 145–154. BRIJS, T., SWINNEN, G., VANHOOF, K., and WETS, G. (2004): Building an association rules framework to improve product assortment decisions. Data Mining and Knowledge Discovery, 8(1):7–23. BRIN, S., MOTWANI, R., ULLMAN, J.D., and TSUR, S. (1997): Dynamic itemset counting and implication rules for market basket data. SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. Tucson, Arizona, USA, 255–264. DUMOUCHEL, W., and PREGIBON, D. (2001): Empirical Bayes screening for multi-item associations. In: F. Provost and R. Srikant (Eds.): Proceedings of the ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases & Data Mining (KDD01), 67–76. ACM Press GOETHALS, B., and ZAKI, M.J. (2004): Advances in frequent itemset mining implementations: Report on FIMI’03. SIGKDD Explorations, 6(1):109–117. HAHSLER, M., HORNIK, K., and REUTTERER, T. (2005): Implications of probabilistic data modeling for rule mining. Report 14, Research Report Series, Department of Statistics and Mathematics, Wirschaftsuniversit¨ at Wien, Augasse 2–6, 1090 Wien, Austria. ¨ HIPP, J., GUNTZER, U., and NAKHAEIZADEH, G. (2000): Algorithms for association rule mining — A general survey and comparison. SIGKDD Explorations, 2(2):1–58. HRUSCHKA, H., LUKANOWICZ, M., and BUCHTA, C. (1999): Cross-category sales promotion effects. Journal of Retailing and Consumer Services, 6(2):99– 105. LAWRENCE, R.D., ALMASI, G.S., KOTLYAR, V., VIVEROS, M.S., and DURI, S. (2001): Personalization of supermarket product recommendations. Data Mining and Knowledge Discovery, 5(1/2):11–32. LIN, W., ALVAREZ, S.A., and RUIZ, C. (2002): Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6(1):83–105. VAN DEN POEL, D., DE SCHAMPHELAERE, J., and WETS, G. (2004): Direct and indirect effects of retail promotions on sales and profits in the do-it-yourself market. Expert Systems with Applications, 27(1):53–62.
Copula Functions in Model Based Clustering Krzysztof Jajuga and Daniel Papla Department of Financial Investments and Insurance Wroclaw University of Economics Wroclaw, Poland Abstract. Model based clustering is common approach used in cluster analysis. Here each cluster is characterized by some kind of model, for example – multivariate distribution, regression, principal component etc. One of the most well known approaches in model based clustering is the one proposed by Banfield and Raftery (1993), where each class is described by multivariate normal distribution. Due to the eigenvalue decomposition, one gets flexibility in modeling size, shape and orientation of the clusters, still assuming general elliptical shape of the set of observations. In the paper we consider the other proposal based on the general stochastic approach in two versions: – classification likelihood approach, where each observation comes from one of several populations; – mixture approach, where observations are distributed as a mixture of several distributions. We propose the use of the copula approach, by representing the multivariate distribution as the copula function of univariate marginal distributions. We give the theoretical bases for such an approach and the algorithms for practical use. The discussed methods are illustrated by some simulation studies and real examples using financial data.
1
Model Based Clustering – Introduction
One of the most common approaches used in clustering is the so-called model based clustering. It is based on the assumption that multivariate data can be considered as a sample drawn from a population consisting of a number of classes (subpopulations), denoted by K, and a particular multivariate distribution is a model for each class. There are two common approaches in such stochastic model based clustering: – classification likelihood approach (e.g. Scott, Symons (1971)); – mixture approach (e.g. Wolfe (1970)). In the classification likelihood approach the likelihood function for n observations is given as: L(θ |x1 , x2 , ..., xn ) =
n
f (xi |θ ) =
i=1
n i=1
fγi (xi |θγi )
(1)
γi = j ⇔ xi ∈ Πj Assuming that the number of parameters for each class is equal to s, we get the total number of parameters to be estimated equal to Ks + n. The
Copula Functions in Model Based Clustering
607
estimation of parameters is performed by an iterative algorithm where for given assignment of observations to classes the parameters are estimated and then the assignment (classification) is updated. In the mixture approach the likelihood function for n observations is given as: ⎛ ⎞ n n K ⎝ L(θ |x1 , x2 , ..., xn ) = f (xi |θ ) = Pj fj (xi |θj )⎠ (2) i=1
i=1
j=1
Assuming that the number of parameters for each class is equal to s, we get the total number of parameters to be estimated equal to Ks + K − 1. It can be proved (Wolfe (1970)) that maximum likelihood estimates of prior probabilities, class parameters and posterior probabilities in a mixture approach can be obtained through the following equations (after taking the derivatives of the log-likelihood function): 1 Pˆj = pˆ(j |xi ) n i=1 n
n
pˆ(j |xi ) ∇θˆj [log fj (xi θˆj )] = 0
(3)
(4)
i=1
pˆ(j|xi ) =
Pˆj fj (xi |θˆj ) K
(5)
Pˆl fl (xi |θˆl )
l=1
The estimation of parameters is performed by an iterative algorithm where for given posterior probabilities the estimation of parameters (prior probabilities and class parameters) is performed and then posterior probabilities are updated. The particular models for classes (clusters) depend on the choice of the multivariate distribution. As one can expect, the most popular models assume a normal multivariate distribution. Banfield and Raftery (Banfield, Raftery (1993)) showed – for the classification likelihood approach – that some wellknown deterministic and stochastic criteria for clustering can be derived from the multivariate normal model. Of course, this model is suitable for clusters of elliptical shape (generally: hyperllipsoidal shape). In this paper we give a proposal of another, more general approach, which can be more suitable for clusters having other than elliptical shapes. The proposal is based on copula function.
2
Copula Function – A Way of the Analysis of Multivariate Distribution
The main idea behind the use of copula functions is in the decomposition of the multivariate distribution into two components, namely marginal distributions and the copula function linking these marginal distributions. Copula
608
K. Jajuga and D. Papla
functions reflect the dependence between the components of the random vector. This idea is presented in Sklar’s theorem (Sklar (1959)), given as: F (x1 , ..., xm ) = C(F1 (x1 ), ..., Fn (xm ))
(6)
Where: F – the multivariate distribution function; Fi – the distribution function of the i-th marginal distribution; C – copula function. The other notion strictly connected to the copula function is copula density. It is given as: c(u1 , ..., um ) = ∂ m C(u1 , ..., um )
(7)
f (x1 , ..., xm ) = c(F1 (x1 ), ..., Fm (xm )) · f1 (x1 ) · ... · fm (xm )
(8)
Where: f – the multivariate density function, fi – the univariate density function, c – copula density. As we see, the analysis of the multivariate distribution function is conducted by “separating” the analysis of the univariate distribution from the analysis of the dependence. There are many possible copula functions, analyzed in theory and used in practice (Nelsen (1999)). Often the copula functions are one-parameter functions, this parameter is denoted by θ. θ can be interpreted as the dependence parameter of two components of a random vector. From the point of view of statistical inference, the basic problem is the estimation of the parameters of multivariate distribution by maximum likelihood method. The log-likelihood function is given as: l(θ) =
n
log c(F1 (xi1 ), ..., Fm (xim )) +
i=1
n m
log(fj (xij )
(9)
i=1 j=1
One of the basic estimation algorithms is performed in two steps: First step is the maximum likelihood estimation of the parameters of the marginal distributions (for each j), through the maximization of the following function: n l(θj ) = log(fj (xij )) (10) i=1
Second step is the maximum likelihood estimation of the parameters of copula function (given estimates obtained in the first step), through the maximization of the following function: l(α) =
n i=1
log c(F1 (xi1 ), ..., Fm (xim ))
(11)
Copula Functions in Model Based Clustering
609
Of course, the particular solution of the maximum likelihood estimation depends on the copula density function.
3
Application of Copula Function in Model Based Clustering
Now we move to the proposal to apply the copula function in model based clustering. The idea is rather simple, since the model for each cluster is given through the multivariate distribution function decomposed according to Sklar’s theorem. Therefore for each cluster we have the following distribution function and density function: Fj (x1 , ..., xm ) = Cj (Fj1 (x1 ), ..., Fjm (xm )) fj (x1 , ..., xm ) = cj (Fj1 (x1 ), ..., Fjm (xm )) · fj1 (x1 ) · ... · fjm (xm )
(12) (13)
As one can see, the model for each cluster “consists of” separate models for each component of a random vector and model for the dependence between these components. By introducing the copula model given in (13) into the classification likelihood approach (formula (1)) and the mixture approach (formula (2)) we get new proposals for these two approaches of model based clustering. Of course, the particular model depends on the choice of the copula function. In any case, to estimate the parameters of the models one should apply an iterative algorithm. Now we will present such algorithms for both, the classification likelihood approach and the mixture approach. 3.1
Algorithm for the Classification Likelihood Approach
1. Start from the initial classification, it can be given randomly or by some prior information. 2. In each iteration: • estimate parameters of the distribution in each class using two step estimation given in the formulas (10) and (11) - parameters of marginal distributions, parameters of dependence); • calculate the density for each observation given each class - in total n times m numbers; • update the classification by assigning each observation to the class of highest density. 3. Iterate until the classification does not change. 3.2
Algorithm for the Mixture Approach
1. Start from the initial posterior probabilities. 2. In each iteration:
610
K. Jajuga and D. Papla
• estimate the parameters of the distribution for each class using (4) and (13) with a two step estimation (parameters of marginal distributions, dependence parameters) and estimate prior probabilities using (3); • calculate the new posterior probabilities using (5). 3. Iterate until the posterior probabilities do not change significantly, for example if maximal difference between posterior probabilities obtained in two consecutive iterations is less than some small number (e.g. 0.01). Of course, the computational studies should be performed as far as the performance of the proposed methods and algorithms is concerned. We will present some introductory studies below. Theoretical considerations lead to the conclusion that the proposed approach could be better suited to the situations where observations belonging to different classes are not multinormally distributed.
4
Some Empirical Studies
Now we present the results of some empirical studies. Due to the limited scope of the paper we are able to give only a sample of the results, obtained in two types of studies. 4.1
Simulation Studies
We present the results for Frank copula. Simulation studies were performed, in such a way that for each set of parameters 100 repetitions of time series were obtained, each time series of 2000 observations. Those 2000 observations were generated according to given stochastic structures – two classes with different dependence parameters (theta) in Frank copula, uniform marginal distributions. The results are presented in tables 1-6. The headers of the columns denote: – sett1, sett2 – parameter theta of the first and the second class; – realt1, realt2 – means of the estimates of parameter theta of the first and the second class (using given classification), – estt1, estt2 – means of the estimates of parameter theta of the first and the second class (using obtained classification), – t1err, t2err – mean squared errors of the estimation of parameter theta of the first and the second class (using obtained classification), – mcsr, the misclassification error, i.e. ratio of observations that were misclassified to the total number of observations. One of the main conclusions is that both methods give better results, when the thetas for both classes differ significantly and there is equal number of elements in each class. Interesting conclusions can be drawn when we compare the results for both methods. When both classes have the same number of elements, the results do not differ much, as we can see from comparing
Copula Functions in Model Based Clustering sett1 -10 -5 -2 -1 1
realt1 -10,7702 -4,8022 -2,0537 -1,0089 0,98271
estt1 -11,9492 -8,5028 0,062831 0,28002 -0,08281
t1err 0,21367 0,40123 0,99563 0,85823 0,09818
sett2 10 5 2 1 10
realt2 9,8748 5,1755 1,695 1,0399 10,431
estt2 12,1005 7,8907 -3,1665 -1,0833 19,8188
t2err 0,23286 0,39032 2,0979 2,1562 0,35574
611
mcsr 0,13607 0,2503 0,48491 0,49172 0,29877
Table 1. The classification LH method: 1. class – 1000 obs., 2. class – 1000 obs.
sett1 -10 -5 -2 -1 1
realt1 -10,1594 -4,4376 -2,5085 -0,78311 1,0959
estt1 -10,5452 -3,3694 -2,9056 -3,1623 2,243
t1err 0,46136 0,212 0,15296 0,1849 0,070913
sett2 10 5 2 1 10
realt2 10,5585 5,179 2,1072 1,1188 10,22
estt2 12,9235 12,4831 11,1939 10,4174 25,6981
t2err 0,24226 0,31789 0,34708 0,33062 0,42573
mcsr 0,14298 0,30568 0,43882 0,48518 0,36895
Table 2. The classification LH method: 1. class – 500 obs., 2. class – 1500 obs.
sett1 -10 -5 -2 -1 1
realt1 -10,9853 -6,6909 -2,5938 -1,3209 1,0679
estt1 1,8846 -0,0519 -1,9211 -2,8983 4,4596
t1err 0,1027 0,094324 0,094435 0,15987 0,068371
sett2 10 5 2 1 10
realt2 9,8389 4,9788 2,0012 0,88916 9,7081
estt2 25,4142 17,3537 12,7836 10,7516 32,3503
t2err 0,41749 0,3045 0,25857 0,30793 0,43988
mcsr 0,3697 0,43572 0,49903 0,47808 0,48153
Table 3. The classification LH method: 1. class – 200 obs., 2. class – 1800 obs.
sett1 -10 -5 -2 -1 1
realt1 -10,9925 -5,1323 -2,2021 -1,117 0,79766
estt1 -10,1026 -4,7056 -0,57051 -0,07374 0,80217
t1err 0,1046 0,25616 0,13011 0,022984 0,07518
sett2 10 5 2 1 10
realt2 10,4629 5,1145 2,3694 0,72939 10,2623
estt2 10,0965 4,9025 0,56571 0,015279 9,9037
t2err 0,13175 0,26838 0,13322 0,023421 0,15549
mcsr 0,1368 0,25886 0,41181 0,46629 0,2945
Table 4. Mixture approach: 1. class – 1000 obs., 2. class – 1000 obs.
sett1 -10 -5 -2 -1 1
realt1 -10,0375 -4,4221 -2,0111 -1,0179 0,67266
estt1 -10,1988 -2,0814 0,60028 0,47399 2,4578
t1err 0,19223 0,12955 0,13027 0,031808 0,08062
sett2 10 5 2 1 10
realt2 10,3235 5,0211 2,1439 0,95 10,1426
estt2 9,8283 6,4077 1,4452 0,5842 11,5999
t2err 0,092179 0,066078 0,14381 0,029809 0,095519
mcsr 0,10194 0,2111 0,48047 0,46427 0,21468
Table 5. Mixture approach: 1. class – 500 obs., 2. class – 1500 obs.
612
K. Jajuga and D. Papla sett1 -10 -5 -2 -1 1
realt1 -10,2067 -5,1065 -1,4504 -2,2055 -0,329
estt1 -5,2113 1,6822 1,4289 0,67689 5,1571
t1err 0,61484 0,14277 0,058705 0,044399 0,098023
sett2 10 5 2 1 10
realt2 9,7521 4,7872 1,8088 1,0854 9,3967
estt2 10,6831 6,7742 1,7605 0,82425 12,6939
t2err 0,069435 0,17868 0,069099 0,04326 0,096888
mcsr 0,055042 0,27028 0,46251 0,45799 0,23052
Table 6. Mixture approach: 1. class – 200 obs., 2. class – 1800 obs.
tables 1 and 4. Estimates of thetas using the obtained classification are closer to the assumed values for the mixture approach. But when the classes are significantly different with respect to the number of elements, the mixture approach performed better than the classification likelihood method. Both estimation errors and misclassification rates are smaller in tables 5 and 6, than in corresponding tables 2 and 3. 4.2
Example from Financial Market
Next we present results of the proposed methods for data from chosen financial markets. We used data for two pairs of stock market indices, namely WIG and WIG20 (indices of Warsaw Stock Exchange), WIG and DAX. We took daily logarithmic returns from the period 11.01.1999 – 2.03.2005 (1449 bivariate observations). As marginal distributions the empirical distribution functions were used. Four different copula functions were used. The results are presented in tables 7 and 8. Copula
Clayton
Ali-Mikhail-Haq
Gumbel
Frank
1 2 1 2 1 2 1 2
class class class class class class class class
Classification θ 20.8996 0.7373 0.9999 -0.9999 12.4563 1.3270 50.8139 1.7612
Mixture θ 13.7452 0.5830 0.9999 0.9986 5 1.5 40.5634 1.7531
Table 7. Estimates of θ for given copulas after classification for WIG and WIG20
From the analysis of tables 7 and 8 one can draw the conclusion, that both methods classify given data into a class with high dependence of the indices (high value of the parameter theta) and a class with low or even negative dependence of the indices (low value of the parameter theta). For
Copula Functions in Model Based Clustering Copula
Clayton
Ali-Mikhail-Haq
Gumbel
Frank
1 2 1 2 1 2 1 2
class class class class class class class class
Classification θ 7.3508 0.1162 0.98838 -1 4.1053 1 12.6373 -2.1192
613
Mixture θ 0.7335 0.2104 0.83321 0.66961 5 1.5 2.7057 1.6083
Table 8. Estimates of θ for given copulas after classification for WIG and DAX
each copula one can see, that dependence between WIG and WIG20 is much higher than between WIG and DAX, especially when comparing results for the first class. There is a similar pattern, when one considers second, lower dependence classes for each copula. This pattern is most evident for Clayton and Franks copulas. Despite that the proposed method proved some usefulness, still more studies are needed. The important problems to be solved, are: – the selection of the best copula function; – more empirical studies comparing different model based clustering methods; – coping with local optima problems in the algorithms.
References BANFIELD J.D., RAFTERY A.E. (1993): Model-Based Gaussian and NonGaussian Clustering. Biometrics, 49, 803–821. SCOTT, A.J. and SYMONS, M.J. (1971): Clustering Methods Based on Likelihood Ratio Criteria. Biometrics, 27, 387–397. SKLAR A. (1959): Fonctions de repartition ` a n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Universit´ e de Paris, 8, 229–231. WOLFE, J.H. (1970): Pattern Clustering by Multivariate Mixture Analysis. Multivariate Behavioral Research, 5, 329–350.
Attribute-aware Collaborative Filtering Karen Tso and Lars Schmidt-Thieme Computer-based New Media Group (CGNM), Institute for Computer Science, University of Freiburg, 79110 Freiburg, Germany
Abstract. One of the key challenges in large information systems such as online shops and digital libraries is to discover the relevant knowledge from the enormous volume of information. Recommender systems can be viewed as a way of reducing large information spaces and to personalize information access by providing recommendations for information items based on prior usage. Collaborative Filtering, the most commonly-used technique for this task, which applies the nearest-neighbor algorithm, does not make use of object attributes. Several so-called content-based and hybrid recommender systems have been proposed, that aim at improving the recommendation quality by incorporating attributes in a collaborative filtering model. In this paper, we will present an adapted as well as two novel hybrid techniques for recommending items. To evaluate the performances of our approaches, we have conducted empirical evaluations using a movie dataset. These algorithms have been compared with several collaborative filtering and non-hybrid approaches that do not consider attributes. Our experimental evaluations show that our novel hybrid algorithms outperform state-of-the-art algorithms.
1
Introduction
Recommender systems use collaborative filtering to generate recommendations by predicting what users might be interested in, given some user’s profile. It is commonly used as a customization tool in e-commerce and is seen as a personalization technology. Unlike the conventional approach where all users view the identical recommendations, a recommender system further personalizes these recommendations such that each user will receive customized recommendations that suit his/her tastes. A few prominent online commercial sites (eg. amazon.com and ebay.com) offer this kind of recommendation services. Two prevailing approaches to developing these systems are Collaborative Filtering (CF; Goldberg et al. 1992) and Content-Based Filtering (CBF). There are two different recommendation tasks typically considered: (i) predicting the ratings, i.e., how much a given user will like a particular item, and (ii) predicting the items, i.e., which N items a user will rate, buy or visit next (topN). As most e-commerce applications deals with implicit ratings, the latter seems to be the more important task and we will focus on it for the rest of the paper.
Attribute-aware Collaborative Filtering
615
In CF, recommendations are generated first by computing the similarities between others’ profiles to identify a set of users, called “neighborhood” pertaining to a particular user’s profile. Usually, the similarities between the profiles are measured using Pearson’s Correlation or Vector Similarity. Finally, the recommendations are derived from this neighborhood. One technique in generating the topN recommendations is the Most-Frequent Recommendation (Sarwar et al. 2000) where the frequency of all items of the neighborhood is considered and the N number of items with the highest frequency is returned. There are two general classes of CF algorithms — Memory-based (UserBased) and Model-based (Resnick et al. 1994; Breese et al. 1998; Sarwaret al. 2000). User-Based CF is one of the most successful and prevalent techniques used in recommender systems. The entire database is employed to compute the similarities between users. Using this similarity, a dualistic form of the User-Based CF called the Item-Based topN algorithm emerged (Deshpande and Karypis 2004). It uses the items instead of the users to determine the similarities. The Item-Based CF has claimed to significantly outperform the User-Based CF. On the other hand, the model-based CF builds a model by learning from the database (Breese et al. 1998; Aggarwal et al. 1999). In CBF methods, the users are defined by the associated features of his/her rated items. These features are usually the attributes or description of the object. In contrast to CF techniques, CBF recommends items to users based solely on the historical data from the users (Balabanovic and Shoham 1997; Burke 2002; Ziegler et al. 2004). Since attributes usually contain meaningful and descriptive information of objects, there have been attempts in combining these two approaches, so-called hybrid approaches, to gain better performance. In this article, we will introduce three methods which incorporate item attributes and focus on the topN recommendation algorithm. Our first two techniques use the standard hybrid model by combining content-based and collaborative filtering. Our third technique integrates attributes directly into collaborative filtering, instead of incorporating attributes via a content-based submodel.
2
Related Work
There are many ways in incorporating attributes into collaborative filtering. One of the first hybrid recommender systems is Fab (Balabanovic and Shoham 1997). Its recommender engine first identifies items (pages) on a current attribute (topic). It then receives highly rated items from the user’s similar neighbors and discards items that have already been seen by the user. When the user rates a new item, his/her profile will be updated and this information will be passed on to his/her neighbors. One of the simplest hybrid approaches is the linear combination of recommendation weighted average of CBF and CF predictions (Claypool et al. 1999). Few others attempted to use
616
K. Tso and L. Schmidt-Thieme
the inductive learning approach. For instance, Basu et al. (1998) considered recommendation as a classification problem and used using hybrid features to predict whether a user will like or dislike an item. The learning task can also be treated by developing a kernel that learns a mapping from user-item pairs to a set of ratings (Basilico and Hofmann 2004). Another method is to learn a vector of weighted attributes using the Winnow algorithm (Pazzani 1999). CF is then applied using the matrix containing the weight of each user’s content-based profile, instead of using the rating matrix. Melville et al. (2002) followed a two-stage approach: first they applied a na¨ıve Bayesian classifier as content-based predictor to complete the rating matrix, then they re-estimated ratings from this full rating matrix by CF. In our paper, we have selected Melville’s model in its adapted form as our hybrid baseline model.
3
Hybrid Attribute-aware CF Methods
We propose three effective Attribute-aware collaborative filtering methods. • Sequential CBF and CF (adapted content-boosted CF), • Joint Weighting of CF and CBF, and • Attribute-aware Item-Based CF. All three approaches recommend topN items that contain the highest frequency of their neighboring items. Similarity between two users is computed using Vector Similarity. The first two algorithms apply CBF and CF paradigms in two separate processes before combining them together at the point of prediction. Our third approach, however, does not employ CBF algorithm; instead item attributes are directly incorporated at the model-building stage. Sequential CBF and CF Our first approach termed, “Sequential CBF and CF” is an adapted form of Melville’s original hybrid model — ContentBoosted Collaborative Filtering (CBCF) (Melville et al. 2002). The reason why we do not use the CBCF directly is because the original model is intended for predicting ratings, whereas this paper focuses on the topN problem. Hence, the CBCF is adapted such that it will recommend N number of items to the user instead of inferring the rating of an item. This model is used as our hybrid baseline for evaluating the other two approaches. Recommendations are generated using CF. CBCF first uses a na¨ıve Bayesian classifier to build a content-based model for each user. Next a full matrix is formed by combining the actual ratings and the predicted ratings learned from the CBF predictor. The adaptation takes place when applying CF. Instead of finding the weighted sum of ratings of other users to compute the prediction ratings for the current user, the full matrix is sparsified by considering solely items with high ratings.
Attribute-aware Collaborative Filtering
617
Fig. 1. CF and CBF processes done in sequence
Fig. 2. CF and CBF processes done in parallel
Joint Weighting of CF and CBF Similarly, our second approach also applies both CBF and CF. Again, na¨ıve Bayesian classifier is utilized here. However, instead of inferring the class or rating of an item based on attributes, it predicts how much a user will like the attributes. Let • • • • •
U be a set of users, I be a set of items, B be a set of (binary) item attributes, Di,b ∈ {0 , 1 } specify whether item i ∈ I has attribute b ∈ B, Ou,i ∈ {0 , 1 } specify whether item i ∈ I occurred with user u ∈ U (i.e., u has rated/bought/visited item i). pˆcb (Ou,. = 1 | D.,b , b ∈ B) :=
1 P (D.,b | Ou,. ) P (Ou,. ) · k
(1)
b∈B
where k := P (D.,b , b ∈ B) Unlike the first approach where the two processes are done sequentially – content-based first then CF, the order of these processes is unimportant for the latter and serves as the complementary view for each other. Equation 1 generates predictions using attributes (CBF) and this is joined with the outputs of CF by computing the geometric mean of the outputs. This mean combination is then used for performing the topN prediction. pˆ(Ou,i = 1) ∼ pˆcb (Ou,i = 1)λ · pcf (Ou,i = 1)1−λ
with λ ∈ [0, 1]
(2)
where λ is used to weight the content-based and collaborative methods, e.g., for λ = 0, we get pure collaborative filtering and for λ = 1, pure content-based filtering.
618
K. Tso and L. Schmidt-Thieme
Attribute-aware Item-Based CF Our third approach extends the Item-Based topN CF (Deshpande and Karypis 2004). Rather than using CBF algorithms, it exploits the content/attribute information by computing the similarities between items using attributes thereupon combining it with the similarities between items using user ratings. This is shown in Equation 3, where isimratings corresponds to the item similarities computed using Vector Similarity with the ratings and isimattributes , computed with the attributes. isimattributes (i, j) :=
Di,. , Dj,. ||Di,. ||2 ||Dj,. ||2
isimcombined := (1 − λ) isimratings +λ isimattributes
(3) with λ ∈ [0, 1] (4)
Again, λ is used to adjust the corresponding weight on CBF and CF. In this case, setting λ to 0 is the same as computing Pure Item-Based.
4
Evaluation and Experimental Results
In this section, we present the evaluation of our three attributes-aware recommendation algorithms and compare their performances with various nonhybrid baseline models, as well as the Sequential CBF-CF as a comparison against an existing hybrid model. The non-hybrid models we have selected are: Most Popular, Pure CF and Pure CBF. Most Popular is the most basic model that simply returns the N most-frequently rated items over all users, i.e., it is not personalized. Pure CF corresponds to the classical User-Based CF. Pure CBF uses the na¨ıve Bayesian as predictor as shown in Equation 1 and applies it as a topN problem by returning the N items which contain the attributes the user likes most. We evaluated the performance of our algorithms with the data obtained from MovieLens (ml; MovieLens 2003), which corresponds to movie ratings. The ratings are expressed on a 5-point rating scale and indicate how much a user likes a movie. Since our algorithms do not take the actual ratings into account, the ratings are treated as a binary value of whether the user has seen or not seen a movie. We have chosen the ml dataset containing approximately one million ratings of 3592 movies made by 6,040 users. In addition, the genres of each movie are provided. There are in total 18 different genres for the ml dataset. The genres of each movie, which are identical to the ones provided by the Internet Movie Database (IMDB), are selected as the content information/attributes for each item. The datasets are split into 80% training set and 20% testing set by randomly assigning the non-zero entries of rows from the rating matrix to the testing set. The quality of these predictive models are measured by comparing the recommendations (topN set) predicted using the training data against the actual items from the testing set.
Attribute-aware Collaborative Filtering
619
Fig. 3. F1 of different recommendation algorithms
The experiments are tested on ten random subsets of the ml dataset with 1000 users and 1500 items each. The results we present here are the average of the ten random trials. Metrics Our paper focuses on the topN problem, which is to predict a fixed number of top recommendations and not the ratings. Suitable evaluation metrics are Precision and Recall. Similar to Sarwar et al. (2000), our evaluations consider any item in the topN set that matches any item in the testing set as a “hit”. F1 measure is also used to combine Precision and Recall into a single metric. Number of hits Number of recommendations Number of hits Recall = Number of items in test set 2 ∗ Precision ∗ Recall F1 = Precision + Recall
Precision =
Experiment Results The results of the average of ten random trials are presented in Figure 3. The parameters selected for each algorithms are shown in Table I. They are selected to be optimal for our algorithms by means of grid search. Additional parameters, threshold and max, for the Sequential CBF-CF are set to 50 and 2 accordingly as chosen in the original model. Comparing the performance achieved by our two novel hybrid algorithms, we can see that Attribute-aware Item-Based CF and Joint Weighting CFCBF outperform the other classical models. The results of CBF and Sequential CBF-CF models are far below the baseline Most Popular model.
620
K. Tso and L. Schmidt-Thieme Name: Neighborhood Size λ joint weight CF-CBF 90 0.15 attr-item CF 400 0.05 item based 400 sequential cb-cf 90 user-based 90 -
Table 1. The parameters chosen for the respective algorithms.
Although Melville et al. (2002) reported that CBCF performed better than User-Based and Pure CBF for ratings, it fails to provide quality topN recommendations for items in our experiments. Thus, we focus our discussion mainly on our other two algorithms. To evaluate the immediate effect on the quality of recommendations after the incorporation of attributes, we compare the Attribute-aware Item-Based CF and Joint Weighting CF-CBF methods with their base algorithms that do not consider attributes. Although only 18 attributes are used, our Attribute-aware Item-Based and Joint Weighting CF-CBF show already significantly good results. As we can see from Figure 3, the performance increases about 5.7% after introducing attributes to its based algorithm — Item-Based topN. Integrating attributes using Joint Weighting CF-CBF methods gives even better performance. As this model is derived from the CF and CBF models, it does approximately 14% better than the CF and more than 100% increases in comparison with the CBF method. Furthermore, Joint Weighting CF-CBF algorithm holds the smallest standard deviation (5.26%). This proves the results from this model to be reasonably reliable.
5
Conclusions and Future Works
The aim of this paper is to improve the quality of topN recommendations by enhancing CF techniques with content information of items. We have proposed three different hybrid algorithms. One of them, to be an adapted formed of an exiting hybrid model (Sequential CBF-CF) and two other novel hybrid models: Attribute-aware Item-Based and Joint Weighting CF-CBF. We have shown that our two novel hybrid models give the best performance in comparison with the Most Popular, User-Based, Item-Based, Content-Based and the Sequential CBF-CF models. Incorporating a small amount of attributes already gives reasonably significant results; we can anticipate that by providing more valuable/positive attributes, the quality of recommendations should gradually increase respectively. Experiments with more attributes as well as to test the algorithms on various larger datasets are also planned for future works.
Attribute-aware Collaborative Filtering
621
References AGGARWAL, C. C., WOLF, J. L., WU, K.-L. and YU, P. S. (1999): Horting hatches an egg: A new graph-theoretic approach to collaborative filtering. In Proceedings of ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York. BALABANOVIC, M. and SHOHAM, Y. (1997): Fab: Content-based, collaborative recommendation. Commun. ACM 40, 66-72. BASILICO, J. and HOFMANN, T. (2004): Unifying collaborative and contentbased filtering. In Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, 2004. BASU, C., HIRSH, H., and COHEN, W. (1998): Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the 1998 Workshop on Recommender Systems. AAAI Press, Reston, Va. 1115. BILLSUS, D. and PAZZANI, M. J. (1998): Learning collaborative information filters. In Proceedings of ICML. 46-53. BREESE, J. S., HECKERMAN, D. and KADIE, C. (1998): Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98). G. F. Cooper, and S. Moral, Eds. Morgan-Kaufmann, San Francisco, Calif., 43-52. BURKE, R. (2002): Hybrid Recommender Systems: Survey and Experiments, User Modeling and User Adapted Interaction, 12/4, 331-370. CLAYPOOL, M., GOKHALE, A. and MIRANDA T. (1999): Combining contentbased and collaborative filters in an online newspaper. In Proceedings of the SIGIR-99 Workshop on Recommender Systems: Algorithms and Evaluation. DESHPANDE, M. and KARYPIS, G. (2004): Item-based top-N recommendation algorithms, ACM Transactions on Information Systems 22/1, 143-177. GOLDBERG, D., NICHOLS, D., OKI, B. M. and TERRY, D. (1992): Using collaborative filtering to weave an information tapestry. Commun. ACM 35, 61-70. MELVILLE, P., MOONEY, R. J. and NAGARAJAN, R. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), pp. 187-192, Edmonton, Canada, July 2002. MITCHELL, T. (1997): Machine Learning. New York, NY: McGraw-Hill. MOVIELENS (2003): Available at http://www.grouplens.org/data. PAZZANI, M. J.(1999): A framework for collaborative, content-based and demographic filtering.Artificial Intelligence Review 13(5-6):393–408. RESNICK, P., IACOVOU,N., SUCHAK, M., BERGSTROM, P. and RIEDL, J. (1994): GroupLens: An open architecture for collaborative filtering of netnews. In Proceedings of the 1994 Conference on Computer Supported Collaborative Work. R. Furuta and C. Neuwirth, Eds. ACM, New York. 175-186. SARWAR, B. M., KARYPIS, G., KONSTAN, J. A. and RIEDL, J. (2000): Analysis of recommendation algorithms for E-commerce. In Proceedings of the 2nd ACM Conference on Electronic Commerce (EC’00). ACM, New York. 285-295. ZIEGLER, C., SCHMIDT-THIEME, L., LAUSEN, G. (2004): Exploiting Semantic Product Descriptions for Recommender Systems, Proceedings of the 2nd ACM SIGIR Semantic Web and Information Retrieval Workshop (SWIR ’04), July 25-29, 2004, Sheffield, UK.
Towards a Flexible Framework for Open Source Software for Handwritten Signature Analysis Richard Guest1 , Mike Fairhurst1 , and Claus Vielhauer2 1 2
University of Kent,Canterbury, CT2 7NT, UK Otto-von-Guericke Magdeburg University, 39016 Magdeburg, Germany
Abstract. The human signature is still the most widely used and accepted form of personal authorisation and verification applied to documents and transactions. In this paper we describe the design and implementation of a flexible and highly configurable framework for the experimental construction and performance investigation of biometric signature verification systems. We focus on a design approach which encompasses a general process model for automatic signature processing, reference instances relating to specific data parsing, feature extraction and classification algorithms and detail the provision of a framework whereby unique signature systems can be easily constructed, trialed and assessed.
1
Introduction
The human signature has a long history of usage for personal authentication. Indeed, it is still the most widely used legally admissible technique for transaction and document authorisation (Jain et al.). Despite this long history of usage, conventional visual methods of authenticity assessment are prone to forgery and fraud - a situation recognised by credit card companies and banks as they move to alternative systems for consumer transitions (such as chip-and-pin). Modern automatic biometric signature systems assess both the constructional (”on-line”) aspects of the signature (for example timing, velocity and pen rhythms) alongside the conventional (”off-line”) assessment of the drawn signature image utilising data captured on a graphics tablet or dedicated signature device (Plamondon and Srihari). These devices capture data in a pen position (X/Y) and pressure format at a constant sampling frequency. Over the last five years, signature verification systems have also found a use in securing mobile devices such as PDAs, Tablet PCs and mobile phones. Other recent device developments include a number of systems which collect data from accelerometers mounted in a pen which remove the need for a fixed capture surface. Also, additional signal types for online signature verification such as pen altitude and pen azimuth signals have been explored for signature verification (Hangai et al.). As mobile and ’novel’ devices become more prevalent, addressing the issues of verification on these platforms will
Towards a Flexible Framework for Handwritten Signature Analysis
623
become an even greater focus of research as will the applications to which they are used. Research in the field of automatic signature verification can be seen to be concentrated, at the technological level, on two main strands: (a) the development of signature measurement features and novel methods of assessing the physical and constructional properties of a signature. Reported techniques within this first strand include pen direction and distance encoding, velocity and dynamic profiling, signature shape features, force and pressure characteristics and spectral and wavelet analysis (Nakanishi et al.). Studies have also analysed features derived from classical assessment of signatures from the forensic community. (b) The development of methods for selecting and combining feature measurements and verifying/identifying signature ownership. Techniques within this strand of research include multiple classifier structures and decision fusion algorithms, PCA, neural networks, probabilistic classification, dynamic time warping/matching and Hidden Markov Modeling (Rabiner). Important studies have also focused on issues such as enrolment strategies, template storage and update, forgery assessment, and so on (see, for example, Plamondon and Srihari). This diversity of work illustrates the need for appropriate software tools to support and facilitate effective future research in automatic signature verification.
2
An Experimental Framework
The motivation for the development of a flexible framework for the implementation, investigation and evaluation of signature systems is born out of two major issues within the research community. As further research is carried out into new systems and algorithms, so the diversification of standards and reference systems increases. Until recently no standard reference signature database was available making it impossible to accurately compare system performance - due to the sensitivity of data, many research groups and institutions keep signature test sets private, storing data in a range of proprietary formats. Likewise, due to proprietary requirements, there yet are no standard reference systems to provide a baseline for performance comparison. A general purpose framework is currently being developed within which researchers can efficiently implement, investigate and evaluate techniques and system components for signature verification. Its key feature is the implementation of a software toolbox to facilitate investigation of both on-line and off-line signature processing, providing open source reference software to the research community and thereby enabling the system to be contributed to by third-party developers. The key design ethos of the system is the simple configuration of system modules (pre-processing, feature extraction and selection, classifiers and storage) for performance evaluation with normalised performance metrics thereby providing a standard against which other systems can be measured. The principal characteristics of the experimental
624
R. Guest et al.
Fig. 1. Framework Subsections
framework include: modular design of system components with respect to handwriting/signature analysis, to provide the means to collect/import test data from various diverse sources, to provide the means to evaluate semantics beyond signatures (e.g. hand-written passwords, pass-phrases or personal identification numbers) and allow reproducible evaluation and exchange (and reuse) of module instances.
3
Framework Implementation
The design of the open source framework allows for flexibility in system implementation and restructuring alongside the addition of new modules and techniques. The latter is achieved through the release into the public domain of data structures containing the input and output formats for modules. The framework can be defined as consisting of six software subsections; the relationship between each subsection can be seen in Figure 1. Each of the subsections has a defined input and output data class structure enabling the development and integration of third party routines for use within the framework. To fully embrace the open source nature of the project the framework is constructed using Gnu CC. Each module instance is implemented as a Linux dynamic library which allows the framework controller to perform dynamic binding thereby producing an optimally compiled system for each configuration.
Towards a Flexible Framework for Handwritten Signature Analysis
625
Fig. 2. Data parsing class structure
3.1
Data Parsing
Typically, a single signature sample captured from a subject is stored in an individual text file. Currently, there are no standard methods for storing data captured from a signature/writing device, with most research groups and commercial software companies use their own proprietary data format. This is one of the primary motivations for the development of a modality interchange format currently being undertaken by ISO/IEC (NIST). This subsection of this framework parses data adhering to a particular format into the internal data structure for the framework. For experimentation purposes and for the conceptual proof of our design, a parser has been constructed to read files in the SVC 2004 format, one of the most publicly available and used signature database in recent years (Yeung et al.). This format stores the signature as a series of timestamped sample points comprising x and y position and pen pressure values. The data structure comprises five data classes; the relationship between these can be seen in Figure 2. CDataSet is the parent class containing such information as the size of the signature capture file, the date and time of the sample capture and the semantic class (signature, drawing, other, etc.). CPData contains information about the test subject while CDevice details the technical details of the capture device. Definitions of these classes can be reused across sample instances. CSourceData contains the raw sample data parsed from the input file while CData contains normalised values. 3.2
Preprocessing
Due to the wide range of capture devices providing the signature data, pre-processing prior to feature extraction is often necessary. In this subsection, data, stored in the standard framework structure as defined above is pre-processed and then stored back into the same structure. Common preprocessing routines such as low-pass filtering and special and temporal interpolation are implemented in the initial module set.
626
R. Guest et al.
Fig. 3. Feature extraction class structure
3.3
Feature Extraction and Selection
Feature modules individually extract the on-line and off-line performance characteristics from a signature sample (for example width of signature, time taken to produce signature). Separate features are implemented in individual modules increasing the flexibility in system construction. Selection of which features form templates or are presented to classifiers is also performed in this section. For experimentation, routines to perform a variety of statistical features have been implemented alongside basic selection configuration. Output from the feature extraction and selection module is stored in a hierarchical class structure represented in Figure 3. At the lowest level, a collection of separate Feature Data (from different feature extraction modules) can be grouped into a Feature Vector. The defined class structure allows for a complete signature capture file to be segmented as a collection of feature vectors and also allows for multiple vectors per segment. A collection of these Feature Segments is brought together to form a Feature Set. Under this scheme Feature Extraction modules have the freedom to ignore one or more segments of the original (for example, an investigator may only be interested in the first n seconds of all files). Selection of features, vectors and segments is defined in the configuration file and managed by the framework controller. Two types of classification systems are allowed within the framework design: In the context of signature biometrics, Reference Storage Systems extract a feature set from a series of training signatures and store them in either feature space and/or template form (stored in the Reference Storage subsection). A comparison can then be made between test and training data with the output being an (optionally normalised) matching score based on the distance between a testing and training set. Training and testing data is formed using the same Data Parsing, Pre-processing and Feature Extraction chain. For initial experimentation a Levenshtein distance metric has been implemented for this category (Schimke et al.). The second subset, Training-based Systems,
Towards a Flexible Framework for Handwritten Signature Analysis
627
relies upon the use of a training set to configure the internal parameters of a classifier (for example a neural network system). These internal parameters are stored (often in proprietary form) for later classification of testing data. The framework provides a structure for the training of a classifier system and recognition through testing with the trained system. Again, the output is an optionally normalised matching score. The Client Model provides storage for these systems. For initial experimentation an HMM-based (Hidden Markov Model) classification system has been implemented. 3.4
Framework Controller
The Framework Controller provides the control, configuration and reporting mechanisms for the software subsections, or modules. Prerequisite and compatibility issues for each of the modules within the framework are defined and are verified before dynamic binding by the controller during configuration thereby ensuring that incompatible configurations are not selected. Key functions within the framework controller include the selections of modules to form an experimental system following a check on the validity of selected routines, presenting a list of available routines to the experimenter, selection and management of enrolment and training data and output matching score calculation signifying a matching between enrolment and verification data. Systems are configured using a text file script system which allows for ease and flexibility in system construction. The script uses a series of keywords to enable the definition of each subsection which is parsed by the framework controller. 3.5
Example System Configurations
Two examples are shown below of typical framework implementations. The first (Figure 4) details a signature system utilising many of the standard routines initially implemented within the framework to assess on-line features from the SVC 2004 database. Following parsing, filtration and interpolation, 15 user-implemented features are extracted and selected from each sample and either used to create a template (training) or test the system. The framework controller checks to see if the selected configuration (as defined by an external script) is valid and handles the presentation of training and testing data. The second example (Figure 5) shows an off-line evaluation system using a neural network training-based classifier. In this configuration, the experimenter has defined and implemented a number of features and parser instances for their own proprietary data format according to the open source data framework. Again the framework controller assesses compatibility prior to dynamic binding as well as handing the division of training and testing data.
628
R. Guest et al.
Fig. 4. On-line System Configuration Example
Fig. 5. Off-line System Configuration Example
3.6
Future Work and Usage
In this paper, we have introduced a novel design and implementation of an open and flexible framework for evaluation of online signature verification modules. We have further introduced to an initial set of reference modules for data parsing and feature extraction, and have shown two exemplary system configurations. It is envisaged that the framework will be of benefit to the signature verification community through the provision of both an experimental system for development and investigation and also, through a standardised framework configuration, a reference system for performance comparison. The flexibility in configuration and open source nature of the specification mean that additional feature routines and classifiers adhering to system standard are easy to implement. This widens the scope of the system’s use beyond signature verification to other handwritten forms (such as drawings and forensic writing investigations) and even to other time-based
Towards a Flexible Framework for Handwritten Signature Analysis
629
measurement systems. In the short term, experimentation with the developed framework will be conducted as part of the EU BioSecure activities (BIOSECURE) focussing on an investigation of optimum system configuration and authoring of additional features and pre-processing modules. Acknowledgements The work described in this paper has been supported in part by the European Commission through the IST Programme under Contract IST-2002-507634 BIOSECURE.
References BIOSECURE: BioSecure Network of Excellence, http://www.biosecure.info HANGAI, S et al.(2000): On-Line Signature Verification based on Altitude and Direction of Pen Movement. In: Proceedings of the IEEE International Conference on Multimedia and Expo, 1, 489–492. JAIN, A.K. et al (1999): Biometrics: Personal Identification in Networked Society, The Kluwer International Series in Engineering and Computer Science, Vol.479, Springer, New York NAKANISHI, I et al. (2004): On-line signature verification based on discrete wavelet domain adaptive signal processing. Proc. Biometric Authentication, LNCS 3072: 584–591 NIST: The National Institute of Standards and Technology, Common Biometric Exchange File Format (CBEFF), http://www.itl.nist.gov/div895/isis/bc/cbeff/ MARTENS, R. and CLAESEN, L.(1996): On-Line Signature Verification by Dynamic Time Warping. In: Proceedings of the 13th IEEE International Conference on Pattern Recognition, Vienna, Austria, 1, 38–42. PLAMONDON, R. and SRIHARI, S.N.(2000): On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey. IEEE Trans. PAMI, 22(1), 63–84. RABINER, L.R. (1989): A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2), 257–286. SCHIMKE, S.et al. (2004): Using Adapted Levensthein Distance for Online Signature Verification, Proc. IEEE International Conference on Pattern Recognition (ICPR), Vol. 2, 931–934 YEUNG, D.Y.et al. (2004): SVC2004: First International Signature Verification Competition, Proc. International Conference on Biometric Authentication (ICBA), 16–2.
Multimodal Biometric Authentication System Based on Hand Features Nikola Paveˇsi´c1 , Tadej Saviˇc1 , and Slobodan Ribari´c2 1
2
Faculty of Electrical Engineering, University of Ljubljana, 1000 Ljubljana, Slovenia Faculty of Electrical Engineering and Computing, University of Zagreb, 10000 Zagreb, Croatia
Abstract. In this paper we present a multimodal biometric authentication system based on features of the human hand. A new biometric approach to biometric authentication based on eigen-coefficients of palm, fingers between first and third phalanx, and finger tips, is described. The system was tested on a database containing 10 grey-level images of the left hand and 10 grey-level images of the right hand of 43 people. Preliminary experimental results showed high accuracy of the system in terms of the correct recognition rate (99.49 %) and the equal error rate (0.025 %).
1
Introduction
Biometrics is a scientific discipline that involves methods of recognizing people by their physical and/or behavioral characteristics. The most common physical and behavioral characteristics of a person used for automatic biometric authentication (identification or verification) are as follows: fingerprint, hand-geometry, palmprint, face, iris, retina, DNA, ear, signature, speech, keystroke dynamics, gesture and gait (Jain et al. (2004)). Biometric systems based on a single biometric characteristic are referred to as unimodal systems. There are several human and technical factors that influence the performance and operation of a unimodal system, among the most important are the following: universality, uniqueness, permanence, collectability, accuracy, acceptability, circumvention, maturity, scalability and cost. Figure 1 provides visual comparison of six most common unimodal authentication systems in terms of the above factors. The human hand contains a wide variety of measurable characteristics that can be used by biometric systems, e.g., shape of the hand, dermatoglyphic patterns on the palmar surface of the hand, and veins pattern on the dorsal surface of the hand; see Figure 2. A single physical or behavioral characteristic of a person can sometimes fail to be sufficient for authentication. For this reason, multimodal biometric systems, i.e., systems that integrate two or more different biometric characteristics, are being developed to increase the accuracy of decisions and to decrease the possibility of circumventing an authentication procedure. Palmprint (print of the hand between the wrist and fingers), digitprints (prints of
Multimodal Biometric Authentication System Based on Hand Features Collectability
Permanence H
H
Accuracy H
MM M L L L
M
L L M H
Acceptability H M L
M
Ability to circumvent
L
L
Uniqueness
M H
Cost
L
M L L L
Collectability
H
H
H M
Permanence Uniqueness H
H
H
H M M
M
M H Universality
M
L Cost
H
H Maturity
Scalability
Collectability H
Accuracy H Acceptability H L
Ability to circumvent
M
M
M
Permanence
H
L M L L M H
L
L
L L H
L
M
Uniqueness
L Cost
H
H Maturity
H
M H Universality M
M
M
Scalability
d)
c) Collectability
L M
M
Ability to circumvent
M
L L L Acceptability M L L M H L H H Universality M HL L M M L L M Ability to Cost circumvent H H Maturity Scalability
Accuracy
H
M
b)
M
M
L
L
a)
Accuracy
Uniqueness H
M
M
Acceptability M L L H M H
Scalability
Permanence
H
H
Accuracy H
M H Universality M L
L H
H Maturity
H
Collectability
631
Permanence
M
Uniqueness H
Collectability Accuracy H
H
H M
M
Permanence
M
Uniqueness H
L L L Acceptability M L L M H H H L Universality L M H L M L M L M Cost Ability to circumvent H H Maturity Scalability
M L L L Acceptability L L M H M L H H Universality M HL L M L L M M Ability to Cost circumvent H H Maturity Scalability
e)
f)
Fig. 1. Kiviat graphs of six most common unimodal biometric systems: a) fingerprint; b) palmprint; c) hand-geometry; d) face; e) iris; f) voice. H, M, and L denote high, medium, and low, respectively. The area of the ten-sided polygon of a Kiviat graph reflects the degree of ”successfulness” of the system (better systems are represented with larger area polygons).
fingers between first and third phalanx) and fingerprints (prints of fingertips) are particularly convenient for fusion because they can be extracted from a single-shot measurement - a visible image of the palmar surface of the hand. In this paper a three-modal biometric authentication system based on integration of palmprint, digitprints, and fingerprints features extracted from a single image of the palmar surface of the hand by means of the KarhunenLo`eve (K-L) transform, is described. The system operates in parallel mode and integrates information at the matching-score level. The rest of the paper is organized as follows: Section 2 presents related work in the field of palmprint- and fingerprint-based unimodal biometric sys-
632
N. Paveˇsi´c et al.
a)
b)
Fig. 2. Images of the hand: a) Visible image of the palmar surface of the hand; b) Infrared image of the dorsum of the hand.
tems and hand-based multimodal biometric systems. Section 3 describes the proposed biometric system based on the fusion of palmprint, digitprints and fingerprints features at the system matching-score level. The experimental results on combining the three biometric modalities are presented in Section 4. Conclusions and future research directions are given in Section 5.
2
Related Work
A recent overview of hand-geometry and palmprint-based unimodal authentication systems as well as multimodal hand-based authentication systems is given by Paveˇsi´c et al. (2004). Most reported hand-geometry-based systems involve determining the lengths and widths of fingers and of parts of the palm at different points on the hand contour, while palmprint-based biometric systems exploit features such as: end- and middle-points of the principal lines, prominent palm-line features, different texture features, orthogonal moments, and coefficients as well as functions of orthogonal transformations. The state of the art in fingerprint recognition is circumstantially described in the monograph by Maltoni et al. (2003). Most fingerprint-based authentication systems follow the minutiae based approaches, which have reached a high level of refinement but suffer from some serious inherent problems: difficult automatic extraction of complete ridge structures for a considerable part of human population, and computationally demanding matching algorithms, especially in case when two fingerprint representations contain different number of minutiae. The developed non-minutiae-based approaches base on the fingerprint grey-level images and exploit features such as texture, directionalfield orientation, ridge shape, and coefficients of the Fourier-Mellin transform. An alternative approach to hand-based biometric authentication bases on detection of vein patterns from infrared images of dorsal surface of the hand. There are at least two important advantages of this approach: firstly, veins are hidden and therefore much harder to forge than external hand features, and secondly, the blood coursing through the veins give an assurance of aliveness.
Multimodal Biometric Authentication System Based on Hand Features
633
Fig. 3. Scanned image of the right hand, contour of the hand, reference points on the contour and nine regions-of-interest.
The authentication system developed by Lin and Fan (2004) already achieves acceptable accuracy. Authentication systems based on the fusion of hand-geometry and palmprint features at the matching-score and decision levels are described in papers: Shu and Zhang (1998), Ribari´c et al. (2002), Kumar et al. (2003). Recently, an authentication system based on the matching-score level fusion of eigen-coefficients of the palmprint and 5 digitprints, has been proposed by Ribari´c and Fratri´c (2005).
3
System Description
In the proposed system, the palmar surface of the hand is acquired by a low-cost office scanner at a resolution of 600 dpi, 256 grey levels. The hand is placed on the scanner with fingers spread naturally. There are no pegs or other hand-position constrainers on the scanner. At the preprocessing stage, the hand image is processed in 3 consecutive steps: 1) the contour of the hand is extracted from the image and reference points are determined from local minima and maxima of the contour curve; 2) locations of nine regions-of-interest (ROI) are determined based on contour reference points as follows: four ROIs on tips of four fingers, four ROIs on four fingers between first and third phalanx, and one ROI on the palm; 3) subimages, determined by the ROIs, are cropped from the original image, rotated to the same position, sized to fixed dimensions (palmprint and fingerprints subimages to the size (64 × 64) pixels, and digitprints subimages to the size (64 × 16) pixels), and lighting is normalized. The scanned image with marked contour, reference points and ROIs is shown in Figure 3. At the features extraction stage, features are generated from ROIs via the Karhunen-Lo`eve transform as follows: each ROI, i.e. the r-pixels subimage, is represented by a r-dimensional vector xi formed by lexicographic ordering of subimage pixels and subsequently projected onto the subspace spanned by the n ≤ min {M, r} eigenvectors corresponding to the largest eigenvalues of the M 1 T ROIs covariance matrices Ci = M j=1 (xij − µi )(xij − µi ) ; i = 1, 2, ..., 9,
634
N. Paveˇsi´c et al.
where M denotes the number of hand images in the clients (training) set and µi = E[xi ] the mean vector of xi . Thus, at the end of the feature extraction stage, the image of the palmar surface of the hand is represented with nine n-dimensional feature vectors consisting of coefficients of the K-L transform (also called eigen-coefficients). At the matching stage, feature vectors are compared against the features vectors (templates) stored in the system database. Nine nearest-neighbor (1NN) classifiers based on Euclidean distance are used. At the fusion stage, nine matching scores (i.e. Euclidean distances Di ; i = 1, 2, ..., 9) are normalized by the Min-max procedure, and subsequently converted to similarities Si according to the formula: Si = 1/(Din + 1). Assuming statistical independence and unequal importance of matching scores, 9 the fused normalized score T SM is computed as: T SM = i=1 wi Si , where th wi (0 ≤ wi ≤ 1; matcher. In our i wi = 1) represents the weight of the i experiments, weights were assigned to the individual matchers proportionally to their recognition rates. At the decision stage, the person presented to the system is authenticated comparing T SM with the decision threshold T . He or she is authenticated if T SM > T .
4
Experimental Results
For testing purposes hand images of 43 people (26 males and 17 females) with 10 images of right hands and 10 images of left hands were acquired (a total of 860 images). As dermatoglyphic patterns on the palmar surface of left and right hand are different (Lu et al. (2003)), left hand images were mirrored and used as right hand images of ”new” persons. In this way 86 image classes with 10 images per class were obtained. Two experiments of person authentication were performed: closed-set identification and verification. 4.1
Closed-set Identification Test
Five images from each image class in the database were chosen randomly and used in the enrolment stage to create the client database. The remaining 5 images were used to test the system. Feature vectors generated by each test image were compared with features vectors of all hand images of all clients in the system database. Each feature vector of the test image was identified individually, as well as within the fusion scheme at the matching-score level. At the decision stage the decision threshold was put to zero (T = 0). There were 430 identifications trials for each feature vector and each fusion scheme during the experiment. The identification test was repeated 20 times and every time another 5 hand images were chosen randomly for the client database.
Multimodal Biometric Authentication System Based on Hand Features Region-of-interest
1 2 3 4 5 6 7 8 9 10 11 12 13
Palmprint Fingerprint (little finger) Fingerprint (ring finger) Fingerprint (middle finger) Fingerprint (index finger) Digitprint (little finger) Digitprint (ring finger) Digitprint (middle finger) Digitprint (index finger) Palmprint + digitprints Palmprint + fingerprints Digitprints + fingerprints Palmprint + digitprints + fingerprints
20
90.9 76.0 83.2 85.5 81.8 92.6 90.9 92.7 89.6 98.4 98.3 99.1 99.2
635
n 50 100 200 300 430 94.2 95.2 95.2 95.1 95.6 80.9 82.1 82.5 82.4 82.6 86.9 88.0 88.2 88.3 88.6 90.1 90.9 91.0 91.2 91.5 86.5 87.5 87.7 88.0 88.2 94.8 94.5 94.1 94.3 94.6 93.7 93.9 93.6 93.7 93.9 95.3 95.8 96.0 96.1 96.2 93.0 93.0 92.7 92.9 93.4 98.8 99.0 99.0 99.0 99.1 98.7 99.0 99.0 99.0 99.0 99.3 99.4 99.4 99.4 99.4 99.4 99.5 99.5 99.5 99.5
Table 1. Average rates of correct recognitions (ARCR) based on n-dimensional feature vectors representing the palmprint ROI, 4 fingerprints ROIs, 4 digitprints ROIs and 4 possibilities of their fusion at the matching score level.
In order to find the optimal number of eigen-coefficients for description of hand ROIs the identification test was performed with 20-, 50-, 100-, 200-, 300- and 430-dimensional feature vectors. Table 1 shows average rates of correct recognition in percentage (%) from 20 repeated experiments based on closed-set identification of individual ROIs and their different possibilities of fusion for different feature vector lengths. The results demonstrate that very high recognition rates can be achieved only with systems based on the score-level fusion of hand ROIs, on condition that they are represented with at least 50 coefficients of the K-L transform (see rows 10 - 13 in the Table). The results show in addition that, if authentication systems base on a single hand ROI, the features generated via K-L transform are less appropriate for ROIs representations (especially valid for fingerprints; see rows 2 - 5 in the Table). 4.2
Verification Test
For verification test the database was divided into two parts: 65 (i.e. ≈ 75 %) classes were used for client experiments, the remaining 21 (i.e. ≈ 25 %) classes were used for impostor experiments. Hand images of classes used for client experiments were divided into two parts: 5 of 10 images were used in the enrolment stage to create the client database; the remaining 5 images were used for testing. Client experiments were performed by comparing 5 test images of 65 test classes with the corresponding class in the client database. A total of 325 (65 test classes × 5 test images) client experiments were made. Impostor experiments were performed
636
N. Paveˇsi´c et al. 0,6 FRR
FRR, FAR [%]
0,5
FAR
0,4 0,3 0,2 0,1 0 0,2
0,3
0,4
0,5
0,6
0,7
Threshold
Fig. 4. Average verification test results; the dependence of FRR and FAR on the threshold value.
by comparing 10 impostor images of 21 classes with each class in the client database. A total of 13,650 (21 impostor classes × 10 impostor images × 65 client classes) impostor experiments were made. In each experiment, client and impostor, fused normalized scores between the test image and 5 images from the claimed class in the client database were calculated. In the decision stage the best score T SMmax = maxj=1,...,5 {T SMj } was compared with the decision threshold T . In case that T SMmax > T , the claimed identity was accepted; otherwise it was rejected. The verification test was repeated 20 times and every time another 5 hand images were chosen randomly for the client database. Based on results of test described in subsection 4.1 the verification test was performed only with 100dimensional feature vectors. Figure 4 presents the average verification test results from 20 repeated tests, and shows the dependency of false rejection rate (FRR) and false acceptance rate (FAR) on the threshold value. The system achieved: the equal error rate (ERR) of 0.025 % at the threshold T = 0.48, the lowest FRR at which no false acceptances occur (zeroFAR) of 0.46 % at the threshold T = 0.60, and the lowest FAR at which no false rejections occur (zeroFRR) of 0.056 % at the threshold T = 0.40
5
Conclusions
In this paper, a multimodal biometric system that uses palmprint, digitprints and fingerprints features for person authentication, has been presented. The obtained initial results, rate of correct recognition of 99.49 % in the closedset identification test, and EER = 0.025 % in the verification test, as well as the fact that the system uses as input device a single sensor (a low-cost scanner), make the system suitable for home and for many network-based applications, for example for access control or virtual access control (web access, e-commerce).
Multimodal Biometric Authentication System Based on Hand Features
637
Future work may include increasing the size of the database, as well as experimenting with new sets of palmprint, digitprints, and fingerprints features.
References JAIN, A.K., ROSS, A. and PRABHAKAR, S. (2004): An Introduction to Biometric Recognition. IEEE Tr. on Circuits and Systems for Video Technology, Special Issue on Image- and Video-Based Biometrics, 14(1), 4-20. KUMAR, A., WONG, D. C. M., SHEN, H. C. and JAIN, A. K. (2003): Personal Verification Using Palmprint and Hand Geometry Biometric. In: J. Kittler and M.S. Nixon (Eds.): Proc. of 4th Int’l Conf. on Audio- and Video-Based Biometric Person Authentication (AVBPA), Springer, 668-678. LIN, C.L. and FAN, K.C. (2004): Biometric Verification Using Thermal Images of Palm-Dorsa Vein Patterns. IEEE Tr. on Circuits and Systems for Video Technology, 14(2), 199-213. LU, G., ZHANG, D. and WANG, K. (2003): Palmprint Recognition using Eigenpalms features. Pattern Recognition Letters, 24, 1463-1467. MALTONI, D., MAIO, D., JAIN, A.K. and PRABHAKAR, S. (2003): Handbook of Fingerprint Recognition. Springer, New York. ˇ C, ´ N., RIBARIC, ´ S. and RIBARIC, ´ D. (2004): Personal authentication PAVESI using hand-geometry and palmprint features - the state of the art. In: C. Vielhauer et al. (Eds.): Biometrics: Challenges arising from theory to practice, Cambridge, 17-26. ´ S., RIBARIC, ´ D. and PAVESI ˇ C, ´ N. (2002): A biometric identification RIBARIC, system based on the fusion of hand and palm features. In: M. Falcone et al. (Eds.): Proc. of the advent of biometrics on the Internet, Rome, 79-82. ´ S. and FRATRIC, ´ I. (2005): A Biometric Identification System Based RIBARIC, on Eigenpalm and Eigenfinger Features. Accepted to be published in: IEEE, Tr. on PAMI. SHU, W. and ZHANG, D. (1998): Automated Personal Identification by Palmprint. Opt.Eng., 37(8), 2359-2362.
Labelling and Authentication for Medical Imaging Through Data Hiding Alessia De Rosa, Roberto Caldelli, and Alessandro Piva Media Integration and Communication Center, University of Florence, Florence, Italy
Abstract. In this paper two potential applications of data hiding technology in a medical scenario are considered. In particular the first application refers to labelling and the watermarking algorithm provides the possibility of directly embed into a medical image the data of the patient; the second application regards authentication and integrity verification, and data hiding is applied for verifying whether and where the content has been modified or falsified since its distribution. Two algorithms developed for these specific purposes will be presented.
1
Introduction
Data hiding is the embedding of some data (usually called watermark) within a digital document, in such a way that the data are intrinsically embedded in the document. Its first application was for copyright protection, whereas the embedded data represent some characteristics of the owner or the authorized user of the document. Subsequently, other interesting applications of data hiding came out: copy protection, authentication, labelling, fingerprinting, tracking, and so on. Basing on the particular application, different requirements have to be satisfied by the data hiding system. In general, the main common ones are: imperceptibility, that is the quality of the marked data must remain of a high level; robustness, that is the embedded watermark must resist to intentional or unintentional attacks; security, that is the system must not be forged by unauthorized people; payload, that is the amount of data to be inserted. In a medical scenario, two applications of data hiding seem to be of great interest and usability: labelling and authentication/integrity verification. The first application refers to the possibility of directly embed into a medical image the data of the corresponding patient. Current standards for medical image data exchange, like DICOM, store separately image data and textual information into different record fields, so that the link between image and patient’s data could get mangled. By means of data hiding it is possible to embed patient records directly into biomedical images to prevent errors of mismatching between patient records and images, and to prevent the loss of the metadata when a file format conversion is applied (for example when a radiography must be compressed for sending through Internet). Furthermore, the size of the image does not increase with embedding additional data.
Labelling and Authentication for Medical Imaging Through Data Hiding
639
Regarding the second application, data hiding is applied in order to verify whether the content has been modified or falsified since its distribution. In this case the watermark should be embedded at the beginning of the chain, i.e. when the digital image is acquired; hereafter, every time that the integrity of the image must be verified, the watermark can be extracted from the possibly corrupted object and the information conveyed by the watermark used for revealing if some manipulation occurred. An interesting scenario for such an application is, for example, the field of insurance: by means of data hiding technology it is possible to evidence the region of the medical image that has been illegally tampered. In the following sections the two above mentioned applications will be analyzed and two algorithms developed for such purposes described.
2
Labelling
The idea is to embed an amount of data into digital medical images: in the specific we consider digital radiographs. In the following we analyse the specific requirements imposed by the considered application and we present the developed algorithm and some experimental results (Piva et al. (2003)). 2.1
Requirement Analysis
For designing the data hiding system suitable for this kind of scenario, three main requirements have been taken into account: the amount of hidden information, invisibility and robustness. Regarding payload we have analysed the standard DICOM (DICOM (2001)) that manages medical images and we have considered all the metadata that such a standard stores in separate fields. From such metadata we have selected a set of the main important ones, for a total amount of about 6 thousand bits. The second requirements concerns the preservation of the quality of the radiography, after modifications have been introduced to embed the suitable payload. For such an application, invisibility of the modifications means the correctness of the diagnosis of a doctor radiologist examining the marked radiography instead of the original one: the alterations introduced by the embedding process should be low enough to avoid any misleading interpretation of the digital radiography content. At this aim, once the payload has been established, the maximum level for watermark energy is fixed by basing on the opinion of a radiologist. Finally, regarding robustness, taking into account the usual processing performed to medical radiographs, we only consider DCT-based JPEG compression with a high quality factor to not disturb the medical image itself. 2.2
An Informed Watermarking Algorithm
Many watermarking algorithms have been developed in the last decade. The most recent studies have highlighted that informed watermarking techniques
640
A. De Rosa et al.
(Chen and Wornell (2001), Eggers et al (2003)) provide best performances with respect to the non-informed ones (e.g. the classical spread spectrum). In particular, such technologies provide good results for high values of DWR (Document to Watermark Ratio) and WNR (Watermark to Noise Ratio), that is for low values of watermark strength and attack level. Such conditions are respected in the framework of data hiding for medical application: in fact, for assuring invisibility the watermark strength has to be low, and the attacks applied to the images are in practical cases only a little compression JPEG. Basing on such considerations, we have considered a Dither Modulation algorithm, which is based on the quantization of the host features (in our case the magnitude of DFT coefficients), depending on two quantizers, which are related to the inserted bit (Piva et al. (2003)). To embed an information bit bi , a host feature xi (i.e. a DFT coefficient of the original image) is quantized with a quantizer Q∆ {xi} with step ∆, having a shift depending on the bit value, thus achieving the corresponding marked feature yi :
if bi = -1 Q∆0 {xi } = Q∆ {xi } yi = (1) Q∆1 {xi } = Q∆ {xi − ∆/2} − ∆/2 if bi = +1 Regarding decoding a hard or soft decision is possible, the former based on each feature, the latter considering all the features quantized through the same bit. From the analyzed image, the received feature ri is considered and quantized and decision is taken on the difference between the analyzed feature and its quantization: zi = Q∆ {ri } − ri . In the case of hard decoding, the decision rule is expressed as: ˆbi = ±1
if
|zi | ≷ ∆/4.
(2)
While hard decoding is optimum in absence of attacks, soft decoding allows to increase the performance of the decoder in presence of attacks. An optimum soft decoding rule in presence of AWGN attacks (i.e. ri = yi + ni , where ni refers to a white gaussian noise) has been derived: ˆbk = ±1 if (ri − Q∆0 {ri })2 ≷ (ri − Q∆1 {ri })2 . (3) i
i
For taking into account visibility issue, we proposed an adaptive dither modulation: instead of uniform quantizers using the constant step ∆, non uniform quantizers are used, characterized by increasing step ∆i , so that the distortion is proportional to the original host feature value. We thus propose the following quantization laws (Piva et al. (2003)): ∆01 = ∆
∆0i = ∆0(i−1) + ∆
∆1i =
∆0i + ∆0(i−1) 2
2+a 2−a
i i ∈ {2, 3, 4...},
i ∈ {1, 2, 3...}
a ∈ [0, 2)
(4)
Labelling and Authentication for Medical Imaging Through Data Hiding JPEG COMPRESSION RATIO
JPEG COMPRESSION RATIO
9,40 14,4 21,6 29,9 40,32 45,63 53,53 59,9 66,6 73,26 77,7 81,6 88,13 108,9 117,5 122,1
9,40 14,4 21,6 29,9 40,32 45,63 53,53 59,9 66,6 73,26 77,7 81,6 88,13 108,9 117,5 122,1 1,E+00
1,E+00
1,E-01
1,E-01
1,E-02
1,E-02
log (BER)
log (BER)
641
1,E-03
1,E-03
1,E-04
1,E-04
1,E-05
1,E-05
1,E-06
1,E-06 ASS
AMSS
UDM
ADM
ASS
(a)
AMSS
UDM
ADM
(b)
Fig. 1. Logarithm of BER as a function of the mean JPEG compression ratio for a payload of 5000 bits: with fixed PSNR = 55dB (a) and with maximum embedding parameters that do not introduce visible artifacts (b). [ASS, Additive Spread Spectrum; AMSS, Additive-Multiplicative Spread Spectrum; UDM, Uniform Dither Modulation; ADM, Adaptive Dither Modulation]
where ∆ is the starting step size and a is a parameter influencing the step size increasing degree. Let us note that if a = 0, this method reduces to the uniform quantization step DM algorithm. During the decoding step, the two possible decoding rules (hard and soft) can be adopted, and are expressed respectively as: ˆbi = ±1
if
ˆbk = ±1
if
|ri − Q∆1 {ri }| ≷ |ri − Q∆0 {ri }|, |ri − Q∆0 {ri }| ≷ |ri − Q∆1 {ri }|. i
2.3
(5)
i
Experimental Results
During experimental results the two dither modulation methods, uniform and adaptive, are compared, also with respect to classical spread spectrum techniques: additive spread spectrum and multiplicative spread spectrum (Barni et al. (2003)). A set of 100 digital radiographies in raw format having size 1024 × 1024 pixels and 8 bits/pixel was collected. On these images two kinds of test have been carried out: in the former, all the images have been modified by imposing a fixed PSNR value (55dB), in the latter, the images have been modified by using, for each of the four methods, the maximum energy assuring perceptual invisibility of the introduced artifacts, thus leading to different values of PSNR for each approach. The information bits have been embedded in the magnitude of a set of DFT coefficients. According to the analysis carried out about the mandatory information required by the DICOM standard, we decided to test the algorithms by embedding into each image a set of 5000 bits. As possible attack, JPEG compression was considered, with a decreasing quality factor from 100% to 70%, that corresponds to an increasing mean JPEG compression ratio from 9.40 to 122.1. Experimental results are shown in Figures 1(a) and 1(b). As it can be seen, results demonstrate the superiority of host-interference rejecting meth-
642
A. De Rosa et al.
ods with respect to nonrejecting ones. The difference of BER decreases with the increase of JPEG compression ratio, until the behaviour is similar when the attack is very strong. Regarding the difference of performance between Uniform Dither Modulation and Adaptive Dither Modulation, while in the first case is very low, in the second case the difference of performance becomes slightly higher.
3
Authentication
In this case a watermarking scheme for embedding a digest of the original image within the to be authenticated image has been developed. The aim of such an algorithm is the possibility to recover the original content (i.e. the embedded digest) in order to compare it with the to be verified content and to localize malevolent modification, with a good level of security and watermark invisibility. We propose a very simple self recovery authentication technique, that hides an image digest into some DWT subbands of the image itself (Piva et al. (2005)). In this case authentication is achieve by means of a robust algorithm given that the embedded digest must be recovered for proving integrity. By considering medical applications, as for labelling, we only consider as possible attack the DCT-based JPEG compression. 3.1
Digest Embedding
The data embedding part of the proposed scheme is sketched in Figure 2. Given a N × N image, after applying a 1-level DWT, the two horizontal and vertical details subbands are further DWT decomposed. The full-frame DCT of the low-pass version is computed and the DCT coefficients are scaled down to decrease their obtrusiveness, by using the JPEG quantization matrix. The first M lowest frequency coefficients (except the DC one) are selected and further scaled, by using a secret key (Key1) (the need for such a step will be clarified immediately). Each DCT coefficient can now be hidden in each sub-band more than once, thus ensuring a certain degree of robustness. The DCT coefficients are substituted to the DWT coefficients in the two detail sub-bands highlighted in dark grey in Figure 2. Before the replacement, a scrambling process, depending on a secret key (KeyA), is applied, so that the replicas of each DCT coefficient will occupy different locations in the two sub-bands: this is important because, if a manipulation occurs we can be quite confident that not all the replicas of a given coefficient will be removed by the attack. Finally inverse DWT is applied and the authenticated image is obtained. The original image and the authenticated one appear very similar from a quality point of view and a PSNR of about 36 dB has been obtained with different test images. The secret scaling using Key1 has been introduced for visibility and security issues. First of all, the scrambling applied to the watermarked coefficients
Labelling and Authentication for Medical Imaging Through Data Hiding
643
Fig. 2. Sketch of the embedding procedure.
before introducing them in place of original DWT coefficients can cause that high amplitude values fall close to low amplitude values, determining an unpleasant quality degradation. To avoid this effect, before scrambling, a sort of de-emphasis operation is applied according to the following rule: cscaled (i) = c(i) · α · ln(i + 2 + rand(i)),
(6)
where c(i) indicates the DCT coefficient in position i within the zig-zag scan and cscaled (i) is the corresponding scaled coefficient; α is a strength factor (usually slightly higher than 1) which is set on the basis of the image final quality, and rand is a shift parameter (ranging between −0.5 and 0.5) generated pseudo-randomly by means of a PRNG (Pseudo-Random Number Generator) initialized with a secret key Key1. The insertion of such a random scaling dependent by a secret-key makes the estimation of the scrambling rule unfeasible, thus increasing the security of the system with respect to potential attackers; otherwise, a hacker could crack the scrambling rule and thus create a seemingly authentic image by reintroducing in the right DWT sub-bands the informative data related to his forged image. 3.2
Integrity Verification
In the integrity verification phase the DWT of the to-be-checked N × N image is computed and the two sub-bands, supposed to contain informative data, are selected. These data are reversed into a vector, which is inversely scrambled by means of the secret key KeyA. By knowing the private scaling key Key1 it is possible to correctly invert the scaling operation performed
644
A. De Rosa et al.
(a)
(b)
(c)
Fig. 3. Original image (a); watermarked, i.e. authenticated, image (b); its manipulated version (c).
during the authentication phase. The inverse scaled coefficients are then put in the correct positions, in such a way to obtain an estimate of the DCT of the reference image (missing elements are set to zero, and a DC coefficient with value 128 is reinserted). These values are weighed back with the JPEG quantization matrix, and then the inverse-DCT is applied to obtain an approximation of the original reference image. The quality of this extracted image (having size N/2 × N/2) is very satisfactory and permits to make a good comparison with the checked for authenticity verification. An automatic system for the detection of manipulations has been also implemented, by simply computing a pixel-wise absolute difference between the sub-sampled to-be-verified image and the extracted image digest. The difference is then suitably thresholded achieving a binary image where the white pixels indicate a local difference between the two images (small differences due to noise are neglected by thresholding). 3.3
Experimental Results
The proposed algorithm has been tested with various medical images, both with and without JPEG compression. In Figure 3 the original radiograph (a), the watermarked image (b) and a manipulated version of it (c) are presented. The maximum level of the watermark energy has been fixed for assuring a quality level of the marked image according to medical opinion. By analyzing the manipulated, but authenticated radiograph, through the proposed data hiding algorithm, it is possible to extract the reference image embedded in it (Figure 4(a)), and thus localizing where some tampering have been produced to the protected radiograph (Figure 4(c)). In the case the embedded reference image is recovered when the authenticated image has been JPEG compressed, the sharpness of the extracted image is slightly poorer with respect to the case in which no compression is applied and a sort of noise is superimposed to the image. Notwithstanding this undesired effect, the reference image is still good to determine if and where something has been changed in the radiograph (more experimental results in (Piva et al. (2005))).
Labelling and Authentication for Medical Imaging Through Data Hiding
(a)
(b)
645
(c)
Fig. 4. Extracted reference image (a), sub-sampled analysed image (b) and pixelwise absolute difference (thresholded for achieving a binary image)(c).
Acknowledgements This publication has been produced with the assistance of the European Union, in the framework of the Culture Tech Project. The contents of this publication is the sole responsibility of the project partners and can in no way be taken to reflect the views of the European Union.
References BARNI, M., BARTOLINI, F., DE ROSA, A., PIVA, A. (2003): Optimum Decoding and Detection of Multiplicative Watermarks. IEEE Trans. on Signal Processing, Special Issue on Signal Processing for Data Hiding, 51, 4, 1118–1123. CHEN, B. and WORNELL, G. W. (2001): Quantization index modulation: A class of provably good methods for digital watermarking and information embedding. IEEE Trans. on Information Theory, 47, 4, 1423–1443. DICOM (2001): Digital Imaging and Communications in Medicine. National Electrical Manufacturers Association, Rosslyn, Virginia USA. EGGERS, J. J., BAUML, R., TZSCHOPPE, R., GIROD, B. (2003): Scalar Costa Scheme for Information Embedding. IEEE Trans. on Signal Processing, Special Issue on Signal Processing for Data Hiding, 51, 4, 1003–1019. PIVA, A., BARTOLINI, F., COPPINI, I., DE ROSA, A., TAMBURINI, E. (2003): Analysis of data hiding technologies for medical images. In: Wong and Delp (Eds.): Security and Watermarking of Multimedia Contents V, SPIE. Santa Clara, CA, USA, 5020, 379–390. PIVA, A., BARTOLINI, F., CALDELLI, R. (2005): Self recovery authentication of images in the DWT domain. International Journal of Image and Graphics, 5, 1, 149–165.
Hand-geometry Recognition Based on Contour Landmarks Raymond Veldhuis, Asker Bazen, Wim Booij, and Anne Hendrikse Signals and Systems Group, Dept. of Electrical Engineering University of Twente, Enschede, The Netherlands
Abstract. This paper demonstrates the feasibility of a new method of handgeometry recognition based on parameters derived from the contour of the hand1 . The contour can be modelled by parameters, or features, that can capture more details of the shape of the hand than what is possible with the standard geometrical features used in hand-geometry recognition. The set of features considered in this paper consists of the spatial coordinates of certain landmarks on the contour. The verification performance obtained with contour-based features is compared with the verification performance of other methods described in the literature.
1
Introduction
Most reported systems for hand-geometry recognition, e.g. Golfarelli et al (1997), Jain, Ross and Pankanti (1999), and Sanchez-Reillo et al (2000) use standard geometrical features as inputs. An overview of these methods is given by Paveˇsi´c et al (2004). A different approach, based on the contour of the hand was published in Jain and Duta (1999). Examples of standard geometrical features are the widths and the lengths of fingers and of parts of the palm, and the angles between line segments connecting certain points. These features are measured from a black-and-white or gray-level image of the hand as shown in Figure 1. The lengths of the line segments and the angles are the features. The alignment pegs appear as black disks. The three larger black disks are for calibration. The performance of hand-geometry recognition is, in spite of its simplicity, quite acceptable. Equal-error rates of about 0.5% have been reported in the literature. This paper demonstrates the feasibility of a new method of contour-based hand-geometry recognition. The contour is completely determined by the black-and-white image of the hand and can be derived from it by means of simple image-processing techniques. It can be modelled by parameters, or features, that capture more details of the shape of the hand than the standard geometrical features do. The features considered in this paper are the spatial coordinates of certain landmarks on the contour. Section 2 discusses the features and the recognition method. The method presented here differs from the one presented in Jain and Duta (1999) in that the latter does not 1
This paper is a short version of Veldhuis et al (2005)
647
y
Hand-geometry Recognition Based on Contour Landmarks
x
Fig. 1. Binary image of the hand and geometrical features. The lengths of the line segments and the angles in the image are used as features.
use landmarks, but the fingers are extracted from the contour and aligned pairwise. The mean alignment error is then used to compare contours. The new method has been evaluated experimentally in a verification context. The verification performance obtained with contour-based features has been compared with the verification performance of a reference system using standard geometrical features. A comparison with results presented in the literature has also been made. The experiment and the results are presented in Sections 3 and 4.
2
Contour-based Recognition
Images of the right hand are used for recognition. The part of the contour that is used runs counterclockwise from a point at a fixed distance below the basis of the little finger to a point at a fixed distance below the basis of the thumb. The parts of the contour below those points are not used, because they are unreliable due to sleeves or cuffs that may appear in the image. The alignment pegs are removed from the extracted contour. Possible dents at their locations are smoothed by linear interpolation. The number of landmarks on a contour can be chosen freely, but the minimum set consists of 11 reference landmarks. These are: the start and end point of the contour, the fingertips and the interfinger points. A number of nl ≥ 0 additional landmarks can be placed on the contour at equidistant positions between adjacent reference landmarks. This means that there are l = 10nl + 11 landmarks in total. Their spatial coordinates (x, y) constitute the feature vector. The dimensionality m of the feature vector is, therefore, twice the number of landmarks.
648
R. Veldhuis et al. 0
y
reference landmark
FT3
equidistant landmark
FT2 FT4
100
FT1
200
IFP3
FT5
IFP2
300
400
IFP1
V1
IFP4 V2
500 150
300
450
600
x
Fig. 2. Original contour (thin) and final contour (thick) with 51 landmarks (nl = 4) indicated by circles. The reference landmarks are indicated by disks.
The verification is based on a log-likelihood-ratio classifier. It is assumed that the feature vectors have multi-variate Gaussian probability densities2 . The total probability density, i.e. the probability density of a feature vector x without prior knowledge of the specific class of x, is p(x) =
1 m 2
(2π) |ΣT |
e− 2 (x−µT ) 1
1 2
T
Σ−1 (x−µT ) T
,
(1)
with m the dimensionality of the feature space, µT the total mean and ΣT the total covariance matrix. The superscript T denotes vector or matrix transposition. It is assumed that a class c is characterized by its class mean µc and that all classes have the same within-class covariance matrix ΣW . The within-class probability density, i.e. the probability density of a feature vector x ∈ c, is 1 − 21 (x−µc )T Σ−1 (x−µc ) W p(x|c) = . (2) m 1 e 2 2 (2π) |ΣW | 2
The reader may wonder why Gaussian densities are assumed. In fact, there are no good reasons. Usually, the following arguments are presented: Many physical processes can be modelled as Gaussian. The linear transformations that are applied for dimensionality reduction will make the data more Gaussian-like. The Gaussian assumption will lead to a solvable problem, which cannot be said of other assumptions
Hand-geometry Recognition Based on Contour Landmarks
649
Prior to classification the feature vector is mapped onto a lower-dimensional subspace by means of a linear transform. The d × m transform matrix M simultaneously diagonalizes the within-class and the total covariance matrix, such that the latter is an identity matrix. This results in a log-likelihoodratio classifier that has a computational complexity that is linear, rather than quadratic, with the dimensionality d. The log-likelihood-ratio is then given by def
l(y) = log
p(y|c) = p(y
1 1 1 − (y − νc )T Λ−1 (y − νc ) + (y − νT )T (y − νT ) − log(|Λ|), 2 2 2
(3)
with y = Mx, νc = Mµc , νT = MµT , and Λ = MT ΣW M a diagonal matrix. If l(y) is above a threshold T , the user is accepted, otherwise he is rejected. The coefficients of the transformation matrix M and the parameters (νc , νT , Λ) of the classifier must be estimated from training data consisting of the landmarks of a number of s subjects. This training procedure is described in detail in Veldhuis et al (2005). For the understanding of the experiment described below, it is important to know that reduction of dimensionality achieved by the d × m matrix M depends on two parameters: p which is the number of dimensions that are retained after a first principal component analysis and d which is the final dimensionality after a subsequent linear discriminant analysis.
3
Experimental Evaluation
A lab system, similar to the one described in Sanchez-Reillo et al (2000), has been realized. A black-and-white image of a hand and the references pins, obtained with this lab system, is shown in Figure 1. The geometrical features are indicated in this figure. The lab system was used for an experimental comparison of two methods: a reference method based on 30 standard geometrical features, similar to those described in Sanchez-Reillo et al (2000), and the contour-based method described above. The reference method also uses a log-likelihood-ratio classifier based on Gaussian probability densities and the dimensionality is reduced by the same procedure as is used for the contour parameters. A database containing 10 to 20 black-and-white images of the right hand of each of 51 subjects was collected. It contains a total of about 850 images. The equal-error rates were estimated from two grand sets containing all the genuine and imposter matching scores, respectively, that were measured in 20 experimental trials. In each of these trials the feature vectors were randomly divided into 2 groups: a fraction of 75% was used as a training set; the remaining 25% were used as a test set. The transform matrix M and the classifier parameters (νc , νT , Λ) were estimated from the training set. The
650
R. Veldhuis et al.
matching scores were computed from the test set. Three types of tests were performed:
• The first test was one-to-template testing. The class means served as templates in the verification process. Therefore, the enrollment was part of the training. Log-likelihood ratios (3), with νc taken as the class means, served as matching scores. • The second test was inclusive one-to-one testing. In each experiment the feature vectors of two hands were compared. Again (3) was used to compute the matching scores, but now y represented one feature vector and νc the other. For each class 75% of the examples were added to a training set; the remaining 25% were added to a test set. • The third test was exclusive one-to-one testing, in which in each trail a random selection of 75% of the classes (i.e. 38 classes) were used for training and the other 25% (i.e. 13) for testing. One-to-template testing predicts the performance of a verification system with extensive enrollment. This type of enrollment will yield, as will be shown later, the best verification performance, but is not very user-friendly. Oneto-one testing predicts the performance of a verification system of which the enrollment consists of only one measurement. This is a common type of enrollment in biometric systems. In inclusive one-to-one testing the data are split per class and divided over the training and test set. This has the advantage that the training data will be representative of the test data. It is not always realistic, since in practice systems may be trained by the manufacturer while enrollment is taken care of by the user. This is accounted for in the exclusive one-to-one testing, where the test set contains other classes than the training set. The parameters of the trails were the number of most significant principal components p, the final dimensionality d of the feature vector, and the number of landmarks l = 10nl + 11. The number of most significant principal components p, was 26 for the reference method and 65 for the new contour-based method. The precise value of p is not critical in the new method. In the case of one-to-template testing, the equal-error rates could not be measured for some values of d, because there was no overlap between the matching scores of the genuine and the impostor attempts. Instead of choosing an equal-error rate of 0 in these cases, we have approximated the logs of the estimated falseaccept and false-reject rates as functions of the matching scores by straight lines. These approximations are based on the 10 matching scores that were closest to there (non-measurable) cross-over point. The error rate at which these linear approximations cross is taken as the equal-error rate. The reader is referred to Veldhuis et al (2005) for more details on this approximation.
Hand-geometry Recognition Based on Contour Landmarks
651
−2
10
−3
10
−4
EER
10
−5
10
−6
10
−7
10
10
15
20
25
30 Dim
35
40
45
50
Fig. 3. One-to-template testing: Equal-error rates as functions of the final dimensionality d, obtained with the reference method with 30 standard geometrical features (line), and with the new method with 51 (dots), 91 (dash dots), and 211 (dashes) landmarks. Equal-error rates that are the result of a linear approximation of the false-accept and false-reject rates are denoted by circles. −1
10
−2
EER
10
−3
10
−4
10
10
15
20
25
30 Dim
35
40
45
50
Fig. 4. Inclusive one-to-one testing: Equal-error rates as functions of the final dimensionality d, obtained with the reference method with 30 standard geometrical features (line), and with the new method with 51 (dots), 91 (dash dots), and 211 (dashes) landmarks.
4
Results
Figures 3, 4, and 5 present the results obtained by one-to-template testing, inclusive one-to-one testing, and exclusive one-to-one testing. All these figures show the equal-error rates as functions of the final dimensionality d. Table 1 compares the results obtained with the reference and the new contour-based method with those published in the literature, in particular
652
R. Veldhuis et al. −1
10
−2
EER
10
−3
10
−4
10
10
15
20
25 Dim
30
35
40
Fig. 5. Exclusive one-to-one testing: Equal-error rates as functions of the final dimensionality d, obtained with the reference method with 30 standard geometrical features (line), and with the new method with 51 (dots), 91 (dash dots), and 211 (dashes) landmarks. Method
1–T
1–1:I
1–1:E
Golfarelli et al (1997) 1.2 · 10−3 – 1.2 · 10−2 Jain, Ross and Pankanti (1999) 5.0 · 10−2 Sanchez-Reillo et al (2000) 5.0 · 10−3 Reference method 3.0 · 10−3 8.0 · 10−3 2.0 · 10−2 Jain and Duta (1999) New contour-based method
−5
≤ 1.0 · 10
−4
2.0 · 10
2.5 · 10−2 2.0 · 10−3
Table 1. Comparison with equal-error rates presented in the literature. 1–T denotes one-to-template testing; 1–1:I denotes inclusive one-to-one testing, and 1–1:E denotes exclusive one-to-one testing. The first 4 methods are based on standard geometrical features and the last 2 on contours.
Golfarelli et al (1997)3 , Jain, Ross and Pankanti (1999), Sanchez-Reillo et al (2000), and Jain and Duta (1999). The 4 methods based on standard geometrical features and the 2 based on contours are separated by a horizontal line. The new contour-based method achieves, by far, the lowest equal-error rates for all three types of testing. 3
With respect to the equal-error rate of 1.2·10−3 reported in Golfarelli et al (1997) it must be remarked that in this reference it is said that ‘at the cross-over point we observed 1 FR and 118 FA’. The number of tests was 800 for the false-reject rate and 9900 for the false-accept rate. This means that the equal-error rate may be anywhere between 1.2 · 10−3 and 1.2 · 10−2 .
Hand-geometry Recognition Based on Contour Landmarks
5
653
Conclusion
A new method for hand-geometry verification, based on the contour of the hand, has been presented. The feature vectors consist of the spatial coordinates of landmarks on the contour. The verification is based on a loglikelihood-ratio classifier. An experiment based on a data set containing a total of 850 hand contours of 51 subjects has been performed. The new method was tested in three ways: one-to-template, one-to-one with test classes represented in the training set and one-to-one without test classes in the training set. Depending on the test, the equal-error rate varied between below 1.0·10−5 and 2.0 · 10−3 . This is substantially better than both the equal-error rate of a reference method based on standard geometrical features and the performances of other methods reported on in the literature.
References GOLFARELLI, M., MAIO, D., and MALTONI, D. (1997): On the error-reject trade-off in biometric verification systems. IEEE Trans. PAMI 19, 786–796. JAIN, A., ROSS, A., and PANKANTI, S. (1999): A prototype hand geometry-based verification system. Proc. 2nd Int. Conf. on Audio- and Video-Based Personal Authentication (AVBPA), pp. 166–171. Washington. SANCHEZ-REILLO, R., SANCHEZ-AVILA, C., and GONZALEZ-MARCOS, A. (2000): Biometric identification through hand geometry measurements. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 1168–1171. ˇ C, ´ N., RIBARIC, ´ S., and RIBARIC, ´ D. (2004): Personal authentication PAVESI using hand-geometry and palmprint features – the state of the art. Workshop Proceedings – Biometrics: Challenges arising from Theory to Practice, pp. 17– 26. Cambridge, UK. JAIN, A., and DUTA, N. (1999): Deformable matching of hand shapes for verification. Proc. IEEE Int. Conf. on Image Prcoessing. Kobe, Japan. VELDHUIS, R., BAZEN, A., BOOIJ, W., and HENDRIKSE, A. (2005): Handgeometry recognition based on contour parameters. Proc. SPIE Biometric Technology for Human Identification II, pp. 344–353. Orlando, FL, USA.
A Cross-cultural Evaluation Framework for Behavioral Biometric User Authentication F. Wolf1 , T. K. Basu2 , P. K. Dutta2 , C. Vielhauer1 , A. Oermann1 , and B. Yegnanarayana3 1
2
3
Otto-von-Guericke University Magdeburg at Magdeburg, 39106 Magdeburg, Germany Indian Institute of Technology, Kharagpur at Kharagpur 721302, India Indian Institute of Technology, Madras at Chennai, India
Abstract. Today biometric techniques are based either on passive (e.g. IrisScan, Face) or active methods (e.g. voice and handwriting). In our work we focus on evaluation of the latter. These methods, also described as behavioral Biometric, are characterized by a trait that is learnt and acquired over time. Several approaches for user authentication have been published, but today they have not yet been evaluated under cultural aspects such as language, script and personal background of users. Especially for handwriting such cultural aspects can lead to a significant and essential outcome, as different spoken and written languages are being used and also the script used for handwriting is different in nature.
1
Motivation
The goal of our work is to analyze cross-cultural aspects of handwriting data as a digital input for biometric user authentication. Therefore, we have designed and developed a biometric evaluation framework within the CultureTech project which focuses on cultural impacts to technology in an European-Indian cross-cultural context. The framework, its methodology as well as a short outline of evaluation aspects have already been presented in Schimke et al. (2004). In this paper we will enhance evaluation aspects into detail and derive first hypotheses of the correctness and usability of biometric user authentication systems for different cultures. Evaluation aspects, considered to be analyzed, are formulated as two different but related sources. First, so called meta data is collected. A taxonomy for meta data is presented in Vielhauer et al. (2005). Following this taxonomy, meta data is differentiated in two main categories, the technical and the non-technical one. While the technical meta data implies hardware and software parameters, the non-technical meta data addresses the cultural and personal background of a person and are subject of our research and focus of
This publication has been produced with the assistance of the European Union (project CultureTech, see http://amsl-smb.cs.unimagdeburg.de/culturetech/).
A Cross-cultural Evaluation Framework
655
this paper. We introduce a new classification of meta data in order to reach two distinct goals: Given the fact, that personal information like age and gender can be statistically estimated by analyzing human handwriting Tomai et al. (2004), the first evaluation goal is the derivation of cultural characteristics of a person such as ethnicity, education, and language by statistical or analytical means of handwriting dynamics. The second goal is to evaluate the impact, which certain meta data can have for biometric user authentication systems based on handwriting. Especially the impacts of additional facts of the personal background like culture, spoken and written languages as well as ethnicity on a biometric handwriting user authentication process shall be analyzed in order to estimate its accuracy. In this paper we focus on the latter. As the second evaluation aspect, biometric handwriting data and the related non-technical meta data are analyzed to estimate effects, a person’s condition during the process of experimental testing can have on the behavioral biometric data. In this context, additional incidents as a special class of meta data can influence biometric handwriting data in certain ways. Beside others, these incidents are determined through cross-cultural experiences of the person in the far or near past, i.e. the persons sojourns abroad, familiarity with given tasks like the person’s familiarity with the hardware such as the digitizer tablet and pen and the attitude towards digital biometric systems in general. Hence, this class of meta data has to be specified and analyzed to adapt the recognition or authentication algorithms in order to enhance their performance and quality measured by the Equal Error Rate (EER). To read more about EER we refer to Scheidat (2005) and Vielhauer (2006). Thus, a security reliance of biometric user authentication systems shall be achieved. Considering non-technical meta data and the cross-cultural context, our methodology is as follows: In order to evaluate the process of user authentication in bilingual or multilingual environment, handwriting data is collected in three different countries, India, Italy and Germany. Based on this, we focus on developing hypotheses based on behavioral biometric handwriting input and the collected meta data. As Vielhauer (2006) will show, meta data can have an essential impact in order to achieve more reliable and correct results in biometric user authentication systems for handwriting. In this paper hypotheses are derived by analyzing biometric handwriting data and the subject related meta data. These hypotheses not only address particular, a biometric system influencing factors, but also evaluate them. Our framework will be of relevance in two main areas of the cross-cultural biometric field. First, it is an enhanced evaluation system for biometric user authentication in multilingual environments and it provides more reliable results. Second, our system can be used for user verification in a cross-cultural context. The paper is structured as follows: In section 2, the process of data collection is introduced and the experimental framework is briefly outlined. In
656
F. Wolf et al.
Fig. 1. The meta data hierarchy
section 3, an enhanced classification of meta data is presented. This is followed by the description of the evaluation methodology and the formulation of hypotheses in section 4. First results which are based on an experimental data collected in Germany are presented in section 5. Finally, section 6 concludes by summarizing the paper and providing a perspective on future work.
2
Meta Data — Definition
As briefly mentioned in our introduction, the overall definition of meta data needs to be specified and classified. Different, but closely related classifications of meta data, as Figure 1 illustrates, can be found in the literature. A basic meta data taxonomy is presented in Viehauer et al. (2005) and differentiates technical and non-technical meta data. Technical meta data include aspects of the used device such as hardware and software specifications. For handwriting sampling technical meta data specify the digitizer tablet and the used pen as well as the used framework. There exist three classes of non-technical meta data. One class includes aspects of biological meta data. Those meta data, described in Jain (2004a) and Jain (2004b) as soft biometric traits, are continuous or discrete parameters, which provide some information about the individual’s biological background. Ethnicity refers to the second class of non-technical meta data - the cultural class, which joins religious, linguistic and ethnic aspects as almost static parameters. The third class of non-technical meta data is determined though dynamic, conditional parameters of the person. This class is divided in the long term and the more dynamic short term conditional meta data. In this paper we focus on the cultural and conditional aspects of non-technical meta data. Biological pa-
A Cross-cultural Evaluation Framework
657
rameters like year of birth, ethnicity, gender, and handedness never change and are valid for one specific person, whenever the data collection may take place. While Jain (2004a) and Jain (2004b) use these biological parameters to limit the group of subjects, a biometric authentication process is used on, our goal is to investigate, to which extend the meta data influences the biometric data during collection. The reason for meta data being a major focus of recent research in the field of biometric user authentication is their impact to improve the performance of traditional biometric systems. In our investigations we establish a double tracked procedure. In order to improve the accuracy and reliability of algorithms for biometric handwriting user authentication systems, we analyze static, as well as dynamic parameters of meta data. Static meta data of the cultural background of a person is collected at the beginning of a sample enrollment, and it is stored as a profile in a data base. Once being collected, this meta data is valid for all upcoming tests, concerning the specified subject. Dynamic meta data of the conditional background of a person is collected through a questionnaire before and after the enrollment. This meta data includes the experiences of a person, which have been gained during his or her biographic past. These dynamic parameters can significantly change over time. New experiences can be made, old experiences can be forgotten. Short term conditional parameters have a very dynamic characteristic. They can be applied exclusively while data enrollment, since they describe the persons actual condition during recording. Both classes of meta data, the cultural, as well as the conditional, essentially influence the output of the sample class collection. Our aim is to analyze their impact in order to improve biometric user authentication systems for handwriting.
3
Data Collection and Experimental Framework
In this section we briefly present the environmental and technical concept of the system as described in Schimke et al. (2004). This includes the data collection and the description of the experimental framework. The structure of all components such as handwriting recording, meta data, and conditional information will be described. Our framework contains a generic system design considering additional meta data models. It consists of the following components: • Sample tasks: The subjects are asked to write 48 given writing samples, each is to repeat 10 times. The different samples are available in English and German. a) Traditional handwriting tasks like giving signature. b) Words / sentences (statements and questions) of different complexity c) Numbers d) Questions about the name, heritage and age are to answer.
658
F. Wolf et al.
• Data Recorder: For sampling, tablet PC hardware is used. • Evaluation Database: Stores the complete handwriting signals along with synchronized meta data, non technical (once stored for each subject) and technical (stored after each sample). • Questionnaire: Independent of the system. Questions are about: a.) long term b.) short term Further, we define a test module as a specified set of handwriting of one person in one language. A whole test session can be set up as follows: 1. Collection of the meta data of the subject 2. Recording of the test modules 3. Filling in the questionnaire During our recordings the subjects have been invited to two test sessions: First, handwriting data in their native language was collected, and second, handwriting data in a second language (usually English) was collected.
4
Methodology — First Hypotheses
In our test modules, data of 29 persons, 10 female and 19 male, has been enrolled. During these experiments, certain meta data could be categorized and specified as follows: Technical meta data concerning the aspects of recording and environment: • Data recorded in a laboratorial environment • Consistent test modules (48 samples, 10 repetitions each) • Tool for handwriting recording: Software PlataSign and digitizer tablet Non-technical meta data that can be assumed as true for all registered subjects of the test modules: • Educational background: Academical (recordings took place at University) • Native Language: German • Learned Languages: English • Learned languages: at least 1 (English), Maximum of 3 • Scripts: Maximum of 2 • Age between 19 and 30 • Subject’s handedness: Right 1 • Subject’s religion: Christian (protestant) or no religion • Equality of gender representation 1
Rigid reeducation of left- to right-handedness in the former GDR [6]
A Cross-cultural Evaluation Framework
659
By analyzing the questionnaires, a high motivation and willingness of the subjects could be observed. Based on the collected handwriting data, meta data, and the questionnaires, our hypotheses are structured as follows: A variety of hypotheses, which initiatively concerns only visual noticeable features, can be derived from the collected data. In this paper, we have chosen hypotheses concerning two of the most obvious aspects and parameters, which have been collected. First, hypotheses about differences of handwriting will be outlined, depending on used languages. Second, gender-specifications and familiarity with different languages will be analyzed. A subject’s input, which consists of conditional and cultural meta data as well as recorded biometric data, result in certain test values or test module parameters. Test values are being analyzed by two different aspects, syntactical and semantical. Thus, hypotheses can be derived and retrospectively crosschecked with the input. Our methodology has the following structure: a.) Analyzing by syntactical aspects. The syntax is the physical entity of the handwriting samples. It contains: • dynamic writing features (e.g. velocity, pressure and the pen’s angles, in particular, altitude and azimuth), • writing features (e.g. position of points, tilt angle, gabs, horizontal and vertical dimensions, length of lines). b.) Analyzing by semantic aspects. The semantic describes aspects of the content. It contains: • personal and special meanings of free chosen answers to questions • added, elided or twisted aspects of words and sentences. • appearance of test modules c.) Meta data have been analyzed and investigated considering special aspects at a) and b), in particular, with regard to gender and languages. d.) Questionnaires have been analyzed e.) Developing of hypotheses, which group aspects of a & c, a & d, b & c and b & d. Obviously, the connections from syntactical or semantic aspects to meta data or the questionnaire are analyzed separately performing 1:1 relationships.
5
Results
Based on the introduced methodology the following hypotheses considering cross-cultural aspects can be summarized: The first hypothesis refers to a person’s conditional meta data influence on his or her writing style, especially numbers, which is based on sojourns abroad. Independent of other meta data as gender, attitude or age, a high semantic variability of written numbers “1”, “7” and “9” could be noticed.
660
F. Wolf et al.
Subjects who stayed abroad show a similar writing style and tend to use the “English standard” 2 instead of the expected “German standard” 3 which they learned in school. As a further cross-cultural aspect, staying abroad also influences choices of individual samples: English phrases are preferably used. It also effects the orthography. The second hypothesis concerns influences of gender on syntactical features. By analyzing the 19 male subject’s test modules compared to the 10 female subject’s test modules, it could be observed that men have a much higher writing pressure than women. Despite this, the variability of female pressure was higher than the male. On average, the male subjects by using tilt angles between 0o and 30o postured the pen much lower than the female subjects did with an angle between 30o and 100o . The horizontal and vertical dimensions also varied. The third hypothesis concerns influences of gender on semantical features. On average, the male subjects showed a much higher writing variability than the female subjects. Besides these hypotheses, more hypotheses can be formulated, concerning attitudes, influences of used soft- and hardware and relationships of subjects to the supervisor, just to name a few. But as mentioned before, we restrict to hypotheses considering the most obvious parameters in a cross-cultural manner.
6
Conclusions and Future Work
In order to investigate and recognize differences between cultures (India, Italy, Germany) and languages, we have introduced a new approach to formulate hypotheses concerning the impact of certain meta data (cultural and biological background) and conditions (experiences and attitudes) on behavioral biometric data, focussing handwriting data. By evaluating our hypotheses a new research area in the field of cross-cultural, as well as multi-modal user interfaces will be opened. Especially for behavioral biometric authentication systems more accurate and reliable results and higher security levels against forgery may be accomplished. Based on new data, collected in different countries (India and Italy), hypotheses will be further tested and verified. Especially the assumption of a similar writing style in same or similar cultural groups and areas will be certified. Subject of future investigations is the detailed analysis if and how particular cross-cultural groups can be characterized by sharing writing habits. The results that have obtained so far are promising. 2
3
“1” just having a light up stroke, 7 lacking the cross bar, “9” having a stroke instead of a curvature “1” having a slight up stroke with a little stroke on top, 7 including a cross bar in the middle, “9” having a curvature under the upper circle
A Cross-cultural Evaluation Framework
661
A focus of recent and future investigations is the enhancement of the test set with audio data collection as a third test session. Hence not only handwriting as an active biometric user authentication system but also speech can be tested and improved in a cross-cultural manner. Further, handwriting and speech can be combined and compared referring meta data to find out the most reliable and well performed behavioral biometric user authentication system.
Acknowledgements The information in this document is provided as is, and no guarantee or warranty is given or implied that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. The work described in this paper has been supported by the EU-India project CultureTech. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the European Union. The content of this publication is the sole responsibility of the University Magdeburg and their co-authors and can in no way be taken to reflect the views of the European Union. Thanks to all partners of the EU-India project CultureTech.
References JAIN, A. K. et al. (2004): Soft Biometric Traits for Personal Recognition Systems. In: Proceedings of International Conference on Biometric Authentication (ICBA). Hong Kong, LNCS 3072, 731-738. JAIN, A. K. et al. (2004): Can soft biometric traits assist user recognition?. In: Proceedings of SPIE Biometric Technology for Human Identification. Orlando, FL, U.S.A., 5404, 561–572. PATIL HEMANT, A. and BASU, T. K. (2004): Speech corpus for text/language independent speaker recognition in Indian languages. The National symposium on Morphology, Phonology and Language Engineering, SIMPLE’04. A1–A4. SCHEIDAT, T. and VIELHAUER, C. (2005): Fusion von biometrischen Verfahren zur Benutzerauthentifikation. In: P. Horster (Ed.), D-A-CH Security 2005 Bestandsaufnahme, Konzepte, Anwendungen, Perspektiven. 82–97. SCHIMKE, S. et al. (2004): Cross Cultural Aspects of Biometrics. In: Proceedings of Biometrics: Challenges arising from Theory to Practice. 27–30. TOMAI, C.I. et al. (2004): Group Discriminatory Power of Handwritten Characters. In: Proceedings of SPIE-IS&T Electronic Imaging. 5296, 116–123. VIELHAUER, C. et al. (2005): Finding Meta Data in Speech and Handwriting Biometrics. In: Proceedings of SPIE-IS&T. 5681, 504–515. VIELHAUER, C. (2006): Biometric User Authentication For IT Security: From Fundamentals to Handwriting. Springer, New York, U.S.A., to appear 2006.
On External Indices for Mixtures: Validating Mixtures of Genes Ivan G. Costa1 and Alexander Schliep1,2 1
2
Department of Computational Molecular Biology Max-Planck-Institute for Molecular Genetics Ihnestraße 73, D-14195 Berlin, Germany Institut f¨ ur Mathematik-Informatik Martin-Luther-Universit¨ at Halle-Wittenberg 06099 Halle, Germany
Abstract. Mixture models represent results of gene expression cluster analysis in a more natural way than ’hard’ partitions. This is also true for the representation of gene labels, such as functional annotations, where one gene is often assigned to more than one annotation term. Another important characteristic of functional annotations is their higher degree of detail in relation to groups of co-expressed genes. In other words, genes with similar function should be be grouped together, but the inverse does not holds. Both these facts, however, have been neglected by validation studies in the context of gene expression analysis presented so far. To overcome the first problem, we propose an external index extending the corrected Rand for comparison of two mixtures. To address the second and more challenging problem, we perform a clustering of terms from the functional annotation, in order to address the problem of difference in coarseness of two mixtures to be compared. We resort to simulated and biological data to show the usefulness of our proposals. The results show that we can only differentiate between distinct solutions after applying the component clustering
1
Introduction
Biology suggests that a single gene will often participate not in one, but in multiple metabolic pathways, regulatory networks or protein-complexes. As a result, mixture models represent the results of gene expression clustering analysis in a more natural way than ’hard’ partitions (Schliep et al. (2005)). This is true not only for the clustering results, but also for the representations of gene labels. Biological sources of information, such as functional annotations, transcription binding sites or protein-protein interactions are formed by overlapping categories. However, this has been neglected so far by validation studies for gene expression analysis. A classical approach for comparing two partitions is the use of external indices (Jain and Dubes (1988)). Their basic definition only allows the comparison of ’hard’ clusterings. To overcome this limitation, we propose extensions of external indices, such as the corrected Rand (CR), suitable for comparing mixtures or overlapping partitions (encoded as mixtures). In order to investigate the characteristics of the proposed index, we make use of experiments with simulated data sets.
On External Indices for Mixtures
663
Other important characteristics of most biological information are their complex structure, large size and specificity of information. Gene Ontology (G.O. Consortium (2000)), for example, is composed of a redundant directed acyclic graph with thousands of biological terms. The terms in Gene Ontology (GO) can either describe general concepts, such as ’development’, which has more then 17.000 annotated genes, or very specific concepts, such as ’pupal cuticle biosynthesis’, which has only one associated gene. The construction of a ’compact’ and ’meaningful’ mixture from such complex structure is nontrivial. Furthermore, one should not expect that the information contained in a single gene expression data set is as specific as the information contained in GO. Biologically speaking, co-regulated genes should share similar function, but clusters of co-regulated genes will be associated not with one, but with several biological functions. The use of CR to compare two mixtures (or partitions), where one of the mixture represents a more coarse representation of the data, yields too conservative CR values, given the high number of false positives. As a consequence, a procedure for clustering GO terms prior to the comparison of the mixtures – clustering of components – is necessary in order to achieve more general representations of GO. This compact representation of GO yields a better basis for comparison of distinct results. To evaluate the proposal, we perform analysis of gene expression time-courses from Yeast during sporulation (Chu et al. (1998)). The results with and without the component clustering are then compared with Yeast annotation from GO.
2
External Indices
External indices assess the agreement between two partitions, where one partition U represents the result of a clustering method, and the other partition V represents a priori knowledge of the clustered data. A number of external indices have been introduced in the literature, but the use of corrected Rand (CR) has been suggested given its favorable characteristics (Hubert and Arabie (1985)). Among others, CR has its values corrected for chance agreement, and is not dependent of the object distribution in U or V (Milligan and Cooper (1986)). This work proposes an extension of the corrected Rand, in order to access the agreement of partitions with overlap (encoded as mixtures) or mixture models, by comparing their posterior distributions for a fixed data set. The main idea of the extended corrected Rand (ECR) is to redefine the indicator functions, as defined in Jain and Dubes (1998), giving them a probabilistic interpretation. To simplify the notation, we consider for a given mixture model f (·|Θ) = K 1 k=1 αk fk (·|Θk ) the components U = {uk }1≤k≤K ; similarly V = {vl }1≤l≤L for a second mixture model. Let O = {on }1≤n≤N be the set of objects to be clustered, U be the estimated mixture model (or clustering solution), and V 1
Θk and αk are the mixture model parameters (McLachlan and Peel (1996))
664
I.G. Costa and A. Schliep
be the mixture defined by the a-priori classification. The posterior distribution defines the probability that a given object o ∈ O belongs to a component uk from U or vl from V , {P[uk |o]}1≤k≤K and {P[vl |o]}1≤l≤L . We denote the event that a pair of objects has been generated by the same component in model U , the co-occurrence event, as oi ≡ oj given U . Assuming independence of the components in U , the probability of the co-occurrence of oi and oj given U for 1 ≤ i ≤ j ≤ N is: P[oi ≡ oj given U ] =
K
P[uk |oi ]P[uk |oj ]
(1)
k=1
We use the above formula to redefine the variables a, b, c and d, used in the definition of CR, which are equivalent to the number of true positives, false positives, false negatives and true negatives respectively. a=
N −1
N
P[oi ≡ oj given U ]P[oi ≡ oj given V ]
(2)
P[(oi ≡ oj given U )C ]P[oi ≡ oj given V ]
(3)
P[oi ≡ oj given U ]P[(oi ≡ oj given V )C ]
(4)
P[(oi ≡ oj given U )C ]P[(oi ≡ oj given V )C ]
(5)
i=1 j=i+1
b=
N −1
N
i=1 j=i+1
c=
N −1
N
i=1 j=i+1
d=
N −1
N
i=1 j=i+1
From these the extended corrected Rand (ECR) can be calculated by the original formula for the CR, as defined below. ECR =
(a + d) − ((a + b)(a + c) + (c + d)(b + d))p−1 p − ((a + b)(a + c) + (c + d)(b + d))p−1
(6)
where p is equal to the sum a + b + c + d or the total number of object pairs. ECR takes values from -1 to 1, where 1 represents perfect agreement while values of ECR near or below zero represent agreements occurred by chance. The original CR, proposed in Hubert and Arabie (1984), estimates the expected Rand value by assuming that the baseline distributions of the partitions are fixed. By definition, ECR is an extension of CR. It works exactly as the latter when hard partitions are given. In the used terminology, a ’hard’ partition can be described by the following posterior. 1, if o ∈ uk P[uk |o] = (7) 0, otherwise
On External Indices for Mixtures
u1
u2
u1
u2
v1
v2
665
v3
Fig. 1. We display three hypothetical partitions, U and U , which represent two distinct clustering results, and V , which represents the true labels (the objects in U and U are depicted in the correspondent label color defined in V ). Both clusterings failed to recover the three true components. U splits the objects from v2 in half, while U joined the objects of v2 and v3 . Comparing the partitions with V , U has a CR value of 0.57 and U a value of 0.53. Assuming, however, that the classes v2 and v3 can not be distinguished in the clustered data, and joining these two components, U would have a CR of 0.56 while U a value of 0.78.
3
Component Clustering
The component clustering deals with the problem of difference in coarseness of two mixtures (or partitions). Given the two mixtures U and V , using the ECR (or CR) to compare the agreement will always result in low values when #U << #V , even when U is a more coarse representation of V . A simple example of this, in the context of partitions, can be seen in Fig 1. In some real world problems, as with the use of functional annotation of genes to validate co-regulation of genes, it is reasonable to assume that U is a coarser representation of V , and the clustering of components in V yields a better comparative basis for choosing between distinct solutions U , and hence between different methods. More formally, given that the number of components of the model V is higher then the one in U , and assuming that model U is a more general description of V , we want to find a partition P = {pk }1≤k≤K of the components in V . This partitioning can be used to define a new model V , where each group of components in P is a single component in V and V is similar to U . A natural choice of a criterion for evaluating the ’similarity’ of the two models is the mutual information.
I(X, Y ) =
J L
P[X = xi , Y = yj ] log
i=1 j=1
P[X = xi , Y = yj ] P[X = xi ]P[Y = yj ]
(8)
Given mixture models U and V , its posteriors on O and, assuming independence between them, we can define the joint probability P[U, V |O] and the probability distribution P[U |O] as: P[U = uk , V = vl |O] =
N 1 P[uk |oi ]P[vl |oi ] N i
(9)
666
I.G. Costa and A. Schliep
1 P[uk |oi ] N i N
P[U = uk |O] =
(10)
We accomplish the components clustering by applying a algorithm similar to hierarchical clustering. It joins a pair of groups of components at a time, until a certain number of clusters is reached. At each step, the partition in the set of candidate partitions (C) with higher mutual information is selected. Starting with the singleton partition, where pi = {vi } for 1 ≤ i ≤ L, the method works as follows: 1. while (#P > #U ) do 2. C=∅ 3. for each pair (pi ,pj ), where 1 ≤ i < j ≤ #P do 4. P = P \ pj 5. pi = pi ∪ pj 6. C = C ∪ {P } 7. P = argmaxH∈C I(U, merge(V, H)) where merge(V, P ) defines a new model V from V , where #V = #P and P [vk |o] = i∈pk P[vi |o].
4
Experiments
To evaluate the extended corrected Rand, we make use of simulated data from multivariate mixture of normals. We use a simple test data with two normal components to compare the characteristics of ECR and CR when distinct overlaps are present. Then, we make use of biological data in order to show the applicability of the proposal, in particular the component clustering method, to real data. The Estimation-Maximization algorithm (EM) is used to fit multivariate normal mixtures with unrestricted covariance matrices (McLachlan and Peel (1996)). For each data set, 15 repetitions of the EM algorithm with random initialization are performed, and the result with maximum likelihood is selected. In the simulated data experiments, 50 test data sets are generated for each proposed mixture. 4.1
Simulated Data
We perform experiments with a normal mixture with two equiprobable components to evaluate the proposed index characteristics in the presence of distinct overlaps. The components have means µ1 = [0, 0]T , µ2 = [d, 0]T , covariant matrices C1 = C2 = I, and 0.0 < d < 7.5 (structured data) (Figueiredo and Jain (2002)). For each component we draw 200 samples (or objects), and the multivariate normal density of the mixtures are used to obtain the distributions P[V |o]. We also display the value from the CR, by the following partition assignment of the objects of a given posterior distribution:
On External Indices for Mixtures
667
1
ECR Random ECR CR Random CR
0.9
corrected Rand
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
2
3
4
5
6
7
d
Fig. 2. We show results of the mixture estimation with the normal bivariates. The larger d, the lower is the overlap between the two components.
P[ui |o] =
1, if i = argmax1≤k≤K (P[uk |o]) 0, otherwise
(11)
Additionally, we generate random noise data to serve as a null case. This consists of data generated from a single normal component with µ = [d/2, 0]T and C = I. A ’hypothetical solution’ (V ) with the same number of components and object distributions is calculated from the definition of the respective structured data. We carried out a non parametric equal-means hypothesis test based on bootstrap (Efron and Tibshriani (1997)) to compare the mean ECR (or CR) obtained with the structured (s) and random data (r). H0 : r = s and H1 : r < s
(12)
As displayed in Fig. 2, for data with high overlap, ECR has higher values then CR, while for data with low overlap both indices have similar values. With random data, both indices take on mean values near zero and low variance (< 0.001), which indicates that ECR is successful in the correction for randomness. In relation to the hypothesis test, H0 is rejected in all d values with α = 0.001 with the use of ECR, while for data with very high overlap (d < 0.4) the null hypothesis is not rejected (α = 0.001) with the use of CR. From these we can conclude that ECR is able to show significant distinctions between the agreement of the random and structured data, even when the overlap is great, while CR fails. 4.2
Biological Data
We use gene expression data from Yeast (Chu et al. (1998)) in our evaluation. This data set contains gene expression measurements during sporulation for over 6400 genes of budding yeast. The measurements were taken at seven time points (0h, 0.5h, 2h, 5h, 7h, 9h and 11h). Clones with more than 20% of values missing were excluded. The data is pre-processed by extracting all those genes
668
I.G. Costa and A. Schliep
0.4
1200
Mixt + GO Clustering Mixture
#GO terms #genes
0.35 1000
0.3
ext. cor. Rand
800
0.25
0.2
600
0.15 400
0.1 200
0.05
0 1
2
3
4
5
GO Level
6
7
8
0 1
2
3
4
5
6
7
8
GO Level
Fig. 3. In the left, we show the ECR values obtained for distinct levels of GO and in the right we show the number GO terms and annotated genes for distinct GO levels. The higher the level the lower the number of genes. The number of GO terms increases until level 3 reaching a peak of 234, and decreases afterwards.
with an absolute fold change of at least two in at least one time point. The resulting data set contains 1171 genes. We perform mixture estimation, as described in Sec. 4, and we use the Bayesian information criteria to determine the optimal number of components (10 for this data set). Gene Ontology Gene Ontology (GO) describes genes in three distinct categories (G.O. Consortium (2000)): cellular component, molecular function and biological process. Such an ontology has the form of a directed acyclic graph (DAG), where the leaves are genes and the internal nodes are terms (or annotations) describing gene function, gene cellular localization or the biological processes genes take part in. Gene are associated not only with the terms which it is directed linked, but also to all parents of this term. Given this parent relation and the number of GO terms, a reasonable way to obtain a mixture from GO is to cut it at a fixed level m, where each GO term in level m represents one component from the mixture T m = {tm p }1≤p≤P . For a given set of genes O, one could define a simple definition of a posterior distribution of a gene o given T m by: m 1/#{i|o ∈ tm i , i = 1, ..., P }, if o ∈ tp |o] = P[tm (13) p 0, otherwise The use of the component clustering posterior to the mixture estimation represented a considerable increase in the ECR values (Fig. 3), while the ECR values obtained only with the mixture estimation are not too far apart from zero (similar results are encountered with other gene expression data sets). The main reason for this difference is the reduction in the number of false positives obtained after the application of the clustering of components. In relation to the use of GO, the choice of the level of cutting the DAG is a rather subjective task. Figure 3 shows that high levels of GO should be
On External Indices for Mixtures
669
avoided, since there is a lower percentage of annotated genes. The levels two and three represent a better choice, since they obtained the highest ECR while they still maintain a reasonable number of genes. These characteristics, however, are dependent on the data set analyzed and on the GO annotation used.
5
Conclusions
The use of simulated data allow us to assess the characteristics of the extended corrected Rand. It displayed superior results in comparison to the original corrected Rand when high overlap is present and values near zero when the data is random. With the biological data, the results indicate that (1) there is a low agreement between the results of mixture analysis and GO and (2) this agreement is greatly enhanced by a clustering of components. We can conclude that the use of component clustering prior to ECR is important when structures with distinct level of coarseness are compared allowing to choose between different solutions which were previously not very distinguishable. Despite the importance of this problem, it has been neglected in the bioinformatics literature, where in several problems we are faced with the comparison of data with such distinctions in coarseness.
References CHU S., et al. (1998), The Transcriptional Program of Sporulation in Budding Yeast, Science, 282, 5389, 699-705. EFRON B. and TIBSHIRANI, R. (1993), An Introduction to the Bootstrap, Chapman & Hall, New York. FIGUEIREDO M. and JAIN, A.K. (2002), Unsupervised learning of finite mixture models, IEEE Transaction on Pattern Analysis and Machine Intelligence, 24, 3, 381-396. HUBERT, L. J., ARABIE, P. (1985), Comparing partitions, Journal of Classification, 2, 63-76. JAIN A.K., DUBES, R.C. (1988), Algorithms for clustering data. Prentice Hall, New York. MCLACHLAN G. and PEEL D. (2000), Finite Mixture Models, Wiley, New York. MILLIGAN G. W. and COOPER M. C. (1986), A study of the comparability of external criteria for hierarchical cluster analysis, Multivariate Behavorial Research, 21, 441-458. ¨ SCHLIEP, A., COSTA, I.G., STEINHOFF, C. and SCHONHUTH, A. (2005), Analyzing gene expression time-courses , IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(3), 179-193. T. G. O. CONSORTIUM (2000), Gene ontology: tool for the unification of biology, Nature Genet, 25, 25-29.
Tests for Multiple Change Points in Binary Markov Sequences Joachim Krauth Institute of Experimental Psychology University of D¨ usseldorf, D-40225 D¨ usseldorf, Germany
Abstract. In Krauth (2005) we derived a finite conditional conservative test for a change point in a Bernoulli sequence with first-order Markov dependence. This approach was based on the property of intercalary independence of Markov processes (Dufour and Torr`es (2000)) and on the CUSUM statistic considered in Krauth (1999, 2000) for the case of independent binomial trials. Here, we derive finite conditional tests for multiple change points in binary first-order Markov sequences using in addition conditional modified maximum likelihood estimates for multiple change points (Krauth, 2004) and Exact Fisher tests.
1
Introduction
A problem which is often considered in the analysis of deoxyribonucleic acid (DNA) sequences is the dissection of these sequences into homogeneous segments. Braun and M¨ uller (1998) give an overview of the statistical methods which are used in this field. It is obvious that DNA sequence segmentation requires the detection of change points in DNA sequences. The observations along such a sequence take on one of the four values of the DNA alphabet (A = adenine, G = guanine, T = thymine, C = cytosine). In order to reduce the number of unknown parameters, many authors classify the four nucleic acids (or bases) further according to their physical and chemical properties. This is of particular importance if one is interested in exact statistical tests where nuisance parameters may cause problems. Three of these alphabets with only two classes are cited in Table 1 of Braun and M¨ uller (1998) namely purine vs. pyrimidine (R (A or G) vs. Y (C or T)), heavy vs. light (S (C or G) vs. W (A or T)), keto vs. amino (K (T or G) vs. M (A or C)). Of course, other alphabets with two classes are also possible. E.g., in Krauth (2004) we considered the classes A vs. the rest (G, T, C). All four possible alphabets generated in this way (i.e. A vs. G, T, C; G vs. A, T, C; T vs. A, G, C; C vs. A, G, T) were considered e.g. in Avery and Henderson (1999b), Avery (2001) and Krauth (2003). We thus reduce the problem of DNA sequence segmentation to the problem of detecting change points in a binary sequence. Many authors have provided methods which are related to this problem. We can classify most of these results with respect to the following aspects:
Tests for Multiple Change Points
671
• Estimating locations of change points vs. testing for the existence of change points by finite or asymptotic tests • considering models with only one change point vs. models with multiple change points • assuming independent outcomes vs. permitting dependent outcomes In applications it seems to be important that decision tools (e.g. statistical tests) are available because estimation procedures may indicate locations of change points which do not exist in reality. Exact tests should be preferred to asymptotic tests because it is difficult to describe the probability for a wrong decision under the null hypothesis in a correct way if an asymptotic procedure is used. Though many results are known for the situation with only one change point, the multiple change point situation seems to be of higher importance in DNA sequence segmentation. Most authors assume that the observed outcomes of a DNA sequence can be considered to be independent. However, Avery and Henderson (1999a) and Avery (2001) reported cases where this assumption seems to be violated and this was also observed by Krauth (2003, 2004) for other DNA sequences. Though many different statistical procedures have been derived for performing DNA segmentations (cf. Braun and M¨ uller (1998)) there seems to be one method which most probably has been applied more than any other. This is the approach proposed by Churchill (1989, 1992). For this approach it is assumed that the different segments can be classified into a finite set of unobserved states which form a hidden Markov chain. The unknown distribution of the states and the distributions on the states are estimated from the data using the EM algorithm yielding approximations to the maximum likelihood estimates. The number of states necessary is estimated by the Bayesian information criterion. Churchill (1989) gives also formulas for the case of first-order Markov dependence between outcomes though most users seem to utilize the approach with independent outcomes. Braun and M¨ uller (1998) criticize that very long sequences are needed to get reliable results and that the EM algorithm may fail to find the global optimum. A further disadvantage of Churchill’s approach is in our opinion that it yields only estimates and that the validity of the results is not controlled by a statistical decision procedure. For this reason we propose here conservative exact significance tests for multiple change points in binary sequences with first-order Markov dependence.
2
Procedure
We consider a binary sequence of n (n ≥ 9) random variables X1 , ..., Xn ∈ {0, 1} and m ∈ {1, ..., n−5 4 } presumable change points τ1 , ..., τm with 0 < τ1 < τ2 < ... < τm < n. In addition, we define τ0 := 0, τm+1 := n. With m change points we have (m + 1) segments and we assume that each segment has at least length 4. The length (n) of the total sequence is assumed to be
672
J. Krauth
odd. Otherwise, we omit the last observation. We define P (Xi = 1) = 1 − P (Xi = 0) =: πj+1 for τj + 1 ≤ i ≤ τj+1 , j = 0, 1, ..., m; τj ∈ {τj−1 +4, ..., n−4(m−j+1)}, j = 1, 2, ..., m; 0 < π1 , π2 , ..., πm , πm+1 < 1. Further, we allow for a first-order Markov dependence with stationary transition probabilities πst := P (Xi = t|Xi−1 = s) for i = 2, ..., n; s, t ∈ {0, 1}. While the length (n) of the sequence is known from the data, the number of change points (m) has to be fixed before starting the procedure. If we choose m to be small in relation to the length (n) of the sequence (e.g. m = 1, 2 or 3), if n is large and if change points exist in reality, we have a good chance to detect some of these change points. If, however, we choose m to be large in relation to the length (n) of the sequence (e.g. m = 10) the power of the procedure may be low and no change points may be detected even if they exist. The construction of exact significance tests for the existence of change points is made difficult by the possible presence of a Markov dependence. We tackle this problem by utilizing the property of “intercalary independence” and the “truncation property” for Markov processes (Dufour and Torr`es (2000)). These properties were investigated in particular for binary first-order Markov chains by Krauth (2005). This allows us to derive conditional exact conservative tests for change points. From the property of “intercalary independence” it follows that the random variables X2i , 1 ≤ i ≤ n−1 2 , of the “even sequence” are conditionally independent for fixed values of the “odd sequence” X2i−1 , 1 ≤ i ≤ n+1 2 . From the “truncation property” it can be concluded that the conditional distribution of X2i depends only on the values of its two neighbors X2i−1 , X2i+1 for 1 ≤ i ≤ n−1 2 . Under the null hypothesis (H0 ) of no change points we assume that X1 , ..., Xn are identically distributed, with stationary transition probabilities. In this case, only three different conditional distributions occur for the variables of the “even sequence”. One distribution results for the neighbors X2i−1 = 1, X2i+1 = 1, one for the neighbors X2i−1 = 0, X2i+1 = 0, and one for the neighbors X2i−1 = 1, X2i+1 = 0 or X2i−1 = 0, X2i+1 = 1, for 1 ≤ i ≤ n−1 as indicated in Krauth (2005). Thus, we can assume that un2 der H0 the “even sequence” is composed of three conditionally independent subsequences (11), (00), and (10, 01) consisting of conditionally independent identically distributed Bernoulli variables if the “odd sequence” is fixed. Because the “odd sequence” is fixed we know how many and exactly which random variables of the “even sequence” belong to the subsequences (11), (00), and (10, 01). By applying the algorithm described in Krauth (2004) to the “odd sequence” we get modified maximum likelihood estimates τˆ1 , ..., τˆm for the locations (indices) of the m change points τ1 , ..., τm and corresponding estimates π ˆ1 , ..., π ˆm+1 for the probabilities π1 , ..., πm+1 of the value 1 (“success
Tests for Multiple Change Points
673
probabilities”) for the (m+1) segments. For this algorithm it is not necessary that the transition probabilities are stationary for the total “odd sequence” but it suffices that they are identical for the trials in each segment. To each estimate τˆj , j ∈ {1, ..., m}, correspond the indices τˆj−1 + 1, ..., τˆj and τˆj + 1, ..., τˆj+1 of two adjacent segments in the “odd sequence”. For j = 1 we define in addition τˆ0 := τ0 := 0 and for j = m we define τˆm+1 := τm+1 := n. The union of the two segments comprises (ˆ τj+1 − τˆj−1 ) trials of the “odd sequence”. Between these trials lie (ˆ τj+1 − τˆj−1 − 1) variables of the original sequence which belong to the “even sequence”. For each of these latter trials we can decide on the basis of the information contained in the trials of the two segments of the “odd sequence” above whether it belongs to the subsequence (11), (00), or (10, 01). We decide now which of these three sequences we want to use for the further analysis and select one of the following two one-sided test problems: H0(1) : πj ≤ πj+1 , H1(1) : πj > πj+1 , H0(2) : πj ≥ πj+1 , H1(2) : πj < πj+1 . We determine how many values 1 and 0 occur in our subsequence ((11), (00) or (10, 01)) in the segment given by {ˆ τj−1 + 1, ..., τˆj } before the trial corresponding to the index τˆj and how many values 1 and 0 occur in the segment given by {ˆ τj + 1, ..., τˆj+1 } after this trial. These four frequencies form a fourfold table which can be evaluated by means of a one-sided Exact Fisher test. If this test yields a significant result we have detected a change point and can identify the index τˆj0 = 2ˆ τj − 1 corresponding to τˆj in the original sequence of n trials. We propose to select the test problem (H0(1) , H1(1) ) if we found π ˆj > π ˆj+1 in the “odd sequence” and (H0(2) , H1(2) ) otherwise. With respect to the selection of one of the subsequences (11), (00) or (10, 01) we propose the following approach: It is obvious that the power of the Exact Fisher test will be small if we consider a short subsequence. With respect to power it seems best to select that subsequence where the two conditionally independent samples before and after the change point have about equal size. Therefore, we propose to count for each of the two segments of the “odd sequence” defined above the number of (11), (00), and (10 or 01) neighbors and to multiply these two numbers for each of the three subsequences. Then that subsequence should be selected for which this number is maximum. In the same way we perform one-sided Exact Fisher tests for all m change points. Because it occurs (m − 1) times for m > 1 that the same segment is used in two different tests we have for m > 1 a multiple test problem with m dependent tests. Therefore, the Bonferroni, Holm or another appropriate multiple test procedure has to be used (cf. e.g. Bernhard et al. (2004)).
674
3
J. Krauth
Power Considerations
At first sight it seems that our procedure for estimating and testing the locations of multiple change points is extremely conservative because only a small portion of the information in the data is used. However, a more detailed discussion seems to indicate that perhaps the loss of power due to neglecting a considerable part of the data can be tolerated. The following arguments may be of interest in this respect: (i) Half of the data (the “odd sequence”) are not directly used in the tests. However, the information in these data is used in several ways: (1) By fixing these data we can consider the trials of the “even sequence” as conditionally independent. This is of importance for the performance of exact tests in the presence of a Markov dependence in the original sequence. Otherwise it would have been necessary to derive asymptotic tests where the unknown nuisance parameter describing the dependence of the trials had to be estimated. Then, not only the power but also the exact size of such asymptotic tests may be difficult to evaluate. (2) The data of the “odd sequence” are not “lost” but are used for estimating the locations of change points in the “odd sequence” and for identifying in the resulting estimates of the segments the subsequences ((11), (00) or (10, 01)) of conditionally independent and under H0 identically distributed random variables in the “even sequence”. Both informations are necessary for performing the exact tests. Further, the data of the “odd sequence” are used for estimating the “success probabilities” for the different segments and this information is used for selecting a one-sided test problem for each change point. (3) As is described in Section 4 we can use the data of the “odd sequence” for computing the Bayesian information criterion (BIC) which may be used to estimate the appropriate number (m) of change points before any tests have been performed. (ii) Only one of the three subsequences in the “even subsequence” is used for each test for a change point and the data of the two other subsequences are neglected. There may be situations where we gain power by using all three subsequences at the same time, e.g. if these subsequences have about the same length. However, as we discussed in Krauth (2005), the restriction to the data in the longest subsequence should be preferred from the point of view of power in most situations. Here, we selected for each change point that subsequence for which the product of the number of observations before and after the change-point estimate is maximum because the power of two-sample tests does not only depend on the sum of the two sample sizes but is also larger for equal sample sizes. (iii) It is not guaranteed that the estimates of the locations of the change points and of the “success probabilities” in the “odd sequence” are similar
Tests for Multiple Change Points
675
to the corresponding estimates in the original sequence or in the “even sequence” or that they are near to the true parameters. Both, the wrong selection of a test problem due to misleading estimates of the “success probabilities” and a change-point estimate deviating considerably from the true location of the corresponding change point will cause a loss of power. But any procedure ignoring the specific information contained in the “odd sequence” which is used here will also cause a loss of power necessarily. (iv) If tests for more than one change point are performed we have a multiple test problem with dependent tests. Using any of the available procedures for controlling the multiple level α we lose power in comparison with the performance of a single test. But this loss of power will result also for any other test procedure for multiple change points. Considering the arguments above it seems that though the present approach may lack power, it may be difficult to provide a less conservative procedure.
4
Example
Just as in Krauth (2003, 2004, 2005) we consider the nucleotide sequence reported by Robb et al. (1998, Fig. 1). This is 1,200 nt in length, is constructed from overlapping clones and is based on the analysis of up to 181 mice embryos. Just as in Krauth (2004, 2005) we coded the letter A (corresponding to the purine adenine) by 1 and the other three letters (G = guanine, T = thymine, C = cytosine) by 0 and generated in this way a binary sequence with 1,200 trials. After omitting the last trial we have 600 trials in the “odd sequence” and 599 trials in the “even sequence”. For illustrating the new approach we consider first the case of only one change point (m = 1) corresponding to two segments. The modified ML estimates (Krauth, 2004) for the “odd sequence” yield τˆ1 = 498, π ˆ1 = .235, π ˆ2 = .376. The location estimate τˆ1 = 498 in the “odd sequence” corresponds to the location estimate τˆ10 = 2ˆ τ1 − 1 = 995 in the original sequence with 1,199 trials. In view of π ˆ1 = .235 and π ˆ2 = .376 we consider the one-sided test problem H0(2) : π1 ≥ π2 , H1(2) : π1 < π2 . The “even sequence” is composed of 44 trials of the subsequence (11), 332 trials of the subsequence (00), and 223 trials of the subsequence (10, 01). In (11) there are 24 trials before and 20 trials after τˆ10 , in (00) the corresponding numbers are 287 and 45, and in (10, 01) we find 186 and 37. The three products yield 480, 12,915, and 6,882. Thus we decide to consider the subsequence (00). In this subsequence we have 51 1’s before τˆ10 and 19 1’s after τˆ10 . Likewise we have 236 0’s before τˆ10 and 26 0’s after τˆ10 . For the corresponding fourfold table Fisher’s Exact test yields for the one-sided test problem selected above a p-value of p1 = .000432, indicating that there is evidence for a change point near to τˆ10 = 995. For m = 2 we find τˆ10 = 993, τˆ20 = 1, 017, π ˆ1 = .236, π ˆ2 = .800, π ˆ3 = .336
676
J. Krauth
and the p-values p1 = .415, p2 = .734, i.e. we have no evidence for any change point. The reason for this might be that the ML estimation procedure identified a rather small center segment of only τˆ20 − τˆ10 = 24 trials with a high “success probability” of .800 and that this may be only an artifact. For m = 3 we have τˆ10 = 137, τˆ20 = 185, τˆ30 = 995, π ˆ1 = .235, π ˆ2 = .000, π ˆ3 = .234, π ˆ4 = .376 and the p-values p1 = .048, p2 = .104, p3 = .000551. Using the Bonferroni or Holm correction we detect a single change point near to τˆ30 = 995. For m = 4 we find τˆ10 = 545, τˆ20 = 585, τˆ30 = 727, τˆ40 = 865, p1 = .978, p2 = .904, p3 = .154, and p4 = .000601, i.e. a change point near to τˆ40 = 865 is detected while for m = 5 we get τˆ10 = 545, τˆ20 = 585, τˆ30 = 727, τˆ40 = 857, τˆ50 = 1, 005, p1 = .978, p2 = .904, p3 = .171, p4 = .057, p5 = .096 and no change point is detected. Of course, the choice of the number of segments or change points, respectively, should be based primarily on biological considerations and not on statistical arguments. However, if several candidate models are under discussion we might consider the proposal of Churchill (1992) and decide for the model with the maximum value of the Bayesian information criterion (BIC) which is defined by ˆ − 1 k log n. BIC = l(θ) 2 ˆ is the maximized loglikelihood, k is the number of free parameters in Here l(θ) the model and n is the sequence length. The estimation procedure described in Krauth (2004) gives the maximized modified loglikelihood, the number of free parameters (π1 , ..., πm+1 , λ1 , ..., λm+1 , τ1 , ..., τm or π11 (1), ..., π11 (m + 1), π00 (1), ..., π00 (m + 1), τ1 , ..., τm , respectively) is given by k = 3m + 2, and the sequence length is that of the “odd sequence”. In our example, we have n = 600 and find BIC (m = 1) = −350.801, BIC (m = 2) = −355.823, BIC (m = 3) = −360.514, BIC (m = 4) = −364.466, and BIC (m = 5) = −369.925. The largest value results for m = 1 change point i.e. for two segments. According to the BIC criterion this model seems to explain the data in the most appropriate way.
References AVERY, P.J. (2001): The Effect of Dependence in a Binary Sequence on Tests for a Changepoint or a Changed Segment. Applied Statistics, 50, 234–246. AVERY, P.J. and HENDERSON, D.A. (1999a): Fitting Markov Chain Models to Discrete State Series such as DNA Sequences. Applied Statistics, 48, 53–61. AVERY, P.J. and HENDERSON, D.A. (1999b): Detecting a Changed Segment in DNA Sequences. Applied Statistics, 48, 489–503. BERNHARD, G., KLEIN, M. and HOMMEL, G. (2004): Global and Multiple Test Procedures Using Ordered P -Values - A Review. Statistical Papers, 45, 2004, 1–14. ¨ BRAUN, J.V. and MULLER, H.G. (1998): Statistical Methods for DNA Sequence Segmentation. Statistical Science, 13, 142–162. CHURCHILL, G.A. (1989): Stochastic Models for Heterogeneous DNA Sequences. Bulletin of Mathematical Biology, 51, 79–94.
Tests for Multiple Change Points
677
CHURCHILL, G.A. (1992): Hidden Markov Chains and the Analysis of Genom Structure. Computers & Chemistry, 16, 107–115. ` O. (2000): Markovian Processes, Two-Sided AuDUFOUR, J.M. and TORRES, toregressions and Finite-Sample Inference for Stationary and Nonstationary Autoregressive Processes. Journal of Econometrics, 98, 255–289. KRAUTH, J. (1999): Discrete Scan Statistics for Detecting Change-Points in Binomial Sequences. In: W. Gaul and H. Locarek-Junge (Eds.): Classification in the Information Age. Springer, Berlin, 196–204. KRAUTH, J. (2000): Detecting Change-Points in Aircraft Noise Effects. In: R. Decker and W. Gaul (Eds.): Classification and Information Processing at the Turn of the Millenium. Springer, Berlin, 386–395. KRAUTH, J. (2003): Change-Points in Bernoulli Trials with Dependence. In: M. Schader, W. Gaul and M. Vichi (Eds.): Between Data Science and Applied Data Analysis. Springer, Berlin, 261–269. KRAUTH, J. (2004): Multiple Change Points and Alternating Segments in Binary Trials with Dependence. In: D. Baier and K.D. Wernecke (Eds.): Innovations in Classification, Data Science, and Information Systems. Springer, Berlin, 154–164. KRAUTH, J. (2005): Test for a Change Point in Bernoulli Trials with Dependence. In: C. Weihs and W. Gaul (Eds.): Classification: The Ubiquitous Challenge. Springer, Berlin, 346–353. ROBB, L., MIFSUD, L., HARTLEY, L., BIBEN, C., COPELAND, N.G., GILBERT, D.J., JENKINS, N.A. and HARVEY, R.P. (1998): Epicardin: A Novel Basic Helix-Loop-Helix Transcription Factor Gene Expressed in Epicardium, Branchial Arch Myoblasts, and Mesenchyme of Developing Lung, Gut, Kidney, and Gonads. Developmental Dynamics, 213, 105–113.
UnitExpressions: A Rational Normalization Scheme for DNA Microarray Data Alfred Ultsch Databionics Research Group, University of Marburg, 35032 Marburg/Lahn, Germany
Abstract. A new normalization scheme for DNA microarray data, called UnitExpresion, is introduced. The central idea is to derive a precise model of unexpressed genes. Most of the expression rates in a typical microarray experiment belong to this category. Pareto probability density estimation (PDE) and EM are used to calculate a precise model of this distribution. UnitExpressions represent a lower bound on the probability that a gene on a microarray is expressed. With UnitExpressions experiments from different microrarrays can be compared even across different studies. UnitExpressions are compared to standardized LogRatios for distance calculations in hierarchical clustering.
1
Introduction
Computational analysis of DNA microarrays searches for gene patterns that play important roles in the progress of certain diseases. In chronic lymphatic leukemia (CLL), the most common leukemia in the Western countries patients of the same early stage of the disease (Binet stage A), develop the disease in very different ways. It can be assumed, that there are at least two subgroups of patients, one with a better chance of a longer survival and the other facing an earlier death (Rosenwald A. et al. 2001). To find gene expressions patterns which are able to predict the development of the cancer would be essential for the treatment of such patients. With several thousands of genes expressions measured for a single patient on one microarray, the detection of differentially expressed genes is a challenge. Besides the small number of cases in many studies, e.g. 12-16 per group in Rosenwald (2001), the distributions of the measurements are non Gaussian. Of the many genes measured (typ. 1.000 40.000) only a small fraction are under- or over expressed, most genes are unexpressed. The large absolute values of the expressed genes make estimations of variances very difficult. Proper variances are, however, crucial for a calculation of the relevance of a particular gene. In contrast to a model of over- or under expression we propose to calculate a model of the distribution of the unexpressed genes. The large number of unexpressed genes compared to the few expressed genes on typical microarrays puts this approach on more solid empirical grounds. The data used in this paper is published at http://llmpp.nih.gov/cll/ data.shtml. For each micorarray 328 gene expressions are measured. There
UnitExpressions / Microarray Data
679
0.45 gene expression empirical Gaussian model of unexpressed
0.4
0.35
PDE = likelihood
0.3
0.25
0.2
0.15
0.1
0.05
0 −6
−5
−4
−3 −2 −1 array measurement
0
1
2
Fig. 1. Distribution of gene expressions on a microarray
are a total of 39 microarrays of CLL and 40 microarrays of another lymphoid malignancy: diffuse large B cell lymphoma (DLCL). Details of the data are published in Rosenwald et al. (2001).
2
Modeling Gene Expression Distributions
The estimation of the distribution of over- or under expressed genes is difficult due to the small number of such genes on typical microarrys. Therefore we propose to model the distribution of the unexpressed genes. Taking the empirical means me and variance se of the data is, however, not a good model for this distribution (see Figure 1). Figure 1 shows a typical distribution of the expressions on a c-DNA microarray. Similar distributions can be found on Affymetrix and Bead arrays. The distribution is analyzed using Pareto Probability Estimation (PDE) (Ultsch 2003). PDE is shown (solid line) together with the empirical Gaussian N(me ,se )(dashed line) and the model of the distributions of the unexpressed genes as described below. The gene expressions consist of a central part of unexpresed genes plus the distributions of the over- and under expressed genes. The latter bias means and variances. If there is no systematic error in the measurements the distribution of unexpressed genes should, however, be a Gaussian. An ideal model of this distribution can be obtained using the Expectation Maximization (EM) algorithm. EM converges from a good initial distribution towards an optimal model. EM is initialized with a Gaussian estimated from the data trimmed to the 10 to 90 percentile limits. EM is run until no substantial change in the sum of absolute differences to the PDE within these limits is
680
A. Ultsch
observed. This results in a model of the unexpressed genes (1) (see dotted line in figure 1). U nEx = N (mu , su ) ∗ wu
(1)
A standardizing transformation of the gene expressions such that the unexpressed genes are N(0,1) distributed is called Unit- or short u-transformation: u=
(x − mu ) su
(2)
Using the u-transformation renders the expression values of different microarray experiments comparable.
3
Unit Expression Values
Although the u-transformed values of unexpressed genes are N(0,1) distributed, there may be extremely large positive or negative values. Absolute expression values ≥6 have been observed. Such values bias Euclideanand correlation distances between expression patterns, which are used for clustering. In Ultsch (2003) the usage of relative differences (RelDiff) instead of the commonly used log ratios are proposed. This limits the values to the range [-2 2]. The calculation of RelDiffs requires, however, the knowledge of the basic color measurements (Cy3/Cy5). A normalization scheme such that unexpressed genes are mapped to zero, over expressed genes to ]0,1] and under expressed to [-1,0[ is naturally achieved by an estimation of the cumulative distribution for expressed genes. Let cdfover (x) denote the probability Pr{expression ≥ x& gene is over expressed}. Let e0 be an expression such that cdfover (x)= 0 for all x < e0 , and e1 such that cdfover (x) =1 for all x > e1 . The limits e0 and e1 can be estimated from the PDE of the utransformed values of all given microarrays. Within the interval [e0 ,e1 ] the ˆ empirical probability density function is pdf over (x)= PDE(x)- UnEx(x). Under the assumption that pdfover (x) is linear proportional to the expression ˆ x, cdfover (x) is a quadratic function. Therefore the empirical cdf over (x), obˆ tained by numerical integration of pdf (x), is fitted with a polynomial of over second degree p2(x). Adjusting p2(x) such that p2(e0 )=0 results in a model of cdfover (x) as follows: cdfover (x) = min(1, p2(x) − p2(e0 )), x ∈ [e0 , e1 ]
(3)
ˆ Figure 2 shows cdf over (x) and cdfover (x) for a particular microrarray For under expressed genes the calculations for cdfunder (x) are symmetrical to the calculations described above using the negative side of the u-transformed expression values. UnitExpression is defined as follows: U nitExpression(x) = sign(x) ∗ (cdfunder (x) + cdfover (x)).
(4)
UnitExpressions / Microarray Data
681
Fig. 2. probability density for over expression
Under expression is denoted as negative, over expression as positive values of UnitExpression. UnitExpression has the same range [-1, 1] for all microarrays. The properties of the individual microarrays are accounted for by the individual calculation for cdfunder (x) and cdfover (x) for each array. More technical details can be found in Ultsch (2005). Ready to use programs are provided at the author’s homepage.
4
Results
Euclidian distances were calculated for all cases using standardized log ratio values (LogRatio distance)and using UnitExpressions. Figure 3 compares the two distance measurements. There is some correlation between the two distance measures. For many gene patterns, however, the distances differ considerably. Consider two extreme scenarios: first, cases with large UnitExpression distances and small LogRatio distances (A in figure 3) and second, cases with small UnitExpression- and large LogRatio distances (B in figure 3). In the A cases the differences between unexpressed and expressed genes were enlarged in UnitExpressions compared to LogRatios. In the B cases the very large absolute values of the unlimited range of LogRatios amounted to the large distances in LogRatios. Furthermore differences among the unexpressed genes amounted to large case distances, although the expression patterns for expressed genes were rather similar.
682
A. Ultsch Euclidian distances
14 A
13
UnitExpressions
12
11
10
9
8
B
7
6
5
0
10
20
30 LogRatio
40
50
60
Fig. 3. distances between cases in log ratios and UnitExpressions
5
Discussion
U-transformation is a standardization such that the unexpressed genes are N(0,1) distributed. The basic assumption of UnitExpression is that the likelihood of a gene to be expressed is linear proportional to the u-transformed absolute value. This assumption is rather conservative. The absolute values of UnitExpression can thus be regarded as a lower bound on the probability that a gene is under- respectively over expressed. For many gene measurements UnitExpression equals zero, thus indicating neither under- nor over expression. For two such genes the differences in expression ratios can be attributed to measurement errors. Using Euclidian distances on such genes, however, results in a nonzero distance, although the expression is the same. Using clustering algorithms on gene expression data depends critically on a meaningful distance. In figure 4 a hierarchical clustering (Eisen et al 1998) of the data shows complete different cluster structures. The clustering with the two clear clusters coincides with the two different diseases DLCL and CLL. Using UnitExpressions, genes that account for the most differences between DLCL and CLL were selected and presented to experts in CLL diseases (Kofler et al 2004, Mayr et al 2005). Our clinical partners from the laboratory for cellular immunotherapy of the University of Cologne found our results convincing. Genes related to cell death (apoptosis) were found to be relevant. Apoptosis is one of the central factors in the development of CLL. This initiates further research on the implications of the genes found for the prognosis of CLL survival rates.
UnitExpressions / Microarray Data
683
12
35
30
10
25
20
8
15 6
10
47 61 42 76 45 65 44 59 64 63 67 74 46 48 50 43 53 58 71 55 56 73 52 54 68 79 78 62 66 69 70 72 57 60 75 77 41 49 51 140 17 9 8 214 16 3 5 15 21 37 25 22 13 7 12 23 30 29 18 33 32 36 26 4 27 11 19 39 28 6 38 35 10 24 31 34 20
42 61 53 76 58 59 43 44 64 67 46 73 49 71 47 74 63 57 62 78 79 48 55 75 60 50 77 11 19 39 12 29 26 35 27 34 33 51 66 69 70 72 41 45 65 56 68 52 54 4 7 210 16 25 3 5 15 21 37 24 13 22 14 31 32 20 38 18 23 30 140 17 8 9 28 36 6
Fig. 4. Hierarchical clustering for UnitExpression (left) and LogRatios (right)
6
Conclusion
A rational normalization scheme for DNA microarray data is introduced. The central idea is to derive a precise model of unexpressed genes, since most of the expression rates in a typical microarray belong to this category. Using PDE (Ultsch (2003)) and EM an optimal model of unexpressed genes can be derived. A lower bound of the probability is extimated, that a gene on a particular microarray is expressed. This estimation is used to normalize the data to UnitExpressions. Unexpressed genes have a zero UnitExpression value Absolute values of UnitExperessions are within the unit interval. Positive and negative values distinguish between over- and under expression. With UnitExpressions experiments from different microarrays can be compared even across different studies. Since microarray experiments are expensive, only relatively few data is available. A meta analysis of this data becomes feasible. With respect to clustering we could demonstrate the usefulness of UnitExpressions to differentiate between two different diseases.
References EISEN et al. (1998): Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, pp. 14863-14868 KOFLER, D.M. et al. (2004): Engagement of the B-cell antigen receptor (BCR) allows efficient transduction of ZAP-70-positive primary B-CLL cells by recombinant adeno-associated virus (rAAV) vectors. Gene Ther. 18, pp 1416-1424. MAYR, C. et al. (2005): Fibromodulin as a novel tumor-associated antigen (TAA) in chronic lymphocytic leukemia (CLL), which allows expansion of specific CD8+ autologous T lymphocytes, Blood, Vol. 105, No. 4, pp. 1566-1573. ROSENWALD A. et al. (2001): Relation of gene expression phenotype to immunoglobulin mutation genotype in B cell chronic lymphocytic leukemia. ,J Exp Med. Dec 3;194(11):pp 1639-1647. ULTSCH, A. (2004): Density Estimation and Visualization for Data containing Clusters of unknown Structure., In Proc. GfKl 2004 Dortmund, pp 94 -105 ULTSCH, A. (2005): UnitExpressions: Normalizing DNA microarry data across different experiments, Technical Report, Department of Computer Science University of Marburg , May 2005.
A Ridge Classification Method for High-dimensional Observations Martin Gr¨ uning and Siegfried Kropf Institute for Biometry and Medical Informatics, University of Magdeburg, 39120 Magdeburg, Germany
Abstract. Currently experimental techniques such as gene expression analysis with microarrays result in the situation that the number of variables exceeds the number of observations by far. Then application of the standard classification methodology fails because of singularity of the covariance matrix. One of the possibilities to circumvent this problem is to use ridge estimates instead of the sample covariance matrix. Raudys and Skurichina presented an analytic formula for the asymptotic error of the one-parametric ridge classification rule. Based on their approach we derived a new formula which is unlike that of Raudys and Skurichina also valid in the case of a singular covariance matrix. Under suitable conditions the formula allows to calculate the ridge parameter which minimizes the classification error. Simulation results are presented.
1
Introduction
1.1
The Linear Discrimination Problem
We consider the following situation: Observations (j)
x1 , . . . , x(j) nj
(1)
(j) X(j) α ∼N (µ , Σ) (j = 1, 2, α = 1, . . . , nj )
(2)
of random vectors
are given, additionally an observation x of a random vector X ∼N (µ(jX ) , Σ),
(3)
µ(1) ∈ Rp , µ(2) ∈ Rp , Σ ∈ PD(p) and jX ∈ {1, 2}
(4)
where are unknown. The problem which arises here is to predict the class parameter jX ∈ {1, 2} of the random vector X. The task is to find a suitable decision rule δ which allows a good prediction.
A Ridge Classification Method for High-dimensional Observations
1.2
685
Classical Method
The Linear Discriminant Analysis (LDA) method uses the discriminant function derived by R. A. Fisher (1936) 1 (1) (2) (1) f (x) = x− (¯ x ) S−1 (¯ x −¯ x(2) ) (5) x +¯ 2
with the pooled sample covariance matrix 1 (j) (j) (j) (j) (x −¯ x )(xα −¯ x ). n − 2 j=1 α=1 α 2
S=
nj
(6)
If f (x) > 0 then δ(x) = 1 is chosen, otherwise δ(x) = 2. 1.3
The Problem of a Singular Sample Covariance Matrix
Let us consider for example the problem of gene expression analysis with microarrays. Here the number of variables amounts to several 1,000, the number of observations, n = n1 + n2 , is typically less than 100. The consequence in such situations is that the sample covariance matrix 1 (j) (j) (j) (j) S= (x −¯ x )(xα −¯ x ) n − 2 j=1 α=1 α 2
nj
(7)
is singular, i. e. not invertible. It follows that the LDA method is not applicable in this situation. And even with sample sizes that are slightly larger than the number of variables, the classical LDA shows poor results (L¨ auter (1992)). 1.4
The One-parametric Ridge Method as an Alternative Method
One possible alternative in such situations is the one-parametric ridge method which uses ridge estimates Sridge = S+λI (8) instead of S, where I is the p × p identity matrix and λ is a ”regularization parameter”. Then the resulting discriminant function is as follows: 1 (1) (2) (1) f (x) = x− (¯ x ) (S+λI)−1 (¯ x −¯ x(2) ), (9) x +¯ 2 where again δ(x) = 1 if f (x) > 0 and δ(x) = 2 else. This procedure is called ”Regularized Discriminant Analysis” (RDA). The additional problem which arises here is how to choose the parameter λ suitable. Certainly there are
686
M. Gr¨ uning and S. Kropf
different possibilities to determine the parameter λ. However, in classification studies the crucial criterion is the classification error, and therefore the task is to choose λ so that the classification error is as small as possible. Because the distribution of the discriminant function usually is unknown it is difficult to determine the error. The conventional procedure here is to estimate the classification error iteratively by cross-validation. This may be very expensive in computing time and memory, especially in high dimensional problems. In view of that we wish to find a direct method to determine the classification error analytically.
2
Classification Error Analysis
2.1
The Classification Error
We define a loss function L : {1, 2} × {1, 2} → {0, 1} according to 1, if j = i, L(j, i) = 0, if j = i,
(10)
where j determines the true and i the predicted parameter. The risk of the decision rule δ is given by R(j, δ) = EPj L(j, δ(x)) =
2
L(j, i)pji ,
(11)
i=1
where pji = Pj ({x ∈ M : δ(x) = i}).
(12)
If furthermore an a-priori-distribution Q on {1, 2} is known, we consider the Bayes risk 2 r(Q, δ) = EQ R(j, δ) = R(j, δ)qj , (13) j=1
where qj are the single probabilities for the classes j (j = 1, 2). In classification surely the goal is to minimize the Bayes risk (i. e. the classification error). But, as mentioned above, the distribution of the decision rule is usually unknown. 2.2
The Asymptotic Formula of Raudys and Skurichina (1995)
ˇ Raudys and M. Skurichina derived an asymptotic formula of the classifiS. cation error of the ridge classification rule (RDA): − 21 √ ∆ n 4p 1 + λB r(Q, RDA) ≈ Φ − 1+ , (14) 2 n−p n∆2 1 + λC
A Ridge Classification Method for High-dimensional Observations
687
where B and C are given by B=
2 trΛ−1 β2 + , 1−y n
C=
1 β1 , 1−y
y=
p , n
−1
βi =
α1 = 1,
m Λ m trΛ−1 α + (i = 1, 2) i ∆2 n−p −1 4trΛ−1 4p α2 = 1 + 1 + . n∆2 nm Λ−1 m
Here Λ is the diagonal matrix with eigenvalues of Σ, Γ is the corresponding (1) orthogonal matrix of eigenvectors, m := Γ (µ −µ(2) ). Differentiation of (14) with respect to λ gives: B − 2C λopt = . (15) BC Expression (14), however, is only defined if n > p so that otherwise this formula can not be applied. 2.3
A New Asymptotic Approach
Now an asymptotic formula for the classification error which is also valid in the case n ≤ p is desired. Raudys und Skurichina used the following approximation by Taylor expansion:
The function
(S+λI)−1 ≈ S−1 −λS−2 .
(16)
f (λ)= (S+λI)−1
(17)
is here considered a matrix-valued function of a scalar. We want to derive an asymptotic formula in a similar manner like Raudys and Skurichina. For that purpose the following reparametrisation is useful: • Multiplication of the discriminant function with 1 +λ (it does not change the discrimination), we get the function g: −1
g(λ)=(1 + λ)(S+λI)
=
˜ := • Reparametrisation according to λ
−1 1 λ S+ . I 1+λ 1+λ
(18)
1 1+λ :
˜ ˜ ˜ −1 . g(λ)=( λS+(1 − λ)I)
(19)
As the first result we propose the following lemma (here S(p) determines the set of all symmetric p × p matrices):
688
M. Gr¨ uning and S. Kropf
Lemma 1. Let be M ∈ S(p) and . a norm in S(p) for which the condition AB ≤ AB
(20)
holds. If the condition ˜ < |λ|
1 M − I
(21)
is fulfilled the following identity is valid: −1
˜ (λ(M − I) + I)
=
∞ k=0
Ck =
∞
˜ k (I − M)k . λ
(22)
k=0
Application of this statement leads to the following approximation by using only the two first members of the series: ˜ ˜ −1 ≈I+λ(I ˜ − S). (λS+(1 − λ)I)
(23)
˜ = 1 and multiplicated with 1 + λ Then by retransformating according to λ 1+λ we get the following discriminant function: 1 (1) (2) (1) f (x) = x− (¯ x ) ((λ + 2)I − S)(¯ x −¯ x(2) ). (24) x +¯ 2
The procedure which uses (24) with the usual decision 1, if f (x) > 0, δ(x) = 2 otherwise
(25)
is called RDA∗ . Applying this result we derived an asymptotic formula for the risk r∞ (Q, RDA∗ ). The result is given in the following theorem. Theorem 1. With assumptions as before let additionally be Q the uniform distribution on {1, 2}. Then the asymptotic risk (for n1 → ∞, n2 → ∞) r∞ (Q, RDA∗ ) is given by 1 µ ((λ + 2)I − Σ)µ ∗ r∞ (Q, RDA ) = r(λ) = Φ − 2 µ ((λ + 2)I − Σ)Σ((λ + 2)I − Σ)µ (26) (Φ: standard normal distribution function, µ := µ(1) −µ(2) ).
Equation (26) determines the asymptotic risk of the procedure RDA∗ which is in a certain manner an approximation of the procedure RDA. Therefore it can be considered the approximative asymptotic risk of RDA. The function r of λ given by (26) now is to be optimized by differentiating and setting r (λ) = 0. In this manner we get the optimum λ0 : µ µµ Σ µ − µ Σµµ Σ µ − 2. µ µµ Σ2 µ − (µ Σµ)2 3
λ0 =
2
(27)
A Ridge Classification Method for High-dimensional Observations
689
The minimum property of λ0 could be proven. A question concerning the characteristic of λ0 is the following: Under which conditions is λ0 positive and lies in the convergence region of the series (22)? This could be proven for special additional assumptions. Now there are the following two possible applications for classification procedures: 1. Usage of the ridge classification rule (RDA) with the discriminant function 1 (1) (2) −1 (1) f (x) = x− (¯ x ) (S+λ0 I) (¯ x −¯ x(2) ) (28) x +¯ 2 and 2. the classification procedure RDA* which uses the approximative discriminant function 1 (1) (2) (1) f (x) = x− (¯ x ) ((λ0 + 2)I − S)(¯ x −¯ x(2) ). (29) x +¯ 2
The parameters µ and Σ are to be replaced by their estimates. In each case λ0 can be either used directly as regularization parameter or only as a suitable starting value for a following iterative determination of the optimal value by cross-validation. The latter procedure is recommended by Raudys and Skurichina (1994). 2.4
Simulation Study
To prove first if the value λ0 computed from the true parameter values indeed approximately minimizes the true classification error a simulation study was carried out. For the simulations the normal distribution model as described in section 1.1 was assumed where the parameters were chosen as follows: µ(1) = (1, . . . , 1) , µ(2) = (0, . . . , 0) , σii = 1 (i = 1, . . . , p) Σ =[σij ]i,j=1,...,p with σij = > 0 if i = j (i, j = 1, . . . , p).
(30) (31)
The number of observations was set to n = 50, the number of variables to p = 500. The values = 0.1, = 0.5 and = 0.9 were used as pairwise correlations. In every run n + 1 data vectors were generated, including n1 = n2 = n/2 training data for every class and one additional observation to classify. For each parameter combination 10,000 runs were executed. The procedure RDA with different pre-defined values of λ was applied for classification. For comparison also the value λ0 , computed from the true parameters, was used. The results are shown in Table 1. (The minimal obtained error rates are marked bold.) It can be seen that under these assumptions the procedure has the lowest error rates for large λ. The error rates for λ0 are very close to the minimal values.
690
M. Gr¨ uning and S. Kropf
0.1
0.5
0.9
0.2747 0.2743 0.2734 0.2715 0.2668 0.2573 0.2484 0.2433 0.2442 0.2447 0.2443 0.2441 0.2442 0.2442
0.3349 0.3341 0.3307 0.3259 0.3127 0.3036 0.3016 0.3005 0.2996 0.2996 0.2995 0.2995 0.2994 0.2994
λ 10−2 10−1.5 10−1 10−0.5 100 100.5 101 101.5 102 102.5 103 103.5 104 ∞
0.0748 0.0747 0.0746 0.0737 0.0718 0.0637 0.0606 0.0582 0.058 0.0581 0.0583 0.0583 0.0582 0.0582
λ0
0.0575 0.2445 0.2994
Table 1. Error rates obtained by simulation in dependence of λ
2.5
Comparison with other Procedures
In a further simulation study the procedures RDA and RDA* are compared with other classification procedures. Here always the value λ0 , computed from the estimated parameters, was used as regularization parameter of both procedures. The following procedures are included into the study: 1. RDA 2. RDA* 3. PCA — the wellknown Principal Components Analysis method 4. The Ridge rule in the version of L¨ auter (1992) SRidge := S+
p(n − 2) −1 Diag(S[Diag(S)] S) (n − 4)(n + p − 3)
(32)
5. The Multifactor rule (L¨ auter (1992)) −1
SMultifactor = Diag(S[Diag(S)]
S)
(33)
The simulations were based again on a normal distribution model with the following parameters: µ(1) , µ(2) were chosen as above, σii = 1 (i = 1, . . . , p) (34) Σ =[σij ]i,j=1,...,p with σij = 0.5 if i = j (i, j = 1, . . . , p). The number of training observations in each group was set to n1 = n2 = 10, and the number of features was varied as p = 5, 10, 15, 20, 50, 100, 200. Also
A Ridge Classification Method for High-dimensional Observations
5
PCA Ridge Multifactor RDA RDA*
0.264 0.285 0.267 0.269 0.275
Number of variables 15 20 50 100 0.261 0.264 0.267 0.278 0.272 0.279 0.272 0.261 0.250 0.253 0.263 0.258 0.254 0.245 0.250 0.262 0.258 0.252 0.245 0.251 0.266 0.263 0.254 0.245 0.251
10
691
200
0.264 0.245 0.245 0.244 0.244
Table 2. By simulation obtained error rates in dependence of the number of variables
as above for each parameter combination 10,000 repititions were run. The results of this study are shown in Table 2. Here it can be seen that the obtained results of both procedures RDA and RDA* are good compared to the competitors, RDA was slightly better than RDA*.
3
Conclusions
Based on the approach of Raudys and Skurichina (1994) we could derive two ridge-like classification methods which determine the regularization parameter λ directly. These methods are also applicable in the case of n < p where the number of variables exceeds the number of observations. Because the identity matrix is used for regularization these methods are especially suitable for similar scaled variables. In the simulation study, where this condition was fulfilled, acceptable results were obtained, especially for large p. In comparison with other procedures the new methods also obtained good results. Therefore at least under suitable conditions the procedures seem to be applicable.
References FISHER, R. A. (1936): The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. ¨ LAUTER, J. (1992): Stabile multivariate Verfahren. Diskriminanzanalyse, Regressionsanalyse, Faktoranalyse. Akademie Verlag, Berlin. ˇ (2001): Statistical and Neural Classifiers: An Integrated Approach to RAUDYS, S. Design. Springer Verlag, London. ˇ and SKURICHINA, M. (1994): Small sample properties of ridge esRAUDYS, S. timate of the covariance matrix in statistical and neural net classification. In: E. M. Tiit, T. Kollo, H. Niemi (Eds.): New Trends in Probability and Statistics: Multivariate statistics and matrices in statistics. TEV, Vilnius and VSP, Utrecht, 3:237–245.
Assessing the Trustworthiness of Clustering Solutions Obtained by a Function Optimization Scheme Ulrich M¨ oller and D¨ orte Radke Leibniz Institute for Natural Products Research and Infection Biology, Hans Kn¨ oll Institute, 07745 Jena, Germany Jena Centre for Bioinformatics
Abstract. We present a method for finding clustering structures which are good and trustable. The method analyzes re-clustering results obtained by varying the search path in the space of partitions. From the scatter of results the joint optimum of given quality criteria is determined and the re-occurrence probability of this optimum (called optimum consensus) is estimated. Then the finest structure is determined that emerged robustly with scores typical of high partition quality. When applied to tumor gene expression benchmark data the method assigned fewer tissue samples to a wrong class compared to methods based on either consensus or quality criteria.
1
Introduction
Although clustering is widely used for unsupervised classification, the goal is not universely defined (What constitutes a cluster?), a single best clustering algorithm is not available, and the problem of finding the optimal parameter set for an algorithm is usually too complex to be solved. Under these circumstances, clustering consensus is a principle that is inuitively interpretable, generally applicable, and useful to avoid several sources of bias. The least common denominator of the results from all clustering methods is the membership assigned to each datum. Therefore, membership is the most natural feature for measuring clustering consensus. For genomic (DNA microarray) data analysis addressed in this paper the methods by Monti et al. (2003) and Swift et al. (2004) are recent examples. Several common clustering methods are based on function optimization. That is, once a data set X and an objective function Q have been fixed, a unique target exists: the partition of X that represents the (global) optimum of Q. Hence, looking for consensus among arbitrary (good and poor) partitions of X seems to be counterproductive. We present an approach for assessing the trustworthiness of the best partition generated in a number of clustering trials. It is expected that a combination of both quality and consensus criteria may improve class discovery. Results are given and discussed for benchmark data representing classes of simulated and real gene expression of tumor samples.
Assessing the Trustworthiness of Clustering Solutions
2
693
Methods
Let Q be the target criterion for clustering a fixed data set X, where π∗ , the best partition of X, is represented by the (global) optimum of Q. As some criteria have a trivial optimum (e.g., Q = 0 if the number of clusters is equal to the size of X), Q may be a composite of different criteria {Q1 , Q2 , . . .} such as an objective function of clustering and a cluster validity index (cf. Theodoridis and Koutroumbas (1999)). Then we consider the simple case where the optimum result π∗ is characterized by the joint optimum of all Qi . (This is a putative optimum, because π∗ does not necessarily exist for an arbitrary set Q). Let {πt }, t = 1, . . . , T be a set of partitions of X, generated by an algorithm with the goal of finding π∗ : πt = ALG(X, ξt , p), where ξ is a parameter that determines the search path of ALG through the space of partitions. ξ is assumed to have no influence on the partition π∗ for a given Q. p are the remaining parameters of ALG. An estimate π ˆ of the optimum partition π∗ is found, if all values of Q obtained for π ˆ are the best values among the T observed results. The trustworthiness of π ˆ can be characterized
by its robustness under the variation of ξ (i.e, ξ1 , . . . , ξt , . . . , ξT ): OC = 1/T t=1,...,T I(πt ≡ π ˆ ) · 100, where I is the indicator function; I(true) = 1 and I(f alse) = 0. We call OC the optimum consensus (in contrast to membership consensus). OC has a range from 0 to 100%. Algorithmic Options Used ALG: fuzzy C-means (FCM) clustering algorithm, ξ: random initial partitions, p: standard objective function implemented in MATLAB, Release 13 (The Mathworks Inc.); fuzzy exponents 1.1 for the simulated data and 1.2 for the microarray data; maximum number of iterations 300; minimum objective function improvement 10−8 ; number of clustering trials: T = 50; quality criteria: Q1 – FCM objective function, Q2 – Davis-Bouldin cluster validity index (DBI) Q1 represents the sum of distances from each data point to a cluster center weighted by that data point’s membership grade. Q2 is described as the average similarity between each cluster and its most similar one. We seek clusterings that minimize both Q1 and Q2 . For the FCM and the DBI see Theodoridis and Koutroumbas (1999).
3
Data sets
Since we present an unsupervised and novel classification method, the data structure should be known, thus permitting an evaluation of the method’s capabilities. Therefore, we analyzed benchmark data (Table 1).
694
U. M¨ oller and D. Radke
Data set
No. of cases No. of features No. of classes
No. of classes Monti et al. (2003)
Simulated6
60
600
6+
7 (HC), 6 (SOM)
Leukemia3
38
999
3
5 (HC), 4 (SOM)
Lung4
197
1000
4
5, 7 (SOM)
Table 1. Characterization of the data sets analyzed. HC = hierarchical clustering, SOM = self-organizing map, used as the basic clustering algorithm
Simulated6 : a set of 60 artificial expression profiles (Figure 1C). 50 marker genes in each class are upregulated. 300 noise genes have the same distribution in all classes. The data contain an unintentionally generated complication: in case no. 8 the first 100 rather than 50 genes are upregulated (plotted with cluster 1 in Figure 1C). Leukemia3 : 38 gene expression profiles from three classes of adult leukemia (Figure 2C). Lung4 : 197 gene expression profiles from three classes of lung cancer and one class of normal lung (Figure 3C). Genes in the leukemia and lung data sets were selected to permit the use of the phenotype (class labels) as a gold standard against which to test the clustering tool. For more details and results see Monti et al. (2003).
4
Results
OC and the DBI were used to estimate the number of clusters based on both partition robustness and partition quality. Clear and unbiased (non-random) clustering structure is expected to be identified by a low DBI and/or high optimum consensus (below also called consensus). Moreover, we consider the optimization effort required to obtain a partition which is expected to be low if the clusters are easily distinguishable (cf. M¨ oller (2005)). Simulated6. 100% consensus and small DBI values for partitions with two and three clusters are strong evidence of a coarse data structure (Figure 1A); the first and the first two clusters, respectively, were separated from the other data. Accordingly, no structure was recognized where consensus was lacking and the DBI values were high (results for four, five, and more than seven clusters). Between these extremes markers of finer structure were found. All results for six clusters – consensus, DBI, and clustering effort – were more similar to typical results of clear structure than to results not indicating structure (Figure 1B). If 22% consensus is regarded significant, the sevencluster partition was the finest structure recognized. Both partitions differ only in case no. 8 which has the features of clusters 1 and 2: in the sevencluster partition this case forms a singleton cluster, whereas in the six-cluster partition it does not. All the other cases were correctly classified.
Assessing the Trustworthiness of Clustering Solutions A
2 clusters, T=50 1 0.8 0.6 0.4 0.2 0
3 clusters, T=50 1 0.8 0.6 0.4 0.2 0
0
0
6 clusters, T=50
7 clusters, T=50
1 0.8 0.6 0.4 0.2 0 1
2
2
4
2
4
0
0.5
5 8
9 6
0.2 7
3
0
1
1.5
20 40 60 80 100 optimum consensus
cluster 1 (8 profiles) expression expression
4 9 8
50
5
1
6
2
3 2
0
0
20 40 60 80 100 optimum consensus
cluster 2 (12 profiles)
cluster 3 (10 profiles)
1
1
0
0
0
cluster 4 (15 profiles)
−1
cluster 5 (5 profiles)
−1
1
1
1
0
0
0
−1
0
7
1
−1
3
2 0
C
FCM iterations
Davis−Bouldin index
4
0.4
2
Simulated6
0.8 0.6
1
9 clusters, T=50 1 0.8 0.6 0.4 0.2 0
Simulated6, 20% percentile
B
0
8 clusters, T=50 1 0.8 0.6 0.4 0.2 0
0
5 clusters, T=50 1 0.8 0.6 0.4 0.2 0
0
1 0.8 0.6 0.4 0.2 0 0
4 clusters, T=50 1 0.8 0.6 0.4 0.2 0
695
100 200 300 400 500 (simulated) genes
−1
100 200 300 400 500 (simulated) genes
−1
cluster 6 (10 profiles)
100 200 300 400 500 (simulated) genes
Fig. 1. Results for the data set Simulated6. A) Normalized DBI (y-axis) against the rescaled FCM objective function. + median, × mean, ◦ optimum consensus result. B) DBI and average number of FCM iterations against optimum consensus. Data labels denote the number of clusters. Dash-dot lines separate results of recognizable structure (lower right) from other results. C) The data of each simulated cluster superimposed
696
U. M¨ oller and D. Radke 2 clusters, T=50
A
3 clusters, T=50
1 0.8 0.6 0.4 0.2 0 0
0.5
5 clusters, T=50
1 0.8 0.6 0.4 0.2 0
1
1 0.8 0.6 0.4 0.2 0
0
0
6 clusters, T=50
0.2
0.4
0
0.2
0.4
0.6
0
0.4 0.2
2
3
0 0
FCM iterations
Davis−Bouldin index
200 7 5 6 4
0.4
0.6
0.8
6
150
5 7
100
20 40 60 80 100 optimum consensus
50 0
cluster no. → B-lineage acute myeloid leukemia (B-ALL)
4 0
2
3
20 40 60 80 100 optimum consensus
1
2
3
19
T-lineage acute myeloid leukemia (T-ALL) acute myeloid leukemia (AML)
0.2
Leukemia3
0.8 0.6
2
1 0.8 0.6 0.4 0.2 0
Leukemia3, 50% percentile
B
1
7 clusters, T=50
1 0.8 0.6 0.4 0.2 0 0
C
4 clusters, T=50
1 0.8 0.6 0.4 0.2 0
8 1
10
Fig. 2. Results for the data set Leukemia3. For A and B see the comments in Figure 1. C) Classification table: biological class labels representing the phenotype (rows) versus the results of the three-cluster optimum partition
Assessing the Trustworthiness of Clustering Solutions A
2 clusters, T=50 1 0.8 0.6 0.4 0.2 0
3 clusters, T=50 1 0.8 0.6 0.4 0.2 0
0
1 0.8 0.6 0.4 0.2 0 0
6 clusters, T=50 1 0.8 0.6 0.4 0.2 0
4 clusters, T=50
0.2
0.4
1 0.8 0.6 0.4 0.2 0 0
0
0
8 clusters, T=50 1 0.8 0.6 0.4 0.2 0
0
0
1
2
0
8
0.8 0.6
7
6
0.2
2 4
0 0
cluster no. → Normal lung
1
3 8 9
200 150
4
100 5
0
2
0
3
6 2 20 40 60 80 100 optimum consensus
4
16
Squamous cell carcinomas
5
4
1
19 3
6
1 15
Carcinoids Adenocarcinomas
7
250
50
20 40 60 80 100 optimum consensus
C
2
300
9
5 3
0.4
1
Lung4
FCM iterations
Davis−Bouldin index
1
0.02 0.04
9 clusters, T=50 1 0.8 0.6 0.4 0.2 0
Lung4, 50% percentile
B
5 clusters, T=50 1 0.8 0.6 0.4 0.2 0
0.6
7 clusters, T=50
697
1
1 1
32
61
42
Fig. 3. Results for the data set Lung4. For A and B see the comments in Figure 1. C) Classification table: biological class labels representing the phenotype (rows) versus the results of the six-cluster optimum partition
698
U. M¨ oller and D. Radke
Leukemia3. Apparently, just one scenario is to be seriously considered. DBI and consensus clearly indicated the presence of three classes (Figure 2A, B). Moreover, partitioning into three clusters required the lowest clustering effort. The classes derived from the three-cluster partition are confirmed by the phenotype information underlying this data set (Figure 2C). Apart from a coarse-grained structure of two clusters that can be found for many data sets, the scatter in Figure 2A indicates a non-random partition when generating four or five clusters. However, the associated validity index values did not significantly differ from the results obtained for more than five clusters. Lung4. Here the interpretation follows the scheme described above for the simulated data. Markers of a clustering structure were found for the partitions with two, four, and six clusters based on all criteria used: DBI, consensus, and clustering effort (Figure 3A, B). Evidence of a unique structure is lacking for partitions with three, five, and more than seven clusters. The seven-cluster partition emerged robustly, but its scores for the validity index and the clustering effort resembled the results with no clear indications of structure. Hence, the finest structure well recognizable by this method has six clusters. In fact, six was the smallest number of clusters essentially distinguishing the biological classes based on the phenotype (Figure 3C).
5
Discussion
Our study showed that a combination of quality and consensus criteria applied to re-clustering results may improve the discovery of classes in a data set. This finding is supported by several arguments. Clustering results with top scores of one quality criterion may have been misinterpreted without considering their trustworthiness (see Figure 2A, 7 clusters, DBI ≈ 0). Conversely, when neglecting partition quality robust results could have been falsely regarded as the finest resolution of good clusters (Figure 3A, 7 clusters). Based on both types of criteria an adequate characterization was obtained for the simulated classes and the tumor classes based on the phenotype as the gold standard. Also from the theoretical point of view the utilization of complementary criteria is reasonable: indications of either high quality or consensus may occur spuriously based on heuristic quality criteria or heuristic choices for a multiple clustering. We compared the results of our optimum consensus (OC) method with those of the resampling-based membership consensus (MC) method by Monti et al. (2003) (see Table 1). The simulated case with the features of two clusters was consistently regarded as a singleton cluster according to MC. Our method characterized the ambivalent situation: depending on whether higher weight is given to optimum consensus or partition quality this case was interpreted as a singleton cluster or as an outlier recognizable by its fuzzy cluster membership. The given classes of leukemia data were separated by the MC method at the expence of a possible over-partitioning, whereas the OC method avoided
Assessing the Trustworthiness of Clustering Solutions
699
over-partitioning at the expense of a single misclassification. Both MC and OC provided results that enabled the analyst to substantially separate the classes of lung tissues with a slight over-partitioning, where the six-cluster OC partition contained 12 errors and the five-cluster MC partition involved 29 errors. Note that the OC method ran 50 clustering trials for the full data set, whereas the MC method performed 500 hierarchical clustering trials or 200 self-organizing map trials, each for randomly selected 80% of the original data. Overall, both methods performed comparatively well. OC may have some advantage due to its optimization aspect. That (global) optimization methods may be superior has been observed by Swift et al. (2004) in a comparison of different methods. The representation of results as in Figure 1A and B allows the user to inuitively estimate the degree of evidence of an underlying structure at different levels of resolution. This requires a decision about the lowest degree of evidence that is regarded significant. Automatic processing would also be desirable; for initial solutions see M¨ oller (2005). Another question is how to balance quality and robustness for a final decision. The results of our optimum consensus method are encouraging given the high dimensionality of and the noise in the data analyzed and in comparison to other benchmark results. To our knowledge the combination of partition consensus and partition quality is a novel approach for gene expression data analysis. Optimum consensus is generally applicable to many methods for cluster analysis. To ensure that optimum consensus indeed describes a unique partition, cluster memberships have to be compared. This was done in our study, although not explicitly demonstrated. We also obtained encouraging results with other versions of the OC approach and for more comprehensive stochastic data models as well as for data generated by resampling techniques. This remains a field of our research.
References ¨ MOLLER, U. (2005): Estimating the Number of Clusters from Distributional Results of Partitioning a Given Data Set. In: B. Ribeiro, R.F. Albrecht, A. Dobnikar, D.W. Pearson and N.C. Steele (Eds.): Adaptive and Natural Computing Algorithms. Springer, Wien, 151–154. MONTI, S., TAMAYO, P., MESIROV, J., and GOLUB, T. (2003): Consensus Clustering: a Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, 52, 91–118. SWIFT, S., TUCKER, A., VINCIOTTI, V., MARTIN, N., ORENGO, C., LIU, X., and KELLAM, P. (2004): Consensus Clustering and Functional Interpretation of Gene-Expression Data. Genome Biology, 5, R94. THEODORIDIS, S. and KOUTROUMBAS, K. (1999): Pattern Recognition. Academic Press, San Diego.
Variable Selection for Discrimination of More Than Two Classes Where Data are Sparse Gero Szepannek and Claus Weihs Fachbereich Statistik Universit¨ at Dortmund Vogelpothsweg 87 44225 Dortmund Germany
Abstract. In classification, with an increasing number of variables, the required number of observations grows drastically. In this paper we present an approach to put into effect the maximal possible variable selection, by splitting a K class classification problem into pairwise problems. The principle makes use of the possibility that a variable that discriminates two classes will not necessarily do so for all such class pairs. We further present the construction of a classification rule based on the pairwise solutions by the Pairwise Coupling algorithm according to Hastie and Tibshirani (1998). The suggested proceedure can be applied to any classification method. Finally, situations with lack of data in multidimensional spaces are investigated on different simulated data sets to illustrate the problem and the possible gain. The principle is compared to the classical approach of linear and quadratic discriminant analysis.
1
Motivation and Idea
In most classification procedures, the number of unknown parameters grows more than linearly with dimension of the data. It may be desirable to apply a method of variable selection for a meaningful reduction of the set of used variables for the classification problem. In this paper an idea is presented as to how to maximally reduce the number of used variables in the classification rule in a manner of partial variable selection. To motivate this, consider the example of 5 classes distributed in a variable as it is shown in figure 1. It will be hardly possible to discriminate e.g. whether an observation is of class 1 or 2. An object of class 5 instead will probably be well recognized. The following matrix (rows and coloumns denoting the classes) shows which pairs of classes can be discriminated in this variable: C2 C3 C4 C5 C1 − − − + C2 − − + (1) C3 − + C4 +
Variable Selection for Discrimination of More Than Two Classes
701
Class 1 Class 2 Class 3 Class 4 Class 5
0.2 0.0
0.1
Density
0.3
0.4
Estimated class densities
−4
−2
0
2
Value of variable X³
Fig. 1. Example of 5 classes.
We conclude that, since variables may serve for discrimination of some class pairs while at the same time not doing so for others, a class pair specific variable selection may be meaningful. Therefore we propose the following procedure: 1. Perform ”maximal” variable subset selection for all K(K − 1)/2 class pairs. 2. Build K(K − 1)/2 class pairwise classification rules on possibly differing variable subspaces. 3. To classify a new object, perform K(K−1)/2 pairwise decisions, returning the same number of pairwise posterior probabilities. The remaining question consists in building a classification rule out of these K(K − 1)/2 pairwise classifiers.
2 2.1
Pairwise Coupling Definitions
We now tackle the problem of finding posterior probabilities of a K-class classification problem given the posterior probabilities for all K(K − 1)/2 pairwise comparisons. Let us start with some definitions. Let p(x) = p = (p1 , . . . , pK ) be the vector of (unknown) posterior probabilities. p depends on the specific realization x. For simplicity in notation
702
G. Szepannek and C. Weihs
we will omit x. Assume the ”true” conditional probabilities of a pairwise classification problem to be given by ρij = P r(i|i ∪ j) =
pi pi + pj
(2)
Let rij denote the estimated posterior probabilities of the two-class problems. The aim is now to find the vector of probabilities pi for a given set of values rij . Example 1: Given p = (0.7, 0.2, 0.1). The ρij can be calculated according to equation 2 and can be presented in a matrix: ⎞ ⎛ . 7/9 7/8 (3) {ρij } = ⎝ 2/9 . 2/3 ⎠ 1/8 1/3 . The inverse problem does not necessarily have a proper solution, since there are only K − 1 free parameters but K(K − 1)/2 constraints. Example 2: Consider ⎞ . 0.9 0.4 {rij } = ⎝ 0.1 . 0.7 ⎠ 0.6 0.3 . ⎛
(4)
From Machine Learning, majority voting (”Which class wins most comparisons ?”) is a well known approach to solve such problems. But here, it will not lead to a result since any class wins exactly one comparison. Intuitively, class 1 may be preferable since it dominates the comparisons the most clearly. 2.2
Algorithm
In this section we present the Pairwise Coupling algorithm of Hastie and Tibshirani (1998) to find p for a given set of rij . They transform the problem into an iterative optimization problem by introducing a criterion to measure the fit between the observed rij and the ρˆij , calculated from a possible solution pˆ. To measure the fit they define the weighted Kullback-Leibler distance: rij 1 − rij l(ˆ p) = nij rij ∗ log + (1 − rij ) ∗ log (5) ρˆij 1 − ρˆij i<j
nij is the number of objects that fall into one of the classes i or j. The best solution pˆ of posterior probabilities is found as in Iterative Proportional Scaling (IPS) (for details on the IPS-method see e.g. Bishop, Fienberg and Holland, 1975). The algorithm consists of the following three steps:
Variable Selection for Discrimination of More Than Two Classes
1. Start with any pˆ and calculate all µ ˆij . 2. Repeat until convergence i = (1, 2, . . . , K, 1, . . .):
j=i nij rij pˆi ← pˆi ∗
ˆij j=i nij ρ
703
(6)
renormalize pˆ and calculate the new µ ˆij 3. Finally scale the solution to pˆ ← pˆ pˆ i
i
Motivation of the Algorithm: Hastie and Tibshirani (1998), show that l(p) increases at each step. For this reason, above by 0, the
since it is bounded
algorithm converges. The limit satisfies i=j nij ρij = i=j nij rij for every class i = 1, . . . , K if a solution p exists. pˆ and ρˆij are consistent. Even if the choice of l(p) as optimization criterion is rather heuristic, it can be motivated in the following way: consider a random variable nij rij , being the rate of class i among the nij observations of class i and j. This random variable can be considered to be binomially distributed nij rij ∼ B(nij , ρij ) with ”true” (unknown) parameter ρij . Since the same (training) data is used for all pairwise estimates rij , the rij are not independent, but if they were, l(p) of equation 5 would be equivalent to the log-likelihood of this model (see Bradley and Terry, 1952). Then, maximizing l(p) would correspond to maximum-likelihood estimation for ρij . Going back to example 2, we obtain pˆ = (0.47, 0.25, 0.28), a result being consistent with the intuition that class 1 may be slightly preferable.
3
Validation of the Principle
In this section, the suggested procedure of a pairwise variable selection combined with Pairwise Coupling [PVS] is compared to usual classification using linear and quadratic discriminant analysis [LDA, QDA]. Variable Selection: The method of variable selection in our implementation is a quite simple one. We used class pair - wise Kolmogorov-Smirnov tests (see Hajek, 1969, pp.62–69) to check whether the distributions of two classes differ in a variable or not. For every class pair and every variable, the statistic D = max |Fnk1 (x) − Fnk2 (x)| (7) x
is calculated, where the Fnki (x) are the empirical distributions of class ki , i = 1, 2. A variable is taken into a pairwise model if its p value strongly indicates differing densities. Of course, any other variable selection could be used instead.
704
G. Szepannek and C. Weihs
n 4 6 8 10 15 20 50
Classes 1-9 Class 10 LDA PVS(LDA) QDA LDA PVS(LDA) 0.323 0.251 - 0.698 0.596 0.224 0.158 - 0.646 0.533 0.187 0.136 - 0.631 0.544 0.164 0.120 - 0.621 0.524 0.141 0.116 0.790 0.584 0.533 0.125 0.107 0.536 0.569 0.519 0.105 0.098 0.214 0.543 0.533
QDA 0.818 0.766 0.632
Table 1. Averaged error rates of LDA, QDA and PVS at varying class sizes
3.1
A First Example
Our first example is chosen according to the introductory example in section 1 to again illustrate the problem. Data are simulated in 9 classes and 10 variables. Class i is distributed according to X ∼ N (2 ∗ 1.64 ∗ ei , I) if i < 10 and X ∼ N (0, I), if i = 10. Here ei represents the standard basis vector, 0 is the 0 vector and I is the identity matrix. This means, two classes k = l, k, l < 10 differ in their distributions in only 2 variables (k and l). Class 10 can be discriminated from any other class i only in variable i. By construction, no variable can be omitted. For that reason, variable selection will not remove any of the variables, using usual discriminant analysis. We computed the results for using LDA and QDA and compared it to the pairwise variable selection using LDA. We computed simulations with varying (equal) class sizes in the training data to investigate the effect of sparse data. In the test data each class contains 50 objects. Error rates are averaged over 50 repetitive trials. The results are given in table 1. QDA classification rules can only be build having enough data. Even at larger class sizes QDA error rates are very high. If there are few observations the PVS approach shows strong advantages compared to usual LDA. For larger class sizes the differences in the error rates of both methods seem to vanish. 3.2
Differing Variances
We now extend the situation of the first example. In real life it may be possible that one is confronted with data where one of the classes is strongly concentrated in a specific variable. Of course, this class can be more easily identified by its realizations in this variable. Using LDA will fail to detect this by pooling all classes’ covariances. We modelled this situation with data consisting of 10 classes and 10 variables. Class i is distributed following X ∼ N (ei , Σ) with Σ being the identity except from (σ)ii := 0.1. An illustration of the phenomenon is given in figure 2.
Variable Selection for Discrimination of More Than Two Classes
Pooled variances
density
0
0.0
0.1
0.2
2 1
density
0.3
3
0.4
4
Unequal variances
705
−2
0
2 x
4
6
−2
0
2
4
6
x
Fig. 2. Example of unequal variances and their pooled estimators.
Intuitively, QDA seems to be more appropriate in this situation. The results for varying training data sizes are shown in figure 3. Astonishingly, here LDA still shows smaller error rates than QDA. For QDA, there does not seem to be enough data. Both methods can be largely improved by a class pairwise variable selection using QDA. The line with the smallest error rates is a reference line if the ”perfect” subset of variables would always be found. Surprisingly the KS-test yieds results that are very close to that line. But note that such Variable selection will fail to detect situations of correlation between variables. 3.3
Waveform Data
In order to obtain more general results we also wanted to apply the method to a common and well known classification problem. We chose the Waveform data introduced by Breiman et al. (1984). The problem consists of simulated data of three classes and 21 variables. Three waveforms over the variables (indexed by j) are given by: h1 (j) = max(6 − |j − 11|, 0) , h2 (j) = h1 (j − 4) and h3 (j) = h1 (j + 4). h2 (h3 ) is equal to h1 but shifted to the left (right). For each object, its mean lies uniformly distributed between 2 of these waveforms over all 21 variables. The objects (depending on their class memberships) are given by class 1 : Xj = U h1 (j) + (1 − U )h2 (j) + j
(8)
706
G. Szepannek and C. Weihs
Averaged error rates
0.6
LDA
0.4 0.2
0.3
error rates
0.5
PVS KS PVS optimal subset QDA
10
20
30
40
50
60
70
Class size
Fig. 3. Averaged error rates on test data.
LDA PVS(LDA) QDA PVS(QDA) Bayes risk New Simulation 20.02 16.96 21.31 19.77 20.5 14.9 Breiman’s results 19.1 Table 2. Averaged error rates over 100 trials of Waveform data.
class 2 : Xj = U h1 (j) + (1 − U )h3 (j) + j class 3 : Xj = U h2 (j) + (1 − U )h3 (j) + j
(9) (10)
where Xj denotes variable j. U is an object specific random uniform number of the interval [0, 1] and j is an additional iid standard normal error. For class 2 and 3 the combination of waveforms changes. The training data consists of 300 observations, each class having equal prior probabilities. The test data has 500 observations. We simulated 100 repetitions and averaged the error rates. It can be seen, that the results of both, linear and quadratic discriminant analysis here can be improved by using the PVS-approach. 3.4
Additional Remarks
Situations, where a class pairwise variable selection will not lead to a further reduction of the variable space - compared to a K - class overall variable selection are not investigated yet.
Variable Selection for Discrimination of More Than Two Classes
707
In K-class LDA the posterior probabilities are found by normalizing the density estimations of the classes, given the observation x. Therefore, the conditional pairwise posterior probabilities for two classes using K-class LDA will be the same as in the pairwise approach, except from situations of different covariances of the classes. The covariance of PVS-LDA is pooled only by two instead of all K covariances. In the case of QDA there should not be any such difference between the K-class and pairwise classification rules since all covariances are estimated separately.
4
Summary
A principle is suggested to perform the maximal possible variable selection by splitting a K-class classification problem into K(K − 1)/2 two-class problems and an algorithm is presented to build a classification rule from the results using these methods. This principle can be applied to any classification method and any variable selection procedure. The method is investigated on different simulated data sets using (linear and quadratic) discriminant analysis and the results are compared to their original results. Gain in classification error rate can be noticed, especially if the number of observations is not very large. Additionally, the pairwise variable subset selection can give interpretational insight into which variables characterize the differences between two classes. On the other hand, the computation time grows since there have to be built K(K − 1)/2 classification models. Also, the classification rule of each object has to be evaluated by the Pairwise Coupling algorithm.
References BISHOP, Y., FIENBERG, S. and HOLLAND, P. (1975): Discrete multivariate analysis, MIT Press, Cambridge. BRADLEY, R. and TERRY, M. (1952): The rank analysis of incomplete block designs, i. the method of paired comparisons, Bimometrics, 324–345. BREIMAN, L. FRIEDMAN, J., OLSHEN, R. and STONE, C. (1984): Classification and regression trees. Chapman & Hall, NY. HAJEK, J. (1969): A course in nonparametric statistics. Holden Day, San Francisco. HASTIE, T. and TIBSHIRANI, R. (1998): Classification by Pairwise Coupling. Annals of Statistics, 26(1), 451–471. SCOTT, D. (1992): Multivariate Density Estimation Wiley, NY.
The Assessment of Second Primary Cancers (SPCs) in a Series of Splenic Marginal Zone Lymphoma (SMZL) Patients Stefano De Cantis1 and Anna Maria Taormina2 1
2
Dipartimento di Metodi Quantitativi per le Scienze Umane, Viale delle Scienze, Edificio 13, Universit´ a degli Studi di Palermo, 90128, Palermo, Dipartimento di Scienze Statistiche e Matematiche “Silvio Vianelli”, Viale delle Scienze, Edificio 13, Universit´ a degli Studi di Palermo, 90128, Palermo
Abstract. The purpose of this study is to estimate the risk of second primary cancer (SPC) in 129 consecutive patients with splenic marginal zone lymphoma (SMZL) diagnosed in three Italian haematological centres. The person-years method deriving as a sum of products of age- and sex- specific rates and of the corresponding time at risk was used. The SPC Standardized Incidence Ratio (SIR) was 2.03 with a 95% confidence interval: [1.05, 3.56] (p < 0.05) and the corresponding Absolute Excess Risk (AER) was 145.8 (per 10000 SMZL patients per year). Our findings evidence a high frequency of additional cancers in patients with SMZL and suggest that the incidence rate of SPCs is significantly different from that expected in the general population.
1
Introduction
Second primary cancers (SPCs) have become an issue of extensive research (Neugut et al., 1999). A critical factor determining the increased risk of developing an SPC is the enhanced probability of surviving the first neoplasm. For example, if cancer were a disease with no associated mortality and evenly distributed throughout the population, and, assuming a lifetime cumulative incidence of approximately 33%, one would expect that 1 in 9 people would develop two primary cancers in their lifetime. Although chance or random distribution probably plays the most important role in second malignancies, a statistically elevated association between two tumour types may point to a specific aetiology (Rheingold et al., 2002). In addition to shared environmental risk factors, lifestyle factors (such as smoking, alcohol, exercise, and diet), genetic predisposition and treatment-related malignant effects, two other critical and more complex (from a statistical point of view) factors are: the
The paper is the result of a joint work of both authors; however, S. De Cantis wrote Sections 1, 2 and 4, while A.M. Taormina wrote Section 3. Special thanks go to Emilio Iannitto (MD) and his workgroup for some basic ideas inside this paper, for making available the dataset and for his valuable suggestions and comments.
Second Primary Cancers in SMZL Patients
709
probability of surviving a first neoplasm and detection bias (seen with the enhanced long-term surveillance for second malignancies in cancer survivors). Splenic marginal zone lymphoma (SMZL) is an infrequent neoplasm (Franco et al., 2003) whose peak of incidence occurs in the 7th decade of life. SMZL pursues a rather indolent clinical course with an estimate of about 2/3 of patients alive 5 years after diagnosis, and a median survival rate exceeding 10 years. The long median survival time of SMZL patients provides enough time for a second malignancy to develop. Up to 50% of patients die from causes not related to the lymphoma and some die from secondary cancers. However, the issue of the risk of developing an SPC has not yet been specifically addressed. The aim of our study was to investigate the frequency of additional cancers in a series of 129 patients consecutively diagnosed with SMZL in three Italian centres (Palermo, Reggio Calabria, Verona). We assessed whether the number of SPCs observed in our series was greater than that expected on the basis of the incidence of all cancers in the general population, as calculated from Italian cancer registers.
2
Observed-to-expected Number of SPCs
SPC was defined as any invasive neoplasm occurring more than 3 months after SMZL diagnosis. A malignancy was considered concurrent when it was diagnosed in a period ranging from 3 months before to three months after SMZL diagnosis. Malignancies detected more than 3 months before the diagnosis of SMZL were considered preceding. The time at risk for a second malignancy was defined as the time from diagnosis of SMZL to the occurrence of a second malignancy, death, or last contact; accordingly, time to SPC was defined as the time interval between SMZL diagnosis and SPC diagnosis. Survival analyses were calculated by the Kaplan-Meier method and cumulative probability of an SPC , an event in the presence of other competing events, was estimated according to competing risk approach (Kalbfleisch and Prentice (1980), Marubini and Valsecchi (1995)). If the probability of a given outcome is related to time, outcome measures (SPC incidence rate) are affected by the length of the observational period. For this reason, the SPC incidence rate must be the number of persons who develop a second cancer over a defined interval of time or age divided by the corresponding person-years at risk. A person-year is defined as the equivalent of the experience of one individual for one year. All subjects contribute as many years as they have been actually observed (or exposed). Determining for each individual the amount of observation time contributed to a given age × calendar period category and summing up the contributions for all cohort members we obtained the total number of person-years of observation in that category (Breslow and Day, 1987). A comparison of the SPC observed incidence with the expected incidence of second malignancies (O/E) provides the Standardized Incidence
710
S. De Cantis and A.M. Taormina
Age Classes 35 40 40 45 ... i ... 80 85
1983 1988 k pk11 r11 k k p21 r21 ... ... k pki1 ri1 ... ... k pk10,1 r10,1
Calendar Period 1988 1993 1993 1998 k k pk12 r12 pk13 r13 k k k k p22 r22 p23 r23 ... ... ... ... k k pki2 ri2 pki3 ri3 ... ... ... ... k k pk10,2 r10,2 pk10,3 r10,3
1998 2004 k pk14 r14 k k p24 r24 ... ... k pki4 ri4 ... ... k pk10,4 r10,4
k Table 1. Sex-, calendar period- and age-specific person-years (pkij ) and rates (rij ).
Ratio (SIR), an estimate of the risk of developing a second malignancy in a given cancer survivor. For example, an SIR of 2.0 would indicate that an individual diagnosed with a given malignancy has twice the risk of developing a second malignancy of a given type, compared with a similarly aged member of the general population of the same sex (Rheingold et al., 2002). The Absolute Excess Risk (AER) was determined by subtracting the expected number from the observed number of second cancers and then dividing the difference by the number of person-years at risk. The number of excess second cancers was expressed per 10000 SMZL patients per year. The expected number of subjects with SPC was calculated by using cancer incidence rates specific for 5-year age and sex, from calendar-period specific incidence rates estimated for the Italian population. As there is no tumour register covering the whole Italian area, we relied on established local registers, in particular the Ragusa register for patients from Southern Italy (Palermo and Reggio Calabria units) and the Parma register for patients from Northern Italy (Verona unit). Calendar periods were 1983 to 1987 (Zanetti and Crosignani, 1992), 1988 to 1992 (Zanetti et al., 1997), 1993 to 1997 (Zanetti et al., 2002) and 1998 to 2003. To obtain estimates for the specific rates in the calendar period from 1998 to 2003 (unpublished and at least partially unknown data, at present date), we forecast a linear trend development from incidence rates for previous calendar periods. Let gender, age class and calendar period be labelled by k, i, j, respectively. In particular, k = 1, 2 (1: male and 2: female), i = 1, 2, . . . , 10 (1: 35 40 years, 2: 40 45 years, . . . , 10: 80 85 years) and j = 1, 2, 3, 4 (1: 1983 1988, 2: 1988 1993, 3: 1993 1998, 4: 1998 2004). Let pkij be the number of person-years at risk for patients (male or female) in each 5-year age and k period class, and let rij be the corresponding sex-, period- and age-specific rate (see Table 1). Let E k (k = 1, 2) be the expected number of SPCs in SMZL patients according to specific rates obtained from geographically homogeneous local registries. It holds that: k Ek = pkij × rij k = 1, 2. i
j
Second Primary Cancers in SMZL Patients
711
Fig. 1. Event-free survival (EFS) and cumulative incidence of second primary cancers for 129 SMZL patients
SPC Standardized Incidence Ratio (SIR) or Observed-to-Expected number is simply SIRk = Ok /E k , where Ok is the observed number of second cancers in SMZL (male and female) patients. Tests of significance for the standardized cancer incidence ratio and the corresponding confidence interval (CI) were calculated assuming that the observed number of SPCs followed a Poisson distribution (for details, see Breslow and Day, (1987)). The exact Poisson confidence interval limits for SIR can be expressed in the form SIRL = SIR×ML and SIRU = SIR × MU , where ML and MU are multipliers determined by α and O, the observed number of second cancers; ML and MU for different confidence limit factors are tabulated in Haenszel et al. (1962). A two-sided test was used to test the equality of the observed and expected number of cancers. R 2.1.1, a language and environment for statistical computing, and in particular, the survival package and the cmprsk package (Gray (1988), Fine and Gray (1999)), were used for these calculations; (http://cran.r-project.org).
3
Data and Results
From January 1988 to December 2003, 129 adult patients were consecutively diagnosed with SMZL in three Italian haematological centres (University of Palermo, University of Verona, and Reggio Calabria General Hospital). The clinical information was retrospectively obtained from the patients’ medical records. We also asked the cooperating centres to provide complete information on the dates of diagnosis of the additional cancers, the nature of the tumours, the SMZL status at the time of the SPC diagnosis, the initial therapy for SMZL, and the patient follow-up. Only patients with histologically proven additional cancers were considered for the study. For clinical details (patients’ characteristics, clinical and biological characteristics on diagnosis, therapy, and survival data), see Iannitto et al. (2005), while event-free survival (EFS) is reported in Fig.1
712
S. De Cantis and A.M. Taormina
Second cancer All types
SIR CI (O/E) 2.03 [1.05, 3.55] (12/5.9229) Genitourinary 3.70 [1.01, 9.48] tract (4/1.0801) Lung 9.16 [1.41, 13.25] (4/0.4365)
p-value
SIR (O/E) SIR (O/E) PA RC VR Male Female 1.84 1.40 2.59 1.75 2.98 (4/2.1741) (2/1.4317) (6/2.3171) (8/4.5815) (4/1.3414) < 0.05 2.63 4.71 3.63 2.87 15.30 (1/0.3801) (2/0.4246) (1/0.2754) (3/1.0447) (1/0.0654) < 0.05 7.31 6.13 2.98 4.21 16.63 (2/0.2735) (1/0.1630) (1/0.3361) (3/0.7125) (1/0.0601)
< 0.05
Table 2. Standardized Incidence Ratios (SIRs), corresponding confidence intervals (CIs) and p-values to estimate the risk of second cancers in SMZL patients for all types and for selected cancers. Centre-specific (PA: Palermo, VR: Verona, RC: Reggio Calabria) and gender-specific SIRs are also reported.
Seventeen additional cancers were detected; five were previous malignancies and twelve were SPCs. The average time from SMZL diagnosis to SPC detection was 36.3 months (range: 3.7 - 69.5 months), and the 3- and 5-year cumulative incidence rates of SPCs were 5.5% and 18.3%, respectively (see Fig.1). The mean age on diagnosis of SMZL patients who developed an SPC was not significantly different from that of patients who remained free of SPC (67.7 vs 64.7, p > 0.1). Of the 17 additional cancers recorded, only those that developed at least 3 months after the diagnosis of SMZL were included in the analysis (4 pulmonary, 4 genitourinary, 1 breast, 1 rectal, 1 endometrial, 1 hepatic). The follow-up of the 129 cohort members was based on a total of 416.79 person-years. A higher risk of developing an SPC was identified when considering all SPCs (SIR: 2.03; 95% CI: [1.05, 3.55]; p < 0.05; AER: 145.8). A higher risk was also detected when considering genitourinary tract cancers alone (SIR: 3.70; 95% CI: [1.01, 9.48]; p < 0.05; AER: 70.1), but the risk was mainly confined to female patients. A greater risk of developing a lung cancer was detected (SIR: 9.16; 95% CI: [1.41, 13.25]; p < 0.05; AER: 85.5). The risk was high for both male and female patients (SIR: 4.21 and 16.63, respectively). Our findings evidence a high frequency of additional cancers in patients with SMZL and suggest that the incidence of SPCs is significantly different from that expected in the general population. The frequency of cases with genitourinary tract and lung malignancies in our series is higher than expected. Although confirmatory data are needed, it is our opinion that SMZL patients are at risk of second cancer and should be carefully investigated on diagnosis and monitored during the follow-up.
4
Discussion
The present study is a retrospective analysis of the frequency of additional cancers in a series of SMZL patients. Our data indicate a 5-year cumulative incidence rate of 13% additional cancers, a higher than expected risk of developing an SPC (SIR = 2.03) and a very high SPC absolute excess risk per 10000 SMZL patients per year (145.8). This issue has drawn much attention in recent years. SPCs have been reported in SMZL patients (Parry-Jones et
Second Primary Cancers in SMZL Patients
3000 2000
RATES
0
1000
2000 0
1000
RATES
3000
4000
PARMA MALE
4000
RAGUSA MALE
20
30
40
50
60
70
80
20
30
40
AGE CLASSES
50
60
70
80
AGE CLASSES
3000 2000 1000 0
0
1000
2000
RATES
3000
4000
PARMA FEMALE
4000
RAGUSA FEMALE
RATES
713
20
30
40
50 AGE CLASSES
60
70
80
20
30
40
50
60
70
80
AGE CLASSES
Fig. 2. Gender- and age-specific cancer incidence rates from Ragusa and Parma registries for different calendar periods: 1983-1987 (dashed line), 1988-1992 (dotdashed line), 1993-1997 (long dashed line) and 1998-2003 (dotted line)
al. (2003), Mulligan et al. (1991)). However, this report is, to our knowledge, the first to focus on the risk of developing a second malignancy among SMZL patients. In a study on NHL treatment, an international cohort of 6, 171 patients was retrospectively examined; approximately 1 in 5 patients developed a second cancer (Travis et al., 1993). The investigators concluded that NHL patients continue to be at significantly elevated risk of developing an SPC for up to two decades following their first cancer diagnosis. The exposure to cytotoxic therapy, the improvement in early detection due to the extensive and recurring use of imaging techniques to follow up lymphoma patients, and the prolonged survival could contribute significantly to the outlining of the risk of a second cancer. However, genetic predisposition or some other common cause such as environmental factors shared by both primary and secondary cancer may also play a major role, as has been demonstrated in some histotypes (Neugut et al. (1999), Howe (2003)). As far as the SPC histotype distribution is concerned we observed an excess of second genitourinary and lung cancers. The incidence of these cancers was proved to be higher than that expected, as calculated on tumour register data. Among the possible long-term side effects of cancer therapy, the development of an SPC is one of great concern. An association between genitourinary urologic and lymphoproliferative disorders has already been reported (Mulligan et al. (1991), Travis et al. (1995)). Cyclophosphamide metabolites have been alleged to play a major
714
S. De Cantis and A.M. Taormina
role, at least in bladder cancer, although relatively high doses are needed and only one case of bladder cancer was reported in a recent series of 2,837 large B-cell lymphomas (Andre et al., 2004). An increased incidence of secondary lung cancers has also been reported in patients with lymphoid malignancies. The increased risk of secondary lung cancer has been shown to be linked to alkylating agents in a dose-dependent fashion and regardless of treatment category. The unexpected finding of a higher risk of developing a lung cancer in patients treated with chemotherapy than in those treated with radiotherapy or with both modalities has also been reported (Kaldor et al., 1992) . It is noteworthy that cigarette smoking is a well-known risk factor for the development of bladder cancer (Khan et al., 1998). Smoking has also been debated as a possible risk factor for the development of malignant lymphomas, particularly for follicular NHL (Peach and Bennet, 2001). Regrettably, information regarding smoking habits was not systematically reported in our medical records. Although the role of shared aetiological factors remains unclear, the pattern of excess cancers in this SMZL series does not support the hypothesis that therapy played a major role. Even if it is only a hypothesis, we think that an influence of an immunodeficiency associated with the lymphoma itself could be worth exploring. In conclusion, we documented a high frequency of additional cancers in a series of SMZL patients. The dimension of the sample suggests the importance of a note of caution until this observation has been verified in other studies. Due to the study design we cannot comment the relative risk of developing an SPC in SMZL patients as compared to those with other types of lymphoma, or the role of therapy. With these shortcomings in mind, we think that data on the frequency of additional neoplasms may be valuable in terms of patient management. Given the prolonged survival of patients with SMZL, it is important for physicians to be alert to the occurrence of second cancers, particularly when new symptoms or physical findings arise. We suggest that in SMZL a clinically reasonable check-up for a secondary concomitant neoplasia should be carried out on diagnosis, particularly in elderly patients. Further studies should address the role of shared risk factors, host determinants, gene-environment interactions, and other influences.
References ANDRE, M. and MOUNIER, N. et al. (2004): Second cancers and late toxicities after treatment of aggressive non-Hodgkin’s lymphoma with ACBBP regimen: a Gela cohort study on 2837 patients. Blood, 103, 1222–1228. BRESLOW, N.E. and DAY, N.E. (1987): Statistical methods in cancer research.Volume 2. IARC, Lyon. FINE, J.P. and GRAY, R.G. (1999): A proportional hazards model for the subdistribution of a competing risk. JASA, 94, 496–509. FRANCO, V., FLORENA, A. and IANNITTO, E. (2003): Splenic marginal zone lymphoma. Blood, 101, 2464–2472.
Second Primary Cancers in SMZL Patients
715
GRAY, R.J. (1988): A class of K-sample tests for comparing the cumulative incidence of a competing risk. Annals of Statistics, 16, 1141–1154. HAENSZEL,W., LOVELAND, D. and SIRKEN, M.G. (1962): Lung cancer mortality as related to residence and smoking histories. Journal of National Cancer Institute, 28, 947–1001. HOWE, H.L. (2003): A review of the definition for multiple primary cancers in the United States. North America Association of the Central Cancer Registries, 1–40. IANNITTO,E. and MINARDI, V. et al. (2005): Assessment of the frequency of additional cancers in patients with splenic marginal zone lymphoma . European Journal of Haematology, UNPUBLISHED. KALBFLEISCH, J.D. and PRENTICE, R.L. (1980): The analysis of failure time data. Wiley, New York. KALDOR, J.M. and DAY, N.E. et al. (1992): Lung cancer following Hodgkin’s disease: a case-control study. Int. J. Cancer, 52, 677–681. KHAN, M.A. and TRAVIS, L.B. et al. (1998): P53 mutations in cyclophosphamideassociated bladder cancer. Cancer Epidemiological Biomarkers Prev., 7, 397– 403. MARUBINI, E. and VALSECCHI, M.G. (1995): Analyzing survival data from clinical trials and observational studies. Wiley, New York. MULLIGAN, S., MATUTES, E. and DEARDEN, C. (1991): Splenic lymphoma with villous lymphocytes: natural history and response to therapy in 50 cases. Br. J. Aematol., 78, 206–209. NEUGUT, A.I., MEADOWS, A.T. and ROBINSON, E. (1999): Multiple primary cancers. Lippincot Williams & Wilkins, Philadelphia. PARRY-JONES, N. and MATUTES, E. et al.(2003): Prognostic features of splenic lymphoma with villous lymphocytes: a report on 129 patients. Br. J. Aematol., 120, 759–764. PEACH, H.G. and BARNETT, N.E. (2001): Critical review of epidemiological studies of the association between smoking and non-Hodgkin’s lymphoma. Hematological Oncology, 19, 67–80. RHEINGOLD, S.R., NEUGUT, A.I. and MEADOWS, A.T. (2002): Secondary cancer: incidence, risk factors and management. Holland J.F. and Frei E. (eds): Cancer Medicine. BC Decker Inc., Hamilton (Canada). TRAVIS, L.B. and CURTIS, E.L. et al. (1993): Second cancers among long term survivors of non-Hodgikin’s lymphoma. Journal of National Cancer Institute, 85, 1932–1937. TRAVIS, L.B. and CURTIS, E.L. et al. (1995): Bladder and kidney cancer following cyclosposphamide therapy for non-Hodgkin’s lymphoma. Journal of National Cancer Institute, 87, 524–530. ZANETTI, R. and CROSIGNANI, P. (eds) (1992): Cancer in Italy. Incidence data from Cancer Registries 1983-1987. Lega Italiana per la Lotta contro i Tumori e Associazione Italiana di Epidemiologia, Torino. ZANETTI, R., CROSIGNANI, P. and ROSSO, S. (eds) (1997): Cancer in Italy: 1988-1992. Il Pensiero Scientifico, Roma. ZANETTI, R. and GAFA, L. et al. (eds) (2002): Cancer in Italy. Incidence data from Cancer Registries. Third volume: 1993-1998. Il Pensiero Scientifico Editore, Roma.
Heart Rate Classification Using Support Vector Machines Michael Vogt1 , Ulrich Moissl1 , and Jochen Schaab2 1
2
Institute of Automatic Control, Darmstadt University of Technology, 64283 Darmstadt, Germany Institute of Flight Systems and Automatic Control, Darmstadt University of Technology, 64287 Darmstadt, Germany
Abstract. This contribution describes a classification technique that improves the heart rate estimation during hemodialysis treatments. After the heart rate is estimated from the pressure signal of the dialysis machine, a classifier decides if it is correctly identified and rejects it if necessary. As the classifier employs a support vector machine, special interest is put on the automatic selection of its user parameters. In this context, a comparison between different optimization techniques is presented, including a gradient projection method as latest development.
1
Heart Rate Estimation
Hemodialysis is the treatment of choice for permanent kidney failure. Blood is taken from the body via an artificial vascular access and pumped through a special extracorporeal filter (dialyzer) which removes harmful wastes and excess water, see Fig. 1. A major problem in hemodialysis is the unphysiologically high rate of fluid removal from the blood compartment which leads to hypotensive crises and
Fig. 1. Detection of the heart pressure signal in the extracorporeal circuit of a dialysis machine.
a)
0.8 0.6
Height of second largest peak
0.8
0.4
0.2
0.2
1
2.5 2 1.5 Frequency [Hz]
Heartpeak
3
0 0.5 b)
Area under spectrum
60
2.5 2 1.5 Frequency [Hz]
3
difference
40 20
1
average heart rate
80
0.6
0.4
0 0.5
100
1
Heartpeak
717
Feature 3
Feature 2
Feature 1 1
Heart rate [bpm]
normalized amplitude spectrum |X(n)|
Heart Rate Classification Using Support Vector Machines
sudden deviation of heart rate due to patient movement
0 0
10 c)
20
30 Time [s]
40
50
Fig. 2. Features used for classification of determined HR values
cardiovascular instability in about 30 % of all treatments. For early detection of critical episodes the heart rate (HR) can be used as an indicator of the patient’s cardiovascular state (Wabel et al. (2002)). As indicated in Fig. 1, the heart pressure signal can be picked up in the extracorporeal blood line after filtering out the pressure signal of the blood pump (Moissl et al. (2000)). The HR is then determined by searching for the maximum in a Fourier spectrum of the heart signal over a time window of 10 seconds. As the S/N ratio is very low, the HR determination may occasionally be incorrect due to noise in the spectrum arising from patient movement or inadequate pump signal filtering. Therefore each HR estimate is classified by means of three features whether it should be accepted or rejected. These features are based on the fourier spectrum, as depicted in Fig. 2. Feature 1 is the height of the second largest peak in the normalized spectrum. The idea behind this feature is, that in the absence of noise there will be only a peak at the HR but almost no power at the remaining frequencies, resulting in a low value of feature 1. Feature 2 is the area under the spectrum in the range of 0.5 to 3 Hz (according to a HR between 30 and 180 beats per minute). Again, in the absence of noise this feature is very low. Feature 3 detects sudden jumps in the HR, which are unlikely to occur in reality. Figure 2 (c) shows the estimated HR over time, and a low pass filtered average HR; the feature is defined as the difference between both values. At t = 20 sec the HR suddenly drops from 80 to 40 bpm, indicating that the peak of the heart signal in the spectrum is no longer the maximum, but rather a noise component in the low frequency range. Only a few seconds later the noise decreases and the HR is determined correctly again. These three features are chosen to maintain transparency in the underlying reasoning principles, which is preferable in medical applications. Therefore they were chosen from a set of 15 different features in order to gain highest acceptance by a group of medical doctors. The following sections show how a classifier decides based on these the features if the heart rate is correctly estimated or not.
718
M. Vogt et al. x2
1
yi = +1
0.8 0.6 Support Vectors 0.4 yi = −1 0.2
ξi
Boundary Margin
m
0 0
0.2
0.4
0.6
0.8
1
x1
Fig. 3. Separating two overlapping classes with a linear decision function
2
Support Vector Machine Classification
A two-class classification problem is defined by the data set {(xi , yi )}N i=1 with the feature vectors xi and the class labels yi ∈ {−1, 1}. Linear Support Vector Machines (SVMs) aim to find the coefficient vector w and the bias term b of a maximal flat decision function f (x) = wT x + b. Positive values f > 0 indicate class +1, negative ones (f < 0) class −1. Consequently, the decision boundary is given by f (x) = 0. In the case of separable classes, maximal flatness corresponds to a maximal margin m = 2/wT w between the two classes (Sch¨olkopf and Smola (2002)). For non-separable classes, margin errors are considered by slack variables ξi ≥ 0 measuring the distance between the margin and the data, see Fig. 3. This concept leads to the soft margin classifier: min w,ξ
s.t.
Jp (w, ξ) =
N 1 T w w + C ξi 2 i=1
yi (wT xi + b) ≥ 1 − ξi ξi ≥ 0, i = 1, . . . , N
(1a) (1b) (1c)
where C is a user parameter describing the trade-off between maximal margin and correct classification. For unbalanced classes or asymmetric costs, different C values for both classes are advantageous, which can be expressed by data-dependent constants Ci with if yi = +1 C+ = C (2) Ci = C− = r · C if yi = −1 Here r is a user-defined weighting factor. The solution of the primal quadratic program (1) is found by solving its dual optimization problem, which yields the Lagrange multipliers αi ∈ [0, Ci ] of the primal constraints. With these multipliers, f can be written in its support vector expansion f (x) = αi yi K(x, xi ) + b . (3) αi =0
Heart Rate Classification Using Support Vector Machines
719
Those xi corresponding to αi = 0 are called support vectors and typically comprise only a small fraction of the data set. K(x, x ) is a kernel function (Sch¨ olkopf and Smola (2002)) that introduces nonlinearity into the model. For linear SVMs, it is simply the scalar product K(x, x ) = xT x . However, the dual formulation allows employ a variety of nonlinear functions such as Gaussians x − x 2 K(x, x ) = exp − . (4) 2σ 2 Applying Lagrangian theory to the primal problem (1) leads to the following dual problem which has to be solved for the multipliers α:
min Jd (α) = α
s.t.
N N N 1 αi αj yi yj K(xi , xj ) − αi 2 i=1j=1 i=1
0 ≤ αi ≤ Ci , N
i = 1, . . . , N
αi yi = 0
(5a) (5b) (5c)
i=1
An interesting variant of this problem is to omit the bias term b, i.e., to keep it fixed at b = 0. As a result, the equality constraint (5c) will vanish so that (5) reduces to a “box-constrained” QP problem which is significantly easier to solve. This modification is studied in detail by Vogt and Kecman (2005).
3
Solving the SVM Classification Problem
The following discussion considers the solution of the dual problem (5), possibly lacking the equality constraint (5c). Optimization is a key issue of the SVM method for (at least) two reasons: First, the size of the QP problem is O(N 2 ) where N is the number of data samples and may be very large in practical applications. And second, user parameters like Ci have to be determined according to validation results. This requires to vary them over a wide range which has strong influence on the computation time. Working-set Methods. Most SVM optimization methods employ workingset (or decomposition) algorithms since they are suitable even for very large data sets due to their memory complexity O(N ). The basic idea is to repeatedly select a set of variables (the working set W) and solve (5) with respect to it while the rest of the variables is kept fixed. A popular variant is the SMO algorithm (Platt (2001)) that reduces W to only 2 variables (or even a single variable, see Kecman et al. (2005)) so that the minimization on W can be done analytically. In Sect. 4, two implementations will be used: 1. An own C implementation with the improvements by Keerthi et al. (2001) and a cache for kernel function values. 2. LIBSVM (Chang and Lin (2005)), a well-established C++ library that implements a strategy similar to SMO with many enhancements.
720
M. Vogt et al.
∇Jd (α(k) ) αC α(k) Fig. 4. Computing the cauchy point using gradient projection
Active-set Methods. Active-set algorithms are an alternative to workingset methods since they compensate for some of their drawbacks. However, active-set strategies require more memory and are slower in some cases: Their memory consumption is O(N + Nf2 ) where Nf is the number of free variables, i.e., those with 0 < αi < Ci . The computation time is roughly dependent on the number of support vectors. The primary goal is to find the active set A, i.e., those inequality constraints that are met with equality (αi = 0 or αi = Ci ). If A is known, then a system of linear equations yields the solution of (5). But since A is unknown in the beginning, it is constructed iteratively by adding and removing variables and testing if the solution remains feasible. For the experiments in Sect. 4, an own C implementation is used (Vogt and Kecman (2005)). Gradient Projection. The main drawback of active-set methods is that they change the active set by only one variable per step to ensure convergence. Each step involves the check of the Karush-Kuhn-Tucker (KKT) conditions, which requires basically the computation of the gradient g = ∇Jd (α) with αj yi Kij = αi yi (f (xi ) − b) . (6) gi = αi yi αj =0
The matrix-vector product (6) dominates the computation time of the algorithm. Since it starts with α = 0, this is critical if the number of support vectors is large. In that case gradient projection steps (Mor´e and Toraldo (1989)) can accelerate the algorithm by changing multiple variables per step. This technique projects the current gradient ∇Jd (α) onto the feasible region and searches for the first local minimum on the resulting piecewiselinear path – the Cauchy point αC , see Fig 4. The active components in αC define the active set for the next step. Because the algorithm is derived for box-constrained QP problems only, it is applied to the “no-bias SVM” (b = 0) mentioned in Sect. 2. Because SVM classifiers typically produce ill-conditioned QP problems, only a fixed number of the largest gradient components is projected. This also limits the size of the linear system when the Cauchy point is found in the first interval (and the system would therefore be as large as the full problem). Currently, gradient projection is implemented as extension to the active-set method described above replacing its inactivation part.
Heart Rate Classification Using Support Vector Machines
721
Features in the input space Class +1 Class +1
Feature 3
1
0.5
0 0
1 0.5
0.5
Feature 1
1 0
Feature 2
Fig. 5. Features and class boundary
4
Results
The heart rate classification problem described in Sect. 1 is solved by a nonlinear SVM with Gauss kernel (4). Its parameters to be chosen by the user are: The trade-off constant C, the class weighting factor r (see (2)) and the Gaussians’ width σ. They are found by computing the generalization error on a test sample and selecting the combination with the smallest test error. Figure 5 shows the distribution of the 3 features in the input space and a typical class boundary. To find out reasonable ranges for C, r and σ, a series of experiments with a reduced data set is carried out first. It consists of only 1000 samples to limit the computation time in the critical cases. Table 1 shows an example for the variation of C with r = 1 and σ = 0.1. NSV denotes the number of support vectors, whereas NBV is the number of bounded SVs (αi = C). All algorithms run under MATLAB on a Pentium-III PC at 800 MHz and use a kernel cache large enough to store all necessary entries. For small values of C, SMO and LIBSVM are very fast, but the computation time increases by magnitudes for large C. Both algorithms show this behavior but LIBSVM is generally faster due to its enhancements and more sophisticated implementation. As pointed out in Sect. 3, the active-set method’s computation times are mainly C SMO LIBSVM Active-set NSV NBV Error
10−2 0.67 s 0.18 s 3.34 s 276 274 13.5%
10−1 0.31 s 0.09 s 1.99 s 229 223 6.45%
100 0.24 s 0.09 s 0.69 s 157 132 5.07%
101 0.24 s 0.11 s 0.44 s 121 83 3.92%
102 0.51 s 0.18 s 0.34 s 96 56 3.90%
103 2.86 s 0.76 s 0.29 s 90 43 3.60%
104 27.7 s 6.16 s 0.30 s 87 36 3.47%
105 135 s 50.7 s 0.27 s 79 30 4.08%
Table 1. Computation times of different algorithms for a variation of C
722
M. Vogt et al. Components 1 2 3 5 10 15 Iterations 336 211 136 127 158 189 Gradients 362 191 126 84 51 42 Time 3.58 s 2.11 s 1.50 s 1.12 s 0.85 s 0.85 s Table 2. Gradient projection with limited number of components
100 98 Correct esti− 96 mates among accepted (%) 94 92 90 100 95 Accepted estimates (%)
90 85 80 90
91
92
93 94 95 96 97 Required level of accuracy (%)
98
99
100
Fig. 6. Results of the heart rate classification
dependent on the number of support vectors. Consequently, it is fast for large C and much less dependent on C than the other algorithms. Table 2 shows how gradient projection steps can accelerate the active-set algorithm when the number NSV of support vector is large, e.g., for small C. It also confirms the appraisal of Sect. 3: Projecting more than 10 . . . 15 components does not further speed up the algorithm (even though there are less gradient evaluations), but will cause additional definiteness problems. The overall problem formulated in Sect. 1 is an optimization problem with multiple (i.e., two) objectives: On the one hand, accepted estimates should be correct, on the other hand, as few estimates as possible should be rejected. This dilemma is solved by turning one of the objectives into a restriction: 1. 2. 3. 4.
Define a required level of correct estimates, e.g. 99.5 %. Compute SVMs for the variation of C, r and σ. Select all SVMs respecting the required level (not only the best). Choose the one with the smallest rejection rate.
The results are illustrated in Fig. 6: If a classifier respects a higher level of accuracy, more estimates are rejected. E.g., if 99.5 % of the estimates need to be correct, then approximately 79 % are accepted. This is a very good proportion since the estimation procedure (without classification) leads to 85 % of correct values, i.e., the additional classifier drops only 6 % of correct values. The class boundary for this case in shown in Fig. 5.
Heart Rate Classification Using Support Vector Machines
5
723
Conclusions
The result of this contribution is twofold: Regarding the heart rate estimation it turned out that classification can significantly increase the quality of the estimates. The user just has to prescribe a required level of accuracy, and the classifier will reject as many estimates as necessary to meet this level. On the other hand, the study includes a comparison between different SVM optimization techniques. The importance of this topic originates from the automatic selection of user parameters (C, r, σ) according to validation results. It has been shown that active-set methods are a reasonable alternative to the currently used working-set strategies, and that their drawbacks can be attenuated by gradient projection steps. This class of algorithms might be a promising direction for future work on SVM optimization.
References CHANG, C.C. and LIN, C.J. (2005) LIBSVM: A Library for Support Vector Machines. Technical Report, National Taiwan University, Taipei, Taiwan. KECMAN, V., HUANG, T.M. and VOGT, M. (2005) Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets. In: Support Vector Machines: Theory and Applications. Springer-Verlag, Berlin. KEERTHI, S. et. al (2001): Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13, 637–649. MOISSL, U., WABEL, P., LEONHARDT, S. and ISERMANN, R. (2000): OnlineHerzfrequenzerkennung w¨ ahrend der Dialysebehandlung mit Hilfe einer neuen Formfilter-Methode. Biomedizinische Technik 45, 417-418. ´ J.J. and TORALDO G. (1989): Algorithms for Bound Constrained MORE, Quadratic Programming Problems. Numerische Mathematik 55(4), 377–400. PLATT, J.C. (1999): Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Sch¨ olkopf, B. et. al (Eds.): Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA. ¨ SCHOLKOPF, B. and SMOLA, A. (2002): Lerning with Kernels. MIT Press, Cambridge, MA. VOGT, M. and KECMAN, V. (2005): Active-Set Methods for Support Vector Machines. In: Support Vector Machines: Theory and Applications. Springer-Verlag, Berlin. WABEL, P., MOISSL, U., LEONHARDT, S. and ISERMANN, R. (2002): Ans¨atze zur Identifikation von Patientenparametern w¨ ahrend der H¨ amodialysetherapie. at – Automatisiserungstechnik 50(5), 220–227.
Visual Mining in Music Collections Fabian M¨ orchen, Alfred Ultsch, Mario N¨ ocker, and Christian Stamm Data Bionics Research Group, Philipps-University Marburg, 35032 Marburg, Germany Abstract. We describe the MusicMiner system for organizing large collections of music with databionic mining techniques. Visualization based on perceptually motivated audio features and Emergent Self-Organizing Maps enables the unsupervised discovery of timbrally consistent clusters that may or may not correspond to musical genres and artists. We demonstrate the visualization capabilities of the U-Map. An intuitive browsing of large music collections is offered based on the paradigm of topographic maps. The user can navigate the sound space and interact with the maps to play music or show the context of a song.
1
Introduction
Humans consider certain types of music as similar or dissimilar. To teach a computer systems to learn and display this perceptual concept of similarity is a difficult task. The raw audio data of polyphonic music is not suited for direct analysis with data mining algorithms. In order to use machine learning and data mining algorithms for musical similarity, music is often represented by a vector of features. We generalized many existing low level features and evaluated a large set of temporal and non temporal statistics for the high level description of sound (M¨ orchen et al. (2005)). From the huge set of candidate sound descriptors, we select a small set of non-redundant features to represent perceptual similarity based on a training set of manually labeled music. Clustering and visualization based on these feature vectors can be used to discover emergent structures in collections of music that correspond to the concept of perceptual similarity. We demonstrate the clustering and visualization capabilities of the new audio features with Emergent Self-organizing Maps (ESOM) (Ultsch (1992)). First some related work is discussed in Section 2 in order to motivate our approach. The datasets are described in Section 3. The method to generate and select the audio features is very briefly explained in Section 4. Visualization of music collections with U-Map displays of Emergent SOM are explored in Section 5. Results and future research is discussed in Section 6, followed by a brief summary in Section 7.
2
Related Work and Motivation
Many approaches of musical similarity represent songs by mixture models of a large set of Mel Frequency Cepstral Coefficients (MFCC) feature vec-
Visual Mining in Music Collections
725
tors (e.g. Logan and Salomon (2001), Aucouturier and Pachet (2002)). These model based representation cannot easily be used with data mining algorithms requiring the calculation of a prototype representing the notion of an average or centroid like SOM, k-Means, or LVQ. In Tzanetakis and Cook (2002) a single feature vector is used to describe a song, opening the musical similarity problem for many standard machine learning methods. Genre classification with an accuracy of 66% is performed. The problem with genre classification is the subjectivity and ambiguity of the categorization used for training and validation (Aucouturier and Pachet (2003)). An analysis of musical similarity showed bad correspondence with genres, again explained by their inconsistency and ambiguity (Pampalk et al. (2003)). In Aucouturier and Pachet (2003) the dataset is therefore chosen to be timbrally consistent irrespectively of the genre. Recently, interest in visualization of music collections has been increasing. Song based visualizations offer a more detailed view into a music collection than album or artist based methods. In Torrens et al. (2004) disc plots, rectangle plots, and tree maps are used to display the structures of a collection defined by the meta information on the songs like genre and artist. But the visualizations do not display similarity of sound, the quality of the displays thus depends on the quality of the meta data. Principal component analysis is used in Tzanetakis et al. (2002) to compress intrinsic sound features to 3D displays. In Pampalk et al. (2002) it was already demonstrated, that SOM are capable of displaying music collections based on audio features.
3
Data
We have created two datasets to test the visualization of music collections. Our motivation for composing the data sets was to avoid genre classification and create clusters of similar sounding pieces within each group, while achieving high perceptual distances between songs from different groups. We selected 200 songs in five perceptually consistent groups (Acoustic, Classic, Hiphop, Metal/Rock, Electronic) and will refer to this dataset as 5G. The validation data was created in a similar way as the training data. Eight internally consistent but group wise very different sounding pieces totalling 140 songs were compiled. This dataset will be called 8G.
4
Audio Features
We briefly present our method of generating a large set of audio features and selecting a subset for modelling perceptual distances. The full details are given in M¨ orchen et al. (2005). First, more than 400 low level features were extracted from short sliding time windows, creating a down sampled time series of feature values. The features included time series descriptions like volume or zerocrossings, spectral descriptions like spectral bandwidth or
726
F. M¨ orchen et al.
9 Electronic Different music ML Decision Error
8 7
Likelihood
6 5 4 3 2 1 0
0.15
0.2
0.25 0.3 0.35 0.4 2nd CC 7th Sone Band
0.45
Features 5G MusicMiner 0.41 MFCC 0.16 McKinney 0.26 Tzanetakis 0.18 Mierswa 0.12 FP 0.10 PH 0.07 SH 0.05
Datasets 8G 0.42 0.20 0.30 0.20 0.16 0.04 0.07 0.09
0.5
Fig. 1. Probability densities for Electronic music vs. different music
Fig. 2. Distance scores on training (5G) and validation (8G) data
rolloff (Li et al. (2001)), and MFCC as well as generalizations thereof. The aggregation of low level time series to high level features describing the sound of a song with one or a few numbers was systematically performed. Temporal statistics were used to discover the potential lurking in the behavior of low level features over time. More than 150 static and temporal aggregations were used, e.g. simple moments, spectral descriptions and non-linear methods. The cross product of the low level features and high level aggregations resulted in a huge set of about 66,000 mostly new audio features. A feature selection was necessary to avoid noisy and redundant attributes and select features that model perceptual distance. We performed a supervised selection based on the perceptually different sounding musical pieces in the training data. The ability of a single feature to separate a group of music from the rest was measured with a novel score based on Pareto Density Estimation (PDE) (Ultsch (2003)) of the empirical probability densities. Figure 1 shows the estimated densities for a single feature and the Electronic group vs. all other groups. It can be seen that the values of this feature for songs from the Electronic group are likely to be different from other songs, because there is few overlap of the two densities. Using this feature as one component of a feature vector describing each song will significantly contribute to large distance of the Electronic group from the rest. This intuition is formulated as the Separation score calculated as one minus the area under the minimum of both probability density estimates. Based on this score a feature selection is performed including a correlation filter to avoid redundancies. Based on the training data, the top 20 features are selected for clustering and visualization in the next section. We compared our feature set to seven sets of features previously proposed for musical genre classification or clustering: MFCC (Aucouturier and Pachet (2002)), McKinney (McKinney et al. (2003)), Tzanetakis (Tzanetakis and Cook (2002)), Mierswa (Mierswa and Morik (2005)), Spectrum Histogram (SH), Periodicity Histograms (PH), and Fluctuation Patterns (FP) (Pampalk
Visual Mining in Music Collections
727
et al. (2003)). The comparison of the feature sets for their ability of clustering and visualizing different sounding music was performed using a measure independent from the ranking scores: the ratio of the median of all inner cluster distances to the median of all pairwise distances, similar to (Pampalk et al. (2003)). One minus this ratio is called the distance score, listed in Table 2 for all feature sets, the bars visualize the performance on the validation data that was not used for the feature selection. The MusicMiner features perform best by large margins on both datasets. The best of the other feature sets is McKinney, followed by MFCC and Tzanetakis. The fact that McKinney is the best among the rest, might be due to the incorporation of the temporal behavior of the MFCC in form of modulation energies. The worst performing feature sets in this experiment were Spectrum Histograms and Periodicity Histograms. This is surprising, because SH was found to be the best in the evaluation of (Pampalk et al. (2003)). In summary, our feature sets showed superior behavior in creating small inner cluster and large between cluster distances in the training and validation dataset. Any data mining algorithms for visualization or clustering will profit from this.
5
Visualization of Music Collections
Equipped with a numerical description of sound that corresponds to perceptual similarity, our goal was to find a visualization method, that fits the needs and constraints of browsing a music collection. A 20 dimensional space is hard to grasp. Clustering can reveal groups of similar music within a collection in an unsupervised process. Classification can be used to train a model that reproduces a given categorization of music on new data. In both cases the result will still be a strict partition of music in form of text labels. Projection methods can be used to visualize the structures in the high dimensional data space and offer the user an additional interface to a music collection apart from traditional text based lists and trees. There are many methods that offer a two dimensional projection w.r.t. some quality measure. Most commonly principal component analysis (PCA) preserving total variance and multidimensional scaling (MDS) preserving distances as good as possible are used. The output of these methods are, however, merely coordinates in a two dimensional plane. Unless there are clearly separated clusters in a dataset it will be hard to recognize groups, see M¨ orchen et al. (2005) for examples. Emergent SOM offer more visualization capabilities than simple low dimensional projections: In addition to a low dimensional projection preserving the topology of the input space, the original high dimensional distances can be visualized with the canonical U-Matrix (Ultsch (1992)) display. This way sharp cluster boundaries can be distinguished from groups blending into one another. The visualization can be interpreted as height values on top of the usually two dimensional grid of the ESOM, leading to an intuitive paradigm of a landscape. With proper coloring, the data space can be displayed in
728
F. M¨ orchen et al.
Fig. 3. U-Map of the 5G training data (M=Metal/Rock, A=Acoustic, C=Classical, H=HipHop, E=Electronic) and detailed view with inner cluster relations
form of topographical maps, intuitively understandable also by users without scientific education. Clearly defined borders between clusters, where large distances in data space are present, are visualized in the form of high mountains. Smaller intra cluster distances or borders of overlapping clusters form smaller hills. Homogeneous regions of data space are placed in flat valleys. Training data: For the 5G data set used in the feature selection method, we trained a toroid 50 × 80 ESOM with the MusicMiner features using the Databionics ESOM Tools (Ultsch and M¨ orchen (2005))1 . Figure 3 shows the U-Map. Dark shades represent large distances in the original data space, bright shades imply similarity w.r.t. the extracted features. The songs from the five groups are depicted by the first letter of the group name. In the following paragraphs we analyze the performance of this map. The Classical music is placed in the upper right corner. It is well separated from the other groups. But at the border to the Acoustic group, neighboring to the lower left, the mountains range is a little lower. This means, that there is a slow transition from one group to the other. Songs at the borderline will be somewhat similar to the other group. The Metal/Rock group is placed in the center part of the map. The border to the Acoustic group is much more emphasized, thus songs from these groups differ more than between Acoustic and Classic. The Electronic and Hiphop groups resides in the upper and lower left parts of the map, respectively. The distinction of both these groups from Metal/Rock is again rather strong. The Electronic group is clearly recognized as the least homogeneous one, because the map is generally much darker in this area. In summary, a successful global organization of the different styles of music was achieved. The previously known groups of perceptually different music are displayed in contiguous regions on the map and the inter cluster 1
http://databionic-esom.sf.net
Visual Mining in Music Collections
729
similarity of these groups is visible due to the topology preservation of the ESOM. The ESOM/U-Map visualization offers more than many clustering algorithms. We can also inspect the relations of songs within a valley of similar music. In the Metal/Rock region on the map two very similar songs Boys Sets Fire - After the Eulogy and At The Drive In - One Armed Scissor are arranged next to each other on a plane (see Figure 3). These two songs are typical American hard rock songs of the recent years. They are similar in fast drums, fast guitar, and loud singing, but both have slow and quiet parts, too. The song Bodycount - Bodycount in the House is influenced by the Hiphop genre. The singing is more spoken style and therefore it is placed closer to the Hiphop area and in a markable distance to the former two songs. The Electronic group also contains some outliers, both within areas of electronic music as well as in regions populated by other music. The lonely song center of the map, surrounded by a black mountain ranges is Aphrodite - Heat Haze, the only Drum & Bass song. The Electronic song placed in the Classical group at the far right is Leftfield - Song Of Life. Note, that this song isn’t really that far from ’home’, because of the toroid topology of the ESOM. The left end of the map is immediately neighboring to the right side and the top originally connected to the bottom. The song contains spheric synthesizer sounds, sounding similar to background strings with only a few variations. The two Metal/Rock songs placed between the Hiphop and the Electronic group in the upper left corner are Incubus - Redefine and Filter - Under. The former has a strong break beat, synthesizer effects and scratches, more typically found in Hiphop pieces. The latter happens to have several periods of quietness between the aggressive refrains. This probably ’confused’ the temporal feature extractors and created a rather random outcome. In summary, most of the songs presumably placed in the wrong regions of the map really did sound similar to their neighbors and were in a way bad examples for the groups we placed them in. This highlights the difficulties in creating a ground truth for musical similarity, be it genre or timbre. Visualization and clustering with U-Maps can help in detecting outliers and timbrally consistent groups of music in unlabeled datasets. Validation data: For the 8G validation dataset, the U-Map of a toroid ESOM trained with the MusicMiner features is shown in Figure 4. Even though this musical collection contains groups of music which are significantly different from those of our training data (e.g. Jazz, Reggae, Oldies), the global organization of the different styles works very well. Songs from the known groups of music are almost always displayed immediately neighboring each other. Again, cluster similarity is shown by the global topology. Note, that contrary to our expectations, there is not a complete high mountain range around each group of different music. While there is a wall between Alternative Rock and Electronic, there is also a gate in the lower center part of the map where these two groups blend into one another. With real life music
730
F. M¨ orchen et al.
A=Alternative Rock O=Opera G=Oldies J=Jazz E=Electronic H=Hiphop C=Comedy R=Reggae Fig. 4. U-Map of the 8G validation data
collections this effect will be even stronger, stressing the need for visualization that can display these relations rather than applying strict categorizations.
6
Discussion
Clustering and visualization of music collections with the perceptually motivated MusicMiner features worked successfully on the training data and the validation data. The visualization based on topographical maps enables end users to navigate the high dimensional space of sound descriptors in an intuitive way. The global organization of a music collection worked, timbrally consistent groups are often shown as valleys surrounded by mountains. In contrast to the strict notion of genre categories, soft transition between groups of somewhat similar sounding music can be seen. Most songs that were not placed close to the other songs of their timbre groups turned out to be somewhat timbrally inconsistent after all. In comparison to the Islands of Music (Pampalk et al. (2002)), the first SOM visualization of music collection, we have used less but more powerful features, larger maps for a higher resolution view of the data space, toroid topologies to avoid border effects, and distance based visualizations. The Spectrum Histograms used by Pampalk et al. (2002) did not show good clustering and visualization performance (see M¨ orchen et al. (2005)).
7
Summary
We described the MusicMiner method for clustering and visualization of music collections based on perceptually motivated audio features. U-Map displays of Emergent Self-Organizing Maps offer an added value compared to
Visual Mining in Music Collections
731
other low dimensional projections that is particularly useful for music data with no or few clearly separated clusters. The displays in form of topographical maps offer an intuitive way to navigate the complex sound space. The results of the study are put to use in the MusicMiner2 software for the organization and exploration of personal music collections. Acknowledgements We thank Ingo L¨ ohken, Michael Thies, Niko Efthymiou, and Martin K¨ ummerer for their help in the MusicMiner project.
References AUCOUTURIER, J.-J. and PACHET F. (2002): Finding songs that sound the same. In Proc. of IEEE Benelux Workshop on Model based Processing and Coding of Audio, 1-8. AUCOUTURIER, J.-J. and PACHET F. (2003): Representing musical genre: a state of art. JNMR, 31(1), 1–8. LI, D., SETHI, I.K., DIMITROVA, N., and MCGEE, T. (2001): Classification of general audio data for content-based retrieval. Pattern Recognition Letters, 22, 533–544. LOGAN, B. and SALOMON, A. (2001): A music similarity function based on signal analysis. In IEEE Intl. Conf. on Multimedia and Expo, 190–194. MCKINNEY, M.F. and BREEBART, J. (2003): Features for audio and music classification. In Proc. ISMIR, 151–158. MIERSWA, I. and MORIK, K. (2005): Automatic feature extraction for classifying audio data. Machine Learning Journal, 58:0, 127–149. ¨ ¨ ¨ MORCHEN, F., ULTSCH, A., THIES, M., LOHKEN, I., NOCKER, M., STAMM, ¨ C., EFTHYMIOU, N., and KUMMERER, M. (2005): MusicMiner: Visualizing perceptual distances of music as topograpical maps. Technical Report 47, CS Department, University Marburg, Germany. PAMPALK, E., DIXON, S., and WIDMER, G. (2003): On the evaluation of perceptual similarity measures for music. In Intl. Conf. on Digital Audio Effects (DAFx), 6–12. PAMPALK, E., RAUBER, A., and MERKL, D. (2002): Content-based organization and visualization of music archives. In Proc. of the ACM Multimedia, 570–579. TORRENS, M., HERTZOG, P., and ARCOS, J.L. (2004): Visualizing and exploring personal music libraries. In Proc. ISMIR. TZANETAKIS, G. and COOK, P. (2002): Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5). TZANETAKIS, G., ERMOLINSKYI, A., and COOK, P. (2002): Beyond the queryby-example paradigm: New query interfaces for music. In Proc. ICMC. ULTSCH, A. (1992): Self-organizing neural networks for visualization and classification. In Proc. GfKl, Dortmund, Germany. ULTSCH, A. (2003): Pareto Density Estimation: Probability Density Estimation for Knowledge Discovery. In Proc. GfKl, Cottbus, Germany, 91-102. ¨ ULTSCH, A. and MORCHEN, F. (2005): ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM. Technical Report 46, CS Department, University Marburg, Germany. 2
http://musicminer.sf.net
Modeling Memory for Melodies Daniel M¨ ullensiefen1 and Christian Hennig2 1
2
Musikwissenschaftliches Institut, Universit¨ at Hamburg, 20354 Hamburg, Germany Department of Statistical Science, University College London, London WC1E 6BT, United Kingdom
Abstract. The aim of the presented study was to find structural descriptions of melodies that influence recognition memory for melodies. 24 melodies were played twice to 42 test persons. In the second turn, some of the melodies were changed, and the subjects were asked whether they think that the melody has been exactly the same as in the first turn or not. The variables used to predict the subject judgments comprise data about the subjects’ musical experience, features of the original melody and its position in the music piece, and informations about the change between the first and the second turn. Classification and regression methods have been carried out and tested on a subsample. The prediction problem turned out to be difficult. The results seem to be influenced strongly by differences between the subjects and between the melodies that had not been recorded among the regressor variables.
1
Introduction
The main aim of the presented study was to find structural descriptions of melodies that influence recognition memory for melodies. A further aim was the exemplary comparison of statistical modeling approaches for data from psycho-musicological experiments. Data have been obtained from a recognition experiment where melodies were presented twice to the experimental subjects. Some of the melodies were manipulated for the second presentation and subjects had to decide whether the melody had been changed or not. The experiment is described in detail in Section 2. We tried to explain the judgments of the subjects with 19 predictor variables. This has been done by several classification and regression methods, which have been compared on a test set. The rating scale is ordinal, but we also carried out methods that predict variables on a nominal or interval scale. The prediction methods are described in Section 3 and some results are presented in Section 4. The best results are obtained by ordinal logistic regression and a random forest. The prediction problem turned out to be hard. Even the best methods are not much superior to using the overall mean of the observations for prediction. In Section 5 we discuss some reasons. It seems that properties of the subjects and of the melodies that have not been captured by the explanatory variables play a crucial role.
Modeling Memory for Melodies
2
733
The Experiment
The primary motivation of the experimental design was to create a more realistic experimental scenario for a musical memory task than what is commonly used in similar studies (e.g. Eiting (1984), Taylor and Pembrook (1984), Dowling et al. (1995)). Thus, the design made use of musical material from a style that all subjects were familiar with (pop songs), it presented the objects to be remembered (melodies) in a musical context (arrangement), and the task required no specific musical training. The sample consisted of 42 adults with a mean age of 29 and an average level of musical training that is similar to the German population. The musical material consisted of 36 MIDI polyphonic piano arrangements of existing but little known pop songs. The duration of each arrangement had been reduced to 50 seconds. From each song, a single line melody (“test melody”, 15 seconds) had been extracted. The task followed the “recognition paradigm” widely used in memory research (e.g., Dowling et al. (2002)). Subjects listened to the song arrangement and were played the test melody immediately afterwards. Then they were asked if the test melody has been manipulated or an exact copy of one of the melodies heard in the song. The ratings were done on a six-point scale encoding the subjects’ decision and their judgmental confidence in three levels (“very sure no”, “sure no”, “no”, ”yes”, “sure yes”, “very sure yes”). The subjects were tested individually via headphones. The idea behind the recognition paradigm is that correct memorization should result in the ability to detect possible differences between the melody in the song and the test melody. 24 melodies out of 36 (16 out of 24 for each subject) had been manipulated. The following 19 predictor variables have been used: • Time related factors: – position of the comparison melody in the song in seconds, in notes, in melodies, halves of song, – position of the manipulation in the test melody in seconds, in phrases of the melody, in notes of a phrase (or “no change”), – duration of the test melody in seconds, in notes. • Musical dimensions of the melodies: – similarity of accent structures (as defined in M¨ ullensiefen (2004)), overall similarity of the melodies (M¨ ullensiefen and Frieler (2004)), – manipulation of the melody parameters rhythm, intervals, contour (or “no change”), – manipulation of the structural parameters range, harmonic function, occurrence of the repeated structure (or “no change”). • Musical background of the subjects: musical activity, musical consumption (summarizing scores have been defined from a questionnaire).
734
D. M¨ ullensiefen and C. Hennig
There are 995 valid observations. Subjects were asked whether they knew the song, and the corresponding observations have been excluded from the data analysis. Particular features of these data are: • The dependent variable is ordinal (though such scales have often been treated as interval scales in the literature). It is even more particular, because the six-point scale can be partitioned in the two halves that mean “I believe that the melody is manipulated” vs. “. . . not manipulated”. • The observations are subject-wise dependent. • Some variables are only meaningful for the changed melodies. They have been set to 0 (all values for changed melodies are larger) for unchanged melodies, but this is doubtful at least for linear methods.
3
Prediction Methods
Several prediction methods have been compared. The methods can be split up into regression methods (treating the scale as interval), classification methods (trying to predict one of six classes) and methods taking into account the nature of the scale. There were two possible codings of the six levels of the dependent variable, namely “1 very sure changed”,. . . ,“6 very sure unchanged” (“CHANGERAT”) and “1 correct prediction and very sure”,. . . , “6 wrong prediction and very sure” (“PQUALITY”), where the values 2, 3 indicate a correct answer by the subject but with less confidence in in his or her rating, and the values 4, 5 stand for a wrong answer with less confidence. For some methods, the coding makes a difference. One coding can be obtained from the other by using information present in the predictor variables, but it depends on the coding, which and how many predictor variables are needed. Not all methods worked best with the same coding. The following regression methods have been used: • a linear model with stepwise variable selection (backward and forward, optimizing the AIC) including first-order interactions (products), • a linear mixed model with a random effect for “subject” (variable selection as above), • a regression tree, • a regression random forest (Breiman (2001); default settings of the implementation in the statistical software R have been used for the tree and the forest). The following classification methods have been used: • a classification tree, • a classification random forest, • nearest neighbor.
Modeling Memory for Melodies
735
Used methods that take into account the nature of the scale: • ordinal logistic (proportional odds) regression (Harrell (2001), Chapter 13) with stepwise variable selection with modified AIC (Verweij and Van Houwelingen (1994)) and prediction by the predictive mean, • a two-step classification tree and random forest, where first the two-class problem (“correct” vs. “wrong”, PQUALITY coding) has been solved and then, conditionally, the three-class problem “very sure”/”sure”/”not sure”. The trivial methods to predict everything by the overall mean or, as an alternative, by the most frequent category, have been applied as well. To assess the quality of the prediction methods, the data set has been divided into three parts of about the same size. The first part has been used for variable selection, the second part has been used for parameter estimation in a model with reduced dimension and the third part has been used to test and compare the methods. Methods with a built-in or without any variable selection have been trained on two thirds of the data. The three subsets have initially been independent, i.e., consisting of 14 subjects each. After obtaining the first results, we constructed a second partition into three data subsets, this time dividing the observations of every single subject into three about equally sized parts, because we were interested in the effect caused by the subject-wise dependence. We used three performance measures on the test sample, namely the ratio of the squared prediction error and the error using the mean (R1 ), the relative frequency of correct classification in the six-class problem (R2 ) and the relative frequency of correct classification in the two-class problem (R3 , “change”/”no change”, “correct”/”wrong”, respectively). These measures are not adapted to ordinal data. A more problem-adapted loss function could be defined as follows: From a subject-matter viewpoint, it is “about acceptable” to predict a neighboring category. A prediction error of larger or equal than 3 can be treated as “absolutely wrong”, and it is reasonable to assume a convex loss function up to 3. Therefore, the squared error with all larger errors set to 9 would be adequate. The results with this loss function should hardly deviate from R1 without truncation, though, because most predictions have been in the middle of the scale, and prediction errors larger then 3 hardly occurred.
4
Results
Because of space limitations we only present selected results. We concentrate on R1 , which seems to be the most appropriate one of the measures described above. The results are given in Table 1. While the classification tree was better than the regression tree under R2 , both were dominated by the regression forest (R2 = 0.327). Under R3 , the two-step forest (ignoring the second step)
736
D. M¨ ullensiefen and C. Hennig
Method Partition 1 (independent) Partition 2 (subject-wise) Mean 1.000 1.000 Linear model 0.995 0.850 L. m./random effect 0.912 NA L. m./r. e. (2/3 estimation) 0.890 NA Regression tree 0.945 0.872 Regression forest 0.899 0.833 Reg. for. (subject ind.) NA 0.762 Classification tree 1.062 NA Classification forest 1.170 NA Nearest neighbor 1.586 NA Ordinal regression 0.912 0.815 Ord. reg. (all vars) 0.892 0.806 Two-step forest 1.393 NA Two-step tree 1.092 NA
2 0 −2
Regression forest residuals
4
Table 1. R1 results (all methods with optimal coding).
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 35 Melody
Fig. 1. Residuals (test sample) of regression random forest by melody.
was optimal (R3 = 0.670), but not much better than the trivial guess “all judgments correct”. Under R2 and R3 , only a minority of the methods have been superior to the trivial “most frequent category” (R2 = 0.3, R3 = 0.645). Under R1 on the initial partition, the classification methods yielded values larger than 1 (i.e., worse than overall mean) and have been outperformed by the regression and ordinal methods. The regression forest (CHANGERAT coding) yielded a relatively good performance and provides useful information about the variable importance. The variable importance statistic “MSE
Modeling Memory for Melodies
737
increase if the variables would have been left out” for the random forest is more stable and therefore better interpretable than the selections of the stepwise methods because of the resampling character of the forest. The most important variables have been the overall melodic similarity, similarity of accent structures and the musical activity of the test persons. These variables are also among the four variables that appear in the regression tree. Better results have been obtained by the ordinal regression on all variables without selection (while full models have been worse than models with reduced dimensionality for the linear models) and for a random effect linear model with variable selection. In general, the results are much worse than expected and demonstrate that the involved regression methods extract only slightly more information from the data than trivial predictors. We suspected that this tendency is due to the fact that between-subjects differences dominate the judgments in a more complex manner than captured by the variables on musical background or the additive random effect of the mixed model. Therefore we repeated the comparison (without classification methods) on a partition of the data set where the same subjects have been present in all data subsets. The regression forest and the ordinal regression were the best methods in this setup (note that the overall mean, which is used as a reference in the definition of R1 , yielded a better MSE as well on this partition). By far the best result was obtained by a random forest including subject indicators as variables. The three variables mentioned above yielded again the highest importance statistics values. The predictions have been improved on the second partition, but they still seem to be heavily dominated by random variations or influences not present in the predictor variables.
5
Further Exploration and Conclusion
We explored further the reasons for the generally weak performance of the methods compared to the trivial predictors. This led to two ideas: • The familiarity of the structure of a melody (frequency and plausibility of melodic features) may play a key role. Figure 1 shows exemplary how the residuals of the random forest for the initial partition depend on the melody. A music-analytic look at the melodies with the highest positive residuals (1, 14, 18, 27, 28) reveals that they all include short and significant motifs of great “Pr¨ agnanz” (highly individual character), a feature that is hard to assess with quantitative methods. • Different subjects show different rating behavior. It can be seen in Figure 2 that some subjects prefer less extreme ratings than others. The quality of the ratings varies strongly as well. These variations cannot be fully explained by the musical activity and musical consumption scores or
738
D. M¨ ullensiefen and C. Hennig 41
37
1
28
6
14
4
26
39
15
25
27
8
34
38
33
3
36
24
30
42
2
5
7
9
13
32
29
11
23
16
10
21
20
19
31
35
12
40
18
17
22
Fig. 2. Ratings (CHANGERAT coding) by subject. Every histogram gives frequencies for the ratings 1 to 6 over all melodies for one particular subject (numbers are subject indicators). Subjects are ordered according to their personal mean of PQUALITY (best raters on bottom right side, the worst raters - highest PQUALITY mean - are no. 41, 37, 1 and so on) and colored by musical activity (black
high activity, white low activity).
handled adequately by subject factors in the random forest or random effects. Figure 2 shows that high musical activity is related to a good rating quality, but the worst raters have medium values on the activity variable. Musical consumption (not shown) seems even less related to the subject differences. An idea to include these subject differences in the present study has been to perform a cluster analysis on the subject’s rating behavior characterized by mean, variance and skewness of the two codings CHANGERAT and PQUALITY. A tentative visual cluster analysis revealed three clusters of particular subjects and a large “normal” group. We repeated the random forest on the second data partition including three cluster indicators. This yielded R1 = 0.766. This result is biased because all observations were used for the clustering and the test sample has no longer been independent of the predictions. If done properly, the clustering should be performed on the first third of the data and the regression forest should be trained on the second third. But this would leave only 8 observations to cluster the subjects, which is not enough.
Modeling Memory for Melodies
739
In general, the regression random forest seemed to be the most useful prediction method, especially for the assessment of the variable importance. The ordinal regression did a good job as well, but the main result of the study is the remaining large unexplained variation. This outcome suggests that the model is still lacking important predictors from the area of musical features. Such predictors should for example capture the “Pr¨ agnanz” of individual motifs. It is interesting to see that in all applied models the two measures of melodic similarity and structure similarity are the variables with the largest explanatory potential. From a viewpoint of a cognitive memory model this means that the structural relation and the quantifiable differences between melody in the song and single line test melody is more decisive for memory performance than are experimental parameters (like the position of the target melody in the song or the duration of the different song parts) or information about the subjects’ musical background. In this sense, the results of this study shed some valuable light on the factors influencing recognition memory for melodies (even though the large amount of unexplained variance makes reliable indications of variable importance somewhat dubious). Melodic features that may serve as further predictors are melodic contour, melodic and rhythmic complexity, coherence of melodic accents, and the familiarity of these features as measured by their relative frequency in a genre-specific database. The construction of new models making use of these novel melodic features are currently under investigation.
References BREIMAN, L. (2001): Random forests. Machine Learning, 45, 5–32. DOWLING, W.J., KWAK, S., and ANDREWS, M.W. (1995): The time course of recognition of novel melodies. Perception & Psychophysics, 57(2), 136–149 DOWLING, W.J., TILLMANN, B., and AYERS, D.F. (2002): Memory and the Experience of Hearing Music. Music Perception, 19 (2), 249–276. EITING, M.H. (1984): Perceptual Similarities between Musical Motifs. Music Perception, 2(1), 78–94 HARRELL, F.E., jr. (2001): Regression Modeling Strategies. Springer, New York. ¨ MULLENSIEFEN, D. (2004): Variabilit¨ at und Konstanz von Melodien in der Erinnerung. Ein Beitrag zur musikpsychologischen Ged¨ achtnisforschung. PhD Thesis, University of Hamburg. ¨ MULLENSIEFEN, D., and FRIELER, K. (2004): Cognitive Adequacy in the Measurement of Melodic Similarity: Algorithmic vs. Human Judgements. Computing in Musicology, 13,147–176. TAYLOR, J.A., and PEMBROOK, R.G. (1984): Strategies in Memory for Short Melodies: An Extension of Otto Ortmann’s 1933 Study. Psychomusicology, 3(1), 16–35. VERWEIJ, P.J.M., and VAN HOUWELINGEN, J.C. (1994): Penalized likelihood in Cox regression. Statistics in Medicine, 13, 2427.-2436.
Parameter Optimization in Automatic Transcription of Music Claus Weihs and Uwe Ligges Fachbereich Statistik, Universit¨ at Dortmund, 44221 Dortmund, Germany
Abstract. Based on former work on automatic transcription of musical time series into sheet music (Ligges et al. (2002), Weihs and Ligges (2003, 2005)) in this paper parameters of the transcription algorithm are optimized for various real singers. Moreover, the parameters of various artificial singer models derived from the models of Rossignol et al. (1999) and Davy and Godsill (2002) are estimated. In both cases, optimization is carried out by the Nelder-Mead (1965) search algorithm. In the modelling case a hierarchical Bayes extension is estimated by WinBUGS (Spiegelhalter et al. (2004)) as well. In all cases, optimal parameters are compared to heuristic estimates from our former standard method.
1
Introduction
The aim of this paper is the comparison of different methods for automatic transcription of vocal time series into sheet music by classification of estimated frequencies using minimal background information. Time series analysis leads to local frequency estimation and to automatic segmentation of the wave into notes, and thus to automatic transcription into sheet music (Ligges et al. (2002), Weihs and Ligges (2003, 2005)). The idea is to use as little information as possible about the song to be transcribed and the singer interpreting the song to be able to transcribe completely unknown songs interpreted by unknown singers. For automatical accompaniment Raphael (2001) uses Bayes Belief Networks. Cano et al. (1999) use Hidden-Markov-Models (HMMs) for training along known sheet music. Rossignol et al. (1999) propose a model for pitch tracking, local frequency estimation and segmentation taking into account the extensive vibrato produced by, e.g., professional singers. Davy and Godsill (2002) are using an MCMC model for polyphonic frequency estimation. The MAMI (Musical Audio-Mining, cp. Lesaffre et al. (2003)) project has developed software for pitch tracking. There are some software products available for transcription (or at least fundamental frequency tracking), such as AmazingMidi (http://www.pluto. dti.ne.jp/~araki/amazingmidi), Akoff Music Composer (http://www. akoff.com), Audio to score (logic) (http://www.emagic.de), Autotune
This work has been supported by the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 475.
Optimization in Transcription of Music
741
(http://www.antarestech.com), DigitalEar (http://www.digital-ear. com), Melodyne (http://www.celemony.com), IntelliScore (http://www. intelliscore.net), and Widi (http://www.widisoft.com). None of them produced satisfying results in our test with a professional soprano singer, either because of inability to track the frequency, or because of non-robustness against vibrato. All our calculations are made in R (R Development Core Team (2004)).
2
Heuristic Automatic Transcription
For automatic transcription we assume that CD-quality recordings are available down sampled to 11025 Hz (in 16 bit). As an example we studied the classical song “Tochter Zion” (G.F. H¨andel) (see, e.g., Weihs and Ligges (2005)). The heuristic transcription algorithm proposed in Weihs and Ligges (2003) has the following form: • Pass through the vocal time series by sections of size 512. • Estimate pitch for each section by the heuristics: ffheur = h+[(s−h)/2] ds/dh, where h = first peaking Fourier frequency, s = peaking neighbor, dh and ds corresponding spectral density values. This way, |error| < 2 Hz can be shown for pure sine waves in frequency range of singing. • Check by means of higher partial tones whether estimated pitch relates to fundamental frequency. • Classify note for each section using estimated fundamental frequencies (given well-tempered tuning, and the (estimated) concert pitch of a ). • Smooth the classified notes because of vibrato by means of doubled running median with window width 7.
An example of the result of this algorithm can be seen in Figures 1 and 2 for soprano singer S5. Note that a corresponds to 0. Singer S5 has an intensive vibrato. Thus classification switches rapidly between 2 (b ), 3 (c ), and 4 (c# ) in the first 2 rows before smoothing (Figure 1). Unfortunately, NA NA -30 NA 2 2 15 15 15 15 15 15 2 3 2 2 2 2 4 4 3 2 2 2 2 3 NA NA NA -2 -1 0 0 -1 0 0 0 0 0 0 0 0 0 -1 -1 -1 0 0 0 0 0 -1
3 2 0 1
3 2 0 1
3 2 0 1
3 2 0 1
3 2 0 1
2 2 0 1
2 2 2 4 4 2 -29 NA NA NA 0 0 -1 -1 -1 1 1 0 NA NA
Fig. 1. Unsmoothed classification for sing S5 15 15 2 2 NA NA 0 0
15 15 15 15 15 15 15 15 15 15 2 2 2 2 2 2 2 2 2 2 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 2 0 0
3 2 0 1
3 2 0 1
3 2 0 1
3 2 0 1
3 2 0 1
2 2 0 1
2 2 0 1
Fig. 2. Smoothed classification for singer S5
2 2 2 2 2 NA NA NA 0 0 0 0 1 1 NA NA
C. Weihs and U. Ligges OT1
500
1000
OT2
FF
OT1
500
1000
OT2
0.000 0.002 0.004 0.006 0.008 0.010
FF
0.0
0.2
0.4
0.6
0.8
1.0
742
0
1500
2000
0
1500
2000
Frequency
c''' b'' a#'' a'' g#'' g'' f#'' f'' e'' d#'' d'' c#'' c'' b' a#' a' g#' g' f#' f' e'
ideal estimated
46.3
energy
note
Fig. 3. Periodogram: only first overtone easily visible (right: zoomed in)
silence
-8.6 1
2
3
4
5
6
7
8
bar
Fig. 4. Outcome of the heuristics
smoothing does not lead to the correct note 3 (c ) (Figure 2). E.g., classification leads to a note one octave too high in the beginning. To demonstrate the reason consider the corresponding periodogram (based on 512 observations) in Figure 3, where the first overtone (c ) has the only high peak, and neither the fundamental (c ) nor the second overtone is reasonably present. In order to produce sheet music, the blocks of 512 observations corresponding to eighths are combined assuming constant tempo, and the mode of the corresponding classes is taken as the pitch estimator. Figure 4 compares the outcome of this heuristics with the correct sheet music (grey horizontal bars) for singer S5. Note that energy indicates the relative amplitude of the local wave, see Weihs and Ligges (2005) for a definition. Very low energy indicates rests, consonants or breathing.
Optimization in Transcription of Music
3
743
Parameter Optimization of Heuristics
The idea of this paper is to try to improve the heuristics in various ways. First, the parameters of the heuristics will be optimally adjusted individually to the singer whose wave should be transcribed. This is carried out by means of a Nelder-Mead (1965) optimization of the error rate based on the third part of the example song, i.e. on the last 8 measures of “Tochter Zion”. Note that such optimization needs training with known scores before application. Thus, this analysis just indicates to what amount the heuristics could be improved by means of a-priori learning. The parameters of the heuristics are, defaults in parentheses (): • pkhght: Indicates that “high peaks” need to have a peak height > a percentage of maximum peak height (1.5%). • slnc: Indicates that “Low energy periods” are a certain percentage of periods with lowest energy (20%). • minp: Indicates that “Silence” is defined as low energy periods with more than minimum no. of high peaks (noise) (7). • srch1-4 : Parameters deciding about meaningfulness of candidate fundamental frequency cff also based on overtones (ots). • srch1 : Multiplier m (1.8) of frequency of first high peak (fp) so that cff ∈ [fp, m · fp]. • srch2 : No. of unimportant smallest Fourier frequencies (10) • srch3,4 : Multipliers ml, mr (1.35, 1.65) of cff so that high peak ∈ [ml · cff, mr · cff ] only if 1st overtone was found instead of fundamental frequency ff . • mdo: Order of median smoother (3) so that window width = 2 · mdo + 1 • mdt: No. of median smoother replications (2) • htthr : Halftone threshold from where on the next halftone is classified: displacement from 50 cents = 0.5 halftone (0) Error rates are calculated based on eighths as follows: # erroneously classified eighth notes (without counting rests) . # all eighth notes − # eighth rests In our example 64 eighth notes in 8 measures are considered. To create real sheet music equal sequential notes are joined. Note that this rule should be improved by identification of onset times of new notes. Table 1 shows the optimization results for sopranos S1, S2, S4, S5, and tenors T3, T6, T7. The first row indicates the defaults, rows 2 and 3 the starting values for optimization. Obviously, the only professional (S5) is the most outstanding, and the worst case concerning error rate at the same time. Figures 5 and 6 compare the original sheet music with the optimized outcome of S5. Note that parameter optimization overall leads to an error estimate opte roughly halving of heuristic error rates heue. Further studies will have to show whether optimized parameters are general enough to be used for different performances of the same singer.
744
C. Weihs and U. Ligges
default start1 start2 S1 S2 S4 S5 T3 T6 T7
pkhght 1.50 1.60 1.20 1.30 1.66 1.20 1.57 1.67 1.39 2.23
slnc minp srch1 srch2 srch3 srch4 mdo mdt htthr opte heue 20.0 7 1.80 10 1.35 1.65 3 2 0.0000 15.0 10 1.80 22 1.30 1.65 5 3 0.0000 25.0 6 1.80 9 1.36 1.70 3 2 0.0000 24.7 4 1.81 10 1.37 1.71 3 2 0.0026 5.7 13.1 25.4 6 1.80 9 1.36 1.70 4 2 0.0035 3.9 7.7 25.0 6 1.97 9 1.36 1.70 3 2 0.0000 7.5 10.9 23.9 10 1.81 23 1.31 1.66 5 3 0.0441 7.8 16.4 25.4 6 1.81 9 1.45 1.70 3 2 0.0089 1.7 1.7 23.2 8 1.80 9 1.38 1.72 2 2 0.0194 7.0 12.1 23.6 6 1.82 11 1.38 1.68 3 2 0.0182 1.7 1.8
Table 1. Results of Nelder-Mead optimization in R (2004)
Fig. 5. Original sheet music of “Tochter Zion”
Fig. 6. Optimized outcome of the example’s data, singer S5
4
Model-based Automatic Transcription
Another way of improving pitch estimation might be the use of a wave model. Therefore, we combine two models in the literature, one of Rossignol et al. (1999), modelling vibrato in music, one of Davy and Godsill (2002). Based on this model we carry out a controlled experiment with artificial data, and estimate the unknown parameters of the model in two ways, one time based on periodograms, the other time based on the original wave data. In the first case a frequentist model is used, in the second case a Bayesian model. In the used frequentist model vibrato is modelled as sine oscillation around heard frequency. Moreover, phase displacements are modelled as well as frequency displacements of overtones: yt =
H
Bh cos [2π(h + δh )f0 t + φh + (h + δh )Av sin(2πfv t + φv )] + t ,
h=1
where t = time index, f0 = fundamental frequency, H = no. of partial tones (fundamental frequency +H − 1 overtones), Bh = amplitude of hth partial tone, δh = frequency displacement of hth partial tone, δ1 := 0, φh = phase displacement of the hth partial tone, fv = frequency of vibrato, Av = amplitude of vibrato, φv = phase displacement of vibrato, and = model error. In the (hierarchical) Bayes MCMC variant of the same model the following stochastic model extensions are used: f0 , the fundamental frequency, is uniformly distributed in [0, 3000] Hz, H − 1, the no. of overtones, is truncated
Optimization in Transcription of Music
745
Poisson distributed with a maximum of 11, the expected value of which is Gamma(H, 1) distributed, Bh , the amplitudes, are normally distributed with a Gamma(0.01, 0.01) distributed precision (= invers variance), δh , the frequency displacements, are normally distributed with a big Gamma(100, 1) distributed precision, φh , the phase displacements, are uniformly distributed in [−π/2, π/2], fv , the vibrato frequency, is uniformly distributed in [0, 12] Hz, Av , the vibrato amplitude, is normally distributed with a general Gamma(0.01, 0.01) distributed precision, φv , the vibrato phase displacement, is uniformly distributed in [−π/2, π/2], , the model error, is normally distributed with a Gamma(0.5, 2) distributed precision. The design of experiments used is a full factorial in 5 variables, namely type of singer (professional female vs. amateur female), pitch (high vs. low, i.e. 1000 vs. 250 Hz), vibrato frequency (5 vs. 9 Hz), vibrato amplitude / vibrato frequency (5 vs. 15), vibrato phase displacement (0 vs. 3). In 4 additional experiments the vibrato amplitude was set to 0 with vibrato frequency and vibrato phase deliberate, set to 0 here. For data generation, professionals were modelled by ff + 2 ots with B1 = 3, B2 = 2, and B3 = 1, amateurs with ff + 1 ot and B1 = 3, and B2 = 1, displacements and noise set to 0. For the estimation of unknown parameters 512 or 2048 observations are used, respectively. Heuristic estimates of the fundamental frequency are taken from one 512 observations block or as the median over the estimates in 7 half overlaying blocks of 512 observations (without any smoothing). Estimations based on spectral information are based on periodograms of the 7 half overlaying blocks of 512 observations. Then the resulting 1792 = 256·7 Fourier frequencies built the basis for Nelder-Mead optimization of the unknown parameters using the following three starting vectors: ff = median(ffheur ) + 2, 0, −2 Hz, Bh = 0.5 for h > 1, fv = 7, Av = 5, φv = 0. A model with ff and 2 overtones was used for estimation in any case. Note that standardized periodograms are used so that B1 = 1 was fixed for identification. Estimated amplitudes Bh for h > 1 are thus relative to B1 . The default stopping criteria of the R function optim were used with a maximum of 5000 iterations. For the estimation of the hierarchical Bayes model WinBUGS optimization (Spiegelhalter et al. (2004)) is used (the WinBUGS model is available from the authors). 512 observations are used and starting values are the same as for the above optimization based on periodograms, except that B1 is free to be estimated now, and the number of overtones H − 1 is estimated as well. As a stopping criterion every 100 iterations it is checked whether linear regression of the last 50 residuals against iteration number delivers a slope significant at the 10% level with a maximum of 2000 iterations. An overall comparison of the results by means of mean absolute deviation (MAD), and root mean squares deviation (RMSD) of the estimated fundamental frequency as well as run time (see Table 2) leads to the conclusion that the heuristics are as good as the more complicated estimation procedures, but much, much faster. Only an increase of the number of observations leads to a distinct improvement. Note in particular that already with 512 observa-
746
C. Weihs and U. Ligges
ff MAD (cent) ff RMSD (cent) run time
Heur. (1) Heur. (median) NM (spectral) WinBUGS 5.06 2.38 1.29 4.88 6.06 2.74 3.35 6.44 < 1 sec 2 sec 4h 31 h
50 0 -50
Frequency distance in cent
Table 2. Deviations of the estimated fundamental frequency for each method
Heuristics (1) Nelder-Mead Heuristics (median) MCMC
9
Nelder-Mead MCMC optimal
5
real vibrato frequency
Fig. 7. Boxplots of deviations of the estimated fundamental frequencies
0
5
10
15
20
estimated vibrato frequency
Fig. 8. Estimates of the vibrato frequency
tions WinBUGS optimization needs 31 hours for the 36 experiments. Simpler methods programmed in C are in development. The results of the optimizations are compared with the results of heuristic pitch estimation in more detail in boxplots corresponding to the estimated fundamental frequencies in Figure 7, where the horizontal lines of ±50 cents = ±0.5 halftone correspond to the natural thresholds to the next halftone above or below, correspondingly. Note that the heuristic based on 2048 observations lead to perfect note classification, whereas (spectral) Nelder-Mead is most often much more exact, but in some cases even wrong in classification. The WinBUGS results are comparable with the results from the heuristic based on 512 observations. Estimates of the vibrato frequency in the model are compared as well (see Figure 8). Here (spectral) Nelder-Mead is nearly perfect in examples with 9 Hz, but unacceptable for 5 Hz. Also the WinBUGS results vary less with 9 Hz.
Optimization in Transcription of Music
5
747
Conclusion
From the experiments in this paper it is learned that Heuristic Transcription can be individually improved by training, that a wave model is not better than the heuristics concerning ff classification, and that the estimation procedure is not good enough for vibrato frequency determination, except for high vibrato frequency and the spectral data estimator. Next steps will include experiments in the polyphonic case as well.
References CANO, P., LOSCOS, A., and BONADA, J. (1999): Score-Performance Matching using HMMs. In: Proceedings of the International Computer Music Conference. Beijing, China. DAVY, M. and GODSILL, S.J. (2002): Bayesian Harmonic Models for Musical Pitch Estimation and Analysis. Technical Report 431, Cambridge University Engineering Department. LESAFFRE, M., TANGHE, K., MARTENS, G., MOELANTS, D., LEMAN, M., DE BAETS, B., DE MEYER, H., and MARTENS, J.-P. (2003): The MAMI Query-By-Voice Experiment: Collecting and annotating vocal queries for music information retrieval. In: Proceedings of the International Conference on Music Information Retrieval. Baltimore, Maryland, USA, October 26-30. LIGGES, U., WEIHS, C., and HASSE-BECKER, P. (2002): Detection of Locally Stationary Segments in Time Series. In: W. H¨ ardle and B. R¨ onz (Eds.): COMPSTAT 2002 - Proceedings in Computational Statistics - 15th Symposium held in Berlin, Germany, Physika, Heidelberg, 285–290. NELDER, J.A. and MEAD, R. (1965): A Simplex Method for Function Minimization. The Computer Journal, 7, 308–313. R DEVELOPMENT CORE TEAM (2004): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. RAPHAEL, C. (2001): A Probabilistic Expert System for Automatic Musical Accompaniment. Journal of Computational and Graphical Statistics, 10, 487–512. ROSSIGNOL, S., RODET, X., DEPALLE, P., SOUMAGNE, J., and COLLETTE, J.-L. (1999): Vibrato: Detection, Estimation, Extraction, Modification. Digital Audio Effects Workshop (DAFx’99). SPIEGELHALTER, D.J., THOMAS, A., BEST, N.G. and LUNN, D. (2004): WinBUGS: User Manual. Version 2.0, Cambridge: Medical Research Council Biostatistics Unit. WEIHS, C. and LIGGES, U. (2003): Automatic Transcription of Singing Performances. Bulletin of the International Statistical Institute, 54th Session, Proceedings, Volume LX, Book 2, 507–510. WEIHS, C. and LIGGES, U. (2005): From Local to Global Analysis of Musical Time Series. In: K. Morik, A. Siebes and J.-F. Boulicault (Eds.): Local Pattern Detection, Springer Lecture Notes in Artificial Intelligence, 3539, Springer, Berlin, 217-231.
GfKl Data Mining Competition 2005: Predicting Liquidity Crises of Companies Jens Strackeljan1 , Roland Jonscher2 , Sigurd Prieur2 , David Vogel3 , Thomas Deselaers4 , Daniel Keysers4 , Arne Mauser4 , Ilja Bezrukov4 , and Andre Hegerath4 1 2 3 4
Otto-von-Guericke-Universit¨ at Magdeburg Sparkassen Rating und Risikosysteme GmbH, Berlin A.I. Insight, Orlanda Lehrstuhl f¨ ur Informatik VI – Computer Science Department RWTH Aachen University – Aachen
Abstract. Data preprocessing and a careful selection of the training and classification method are key steps for building a predictive or classification model with high performance. Here, we present the winner approaches submitted to the 2005 GfKl Data Mining Competition. The task to be solved for the competition was the prediction of a possible liquidity crisis of a company. The binary classification was to be based on a set of 26 variables describing attributes of the companies with unknown semantics.
1
Introduction
Scientific competitions are well-known instruments which have motivated scientists to develop new ideas for several centuries. Today too, large sums of money are sometimes offered as prizes for solutions to unsolved problems in various fields of natural science. The Clay Mathematics Institute has singled out seven so-called Millennium Problems, for which prizes totalling seven million dollars have been offered. These problems include the task of deriving a more fundamental understanding of the Navier-Stokes differential equation on the basis of concrete mathematical proofs, for instance. Since this equation constitutes the basis for the entire field of fluid mechanics, this competition problem is of special scientific as well as technical interest. A particularly difficult problem in conceiving such competitions is the definition of an unambiguous and useful measure of error, since ranking of the submitted solutions is not possible without such a definition. For classification tasks in advanced data analysis the definition of the error criterion is simple, since the winner is the participant with the smallest classification error. As a rule, such a criterion also approximates the real problem very closely. Credit business is a fundamental part of banking business. In doing this, financial institutions ensure investment activity especially in the small and medium sized business, which is main pillar in many economies. To assure sustainable and worth creating investments financial institutions are forced to
GfKl Data Mining Competition 2005
749
Fig. 1. Ranking of all 40 participants
evaluate seminal and survivable subjects, because profits out of this projects increase the opportunities to continue this business. The credit rating is an important part of this evaluation, since it provides banks with a short but comprehensive information about risks of potential investments. The Sparkassen Rating und Risikosysteme GmbH as the owner of the data is responsible for the implementation of existing and the development of new rating methods for all German Sparkassen.
2
Problem Task and General Results
The object of the competition was to predict a liquidity crisis based on a subset of 26 variables describing attributes of companies with unknown semantics. Only the sponsor Sparkassen Rating und Risikosysteme GmbH had knowledge about the exact meanings of the variables. Amounts of 20,000 labeled training data with a class distribution of 10% positive cases, i.e. where an liquidity crisis occurred, and 90% negative cases and 10,000 unlabeled test data were given to the participants. Each participant had to create a file containing a list with the IDs of the first 1000 companies (of the 10,000 test data) with the highest measure of liquidity crisis (e.g. probability, membership) together with a list of his predicted value of this measure, as well as a short report that describes the method used to achieve the classification results. The maximum number of correctly classified companies was 1,111 out of 2,000. The participants’ approaches used Logistic Regression, Naiv Bayes, Support Vector Machine, Multilayer Perceptron, Decision Trees, k-Nearest Neighbour and Fuzzy Rule Sets as well as combinations of different models. In a very close competition, the best model by David Vogel classified 896 companies correctly, Ilja Bezrukov and Thomas Deselaers come in second each with 894. In consideration of the different approaches which covered nearly
750
J. Strackeljan et al.
the complete area of the present data analysis techniques we are not able to give a clear statement about a best practice guideline to solve the task. The following short description of the three winner models should be seen as examples to demonstrate the efficiency of this techniques.
3
Overview of the Winning Model
The winning model used a logistic regression model. In itself, logistic regression would not be sufficient for the complexities of this data set. However, using A.I. Insight’s1 interactive modeling tool MITCH (Multiple Intelligent Computer Heuristics), the variables could be analyzed and pre-processed in such a way that a simple linear technique would adequately capture the complex patterns in the data. The model obtained most of its accuracy from 3 forms of pre-processing: 1. Creation of additional derived features, 2. Transformations, 3. Discovery of interaction terms The MITCH engine is an artificial intelligence modeling tool that incorporates multiple technologies and processes in order to provide highly accurate forecasting and data profiling capabilities that are targeted at solving practical problems for multiple industries. This ability to combine traditional statistical technologies and non-traditional artificial intelligence paradigms at each step of the analytical process insures the achievement of good solution in a broad area of applications without spending a lot of time.
3.1
Derived Features
The derived features consisted of Missing Value Indicators and some ad-hoc binary features. In the raw data, missing values were indicated by the value “9999999”. In the case of 8 variables there were a sufficient number of missing values to determine that the creation of a binary feature could allow a logistic regression model to better utilize this information. Logistic regression lends itself to continuous, monotone relationships when dealing with continuous predictors. Additional analysis was done on continuous variables to look for discontinuous or non-monotone relationships between the predictor and the outcome. There were 14 such relationships found in the competition data set. To offset these relationships in the final model, binary features were created to represent the ranges of these variables. 1
http://www.aiinsight.com
GfKl Data Mining Competition 2005
751
Fig. 2. Nearest Probability Imputation” on predictor No. 23
3.2
Transformations
Creation of the Missing Value Indicators is only the first of two necessary steps to properly take care of missing values. The second step is to replace the “9999999” with an imputed value to avoid numerical problems that such a high value would cause in a logistic regression model. In the competition data set there were 12 variables with missing values, and with all of them the values were replaced using a technique that can be described as “Nearest Probability Imputation”. To implement this technique, a one-dimensional model must be created for the values of the predictor that are available. The model must be monotone for a unique imputed value to exist. The probability of the outcome class is then calculated for the records with null values. Once this probability is computed, the one-dimensional model must be solved in terms of this probability to determine the imputed value of the predictor. Figure 2 illustrates this algorithm and how it was used to impute a value of 18.8 for null values of the 23rd predictor in the data set. Spline transformations were used on three variables that had non-linear relationships with the outcome probabilities. Vogel and Wang (2004) have demonstrated how a one-dimensional least-squares spline transformation can be used to increase the predictive power of a feature in a linear regression model. Although the modeling technique used here is logistic regression, the same principle is applicable. Figure 3 shows a spline fitting the non-linear relationship of the 12th predictor in the raw data. Figure 4 demonstrates the desired result of the spline by fitting the transformed variable with a regression line. This form of the predictor is likely to contribute more to the final model.
752
J. Strackeljan et al.
Fig. 3. Observe the non-linear relationship between the predictor and the outcome probability by fitting this relationship using a one-dimensional spline
Fig. 4. Replacing the predictor with the spline approximation yields a new variable with a monotone linear relationship to the outcome probability.
3.3
Interaction Terms
Using a tool developed at AI Insight, NICA (Numerical Interaction CAlibrator), the data was analyzed in such a way to detect interactions between variables. In a regression model (logistic or linear), the form of the model is such that it expects the predictive effects of variables to be additive, or at the very least, cumulative. If it is determined that 2 or more predictors have a combined effect on the outcome that is not additive, then this pattern can be described as an interaction. Figure 5 illustrates the interactive relationship between the 20th predictor and the binary feature generated for a discontin-
GfKl Data Mining Competition 2005
753
Fig. 5. Illustration of the interaction between two predictors
uous value of the 15th predictor. If the predictors were additive then these two graphs would lie on top of one-another with the same shape, but with a vertical translation. Since this is not the case with these two predictors, an interaction term was introduced to add the value of this interaction to the final model. While thousands of statistically significant interactions were detected in the data, most were not likely to contribute enough to the overall model to justify the increase in model complexity. In the final logistic regression model, 128 first order interactions and 206 second order interactions were used.
4
Method Used by the RWTH Aachen Group
The Method used by the RWTH Aachen Group (Ilja Bezrukov, Thomas Deselaers, Andre Hegerath, Daniel Keysers, and Arne Mauser) draws its potential from the use of well-known techniques for preprocessing, training, and classification. It can be observed for various tasks that different classification methods perform nearly equally well given a suitable preprocessing. Additionally, it was observed that it is very important to avoid over-fitting to the training data. Due to this problem it is important to make sure that the trained methods generalize well to unseen test data. Thus, it can be said that data preprocessing and a careful selection of training and classification methods are the key steps for building a predictive model with high performance. 4.1
Preprocessing
Real-world data like the data used here usually suffer from deficiencies that make classification tasks difficult: missing values, outliers, and noisy distributions affect the performance of classification algorithms. Many classifiers perform better if the feature values are adjusted to a common interval or if they
754
J. Strackeljan et al. Untransformed
Order-preserving
Equi-depth
Fig. 6. Feature transformation: Order-preserving and equi-depth histograms
are generalized using histograms. We transformed the data using binary features and two different variants of histograms: “order-preserving” histograms and “equi-depth” histograms. “Order-preserving” histograms have as many bins as there are different feature values in the data. Then, each feature value is replaced by the normalized index of its bin. The aim of this transformation is to normalize the distances between neighboring feature values, thus outliers are moved towards the mean value. In “Equi-depth” histograms, bin borders are adjusted such that each bin contains approximately the same number of elements. Then, each feature value is replaced by the center-value of its bin. This procedure approximately conserves the original distance proportions but discretizes the features space. The effects of these transformations are shown in Figure 6. The leftmost image shows the original distribution of feature 14 using a cumulative histogram. This distribution is heavily distorted by outliers. It can clearly be seen that the two transformations lead to smoother distributions. From our experience, this often improves the classifier performance. Additionally, for features containing a significant amount of unknown values or zeros we included binary features. For the subsequent classification experiments, we created two datasets with 46 features, containing binary features and features transformed using one of the described histograms. 4.2
Training and Testing
From our experience with data mining tasks, we know that it is crucial to avoid over-fitting. Over-fitting is the result of the fact that the training error is not a good estimate of the test error. In Figure 7, the typical effect is depicted. Training error decreases with model complexity, typically dropping to zero if the model is sufficiently complex. However, a model with zero training error is likely to be over-fit to the training data and will probably generalize poorly to unseen data. In classification lots of parameters have to be estimated and all of these parameters may be subject to over-fitting. These parameters include e.g. which classifier should be used, which preprocessing is suitable, which combination of features should be taken, which combination of classifiers is appropriate, and which parameters should be chosen for the classifier used.
GfKl Data Mining Competition 2005
755
test sample Prediction error
training sample
low
high Model complexity
Fig. 7. Model complexity vs. classifier performance on training- and test-data.
Each of these parameters has to be chosen with care to avoid over-fitting. To do so, first we separated 20% of the training data into a hold-out set to be used for validation of the internal results. These data were never used to determine parameters except for selecting the final models from a small set of well-performing models to be submitted. The remaining 80% of the training data were then used to examine different methods of preprocessing, different classifiers, and to tune the parameters of the classifiers. All these experiments were done using five-fold cross-validation. We employed a variety of standard off-the-shelf classifiers as available e.g. in Netlab2 (Nabney, 2001) and Weka3 (Witten and Frank, 1999). Among the classifiers we examined were neural networks, nearest-neighbor techniques, decision trees, and support vector machines as well as some in-house classifiers for maximum entropy training and naive Bayes estimation. For each of the classifiers we assessed suitable parameters and followed those approaches that gave the best results on the cross-validation data. Finally, we chose around ten parameter setups that performed well on the cross-validation data and evaluated these on the hold-out set. Thus we had a small set of candidates for submission with two scores: 1. cross validation performance, which might be subject to over-fitting and 2. holdout performance, which should behave similar as new test data as it was not yet considered in building the models. These models were additionally examined using bootstrapping to estimate the probability of improvement (POI) of models. The POI is calculated as follows: Given two competing models A and B and their classification decisions, for each decision there are three possible cases: ‘1’) classifier A outperforms classifier 2 3
http://www.ncrg.aston.ac.uk/netlab/ http://www.cs.waikato.ac.nz/ml/weka/
756
J. Strackeljan et al.
B, ‘-1’) classifier B outperforms classifier A, and ‘0’) classifier A and B agree. We randomly draw with replacement n independent samples from the resulting array of cases and calculate the mean of the drawings. This drawing is repeated m times, and it is counted how often the calculated mean is positive (A outperforms B) and how often the mean is negative (B outperforms A). Considering the relative frequencies of these events directly leads to the POI. Considering these data, we chose 5 models to be submitted including naive posterior and maximum entropy (cf. next section), logistic model tree, alternating decision tree, logistic regression, and a combination of three classifiers.
4.3
Classification Methods Used
In this section we describe a selection of the classification methods we used. We focus on the approaches that were ranked high in the competition, either alone or in combination. Combining two successful approaches in classification, logistic regression and decision trees, the logistic model tree (LMT) (Landwehr et al., 2003) showed good results on the given data. The idea is to modify the well known C4.5 decision tree learning algorithm to use logistic regression in the decision nodes, trained with the LogitBoost algorithm. In training, the number of boosting iterations is determined using cross-validation. Being a successful concept for classification for some time now, the multi-layer perceptron (MLP) (Bishop, 1996) also contributed to the good results in combination with other classifiers. An idea that already proved its usability for this kind of classification task in another competition was the combination of “naive posterior” probabilities using a maximum entropy approach. The basic concept is derived from the naive Bayes classifier. Naive Bayes relies on the assumption that features xi are (conditionally) independent from other features. The product rule defines that the probability of all features is P r(x|k) = i P r(xi |k). Usually the sum rule gives better results than product rule. So, instead of using a combination of the classconditional probability distributions P r(k|xi ) for features this leads to the “naive posterior” rule:P r(k|x) ∝ i P r(k|xi ). Here we estimated the single distributions by relative frequencies after the preprocessing of the features. For weighting the individual distributions we decided to use a maximum entropy approach (Berger et al., 1996). The resulting distribution then has a so called log-linear or exponential functional form. exp λi P r(k|xi ) pΛ (k|x) =
k
exp
i i
,
λi P r(k |xi )
Λ = {λi }
GfKl Data Mining Competition 2005
757
Method CV-Score V-Score test data-score rank combination 1445 360 894 2 LMT 1408 358 894 2 MLP 1395 358 884 6 ADT 1426 357 883 7 NB-ME 1412 362 881 9 theoretical maximum 1796 448 1111 – winner (D.Vogel) 896 1
Table 1. Results for cross-validation, for validation, and for the competition.
The corresponding optimization problem is convex and has a unique global maximum. For computing the maximum, we used generalized iterative scaling (Keysers et al., 2002). 4.4
Conclusion
The final result showed that the five submissions from Aachen were within the top ten ranks with two submissions being ranked equally on the second place. The fact that each submission use different classifiers (combination of classifiers, logistic model tree, alternating decision tree, maximum entropy and naive Bayes, and neural nets) illustrates that by using appropriate preprocessing techniques it is possible to create an accurate predictive model without knowledge of the content of the data. Table 1 gives an overview of the five models, the theoretical maximum, and the solution of the winner. What are the reasons for the motivation to participate in such a competition? Without any doubt a successful score in a competition can positively accentuate one’s personal curriculum vitae. This feature is especially important for participants from universities. On the other hand, it may also prove to be beneficial for one’s own career in business and industry. Above all, however, it thus became evident that the company’s own R&D personnel perform at a high level, and that they need not be afraid of comparisons with specialists elsewhere. This aspect is of particular interest, since very few opportunities for benchmarking otherwise exist for such positions in a company. A competition offers a possibility of appraising the status of one’s own algorithms, that is, how good one’s own approach is in comparison with those of other specialists. Every serious scientist must have an interest in comparisons of this kind, since continuing development is feasible only on the basis of such position determinations. Whenever the task involves a real problem encountered in industry, a valuable contact with the associated industrial partner can result from participation in the competition. So much for the motivation of participants. However, what can a company expect if it participates in a competition and furnishes data for the purpose? The answer can be summarised in a very simple way. Whoever is willing to invest time in the preparation and evaluation can profit immensely from such
758
J. Strackeljan et al.
a competition but without active cooperation of this kind, the result will not be satisfactory, and especially the assessment of the significance for one’s own company will be difficult.
References VOGEL, D., and WANG, M. (2004): 1-Dimensional Splines as Building Blocks for Improving Accuracy of Risk Outcomes Models. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 841–846. BERGER, A.L., DELLA PIETRA, S., and DELLA PIETRA, V.J. (1996): A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1):39–71. BISHOP, C.M. (1996): Neural Networks for Pattern Recognition. Clarendon Press. KEYSERS, D., OCH, F.J., and NEY, H. (2002): Efficient Maximum Entropy Training for Statistical Object Recognition. Informatiktage 2002 der Gesellschaft f¨ ur Informatik, 342–345. Bad Schussenried, Germany. LANDWEHR, N., HALL, M., and FRANK, E. (2003): Logistic Model Trees. Proc. 14th Int. Conference on Machine Learning, 241–252. Springer-Verlag, Berlin, Germany. NABNEY, I.T. (2001): Netlab. Algorithms for Pattern Recognition. Advances in Pattern Recognition. Springer-Verlag Telos. WITTEN, I.H., and FRANK, E. (1999): Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, CA, USA.
Author Index
Agostinelli, C., 246 Albert, A., 87 Amenta, P., 286 Anti´c, G., 310, 382 Arnrich, B., 87 Basu, T.K., 134, 654 Bazen, A., 646 Becker, C., 254 Belitz, C., 590 Benden, C., 318 Berrer, H., 502 Bezrukov, I., 748 Bickel, S., 342 Biemann, C., 326 Bioch, J.C., 566 Bloehdorn, S., 334 Booij, W., 646 Boztu˘ g, Y., 558 Brinker, K., 206 B¨ uhlmann, P., 1 Cal` o, D.G., 166 Caldelli, R., 638 Cantis, S., 708 Cerioli, A., 262 Choulakian, V., 294 Ciavolino, E., 286 Cimiano, P., 334 Cl´emen¸con, S., 214 Costa, I.G., 662 Croux, C., 230, 270 Dambra, L., 294 Dehmer, M., 406 Deselaers, T., 748 Dias, J.G., 95 Drost, I., 342 Dutta, P.K., 134, 654 Eissen, S.M., 430 Enache, D., 470 Espen, P.J., 230, 270 Etschberger, S., 526
Fairhurst, M., 622 Fellner, D., 13 Fenk, A., 350 Fenk-Oczlon, G., 350 Filzmoser, P., 230, 270 Fock, H., 526 Fortuna, B., 358 Galimberti, G., 174 Gamrot, W., 111 Ganczarek, A., 550 Garczarek, U., 470 G¨ artner, T., 75 Gatnar, E., 119 Gibbon, D., 366 Gleim, R., 406 Golz, M., 150 Grcar, M., 374 Grobelnik, M., 374, 398 Groenen, P.J.F., 566 Gr¨ uning, M., 684 Grzybek, P., 310, 382 Guest, R., 622 Hahsler, M., 598 Hall, L.O., 21 Hanafi, M., 222 Hantke, W., 478 Havemann, S., 13 Hegerath, A., 748 Hendrikse, A., 646 Hennig, C., 732 Hering, F., 302 Hildebrandt, L., 558 H¨ oppner, F., 438 Hornik, K., 598 Horv´ ath, T., 75 H¨ ose, S., 534 Hotho, A., 334 Hughes, B., 366 Jajuga, K., 606 Jonscher, R., 748 K¨ ampf, D., 486
760
Author Index
Kanade, P.M., 21 Kelih, E., 310, 382 Keysers, D., 748 Klawonn, F., 446 Klein, C., 526 Korn, B., 454 Krauth, J., 670 Krolak-Schwerdt, S., 190 Kropf, S., 684 Kunze, J., 494 Lang, S., 590 Lavraˇc, N., 32 Lesot, M.-J., 462 Ligges, U., 740 Louw, N., 126 Lugosi, G., 214 Malmendier, J., 582 Mauser, A., 748 Mazanec, J.A., 40 Mehler, A., 406 Messaoud, A., 302 Mladeniˇc, D., 52, 358, 374, 398 Moissl, U., 716 M¨ oller, U., 692 Montanari, A., 166 M¨ orchen, F., 278, 724 M¨ ullensiefen, D., 732 Nalbantov, G., 566 Neiling, M., 63 Neumann, G., 390 N¨ ocker, M., 724 Novak, B., 398 Oermann, A., 654 Osswald, R., 326 Ostermann, T., 198 Paaß, G., 414 Papla, D., 606 Patil, H.A., 134 Paulssen, M., 574 Paveˇsi´c, N., 630 Pellizzari, P., 246 Pillati, M., 182 Piva, A., 638 Polasek, W., 502
Prieur, S., 748 Qannari, E.M., 222 Raabe, N., 510 Radke, D., 692 Rapp, R., 422 Reutterer, T., 598 Riani, M., 262 Ribari´c, S., 630 Rosa, A., 638 Rozkrut, D., 518 Rozkrut, M., 518 Rungsarityotin, W., 103 Sahmer, K., 222 Saviˇc, T., 630 Schaab, J., 716 Schebesch, K.B., 542 Scheffer, T., 342 Schliep, A., 103, 662 Schmidt-Thieme, L., 614 Scholz, S.P., 254 Schuster, R., 198 Serikova, E., 142 Serneels, S., 230, 270 Simonetti, B., 294 Soffritti, G., 174 Sommer, D., 150 Sommerfeld, A., 574 Spengler, T., 582 Squillacciotti, S., 238 Stadlober, E., 310, 382 Stamm, C., 724 Stecking, R., 542 Steel, S.J., 126 Stein, B., 430 Steiner, W.J., 590 Strackeljan, J., 748 Szepannek, G., 700 Taormina, A.M., 708 Theis, W., 510 Trippel, T., 366 Trzpiot, G., 550 Tso, K., 614 Ultsch, A., 278, 486, 678, 724 Vayatis, N., 214
Author Index Veldhuis, R., 646 Vielhauer, C., 622, 654 Viroli, C., 166, 182 Vogel, D., 748 Vogl, K., 534 Vogt, M., 716 Vries, H., 414 Walter, J., 87 Webber, O., 510 Weber, M., 103
Weber, T., 158 Weihs, C., 302, 470, 510, 700, 740 Wiedenbeck, M., 190 Wolf, F., 654 Wrobel, S., 75 Yegnanarayana, B., 654 Zhuk, E., 142 Zwergel, B., 526
761