DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 20B
Handbook of Chemometrics and Qualimetrics: PartB
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Other volumes in this series: Volume 1
Microprocessor Programming and Applications for Scientists and Engineers, by R.R. Smardzewski Volume 2 Chemometrics: A Textbook, by D.L. Massart, B.G.M. Vandeginste, S.M. Deming, Y. Micotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach, by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Coputing in BASIC with Applications in Chemistry, Biology and Pharmacology, by P. Vaiko and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June 1990, Maastrichit, Tine Nettierlands, edited by E.J. Karjalainen Volume 7 Receptor Modeling for Air Quality Management, edited by P.K. Hopke Volume 8 Design and Optimization in Organic Synthesis, by R. Carlson Volume 9 Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited by R.G. Brereton Volume 10 Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing, by P.M. Gy Volume 11 Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition), by S.N. Deming and S.L. Morgan Volume 12 Methods for Experimental Design: Principles and Applications for Physicists and Chemists, by J.L. Goupy Volume 13 Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers Volume 14 The Data Analysis Handbook, by I.E. Frank and R. Todeschini Volume 15 Adaption of Simulated Annealing to Chemical Optimization Problems, edited by J. Kalivas Volume 16 Multivariate Analysis of Data in Sensory Science, edited by T. Naes and E. Risvik Volume 17 Data Analysis for Hyphenated Techniques, by E.J. Karjalainen and U.P. Karjalainen Volume 18 Signal Treatment and Signal Analysis in NMR, edited by D.N. Rutledge Volume 19 Robustness of Analytical Chemical Methods and Pharmaceutical Technological Products, edited by M.W.B. Hendriks, J.H. de Boer and A.K. Smilde Volume 20A Handbook of Chemometrics and Qualimetrics: Part A, by D.L. Massart, B.G.M. Vandeginste, LM.C. Buydens, S. De Jong, P.J. Lewi and J. Smeyers-Verbeke Volume 20B Handbook of Chemometrics and Qualimetrics: Part B, by B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. De Jong, P.J. Lewi and J. Smeyers-Verbeke
DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 20B Advisory Editors: B.G.M. Vandeginste and S.C. Rutan
Handbook of Chemometrics and Qualimetrics: Part B B.G.M. VANDEGINSTE Unilever Research Laboratorium, Vlaardingen, The Netherlands
D.L. MASSART Farmaceutisch Instituut, Dienst Farmaceutische en Biomedische Analyse, Vrije Universiteit Brussel, Brussels, Belgium
L.M.C. BUYDENS Vakgroep Analytische Chemie, Katholieke Universiteit Nijmegen, Faculteit Natuun/vetenschappen, Nijmegen, The Netherlands
S. DE JONG Unilever Research Laboratorium, Vlaardingen, The Netherlands
P.J. LEWI Janssen Research Foundation, Center for Molecular Design, Vosselaar, Belgium
J. SMEYERS-VERBEKE Farmaceutisch Instituut, Dienst Farmaceutische en Biomedische Analyse, Vrije Universiteit Brussel, Brussels, Belgium
ELSEVIER Amsterdam - Boston - London - New York - Oxford - Paris San Diego - San Francisco - Singapore - Sydney - Tokyo
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands © 1998 Elsevier Science B.V. All rights reserved. This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science via their homepage (http://www.elsevier.com) by selecting 'Customer support' and then 'Permissions'. Alternatively you can send an e-mail to:
[email protected], or fax to: (+44) 1865 853333. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London WIP OLP, UK; phone: (+44) 207 631 5555; fax: (+44) 207 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Global Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made. First edition 1998 Second impression 2003 Library of Congress Cataloging-in-Publication Data Handbook of chemometrics and qualimetrics / B.G.M. Vandeginste ... [et al.]. p. cm. — (Data handling in science and technology ; v. 20B) Includes index. ISBN 0-444-82853-2 (pt. 20B : acid-free paper) 1. Chemistry, Analytic-Statistical methods. 2. Chemistry, Analytic—Mathematics. 3. Chemistry, Analytic—Data processing. I. Vandeginste, B . G . M . II. Series. QD75.4.S8H36 1998 543\00r5195-dc21
98-42544 CIP
British Library Cataloguing in Publication Data A catalogue record from the British Library has been applied for. ISBN: ISBN:
0-444-82853-2 (Vol. 20B) 0-444-82854-0 (set)
© The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
Preface
In 1991 two of us, Luc Massart and Bernard Vandeginste, discussed, during one of our many meetings, the possibility and necessity of updating the book Chemometrics: a textbook. Some of the newer techniques, such as partial least squares and expert systems, were not included in that book which was written some 15 years ago. Initially, we thought that we could bring it up to date with relatively minor revision. We could not have been more wrong. Even during the planning of the book we witnessed a rapid development in the application of natural computing methods, multivariate calibration, method validation, etc. When approaching colleagues to join the team of authors, it was clear from the outset that the book would not be an overhaul of the previous one, but an almost completely new book. When forming the team, we were particularly happy to be joined by two industrial chemometricians. Dr. Paul Lewi from Janssen Pharmaceutica and Dr. Sijmen de Jong from Unilever Research Laboratorium Vlaardingen, each having a wealth of practical experience. We are grateful to Janssen Pharmaceutica and Unilever Research Vlaardingen that they allowed Paul, Sijmen and Bernard to spend some of their time on this project. The three other authors belong to the Vrije Universiteit Brussel (Prof. An Smeyers-Verbeke and Prof. D. Luc Massart) and the Katholieke Universiteit Nijmegen (Professor Lutgarde Buydens), thus creating a team in which university and industry are equally well represented. We hope that this has led to an equally good mix of theory and application in the new book. Much of the material presented in this book is based on the direct experience of the authors. This would not have been possible without the hard work and input of our colleagues, students and post-doctoral fellows. We sincerely want to acknowledge each of them for their good research and contributions without which we would not have been able to treat such a broad range of subjects. Some of them read chapters or helped in other ways. We also owe thanks to the chemometrics community and at the same time we have to offer apologies. We have had the opportunity of collaborating with many colleagues and we have profited from the research and publications of many others. Their ideas and work have made this book possible and necessary. The size of the book shows that they have been very productive. Even so, we have cited only a fraction of the literature and we have not included the more sophisticated work. Our wish was to consolidate and therefore to explain those methods that have become more or less accepted, also to
newcomers to chemometrics. Our apologies, therefore, to those we did not cite or not extensively: it is not a reflection on the quality of their work. Each chapter saw many versions which needed to be entered and re-entered in the computer. Without the help of our secretaries, we would not have been able to complete this work successfully. All versions were read and commented on by all authors in a long series of team meetings. We will certainly retain special memories of many of our two-day meetings, for instance the one organized by Paul in the famous abbey of the regular canons of Premontre at Tongerlo, where we could work in peace and quiet as so many before us have done. Much of this work also had to be done at home, which took away precious time from our families. Their love, understanding, patience and support was indispensable for us to carry on with the seemingly endless series of chapters to be drafted, read or revised.
Contents Preface
v
Chapter 28
Introduction to Part B References
1 5
Chapter 29
Vectors, Matrices and Operations on Matrices 29.1 Vector space 29.2 Geometrical properties of vectors 29.3 Matrices 29.4 Matrix product 29.5 Dimension and rank 29.6 Eigenvectors and eigenvalues 29.7 Statistical interpretation of matrices 29.8 Geometrical interpretation of matrix products References
7 8 10 15 19 27 30 42 51 56
Chapter 30
Cluster Analysis 30.1 Clusters 30.2 Measures of (dis)similarity 30.2.1 Similarity and distance 30.2.2 Measures of (dis)similarity for continuous variables 30.2.2.1 Distances 30.2.2.2 Correlation coefficient 30.2.2.3 Scaling 30.2.3 Measures of (dis)similarity for other variables 30.2.3.1 Binary variables 30.2.3.2 Ordinal variables 30.2.3.3 Mixed variables 30.2.4 Similarity matrix 30.3 Clustering algorithms 30.3.1 Hierarchical methods 30.3.2 Non-hierarchical methods 30.3.3 Other methods 30.3.4 Selecting clusters 30.3.4.1 Measures for clustering tendency 30.3.4.2 How many clusters? 30.3.5 Conclusion References
57 57 60 60 60 60 62 64 65 65 66 67 68 69 69 76 79 82 82 83 84 85
Chapter 31
Analysis of Measurement Tables Introduction 31.1 Principal components analysis 31.1.1 Singular vectors and singular values 31.1.2 Eigenvectors and eigenvalues
87 87 88 89 91
Vlll
31.1.3 Latent vectors and latent values 31.1.4 Scores and loadings 31.1.5 Principal components 31.1.6 Transition formulae 31.1.7 Reconstructions 31.2 Geometrical interpretation 31.2.1 Line of closest fit . 31.2.2 Distances 31.2.3 Unipolar axes 31.2.4 Bipolar axes 31.3 Preprocessing 31.3.1 No transformation 31.3.2 Column-centering 31.3.3 Column-standardization 31.3.4 Log column-centering 31.3.5 Log double-centering 31.3.6 Double-closure 31.4 Algorithms 31.4.1 Singular value decomposition 31.4.2 Eigenvalue decomposition 31.5 Validation 31.5.1 Scree-plot 31.5.2Malinowski'sF-test 31.5.3 Cross-validation 31.6 Principal coordinates analysis 31.6.1 Distances defined from data 31.6.2 Distances derived from comparisons of pairs 31.6.3 Eigenvalue decomposition 31.7 Non-linear principal components analysis 31.7.1 Extensions of the data by higher order terms 31.7.2 Non-linear transformations of the data 31.7.3 Non-linear PCAbiplot 31.8 Three-way principal components analysis 31.8.1 Unfolding 31.8.2 The Tucker3 model 31.8.3 The PARAFAC model 31.9 PCA and cluster analysis References Chapter 32
Analysis of Contingency Tables 32.1 Contingency table 32.2 Chi-square statistic 32.3 Closure 32.3.1 Row-closure 32.3.2 Column-closure 32.3.3 Double-closure 32.4 Weighted metric 32.5 Distance of chi-square 32.5.1 Row-closure 32.5.2 Column-closure
95 95 96 100 100 104 104 108 112 113 115 118 119 122 123 125 130 134 134 138 140 142 143 144 146 146 148 148 149 149 149 150 153 153 154 156 156 158 161 161 166 167 168 168 .169 170 175 175 176
32.5.3 Double-closure 32.6 Correspondence factor analysis 32.6.1 Historical background 32.6.2 Generalized singular value decomposition 32.6.3 Biplots. 32.6.4 Application 32.7 Log-linear model 32.7.1 Historical introduction 32.7.2 Algorithm 32.7.3 Application References
177 182 182 183 187 193 201 201 201 204 205
Chapter 33
Supervised Pattern Recognition 33.1 Supervised and unsupervised pattern recognition 33.2 Derivation of classification rules 33.2.1 Types of classification rules 33.2.2 Canonical variates and linear discriminant analysis 33.2.3 Quadratic discriminant analysis and related methods 33.2.4 The k-nearest neighbour method 33.2.5 Density methods. 33.2.6 Classification trees 33.2.7 UNEQ,SIMCA and related methods 33.2.8 Partial least squares 33.2.9 Neural networks 33.3 Feature selection and reduction 33.4 Validation of classification rules References
207 207 208 208 213 220 223 225 227 228 232 233 236 238 239
Chapter 34
Curve and Mixture Resolution by Factor Analysis and Related Techniques . . 243 34.1 Abstract and true factors 243 34.2 Full-rank methods 251 34.2.1 A qualitative approach 251 34.2.2 Factor rotations 252 34.2.3 The Varimax rotation . 254 34.2.4. Factor rotation by target transformation factor analysis (TTFA) . 256 34.2.5 Curve resolution based methods 260 34.2.5.1 Curve Resolution of two-factor systems 260 34.2.5.2 Curve resolution of three-factor systems 267 34.2.6 Factor rotation by iterative target transformation factor analysis (ITTFA) 268 34.3 Evolutionary and local rank methods 274 34.3.1 Evolving factor analysis (EFA) 274 34.3.2 Fixed-size window evolving factor analysis (FSWEFA) 278 34.3.3 Heuristic evolving latent projections (HELP) 280 34.4 Pure column (or row) techniques 286 34.4.1 The variance diagram (VARDIA) technique 286 34.4.2 SimpHsma 292 34.4.3 Orthogonal projection approach (OPA) 295 34.5 Quantitative methods for factor analysis 298 34.5.1 Generalized rank annihilation factor analysis (GRAFA) 298 34.5.2 Residual bilinearization(RBL) 300
34.5.3 Discussion 34.6 Application of factor analysis for peak purity check in HPLC 34.7 Guidance for the selection of a factor analysis method References
301 301 302 303
Chapter 35
Relations between Measurement Tables 35.1 Introduction 35.2 Procrustes analysis 35.2.1 Introduction 35.2.2 Algorithm 3.2.3 Discussion 35.3 Canonical correlation analysis 35.3.1 Introduction 35.3.2 Algorithm 35.3.3 Discussion 35.4 Multivariate least squares regression 35.4.1 Introduction 35.4.2 Algorithm 35.4.3 Discussion 35.5 Reduced rank regression 35.5.1 Introduction 35.5.2 Algorithm 35.5.3 Discussion 35.5.4 Example 35.6 Principal components regression 35.6.1 Introduction 35.6.3 Algorithm 35.6.3 Discussion 35.7 Partial least squares regression 35.7.2 NIPALS-PLS Algorithm 35.7.3 Discussion 35.7.4 Alternative PLS algorithms 35.8 Continuum regression methods 35.9 Concluding remarks References
307 307 310 310 314 314 317 317 320 321 323 323 324 324 324 324 325 326 326 329 329 329 330 331 336 337 340 342 345 346
Chapter 36
Multivariate Calibration 36.1 Introduction 36.2 Calibration methods 36.2.1 Classical least squares 36.2.2 Inverse least squares 36.2.3 Principal Components Regression 36.2.4 Partial least squares regression 36.2.5 Other linear methods 36.3 Validation 36.4 Other aspects 36.4.1 Calibration design 36.4.2 Data pretreatment 36.4.3 Outliers 36.5 New developments
349 349 351 353 357 358 366 367 368 371 371 372 374 375
36.5.1 Feature selection 36.5.2 Transfer of calibration models 36.5.3 Non-linear methods References
375 376 378 379
Chapter 37
Quantitative Structure-Activity Relationships (QSAR) 37.1 Extrathermodynamic methods 37.1.1 Hansch analysis 37.1.2 Free-Wilson analysis 37.2 Principal components models 37.2.1 Principal components analysis 37.2.2 Spectral map analysis 37.2.3 Correspondence factor analysis 37.3 Canonical variate models 37.3.1 Linear discriminant analysis 37.3.2 Canonical correlation analysis 37.4 Partial least squares models 37.4. IPLS regression and CoMFA 37.4.2 Two-block PLS and indirect QSAR 37.5 Other approaches References
383 383 388 393 397 398 402 405 408 408 409 409 409 411 416 417
Chapter 38
Analysis of Sensory Data 38.1 Introduction 38.2 Difference tests 38.2.1 Triangle test 38.2.2 Duo-trio test 38.2.3 Paired comparisons 38.3 Multidimensional scaling 38.4 The analysis of Quantitative Descriptive Analysis profile data 38.5 Comparison of two or more sensory data sets 38.6 Linking sensory data to instrumental data 38.7 Temporal aspects of perception 38.7 Product formulation References
421 421 421 421 422 425 427 431 433 437 440 444 446
Chapter 39
Pharmacokinetic Models 449 Introduction 449 39.1 Compartmental analysis 451 39.1.1 One-compartment open model for intravenous administration . . 455 39.1.2 Two-compartment catenary model for extravascular administration 461 39.1.3 Two-compartment catenary model for extravascular administration with incomplete absorption 469 39.1.4 One-compartment open model for continuous intravenous infusion 470 39.1.5 One-compartment open model for repeated intravenous administration 473 39.1.6 Two-compartment mammillary model for intravenous administration using Laplace transform 476
39.1.7 Multi-compartment models 39.1.7.1 The convolution method 39.1.7.2 The Y-method 39.2 Non-compartmental analysis 39.3 Compartment models versus non-compartmental analysis 39.4 Linearization of non-linear models References
487 487 490 493 500 502 505
Chapter 40
Signal Processing 40.1 Signal domains 40.2 Types of signal processing 40.3 The Fourier transform 40.3.1 Time and frequency domain 40.3.2 The Fourier transform ofa continuous signal 40.3.3 Derivation of the Fourier transform of a sine 40.3.4 The discrete Fourier transformation 40.3.5 Frequency range and resolution 40.3.6SampHng 40.3.7 Zero filling and resolution 40.3.8 Periodicity and symmetry 40.3.9 Shift and phase 40.3.10 Distributivity and scaHng 40.3.11 The fast Fourier transform 40.4 Convolution 40.5 Signal processing 40.5.1 Characterization of noise 40.5.2 Signal enhancement in the time domain 40.5.2.1 Time averaging 40.5.2.2 Smoothing by moving average 40.5.2.3 Polynomial smoothing 40.5.2.4 Exponential smoothing 40.5.3 Signal enhancement in the frequency domain 40.5.4 Smoothing and filtering: a comparison 40.5.5 The derivative of a signal 40.5.6 Data compression by a Fourier transform 40.6 Deconvolution by Fourier transform 40.7 Other deconvolution methods 40.7.1 Maximum Likelihood 40.7.2 Maximum Entropy 40.8 Other transforms 40.8.1 The Hadamard transform 40.8.2 The time-frequency Fourier transform 40.8.3 The wavelet transform References
507 507 509 510 510 513 518 519 520 524 526 527 528 529 530 530 535 535 536 538 538 542 544 547 549 550 550 553 556 557 558 562 562 564 566 573
Chapter 41
Kalman Filtering 41.1 Introduction 41.2. Recursive regression ofa straight line 41.3 Recursive multicomponent analysis 41.4 System equations
575 575 577 585 589
41.4.1 System equation for a kinetics experiment 41.4.2 System equation of a calibration line with drift 41.5 The Kalman filter 41.5.1 Theory 41.5.2 Kalman filter of a kinetics model 41.5.3 Kalman filtering ofa calibration line with drift 41.6 Adaptive Kalman filtering 41.6.1 Evaluation of the innovation 41.6.2The adaptive Kalman filter model 41.7 Applications References
592 593 594 594 596 598 598 599 .599 601 603
Chapter 42
Applications of Operations Research 42.1 An overview 42.2 Linear programming 42.3 Queuing problems 42.3.1 Queuing and Waiting 42.3.2 Application in analytical laboratory management 42.4 Discrete event simulation 42.5 A shortest path problem References
605 605 605 609 610 617 618 621 625
Chapter 43
Artificial Intelligence: Expert and Knowledge Based Systems 43.1 Artificial intelligence and expert systems 43.2 Expert systems 43.3 Structure of expert systems 43.4 Knowledge representation 43.4.1 Rule-based knowledge representation 43.4.2 Frame-based knowledge representation 43.5 The inference engine 43.5.1 Rule-based inferencing 43.5.2 Frame-based inferencing 43.5.2.1 Inheritance 43.5.2.2 Object-oriented programming techniques 43.5.3 Reasoning with uncertainty 43.6 The interaction module 43.7 Tools 43.8 Development of an expert system 43.8.1 Analysis of the application area 43.8.2 Definition ofknowledge domain, sources and tools 43.8.3 Knowledge acquisition 43.8.4 Implementation 43.8.5 Testing, validation and evaluation 43.8.6 Maintenance 43.9 Conclusion References
627 627 628 629 630 631 632 633 633 637 637 638 639 640 641 642 642 643 643 644 644 645 645 646
Chapter 44
Artificial Neural Networks 44.1 Introduction 44.2 Historical overview 44.3 The basic unit — the neuron
649 649 650 650
44.4 The linear learning machine and the perceptron network 44.4.1 Principle 44.4.2 Learning strategy 44.4.3 Limitations 44.5 Multilayer feed forward (MLF) networks 44.5.1 Introduction 44.5.2 Structure 44.5.3 Signal propagation 44.5.4 The transfer function 44.5.4.1 Role of the transfer function 44.5.4.2Transfer function of the output units 44.5.4.3 Transfer function in the hidden units 44.5.5 Learning rule 44.5.6. Learning rate and momentum term 44.5.7 Training and testing an MLF network 44.5.7.1 Network performance 44.5.7.2 Local minima 44.5.8 Determining the number of hidden units 44.5.9 Data Preprocessing 44.5.9.1 Scaling 44.5.9.2 Variable selection and reduction 44.5.10 Validation of MLF networks 44.5.11 Aspects of use 44.5.12 Chemical applications 44.6 Radial basis function networks 44.6.1 Structure 44.6.2 Training ; 44.6.3 An example 44.6.4 Applications 44.7 Kohonen networks 44.7.1 Structure 44.7.2 Training 44.7.3 Interpretation ofthe Kohonen map 44.7.4 Applications 44.8 Adaptive resonance theory networks 44.8.1 Introduction 44.8.2 Structure 44.8.3 Training 44.8.4 Application References Index
653 653 656 659 662 662 662 664 665 665 666 666 670 673 674 674 677 677 679 679 679 679 680 680 681 681 682 683 684 687 687 688 690 691 692 692 693 693 694 695 701
Chapter 28
Introduction to Part B
In the introduction to Part A we discussed the "arch of knowledge" [1] (see Fig. 28.1), which represents the cycle of acquiring new knowledge by experimentation and the processing of the data obtained from the experiments. Part A focused mainly on the first step of the arch: a proper design of the experiment based on the hypothesis to be tested, evaluation and optimization of the experiments, with the accent on univariate techniques. In Part B we concentrate on the second and third steps of the arch, the transformation of data and results into information and the combination of information into knowledge, with the emphasis on multivariate techniques. In order to obtain information from a series of experiments, we need to interpret the data. Very often the first step in understanding the data is to visualise them in a plot or a graph. This is particularly important when the data are complex in nature. These plots help in discovering any structure that might be present in the data and which can be related to a property of the objects studied. Because plots have to be represented on paper or on a flat computer screen, data need to be projected and compressed. Analytical results are often represented in a data table, e.g., a table of the fatty acid compositions of a set of olive oils. Such a table is called a two-way multivariate data table. Because some olive oils may originate from the same region and others from a different one, the complete table has to be studied as a whole instead as a collection of individual samples, i.e., the results of each sample are interpreted in the context of the results obtained for the other samples. For example, one may ask for natural groupings of the samples in clusters with a common property, namely a similar fatty acid composition. This is the objective of cluster analysis (Chapter 30), which is one of the techniques of unsupervised pattern recognition. The results of the clustering do not depend on the way the results have been arranged in the table, i.e., the order of the objects (rows) or the order of the fatty acids (columns). In fact, the order of the variables or objects has no particular meaning. In another experiment we might be interested in the monthly evolution of some constituents present in the olive oil. Therefore, we decide to measure the total amount of free fatty acids and the triacylglycerol composition in a set of olive oil
Knowledge intelligence
creativity Deduction (synthesis)
Induction (analysis) Hypothesis
Information
data
design
Experiment
Fig. 28.1. The arch of knowledge.
samples under fixed storage conditions. Each month a two-way table is obtained — six in total after six months. We could decide to analyse all six tables individually. However, this would not provide information on the effect of the storage time and its relation to the origin of the oil. It is more informative to consider all tables together. They form a so-called three-way table. The analysis of such a table is discussed in Chapter 31. If, in addition, all olive samples are split into portions which are stored under different conditions, e.g., open and closed bottles, darkness and daylight, we obtain several three-way tables or in general a multi-way data table. Some analytical instruments produce a table of raw data which need to be processed into the analytical result. Hyphenated measurement devices, such as HPLC linked to a diode array detector (DAD), form an important class of such instruments. In the particular case of HPLC-DAD, data tables are obtained consisting of spectra measured at several elution times. The rows represent the spectra and the columns are chromatograms detected at a particular wavelength. Consequently, rows and columns of the data table have a physical meaning. Because the data table X can be considered to be a product of a matrix C containing the concentration profiles and a matrix S containing the pure (but often unknown) spectra, we call such a table bilinear. The order of the rows in this data table corresponds to the order of the elution of the compounds from the analytical column. Each row corresponds to a particular elution time. Such bilinear data tables are therefore called ordered data tables. Trilinear data tables are obtained from LC-detectors which produce a matrix of data at any instance during the
elution, e.g., an excitation-emission spectrum as a function of time. Bilinear and trilinear data tables are also measured when a chemical reaction is monitored over time, e.g., the biomass in a fermenter by Near infrared Spectroscopy. An example of a non-bilinear two-way table is a table of MS-MS spectra or a 2D-NMR spectrum. These tables cannot be represented as a product of row and column spectra. So far, we have discussed the structure of individual data tables: two-way to multiway, which may be bilinear to multilinear. In some cases two or more tables are obtained for a set of samples. The simplest situation is a two-way data table associated with a vector of properties, e.g., a table with the fatty acid composition of olive oils and a vector with the coded region of the oils. In this case we do not only want to display the data table, but we want to derive a classification rule, which classifies the collected oils in the right category. With this classification rule, oils of unknown origin can be classified. This is the area of supervised pattern recognition, which classically is based on multivariate statistics (Chapter 33) but more recently neural nets have been introduced in this field (Chapter 44). The technique of neural networks belongs to the area of so-called natural computation methods. Genetic algorithms — another technique belonging to the family of natural computation methods — were discussed in Part A. They are called natural because the principle of the algorithms to some extent mimics the way biological systems function. Originally, neural networks were considered to be a model for the functioning of the brain. In the example above, the property is a discrete class, region of origin, healthy or ill person, which is not necessarily a quantitative value. However, in many cases the property may be a numerical value, e.g., a concentration or a pharmacological activity (Chapter 37). Modelling a vector of properties to a table of measurements is the area of multivariate calibration, e.g., by principal components regression or by partial least squares, which are described in Chapter 36. Here the degree of complexity of the data sets is almost unlimited: several data tables may be predictors for a table of properties, the relationship between the tables may be non-linear and some or all tables may be multiway. As indicated before, the columns and the rows of a bilinear or trilinear dataset have a particular meaning, e.g., a spectrum and a chromatogram or the concentration profiles of reactants and the reaction products in an equilibrium or kinetic study. The resulting data table is made up by the product of the tables of these pure factors, e.g., the table of the elution profiles of the pure compounds and the table of the spectra of these compounds. One of the aims of a study of such a table is the decomposition of the table into its pure spectra and pure elution profiles. This is done by factor analysis (Chapter 34). A special type of table is the contingency table, which has been introduced in Chapter 16 of Part A. In Part B the 2x2 contingency table is extended to the general case (Chapter 32) which can be analyzed in a multivariate way. The above
examples illustrate that the complexity of data and operations discussed in Part B require advanced chemometric techniques. Although Part B can be studied independently from Part A, we will implicitly assume a chemometrics background equivalent to Part A, where a more informal treatment of some of the same topics can be found. For instance, vectors and matrices are introduced for the first time in Chapter 9 of Part A, but are treated in more depth in Chapter 29 of Part B. Principal components analysis, which was introduced in Chapter 17, is discussed in more detail in Chapter 31. These two chapters provide the basis for more advanced techniques such as Procrustes analysis, canonical correlation analysis and partial least squares discussed in Chapter 35. In Part B we also concentrate on a number of important application areas of multivariate statistics: multivariate calibration (Chapter 36) quantitative structure-activity relationships (QSAR, Chapter 37), sensory analysis (Chapter 38) and pharmacokinetics (Chapter 39). The success of data analysis depends on the quality of the data. Noise and other instrumental factors may hide the information in the data. In some instances it is possible to improve the quality of the data by a suitable preprocessing technique such as signal filtering and signal restoration by deconvolution (Chapter 40). Prior to signal enhancement and restoration it may be necessary to transform the data by, e.g., a Fourier transform or wavelet transform. Both are discussed in Chapter 40. A special type of filter is the Kalman filter which is particularly applicable to the real-time modelling of systems of which the model parameters are time dependent. For instance, the slope and intercept of a calibration line may be subject to a drift. At each new measurement of a calibration standard, the Kalman filter updates the calibration factors, taking into account the uncertainty in the calibration factors and in the data. Because the Kalman filter is driven by the difference between a new measurement and the predicted value of that measurement, the filter ignores outlying measurements, caused by stochastic or systematic errors. The last step of the arch of knowledge is the transformation of information into knowledge. Based on this knowledge one is able to make a decision. For instance, values of temperatures and pressures in different parts of a process have to be interpreted by the operator in the control room of the plant, who may take a series of actions. Which action to take is not always obvious and some guidance is usually found in a manual. In analytical method development the same situation is encountered. Guidance is required for instance to select a suitable stationary phase in HPLC or to select the solvents that will make up the mobile phase. This type of knowledge may be available in the form of "If Then" rules. Such rules can be combined in a rule-based knowledge base, which is consulted by an expert system (Chapter 43). Questions such as "What If can also be answered by models developed in Operations Research (Chapter 42). For instance, the average time a sample has to wait in the sample queue can be predicted by queuing theory for various priority strategies.
After completion of one cycle of the arch of knowledge, we are back at the starting point of the arch, where we should accept or reject the hypothesis. At this stage a new cycle can be started based on the knowledge gained from all previous ones. The chemometric techniques described in Part A and B aim to support the scientist in running through these cycles in an efficient way.
References 1.
D. Oldroyd, The Arch of Knowledge. Methuen, New York (1986).
This Page Intentionally Left Blank
Chapter 29
Vectors, Matrices and Operations on Matrices This chapter is an extension and generalization of the material presented in Chapter 9. Here we deal with the calculus of vectors and matrices from the point of view of the analysis of a two-way multivariate data table, as defined in Chapter 28. Such data arise when several measurements are made simultaneously on each object in a set [1]. Usually these raw data are collected in tables in which the rows refer to the objects and the columns to the measurements. For example, one may obtain physicochemical properties such as lipophilicity, electronegativity, molecular volume, etc., on a number of chemical compounds. The resulting table is called a measurement table. Note that the assignment of objects to rows and of measurements to columns is only conventional. It arises from the fact that often there are more objects than measurements, and that printing of such a table is more convenient with the smallest number of columns. In a cross-tabulation each element of the table represents a count, a mean value or some other summary statistic for the various combinations of the categories of the two selected measurements. In the above example, one may cross the categories of lipophilicity with the categories of electronegativity (using appropriate intervals of the measurement scales). When each cell of such a cross-tabulation contains the number of objects that belong to the combined categories, this results in a contingency table or frequency table which is discussed extensively in Chapter 32. In a more general cross-tabulation, each cell of the table may refer, for example, to the average molecular volume that has been observed for the combined categories of lipophilicity and electronegativity. One of the air^of multivariate analysis is to reveal patterns in the data, whether they are in the form of a measurement table or in that of a contingency table. In this chapter we will refer to both of them by the more algebraic term 'matrix'. In what follows we describe the basic properties of matrices and of operations that can be applied to them. In many cases we will not provide proofs of the theorems that underlie these properties, as these proofs can be found in textbooks on matrix algebra (e.g. Gantmacher [2]). The algebraic part of this section is also treated more extensively in textbooks on multivariate analysis (e.g. Dillon and Goldstein [1], Giri [3], Cliff [4], Harris [5], Chatfield and Collins [6], Srivastana and Carter [7], Anderson [8]).
29.1 Vector space In accordance with Section 9.1, we represent a vector z as an ordered vertical arrangement of numbers. The transpose T} then represents an ordered horizontal arrangement of the same numbers. The dimension of a vector is equal to the number of its elements, and a vector with dimension n will be referred to as an Ai-vector. A set of/? vectors (z, ... z^ with the same dimension n is linearly independent if the expression:
I^.
(29.1)
z,=0
holds only when all p coefficients c^ are zero. Otherwise, the p vectors are linearly dependent (see also Section 9.2.8). The following three vectors z,, Zj and Zj with dimension four are linearly independent: " 2"
'-\
2
4 z, =
0
Z2 =
"-5l 1 Z3 =
3
-6
-1
_ 2_
_ 4_
as it is not possible to find a set of coefficients Cj, C2, C3 which are not all equal to zero, and which satisfy the system of four equations: -C]
+ 2C2 -5C3 = 0
4c, + 2c2 + IC3 = 0 Oc, + 3c2 - 6C3 = 0 2c, - IC2 + 4c3 = 0
In the case of linearly dependent vectors, each of them can be expressed as a linear combination of the others. For example, the last of the three vectors below can be expressed in the form Z3 = Zj - IT^^, '-5l
" 2"
'-\
2
4 0 _ 2_
^2
=
3 -1
Z3 =
0 -6 _ 4_
A vector space spanned by a set of/? vectors (Zj... z^ with the same dimension n is the set of all vectors that are linear combinations of the p vectors that span the
space [3]. A vector space satisfies the three requirements of an algebraic set which follow hereafter. (1) Any vector obtained by vector addition and scalar multiplication of the vectors that span the space also belongs to this space. This includes the null vector whose elements are all equal to zero. (2) Addition of the null vector to any vector of the space reproduces the vector. (3) For every vector that belongs to the space, another one can be found, such that vector addition of these two vectors produces the null vector. A set of n vectors of dimension n which are linearly independent is called a basis of an n-dimensional vector space. There can be several bases of the same vector space. The set of unit vectors of dimension n defines an n-dimensional rectangular (or Cartesian) coordinate space S^. Such a coordinate space 5" can be thought of as being constructed from n base vectors of unit length which originate from a common point and which are mutually perpendicular. Hence, a coordinate space is a vector space which is used as a reference frame for representing other vector spaces. It is not uncommon that the dimension of a coordinate space (i.e. the number of mutually perpendicular base vectors of unit length) exceeds the dimension of the vector space that is embedded in it. In that case the latter is said to be a subspace of the former. For example, the basis of 5"^ is: 1
0 1
0
0 0
0
0 1
0
0
0 0 0
0 1
Any vector x in S^ can be uniquely expressed as a linear combination of the n basis vectors u.:
: = X^/"/
(29.2)
where the n vectors (Uj ... u j form a basis of 5". The n coefficients (xj ... x^ are called the coordinates of the vector x in the basis (Uj ... u j . For example, a vector x in 5"* may be expressed in the previously defined usual basis as the vector sum: 1 x=3
0 0 0
0 + (-4)
1 0 0
0 +0
0 1 0
0 +2
0 0 1
10
where the coefficients 3, ^ , 0 and 2 are the coordinates of x in the usual basis. Another basis of 5"* can be defined by means of the set of vectors: 1
1
-1
-1
1
-1
-1
1
-1
1 -1
1
1
1
1
1
and the same vector x which we have defined above can be expressed in this particular basis by means of the vector sum: '\
r
-1 -1 -1 1 -1 + 2.25 x = 0.25 + 0.75 + (-1.25) 1 -1 -1 1 1 1 1 1 1
where the coefficients 0.25, 2.25, -1.25 and 0.75 nbw represent the coordinates of X in this particular basis. From the above it follows that a vector can be expressed by different coordinates according to the particular basis that has been chosen. In multivariate data analysis one often changes the basis in order to highlight particular properties of the vectors that are represented in it. This automatically causes a change of their coordinates. A change of basis and its effect on the coordinates can be defined algebraically, as is shown in Chapters 31 and 32. 29.2 Geometrical properties of vectors Every n vector can be represented as a point in an n-dimensional coordinate space. The n elements of the vector are the coordinates along n basis vectors, such as defined in the previous section. The null vector 0 defines the origin of the coordinate space. Note that the origin together with an endpoint define a directed line segment or axis, which also represents a vector. Hence, there is an equivalence between points and axes, which can both be thought as geometrical representations of vectors in coordinate space. (The concepts discussed here are extensions of those covered previously in Sections 9.2.4 to 9.2.5.) In this and subsequent sections we will make frequent use of the scalar product (also called inner product) between two vectors x and y with the same dimension n, which is defined by:
x^y = X^/ >'/
(29.3)
11
By way of example, if x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^ then we obtain that the scalar product of x with y equals: x'ry = 2 x 4 + l x 2 + 3 x ( - l ) + ( - l ) x 5 = 2 Note that in Section 9.2.2.3 the dot product x • y is used as an equivalent notation for the scalar product x^y. In Euclidean space we define squared distance from the origin of a point x by means of the scalar product of x with itself: n
xTx = ^ x f =llxl|2
(29.4)
where 11 xl I is to be read as the norm or length of vector x. Likewise, the squared distance between two points x and y is given by the expression: n
{x-yY
(x-y) = 21Ui - y , ) ' =llx-ylP
(29.5)
where 11 x - yl I is the norm or length of the vector x - y. Note that the expression of distance from the origin in eq. (29.4) can be derived from that of distance between two points in eq. (29.5) by replacing the vector y by the null vector 0: xTx = ( x - 0 ) ^ ( x - 0 ) = llx-OlP =llxlP Given the vectors x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^, we derive the norms: llxlP - 2 ^ +1^ + 3 ^ +(-1)2 =15 llyiP = 4 ^ + 2 ^ + ( - 1 ) ' + 5^ =46 l l x - y i P = ( 2 - 4 ) 2 + ( 1 - 2 ) 2 +(3_(_i))2 ^ ( _ i _ 5 ) 2 ^ 5 7 It may be noted that some authors define the norm of a vector x as the square of 11 xl I rather than 11 xl I itself, e.g. Gantmacher [2]. Angular distance or angle between two points x and y, as seen from the origin of space, is derived from the definition of the scalar product in terms of the norms of the vectors: n
xTy = ^ x , . y,. =llxllllyllcosi^
(29.6)
where i^ represents the angular distance between the vectors x and y. The geometrical interpretation of the scalar product of the vectors x and y is that of an arithmetic product of the length of y, i.e. II yl I, with the projection of x upon y, i.e. 11x11 cos d (Fig. 29.1). From the expression in eq. (29.6) we derive that cos '&
12
llyll 11x11 cos d Fig. 29.1. Geometrical interpretation of the scalar product of x'^y as the projection of the vector x upon the vector y. The lengths of x and y are denoted by 11 xl I and 11 yl I, respectively, and their angular separation is denoted by i3.
equals the scalar product of the normalized vectors x and y: T
cos d = :(x/llxll)'^ (y/llyll) llxllllyH
(29.7)
Note that normalization of an arbitrary vector x is obtained by dividing each of its elements by the norm 11 xl I of the vector. The geometric properties of vectors can be combined into the triangle relationship, also called the cosine rule, which states that: l l x - y i P =llxlP +llylP -211x11 llyllcosi^
(29.8)
This relationship is of importance in multivariate data analysis as it relates distance between endpoints of two vectors to distances and angular distance from the origin of space. A geometrical interpretation is shown in Fig. 29.2. Using the vectors x and y from our previous illustration, we derive that: = 0.0761
cos-d^ 15V46 or:
^ = 85.64 degrees One can define three special configurations of two vectors, namely parallel in the same direction, parallel in opposite directions, and orthogonal (or perpendicular). The three special configurations depend on the angular distances between the two vectors, being 0, 180 and 90 degrees respectively (Fig. 29.3). More generally, two vectors x and y are orthogonal when their scalar product is zero:
13
Fig. 29.2. Distance 11 x - yl I between two vectors x and y of length 11 xl I and 11 yl I, separated by an angle i^.
f^
^
x y = 11x11 llyll
180° V
0 0
V
>
x y = -11x11 llyll
X
/^y
tki
T
X y=0
Fig. 29.3. Three special configurations of two vectors x and y and their corresponding scalar product x^y. Angular separations of x and y are 0, 180 and 90 degrees, respectively.
x^y = 0
(29.9)
It can be shown that the vectors x = [2, - 1 , 8, 0]''' and y = [10, 4, -2, 3]^ are orthogonal, since: x'ry = [2,-l,8,0][10,4,-2,3]T = 2 x l 0 + (-l)x4 + 8x(-2) + 0 x 3 = 0
14
Hence cos iJ equals 0, or equivalently ^ equals 90 degrees. Two orthogonal vectors are orthonormal when, in addition to orthogonality, the norms of these vectors are equal to one: llxll = llyll = l or equivalently: x^x = y V = l
(29.10)
The n basis vectors which define the basis of a coordinate space 5" are n mutually orthogonal and normalized vectors. Together they form a frame of reference axes for that space. If we represent by x and y the arithmetic means of the elements of the vectors x and y: 1 "
^ ' 1 " y = -Yuyi ^
(29.11)
i
then we can relate the norms of the vectors (x - x) and (y - y) to the standard deviations s^ and s^ of the elements in x and y (Section 2.1.4): 1 "
1
^x =-Y,{x,-xy
=-\\x-x\\^
n ,
n
(29.12)
Note that in data analysis we divide by n in the definition of standard deviation rather than by the factor n - 1 which is customary in statistical inference. Likewise we can relate the product-moment (or Pearson) coefficient of correlation r (Section 8.3.1) to the scalar product of the vectors (x - x) and {y -y): n
-
fn
'
n
X(x,-J)2 SCy.-y)'
V'^
- ( ' ' - • ^ ) ^ y - ^ = cos(p IIX-JII lly-yil
(29.13)
15
where cp is the angular distance between the vectors (x - x) and (y - y).
29.3 Matrices A matrix is defined as an ordered rectangular arrangement of scalars into horizontal rows and vertical columns (Section 9.3). On the one hand, one can consider a matrix X with n rows and/? columns as an ordered array ofp vectors of dimension n, each of the form:
with J = 1, ...,/7
^./ =
On the other hand, one can also regard the same matrix X as an ordered array of n vectors of dimension p, each of the form: with / = 1, ..., n
X • — [X-1 , . . . , X;„
In our notation, x^j represents the element of matrix X at the crossing of row / and column j . The vector Xy defines a vector which contains the n elements of the jth column of X. The vector x, refers to a vector which comprises the/? elements of the /th row of X. In the matrix X of the following example:
X=
3 2 0 -1
2 1 -4 2
-2 0 -3 4
we denote the second column by means of the vector Xy^2*
X,=2=i
2 1 -4 2
16
1 1
P
X
T r
''ij nI
''j p
Fig. 29.4. Schematic representation of a matrix X as a stack of horizontal rows x,- , and as an assembly of vertical columns x,.
and the third row by means of the vector x^^-^'-
In the illustration of Fig. 29.4 we regard the matrix X as either built up from n horizontal rows x^ of dimension p, or as built up from p vertical columns x^ of dimension n. This exemplifies the duality of the interpretation of a matrix [9]. From a geometrical point of view, and according to the concept of duality, we can interpret a matrix with n rows and p columns either as a pattern of n points in a /7-dimensional space, or as a pattern of p points in an n-dimensional space. The former defines a row-pattern P"" in column-space SP, while the latter defines a column-pattern P^ in row-space 5^". The two patterns and spaces are called dual (or conjugate). The term dual space also possesses a specific meaning in another
17
1 1
1
p
X
T -^1
Xij
n
Fig. 29.5. Geometrical interpretation of an nxp matrix X as either a row-pattern of n points P" in /7-dimensional column-space S^ (left panel) or as a column-pattern of p points P^ in n-dimensional row-space S" (right panel). The/? vectors u, form a basis of 5^ and the n vectors v, form a basis of 5".
mathematical context which is distinct from the one which is implied here. The occasion for confusion, however, is slight. In Fig. 29.5, the column-space S^ is represented as a/^-dimensional coordinate space in which each row x^ of X defines a point with coordinates {xn,.., x^p.., x^^) in an orthonormal basis (u^,.., Uy,.., u^ such that:
with:
rr 0
'0'
'
" y
=
0
1
and
u,
0
Each element x^j can thus be reconstructed from the scalar product: Xij=xJ
u . = u j X.
(29.14)
In the same Fig. 29.5, the row-space S"^ is shown as an n-dimensional coordinate space in which each column Xj of X defines a point with coordinates (x^j,.., x-j,.., x^p in an orthonormal basis (Vj,.., v,,.., v„) such that: Xj=X,j
V, -\-,., + X,j V,- + . . . + X„^. V„
with:
"o"
T 0
'
V,. =
0
1
and
0
Each element x^j can also be reconstructed from the scalar product: X^j — \ • Xj
— Xj
\ •
(29.15)
The dimension of a matrix X with n rows and p columns is nxp (pronounced n by p). Here X is referred to as an nxp matrix or as a matrix of dimension nxp. A matrix is called square if the number of rows is equal to the number of columns. Such a matrix can be referred to as a pxp matrix or as a square matrix of dimension p. The transpose of a matrix X is obtained by interchanging its rows and columns and is denoted by X^. If X is an nxp matrix then X^ is apxn matrix. In particular, we have that: T\T (X^)'
_
(29.16)
19
The transpose of the matrix X in the previous example is given below: 0 3 2 2 1 -4
-1 2
0
4
•2
-3
A square matrix A is called symmetric if: (29.17)
AT = A
In a square nxn matrix A, the main diagonal or principal diagonal consists of the elements a^ for all / ranging from 1 to n. The latter are called the diagonal elements; all other elements are off-diagonal. A diagonal matrix D is a square matrix in which all off-diagonal elements are zero, i.e.: J, = 0
if
with iandj = 1, ..., n.
i^j
An identity matrix I is a diagonal matrix in which all diagonal elements are equal to unity and all off-diagonal elements are zero, i.e.: and
'//-!
with j^f = 1, ..., n.
ijj, = 0
The following 3x3 matrices illustrate a square, a diagonal and an identity matrix: A-
3 2
2
-2
0
o"
1 -4
3 0 D= 0
-3
0
0"
'l
0
1= 0
0 -3_
0
1
0 0" 1 0 0
1_
Special matrices are the null matrix 0 in which all elements are zero, and the sum matrix 1 in which all elements are unity. In the case of 3x3 matrices we obtain: "0 0 0"
"1 1 r
0= 0 0 0 0 0 0
1= 1 1 1 1 1 1
29.4 Matrix product Chapter 9 dealt with the basic operations of addition of two matrices with the same dimensions, of scalar multiplication of a matrix with a constant, and of arithmetic multiplication element-by-element of two matrices with the same
20
dimensions. Here, we formalize the properties of the matrix product that have already been introduced in Section 9.3.2.3. If X is of dimension nxp and Y is of dimension pxq, then the product Z = XY is an nxq matrix, the elements of which are defined by: ^ik ~
with / = 1, ..., n and /: = 1,..., q
Aj^ij yjk
(29.18)
Note that the inner dimensions of X and Y must be equal. For this reason the operation is also called inner product, as the inner dimensions of the two terms vanish in the product. Any element of the product, say z,^, can also be thought of as being the sum of the products of the corresponding elements of row / of X with those of column k of Y. Hence the descriptive name of rows-by-columns product. In terms of the scalar product (Section 29.2) we can write: (29.19)
= X
Throughout the book, matrices are often subscripted with their corresponding dimensions in order to provide a check on the conformity of the inner dimensions of matrix products. For example, when a 4x3 matrix X is multiplied with a 3x2 matrix Y rows-by-columns, we obtain a 4x2 matrix Z: 3 X 4x3
0
2 -5
-f 4
2
1 -2
4
3
0
3 8' Y = -2 4 3x2 4 - 6J
Z = X 4x2
4x3
Y = 3x2
1
38
26
-44
-4
32
6
44
In this theoretical chapter, however, we do not follow this convention of subscripted matrices for the sake of conciseness of notation. Instead, we will take care to indicate the dimensions of matrices in the accompanying text whenever this is appropriate. The operation of matrix multiplication can be shown to be associative, meaning that X(YZ) = (XY)Z. But, it is not commutative, as in general we will have that XY ^ YX. Matrix multiplication is distributive with respect to matrix addition, which implies that (X + Y)Z = XZ + YZ. When this expression is read from right to left, the process is called factoring-out [4]. Multiplication of an nxp matrix X with an identity matrix leaves the original matrix unchanged: XT = LX = X
(29.20)
where I^ is the identity matrix with dimension /?, and I„ is the identity matrix with dimension n. For example, by working out the rows-by-column product one can
21
easily verify that:
"l 0' "1 0 0] [3 4' '3 4' 0 1 — 0 1 0 2 -1 — 2 -1 0 0 ij [o 3 0 3_
"3 4 2 -1 0 3
A matrix is orthogonal if the product with its own transpose produces a diagonal matrix. An orthogonal matrix of dimension nxp satisfies one or both of the following relationships: XX^ = D
X^X = D„
or
(29.21)
where D„ is a diagonal matrix of dimension n and D^ is a diagonal matrix of dimension p. In an orthogonal matrix we find that all rows or all columns of the matrix are mutually orthogonal as defined above in Section 9.2. In the former case we state that X is row-orthogonal, while in the latter case X is said to be columnorthogonal. The following 3x3 matrix X can be shown to be column-orthogonal: 2.351 0.060 X = 4.726 1.603 -1.840 4.196
0.686 -0.301 0.105
as can be seen by working out the matrix products: '31.248 X^X = | 0
0
0
20.180
0
0
0 0.572
A matrix is called orthonormal if additionally we obtain that: XX^ = T
or
X^X = I
(29.22)
where I^ and I^ have been defined before. In an orthonormal matrix X, all rowvectors or all column-vectors are mutually orthonormal. In the former case, X is row-orthonormal, while in the latter case we state that X is column-orthonormal. A square matrix U is orthonormal if we can write that: UU'^ = U^U = I where I is the identity matrix with the same dimension as U (or U^). The 3x3 matrix U shown below is both row- and column-orthonormal:
(29.23)
22
0.4205 U=
0.0134
05072
0.8455 0.3569
-0.3972
-0.3291 0.9340
0.1388
as can be seen by working out the matrix products:
'l 0 0" UU^ =U'^U = 0 1 0 0 0 Ij An important property of the matrix product is that the transpose of a product is equal to the product of the transposed terms in reverse order: (29.24)
(XY)T = Y^X^ This property can be readily verified by means of an example: '3
2
0
-5
2 4
-1] 4
nT
r 3
8]
-2
4
1 -2 L 4 -6] 3 oj
1
38"
__ 26 -44 32 -4 44_ 6
T
^
1 38
26-4 -44
32
6 44
and 8" T 3 4 -2 4 -6
3 2 -1 0 -5 4 2 1 -2 4 3 0
'3 -2 = 8 4
4] -6j
2 4' 0 1 3 2 --5 [-1 4 -2 0_
r 3
" 1
26-4 6' 38 -44 3 2 44
The trace of a square matrix A of dimension n is equal to the sum of the n elements on the main diagonal:
tr(A) = X«// For example:
(29.25)
23
tr
4 2 - 1 2 8 - 4 = 4 + 8 + 3=15 1 -4 3 If X is of dimension nxp and if Y is of dimensionpxn, then we can show that:
tr(XY) = tr(YX)
(29.26)
In particular, we can prove that: n
tr(X^X) = tr(XX^) = ^ ^ J ^j =1^J
p
^i = l E ^ ' I
(29.27)
J
where x, represents the ith row and x^ denotes theyth column of the nxp matrix X. The proof follows from working out the products in the manner described above. This relationship is important in the case when Y equals X^. Matrix multiplication can be applied to vectors, if the latter are regarded as one-column matrices. This way, we can distinguish between four types of special matrix products, which are explained below and which are represented schematically in Fig. 29.6. (1) In the matrix-by-vector product one postmultiplies an nxp matrix X with ap vector y which results in an n vector z: (29.28)
Xy = z For example: 3
2
0
-5
2 4
-1] r 4
1 -2 3 oj
'^1
-2 A
I 26 -4 6
(2) The vector-by-matrix product involves an n vector x^ which premultiplies an nxp matrix Y to yield a/? vector z^: x^Y = z" For example:
(29.29)
24
1 ^
P
z = Xy
1 z 1
Y
P
X P
1
n
T
T„
z =x Y 1z
p
^
y
Z = xy'
T
z=x y
Fig. 29.6. Schematic illustration of four types of special matrix products: the matrix-by-vector product, the vector-by-matrix product, the outer product and the scalar product between vectors, respectively from top to bottom.
25
3 [3
2
2
-1] -2 4
1 = [1 5] 3
(3) The outer product results from premultiplying an n vector x with ap vector y^ yielding an nxp matrix Z: (29.30)
xy^ = Z
The outer product of two vectors can be thought of as the matrix product between a single-column matrix with a single-row matrix:
[3, - 2 , 4] =
2x3
2x(-2)
2x4
(-5)x3
(-5)x(-2)
(-5)x4
1x3
lx(-2)
1x4
3x3
3x(-2)
3x4
6 -4 15 10 3 -2 9
-6
8 -20 4 12
For the purpose of completeness, we also mention the vector product which is extensively used in physics and which is defined as: (29.31)
xxy = z
(read as x cross y) where x, y and z have the same dimension n. Geometrically, we can regard x and y as two vectors drawn from the origin ofS"^ and forming an angle (x, y) between them. The resulting vector product z is perpendicular to the plane formed by x and y, and has a length defined by: llzll = llxll llyllsin(x,y)
(29.32)
(4) In the scalar product, which we described in Section 29.2, one multiplies a vector x^ with another vector y of the same dimension, which produces a scalar z: T
(29.33)
x'y = z For example: •-1
[3 0 2 -1]
4 -2 0
:3x(-l) + 0 x 4 + 2 x ( - 2 ) + (-l)x0 = - 7
26
The product of a matrix with a diagonal matrix is used to multiply the rows or the columns of a matrix with given constants. If X is an nxp matrix and if D^ is a diagonal matrix of dimension n we obtain a product Y in which the iih row yequals the iih row of X, i.e. x,, multiplied by the iih element on the main diagonal of Y = D„X
(29.34)
and in particular for the /th row: y, = i/„x,
with «"= !,...,«
The matrix D„ effectively scales the rows of X. For example: •2
0 0
0]
3
2
-1'
-6-4
2"
0
1 0
0
0
-5
4
0 - 5
4
3 0
2
1 -2
6
4
3
0
8
0 0
2J
0 0 0
3-6 6
0
Likewise, if D^ is a diagonal matrix of dimension p we obtain a product Y in which they'th column y^ equals theyth column of X, i.e. x^, multiplied by theyth element on the main diagonal of D^: Y = XD^
(29.35)
with; = l,...,p
and in particular for theyth column:
The matrix D^ effectively scales the columns of X. For example: 3
2
0
-5
-1] [2 4
2
1 -2
4
3
0 0
0 '-
0 -1
0' 0
0 3_
6
-2
-3
0
5
12
4
-1
-6
8
-3
0
Pre- or postmultiplication with a diagonal matrix is useful in data analysis for scaling rows or columns of a matrix, e.g. such that after scaling all rows or all columns possess equal sums of squares.
27
29.5 Dimension and rank It has been shown that the/? columns of an nxp matrix X generate a pattern of p points in S"^ which we call PP. The dimension of this pattern is called rank and is indicated by r(P^). It is equal to the number of linearly independent vectors from which all p columns of X can be constructed as Hnear combinations. Hence, the rank of P^ can be at most equal top. Geometrically, the rank of P^can be seen as the minimum number of dimensions that is required to represent the p points in the pattern together with the origin of space. Linear dependences among the p columns of X will cause coplanarity of some of the p vectors and hence reduce the minimum number of dimensions. The same matrix X also generates a pattern oin points in S^ which we call P*^ and which is generated by the n rows of X. The rank of P"is denoted as riP"^) and equals the number of linearly independent vectors from which all n rows of X can be produced as linear combinations. Hence, the rank of P"^ can be at most equal to n. Using the same geometrical arguments as above, one can regard the rank of P^ as the minimum number of dimensions that is required to represent the n points in the pattern together with the origin of space. Linear dependences between the n rows of X will also reduce this minimum number of dimensions because of coplanarity of some of the corresponding n vectors. It can be shown that the rank of P^ must be equal to that of P'^ and, hence, that the rank of X is at most equal to the smaller of n and p [3]: r{P^ ) = r{PP) = r(X) < min(n, p)
(29.36)
where r(X) is the rank of the matrix X (see also Section 9.3.5). For example, the rank of a 4x3 matrix can be at most equal to 3. In the case when there are linear dependences among the rows or columns of the matrix, the rank can be even smaller. An nxp matrix X with n>p is called singular if linear dependences exist between the columns of X, otherwise the matrix is called non-singular. In this case the rank of X equals p minus the number of linear dependences among the columns of X. If n < p, then X is singular if linear dependences exist between the rows of X, otherwise X is non-singular. In that case, the rank of X equals n minus the number of linear dependences among the rows of X. A matrix is said to be of full rank when X is non-singular or alternatively when r(X) equals the smaller of n or/?. Dimensions and rank of a matrix are distinct concepts. A matrix can have relatively large dimensions say 100x50, but its rank can be small in comparison with its dimensions. This point can be made more clearly in geometrical terms. In a 100-dimensional row-space 5^^, it is possible to represent the 50 columns of the matrix as 50 points, the coordinates of which are defined by the 100 elements in each of them. These 50 points form a pattern which we represent by P^^. It is clear
28
that the true dimension of this pattern of 50 points must be less than the number of coordinate axes of the space 5^^ in which they are represented. In fact, it cannot be larger than 50. The true dimension of the pattern P^^ defines its rank. In an extreme case when all 50 points are located at the origin of 5^^, the rank is zero. In another extreme situation we may obtain that all 50 points are collinear on a line through the origin, in which case the rank is one. All 50 columns may be coplanar in a plane that comprises the origin, which results in a rank of 2. In practical situations, we often find that the points form patterns of low rank, when the data are sufficiently filtered to eliminate random variation and artifacts. Multivariate data analysis capitalizes on this point, and in a subsequent section on eigenvectors we will deal with the algebra which allows us to find the true number of dimensions or rank of a pattern in space. We can now define the rank of the column-pattern P^^ as the number of linearly independent columns or rank of X. If all 50 points are coplanar, then we can reconstruct each of the 50 columns, by means of linear combinations of two independent ones. For example, if x^^j and Xj^2 ^^^ linearly independent then we must have 48 linear dependences among the 50 columns of X: ^7=3 ~ ^ 1 3 ^7=1 "''^23 ^7=2 X,=4 = C j 4 Xy^i + C 2 4
Xj=2
^j=50 ~^1,50 ^7=1 "^^2,50 ^7=2
where 0,3, C23,... are the coefficients of the linear combinations. The rank of X and hence the rank of the column-pattern P^^ is thus equal to 50 - 48 = 2. In this case it appears that 48 of the 50 columns of X are redundant, and that a judicious choice of two of them could lead to a substantial reduction of the dimensionality of the data. The algebraic approach to this problem is explained in Section 29.6 on eigenvectors. A similar argument can be developed for the dual representation of X, i.e. as a row-pattern of 100 points P^^ in a 50-dimensional column-space 5^^. Here again, it is evident that the rank of P^^ can at most be equal to 50 (as it is embedded in a 50-dimensional space). This implies that in our illustration of a 100x50 matrix, we must of necessity have at least 100 - 50 = 50 linear dependences among the rows of X. In other words, we can eliminate 50 of the 100 points without affecting the rank of the row-pattern of points in 5^^. Let us assume that this is obtained by reducing the 100x50 matrix X into a 50x50 matrix X'. The resulting pattern of the rows in X' now comprises only 50 points instead of 100 and is denoted here by the symbol pioo-50 ^hich has the same rank as P^^. Since we previously assumed 48 linear
29
dependences among the columns of X we must necessarily also have 48 additional linear dependences among the rows of X'. Hence the rank of p^^^^^ in S^^ is also equal to 2. Summarizing the results obtained in the dual spaces we can write that: r(P^^) = r(P^^) = r(X)=2 We will attempt to clarify this difficult concept by means of an example. In the 4x3 matrix X we have an obvious linear dependence among the columns: "3
5 2
X=
2 -1
smce Xy^3 = Xy^i + Xy^2 • By simple algebra we can derive that there is also a linear dependence among the rows of X, namely:
11X/=i
^i=4 ~ '
29
+
29
^i=2
Hence we can remove the fourth row of X without affecting its rank, which results in the 3x3 matrix X": 3 5 8 X= 7 2 9 1 2 3 Note that the linear dependence among the columns still persists. We now show that there remains a second linear dependence among the rows of X':
X i=3
11
= •
29
29
i=2
Thus we have illustrated that the number of independent rows, the number of independent columns and the rank of the matrix are all identical. Hence, from geometrical considerations, we conclude that the ranks of the patterns in row- and column-space must also be equal. The above illustration is also rendered geometrically in Fig. 29.7. The rank of a product of two matrices X and Y is equal to the smallest of the rank
30
Fig. 29.7. Illustration of a pattern of points with rank of 2. The pattern is represented by a matrix X with dimensions 5x4 and a linear dependence between the three columns of X is assumed. The rank is shown to be the smallest number of dimensions required to represent the pattern in column-space S^ and in row-space S".
ofXandY: r(XY) = min(r(X), r(Y))
(29.37)
This follows from the fact that the columns of XY are linear combinations of X and that the rows of XY are linear combinations of Y [3]. From the above property, it follows readily that: r(XX^) = r(X^X) = HX)
(29.38)
where the products XX^ and X^X are of special interest in data analysis as will be explained in Section 29.7.
29.6 Eigenvectors and eigenvalues A square matrix A of dimension p is said to be positive definite if: x^Ax > 0
(29.39)
for all non-trivial p vectors x (i.e. vectors that are distinct from 0). The matrix A is said to be positive semi-definite if:
31
x^Ax > 0
for all X ;^ 0
(29.40)
It can be shown that all symmetric matrices of the form X^X and XX^ are positive semi-definite [2]. These cross-product matrices include the widely used dispersion matrices which can take the form of a variance-covariance or correlation matrix, among others (see Section 29.7). An eigenvalue or characteristic root of a symmetric matrix A of dimension p is a root X^ of the characteristic equation: IA-^2II = 0
(29.41)
where IA - ^^11 means the determinant of the matrix A - V'l [2]. The determinant in this equation can be developed into a polynomial of degree p of which all/7 roots yj- are real. Additionally, if A is positive semi-definite then all roots are nonnegative. Furthermore, it can be shown that the sum of the eigenvalues is equal to the trace of the symmetric matrix A:
I^i=tr(A)
(29.42)
and that the product of the eigenvalues is equal to the determinant of the symmetric matrix A:
n^=iAi
(29.43)
By way of example we construct a positive semi-definite matrix A of dimensions 2x2 from which we propose to determine the characteristic roots. The square matrix A is derived as the product of a rectangular matrix X with its transpose in order to ensure symmetry and positive semi-definitiveness: "2 -5
-\ 4
A = XTX =
1 -2 0 _3
39
-24
-24
21
from which follows the characteristic equation:
|39_^2 \X-XH\--
-24
_24 I =0 21-^2 1
The determinant in this characteristic equation can be developed according to the methods described in Section 9.3.4:
32
IA-;i2i|^(39-;i2)(21-X^)-(-24)(-24) = 0 which leads to a quadratic equation in }}\ (X^y -(21+39)>.2 + 3 9 x 2 1 - 2 4 x 2 4 = 0 or (?i2)2 _60;i2 +243 = 0
From the form of this equation we deduce that the characteristic equation has two positive roots: ?i^ 2 = 30 ± (30^ - 243)^/2 ^ 30 + 25.632
or X] =55.632
and
X\ =4.368
It can be easily verified that: 2
XM=tr(A) = 60
n^
2^=IAI = 243
If A is a symmetric positive definite matrix then we obtain that all eigenvalues are positive. As we have seen, this occurs when all columns (or rows) of the matrix A are linearly independent. Conversely, a linear dependence in the columns (or rows) of A will produce a zero eigenvalue. More generally, if A is symmetric and positive semi-definite of rank r < /?, then A possesses r positive eigenvalues and (p - r) zero eigenvalues [6]. In the previous section we have seen that when A has the form of the product of a matrix X with its transpose, then the rank of A is the same as the rank of X. This can be easily demonstrated by means of a simplified illustration: 1 -2" =
-3 2 -1
6 -4
A = X^X =
15 -30
-30 60
2_
Note that there is a linear dependence in X which is transmitted to the matrix of cross-products A:
33
The singularity of A can also be ascertained by inspection of the determinant lAI which in this case equals zero. As a result, the characteristic equation has the form of a degenerated quadratic: (Vy
- ( 1 5 + 60);i2 + 1 5 x 6 0 - 3 0 x 3 0 = 0
The last term of the characteristic equation is always equal to the determinant of A, which in this case equals zero. Hence we obtain:
which leads to: X] = 75
and
X\ = 0
with 2
XM=tr(A) = 75 k
and
n^i=iAi=o k
An eigenvector or characteristic vector is a nontrivial normalized vector v (distinct from 0) which satisfies the eigenvector relation: {k-XH)\^Q
or
Av = ?i2v
(29.44)
from which follows that: v'^Av = X^
(29.45)
because of the orthonormality condition: v^v=l We have seen above that a symmetric non-singular matrix of dimensions/?x;? hasp positive eigenvalues which are roots of the characteristic equation. To each of these p eigenvalues V- one can associate an eigenvector v. The p eigenvectors are normalized and mutually orthogonal. This leads us to the eigenvalue decomposition (EVD) of a symmetric non-singular matrix A: V^AV^A^
(29.46)
34
with the orthonormality condition:
where I^ is thepxp identity matrix and where A^ is apx/? diagonal matrix in which the elements of the main diagonal are the p eigenvalues associated to the eigenvectors (columns) in the pxp matrix V. Because A^ is a diagonal matrix, the decomposition is also called diagonalization of A. Algorithms for eigenvalue decomposition are discussed in Section 31.4. These are routinely used in the multivariate analysis of measurement tables and contingency tables (Chapters 31 and 32). Because of the orthonormality condition we can rearrange the terms of the decomposition in eq. (29.46) into the expression: A = VA^V^
(29.47)
which is known as the spectral decomposition of A. The latter can also be expanded in the form: A = ^^ V, v^ + . . . + ^^, V, v l + . . . + ^
V, v^
(29.48)
where v^ represents the Jdh column of V and where \\ is the eigenvalue associated tov^. By way of example, we propose to extract the eigenvectors from the symmetric matrix A of which the eigenvalues have been derived at the beginning of this section: A=
39 -24
-24 21
and which has been found to be positive definite with eigenvalues: ?.2= 55.632
and
X^ =4.368
The eigenvector relation for the first eigenvector can then be written as:
{K-X\l)y,
=
"39 - 55.632 -24
-24 21-55.632, , v^x
=0
where v^ and Vj2 are the unknown elements of the eigenvector Vj associated to A,^. The determinant of the corresponding system of homogeneous linear equations equals zero: IA-?i2i| = (_i6.632)(-34.632)-(-24)(-24) = 0 within the limits of precision of our calculation. Hence, we can solve for the unknowns v'^ and v'21, which are the non-normalized elements v^ and V2\.
35
-16.632 v ' „ -
24v'2j=0
-24v',i - 34.632 v'21 =0 This leads to the solution: v'n = 1 21
-16.632/24 = -24/34.632 = -0.693
Since the norm of \\ is defined as: IIv^ Il = (v7i + v'l, )i/2 =(1 + 0.6932)1/2 ^1.217 we derive the elements of the normalized eigenvector Vj from: vji =v'ii/llvVI = l/l.217=0.822 V2i= v'21/llv'ill = -0.693/1.217 = - 0 . 5 6 9 or Vi = [0.822-0.569f In order to compute the second eigenvector, we make use of the spectral decomposition of the matrix A: A-?l2 y 1
..T
_ •
M^l
' 2 V.Vn^ ^2^2
where the left-hand member is called the residual matrix (or deflated matrix) of A after extraction of the first eigenvector. This reduces the problem to that of finding the eigenvector V2 associated to 'k\ from the residual matrix:
-Mv,v;^ =
39 -24 "1.417 2.045
0.822 -24 [0.822 - 55.6322 21_ -0.569
-0.569]
2.045 2.S>51
We now have to solve the eigenvector relation: {K-X\y,y1
-'k\l)y^
=
1.417 - 4.368 2.045
2.045 2.951-4.368
'12
=0
'22
where v^2 and V22 are the unknown elements of the eigenvector V2 associated to X, 2. The determinant of the residual matrix is also zero:
36
\A~X]
v,v^ ^ 1 II = (-2.951)(-1.417)-2.045x2.045 = 0
within the limits of precision of our calculation. Now, we can solve the system of homogeneous linear equations for the unknowns v',, 12 'and v'„ 22 which are the non-normalized elements of v,^ 12 ' and v "22-2.951 v',2 -l-2.045v'22 :0 2.045 v' 12
1.417 v'22 = 0
from which we derive that: 1
12 •
v'„ = 2.951/2.045 = 2.045/1.417 = 1.443 After normalization we obtain the elements v,2 and V22 of the eigenvector V2 associated to A, 2: v,2 =v',2 /llv'2 11=1/1.756 = 0.569 V22 =v'22 /llv'211= 1.433/1.756 = 0.822 where llv'2 II = (1+1.4332)"2 ^ i 7 5 g It can be shown that the second eigenvector Vj can also be computed directly from the original matrix A, rather than from the residual matrix A - X\\f\J, by solving the relation: ( A - ^ 1)^2 = 0 This follows from the orthogonality of the eigenvectors v, and V2. We have preferred the residual matrix because this approach is used in iterative algorithms for the calculation of eigenvectors, as is explained in Section 31.4. Finally, we arrange the eigenvectors column-wise into the matrix V, and the eigenvalues into the diagonal matrix A^: V=
0.822 0.569 -0.569 0.822
A^ =
55.632 0 0 4.368
From V and A-^ one can reconstruct the original matrix A by working out the consecutive matrix products: VA^V T
_
0.822 0.569 -0.569 0.822
55.632 0 0 4.368
0.822 0.569
-0.569 0.822
37
39
-24
-24
21
=A
The 'paper-and-pencir method of eigenvector decomposition can only be performed on small matrices, such as illustrated above. For matrices with larger dimensions one needs a computer for which efficient algorithms have been designed (Section 31.4). Thus far we have considered the eigenvalue decomposition of a symmetric matrix which is of full rank, i.e. which is positive definite. In the more general case of a symmetric positive semi-definite pxp matrix A we will obtain r positive eigenvalues where r?. In the general case we obtain a/7Xr matrix of eigenvectors V such that: V ^ A V = A2
or
A = VA2V^
(29.49)
with the orthonormality condition:
where I^ represents the identity matrix of dimension rxr, and where Js} is the diagonal matrix whose elements on the main diagonal are the r positive eigenvalues, with r < /?. As a general rule we cannot write that VV^ equals I^ but the product possesses a property which is analogous to that of an identity matrix: VV^A = A VV^ = A
(29.50)
which holds for the matrix A from which V has been computed. This remarkable property is illustrated by means of the singular 2x2 matrix A which we have already introduced at the beginning of this section: A=
15 -30
-30 60
0.447 [75] [0.447 -0.894
- 0.894] = VA^ V^
and which possesses only one eigenvector and associated eigenvalue. By working out the matrix product of V with its transpose we obtain that: VV T
0.2 -0.4
_
-0.4 0.8
which is different from the 2x2 identity matrix. Nevertheless we find that: AW
T
_
15 -301 0.2 30 60 -0.4
-0.4 0.8
15 -30 =A -30 60
38
and 0.2 •0.4
- 0 . 4 ] r 15 -30 0.8 -30 60
15 -30 =A -30 60
The spectral decomposition of a symmetric matrix A can be rewritten in the form: A = VA^VT = VAAVr = SST
(29.51)
where: S = VA and where A is a diagonal matrix whose main diagonal elements are equal to the square roots of those of A^. This expression is also known as the Young-Householder factorization of a symmetric matrix [6]. The spectral decomposition can also be used to compute various powers of a symmetric matrix. For example, the square of a symmetric matrix A of dimension p can be computed by means of its spectral decomposition: A^ = A A = ( V A 2 V T ) ( V A 2 V ^ ) = VA^V'^
(29.52)
where use is made of the orthonormality of the columns of V. This procedure generally applies to all real exponents and holds also for singular symmetric matrices (for which r
(29.53)
for all real exponents s.
It should be noted that in the case of a singular matrix A, the dimensions of V and A^ are pxr and rxr, respectively, where r is smaller than p. The expression in eq. (29.53) allows us to compute the generalized inverse, specifically the MoorePenrose inverse, of a symmetric matrix A from the expression: (29.54)
A-^ = VA-^V^
which holds even when A is singular. It provides a more general method for computing inverses, but requires that an algorithm for eigenvector extraction is available. For the previously defined 2x2 matrix A we obtain the inverse from its eigenvalue decomposition, which has already been derived: A=
39 -24
-24 21
0.822 -0.569
0.569 0.822
55.632 0 0 4.368
0.822 0.569
-0.569 0.822
39 2\TT
= \A^\ A-' =
0.822 0.569 -0.569 0.822
0.017975 0 0.822 -0.569 0 0.22894 0.569 0.822
0.0863 0.987 " 0.987
0.1605
It can be easily verified that: AA-^=A-iA = l2 within the numerical precision of our calculation. From the spectral decomposition we can deduce that A and A^ always have the same rank, since the rank of a matrix is equal to the number of nonzero eigenvalues which are also the elements on the main diagonal of A^ (or A^OA theorem, which we do not prove here, states that the nonzero eigenvalues of the product AB are identical to those of BA, where A is an nxp and where B is a pxn matrix [3]. This applies in particular to the eigenvalues of matrices of cross-products XX^ and X^X which are of special interest in data analysis as they are related to dispersion matrices such as variance-covariance and correlation matrices. If X is an nxp matrix of rank r, then the product X^X has r positive eigenvalues in A^ and possesses r eigenvectors in V since we have shown above that: X^X = VA^2\7T V
with
\^Y = V
and
r
(29.55)
The transposed product XX^ has the same r eigenvalues in A^, but generally possesses r different eigenvectors in U: 2ITT XXT = UA2U
with
U^U = L
and
r
(29.56)
We have already mentioned that the determinant of a square symmetric pxp matrix A of the form X^X can be expressed in terms of its p eigenvalues: IAI = IX'rXI = n
X
(29.57)
This is an alternative way of computing determinants which, unlike the usual one described in Section 9.3.4, allows for a geometrical interpretation. As we have shown above (Section 29.3), a pxn matrix X can be interpreted geometrically as a pattern of points P*" in the space S^. In the case of an (hyper) ellipsoidal pattern we find that the axes of symmetry coincide with the eigen-
40
vectors of X^X and that the lengths of the semi-axes are equal to the square roots of the corresponding eigenvalues. This is the case when the data represented by X follow a multivariate normal distribution [3]. The product of the square roots of the eigenvalues can thus be regarded as the product of the lengths of the semi-axes of the ellipsoid. Hence, this product is proportional to the volume enclosed by the (hyper)ellipsoid. From eq. (29.57) follows that the determinant lAI or IX'^XI can also be regarded as being proportional to the volume enclosed by an ellipsoidal pattern. In this case, the semi-axes of the (hyper)ellipsoid are also oriented in the direction of the eigenvectors of X^X but their lengths are proportional to the corresponding eigenvalues. When /? = 2 we obtain that the determinant is proportional to the area of the ellipse nX] X\; when /? = 3 the determinant is proportional to the volume of the ellipsoid (4n/3)X]X\'k\, etc. In some applications it is required that this volume be minimal. For example, in an experimental design with design matrix X, D-optimality is achieved when the determinant, i.e. the volume of the ellipsoid, derived from (X^X)~^ is minimized (Section 24.2.1). This geometrical interpretation may clarify the concept of degeneracy, when two or more eigenvalues are identical. In that case, the (hyper)ellipsoid which encloses the pattern P^ can be seen to be spherical in two or more dimensions of S^. Singularity of the matrix A occurs when one or more of the eigenvalues are zero, such as occurs if linear dependences exist between the p rows or columns of A. From the geometrical interpretation it can be readily seen that the determinant of a singular matrix must be zero and that under this condition, the volume of the pattern P^ has collapsed along one or more dimensions of 5^. Applications of eigenvalue decomposition of dispersion matrices are discussed in more detail in Chapter 31 from the perspective of data analysis. Singular value decomposition (SVD) of a rectangular matrix X is a method which yields at the same time a diagonal matrix of singular values A and the two matrices of singular vectors U and V such that: X = UAV^
with
U'^U = V^V = I,
(29.58)
where r is defined as above. This approach is also covered in greater detail in Chapter 31. We only state here that the singular vectors in U and V of X are identical to the eigenvectors of X^X and XX'^, respectively, up to algebraic signs, and that the singular values in the diagonal matrix A are equal to the positive square roots of the corresponding eigenvalues in A^. It is important to realize that the eigenvectors in U and in V are computed independently from one another, while the corresponding singular vectors are computed simultaneously (such as in the NIPALS algorithm which is discussed in Section 31.4.1). The difference between eigenvectors and singular vectors resides only in a possible disagreement between their algebraic signs.
41
From the 4x2 matrix X of our previous illustration we already derived V and A^ from the eigenvalue decomposition of the 2x2 cross-product matrix X^X:
r 2 -f X=
-5
4
1 -2 _ 3
0_
X'^X = VA^V^ 0.822 0.569] 55.632 0 •0.569 O522J 0 4.368 39
-24'
•24
21_
0.822 0.569
-0.569 0.822
In a similar way we can derive the eigenvalue decomposition of the corresponding 4x4 cross-product matrix XX^: XX^
=UA2U^
0.297
0.152
-0.856
0.210
0.263
-0.514
0.331
0.818
55.632 0
0
0.297
-0.856
4.368
0.152
0.210
0.263 0.331 -0.514 0.818
6 4 5 -14 41 -13 -15 14 4 -13 5 3 6 -15 3 9 From these results we can now define the singular vector decomposition of the 4x2 data matrix X:
42
0.297 0.152] -0.856 0.210 r7.459 0 1 [0.822 -0.569 UAV = 0.263 -0.514 [o 0.822 2.09oJ [o.569 0.331 0.818 J 2 -1" -5 4 =X 1 -2 3 0 within the limits of the precision of our calculation. Note that the algebraic signs of the columns in U and V are arbitrary as they have been computed independently. In the above illustration, we have chosen the signs such as to be in agreement with the theoretical result. This problem does not occur in practical situations, when appropriate algorithms are used for singular vector decomposition.
29.7 Statistical interpretation of matrices The vector of column-means m^ of an nx/? matrix X is defined as follows: m
II X
or
1 '' ^7 = - Z ^ ^ 7
withy = 1, ...,p
(29.59)
where 1„ is the sum vector of dimension n which is composed of n elements equal to 1 [ 1 ]. For example, when n equals four, we define the sum vector by means of:
h=
The elements of m^ are the coordinates of the column-centroid. Geometrically, this point corresponds with the center of mass of the row-pattern P'^ of points formed in column-space S^. In this chapter we assume that all points in a pattern are given the same mass. In data analysis it is possible to assign different masses to individual
43
points, thus giving more weight to certain points than to others. This aspect will be further developed in Chapter 32. In order to be consistent with the concept of the two dual data spaces, we must also define the vector of row-means m^: 1 1 ^ m ^ = —XI or ^i =-~y\^ij with/= 1,..., n (29.60) P P j where 1^ is the sum vector of dimension p. The elements of m^ represent the coordinates of the row-centroid, which corresponds with the center of mass of the column-pattern P^ of points formed in row-space S"^. The global mean m is a scalar defined by: m = — llXl^ np
or
m= — XS^.y np i J
(29.61)
Usually, the raw data in a matrix are preprocessed before being submitted to multivariate analysis. A common operation is reduction by the mean or centering. Centering is a standard transformation of the data which is applied in principal components analysis (Section 31.3). Subtraction of the column-means from the elements in the corresponding columns of an nxp matrix X produces the matrix of deviations from column-means or the column-centered matrix Y^: Y^ = X-l„ml
(29.62)
or y^j =x-j - mj
with /= 1, ..., n andy = 1, ..,/?
where the outer product has dimension nxp. Note that this outer product is required in order for the above expression to be consistent with matrix addition. The geometrical interpretation of column-centering is a parallel translation of the pattern of points P^ representing the rows in column-space S^ such that the center of mass coincides with the origin of space. Column-centering does not affect distances between points of P"^ in S^. It does change, however, distances of P^ in the dual space S"^. Figures 29.8a and b show the dual patterns in the dual spaces before and after column-centering. The geometrical interpretation of columncentering is more fully explained in Chapter 31. Similarly as in eq. (29.62), subtraction of the row-means from the elements in the corresponding rows of the matrix X results in the matrix of deviations from row-means or row-centered matrix Y„: Y„ = X-m„i;^
(29.63)
44
Fig. 29.8. (a) Pattern of points in column-space S^ (left panel) and in row-space S" (right panel) before column-centering, (b) After column-centering, the pattern in S^ is translated such that the centroid coincides with the origin of space. Distances between points in S^ are conserved while those in 5" are not. (c) After column-standardization, distances between points in 5^ and S" are changed. Points in S" are located on a (hyper)sphere centered around the origin of space.
45
or J IJ
^ IJ
with /= 1,..., n andy = 1, ...,p
"^l
where the outer product has dimension nxp. Row-centering achieves a parallel translation of the pattern of points P^ representing the columns in row-space S"^ such that the center of mass coincides with the origin. It leaves distances in 5^^ unchanged, but affects distances in 5^. Finally, we can center the matrix X simultaneously by rows and by columns, which yields the matrix of deviations from row- and column-means or the doublecentered matrix Y:
Y=
X-\ml-m„\l+m\M
(29.64)
or yij =
X,
rri: m, + m
with / = 1,..., n andj = 1, ...,p
where all outer products are of dimension nxp. For example, given a 4x3 matrix X:
X=
1
5
7
-1
-3
9
1
3
-2
we derive (a) the vector of column-means nip and the column-centered matrix Y^: m =[1.50
p
4.00
2.75^
-0.5 1 5.5 - 5
0.25 1.25
-4.5
3.25
-0.5
5 -1
-475
It can be verified that the sums of Y^ computed column-wise are zero. (b) the vector of row-means m„ and the row-centered matrix Y„: m„ =[3.000 3.333 4.000 0.667]^
46
-2.000 Y„ =
2.000
0.000'
3.667 -4.333
0.667
-7.000
5.000
2.000
0.333
2.333
-2J667
It can be verified that the sums of Y„ computed row-wise are zero. (c) the global mean m and the double-centered matrix Y: m = 2.750
Y=
-0.750
0750
0.000
4.917
-5.583
0.667
-5.750
3750
2.000
1.583
1.083
-2.667
In the above matrix Y we obtain that the sums computed row- and column-wise are zero. A vector of column-standard deviations dp provides for each column of X a measure of the spread of the elements around the corresponding column-mean: \ Ml
,1/2
dT =
or
p
dj =
withy = 1, ...,/?
z^yl
(29.65)
where Y^ is the column-centered matrix obtained from X and where the power symbols indicate that all elements of Y^ are to be squared and that the square root of all elements in the vector-to-matrix product is to be taken. We divide by n rather than by n-\ in the above definition of standard deviation. The former seems to be more appropriate in exploratory data analysis, where the emphasis lies in revealing geometrical structure in the data. The latter is preferred in confirmatory data analysis, where the focus is on estimating statistical properties from sampled data. Each element dj of the vector d^ contains the norm or length of the corresponding column y^ in Y^, multiplied by a constant: .1/2
1 dj =
|y./|
smce
iy,n =
\yl
The geometrical interpretation of column-standard deviations is in terms of distances of the points representing the columns of Y^ from the origin of S'^Cmulti-
47
plied by the constant 1 / V AI). Division of each element of an nxp column-centered matrix Y^ by its corresponding column-standard deviation yields the matrix of column-standardized scores or the column-standardized matrix Z„:
z.=Y^;
(29.66)
or with /= 1, ..., n andy = 1, ...,/?
'yij'^^j
provided that d- ^ 0, and where D^ is the diagonal matrix of dimension p in which the main diagonal elements are equal to the elements in d^. All columns in Z^ have mean zero and unit standard deviation. After column-standardization, the rowpattern P" is centered about the origin of 5^. The points in the column-pattern P^ of p points in the dual space S^ are at equal distances (exactly from the origin. Hence, after column-standardization, points in S"^ are forced to lie on a (hyper)sphere around the origin. Note that angular distances in 5^ after column-standardization are the same as those defined by column-centering (Fig. 29.8c). In S^ distances after column-standardization are different from those defined by column-centering. Similar definitions to those in eqs. (29.65) and (29.66) can be derived for the matrix of deviations from row-means Y„ and for the row standardized matrix Z„: 111
All
d
-
•YM,
\P
or
with/=!,...,«
d, = /p y j
(29.67)
J ,1/2
Since
iy,n = VJ
(29.68)
Z . =D-^ Y or Zij=yij/di
with / = 1, ..., n
and
j=l,...,p
provided that d^ ^ 0, and where D„ is the diagonal matrix of dimension n in which the main diagonal elements are equal to the elements in d^. Row-standardization forces points in S^ on a (hyper)sphere about the origin. In 5^, angular distances after row-standardization are the same as those defined by row-centering. Distances defined in 5^ by row-standardization are different from
48
those defined by row-centering. Note that it is not generally possible to obtain a double-standardized matrix, i.e. a matrix which has at the same time unit row- and unit column-standard deviations. Using the same matrix X as in the illustration above we derive: (a) the vector of column-standard deviations d^ and the column-standardized matrix Z^ from the column-centered matrix Y^: d^ =[3.571 3.606 2.947]'^
z.-
-0.140
0.277
0.085
1.540
-1.387
0.424
-1.260
1.387
1.103
-0.140
-0.277
-1.612
(b) the vector of row-standard deviations d^ and the row-standardized matrix Z^ from the row-centered matrix Y^: d„-[1.633 3.300 5.099 2.055]'^ -1.225 1.225 0.000 1.111 -1.313 0.202 -1.373 0.981 0.392 0.162 1.136 -1.298 After preprocessing of a raw data matrix, one proceeds to extract the structural features from the corresponding patterns of points in the two dual spaces as is explained in Chapters 31 and 32. These features are contained in the matrices of sums of squares and cross-products, or cross-product matrices for short, which result from multiplying a matrix X (or X^) with its transpose:
s =xx^ s, = x^x
(29.69) (29.70)
where X is an nxp data matrix. The nxn matrix S„ is symmetric and contains all scalar products that can be formed from the rows of X. The px/? matrix S^ is also symmetric and contains all scalar products that can be formed from the columns of X. Using the previously defined properties of vectors we can derive from these cross-products the norms of the vectors, their angular distances and the distances between their endpoints. This shows that the structural information of the patterns in space are contained in the cross-product matrices.
49
From a previously derived property of the trace of a product of two matrices (eq. (29.26)) follows that: tr(S,) = tr(XXT) = tr(X^X) = tr(Sp
(29.71)
For the 4x3 matrix X in the previous illustration in this section, we derive the cross-product matrices: 35 14
s„ = 60
14 66
60 -6
10 -4
-6
126
12
10
-4
12
14
s.=
' 60
-26
11"
-26
116
59
11
59
65
for which holds that: tr(S„) = tr(SJ = 241 A special form of cross-product matrix is the variance-covariance matrix (or covariance matrix for short) C^, which is based on the column-centered matrix Y^ derived from an original matrix X: C p = - Y ^p Yp n
(29.72)
The matrix C^ contains the variances of the columns of X on the main diagonal and the covariances between the columns in the off-diagonal positions (see also Section 9.3.2.4.4). The correlation matrix R^ is derived from the column-standardized matrix Z^: (29.73)
n
which contains the coefficients of correlation between the columns of X. From Y^ and Z^, which have been computed in the previous examples, we obtain the corresponding column-covariance matrix C^ and column-correlation matrix R^: 12.750 Cp = -12.500 -1.375
-12.500 13.000
-1.375 3.750
3.750
8.688
50
1 R.
-0.971
-0.131
-0.971
1
0.353
-0.131
0.353
1
Similar expressions to those in eqs. (29.72) and (29.73) can be derived for the variance-covariances and correlations between the rows of X: 1
(29.74)
Y Y^ n
n
(29.75) P where Y„ and Z„ now represent the row-centered and row-standardized matrices, respectively, as defined above. The row-covariance matrix C„ and the row-correlation matrix R„ are obtainable from the previously computed Y„ and Z„:
C =
2.667
-5.333
8.000
-5.333
10.889
-15.333
8.000
-15.333 -3.556
26.000
-0.990 1
0.961
1.333 1 -0.990 R„ = 0.961 0.397
1.333
1.333' -3.556 1.333 4.222_
0.397" -0.911 •-0.524 1 0.127 -0.911 0.127 -0.524 1
The eigenvectors extracted from the cross-product matrices or the singular vectors derived from the data matrix play an important role in multivariate data analysis. They account for a maximum of the variance in the data and they can be likened to the principal axes (of inertia) through the patterns of points that represent the rows and columns of the data matrix [10]. These have been called latent variables [9], i.e. variables that are hidden in the data and whose linear combinations account for the manifest variables that have been observed in order to construct the data matrix. The meaning of latent variables is explained in detail in Chapters 31 and 32 on the analysis of measurement tables and contingency tables.
51
©
® Fig. 29.9. (a) Geometrical representation of the image s in 5" of a vector v in S^. To each point in S" corresponds a vector in S^, and vice versa, (b) Geometrical representation of the image 1 in S^ of a vector u in S". To each point in S^ corresponds a vector in 5", and vice versa.
29.8 Geometrical interpretation of matrix products The matrix-to-vector product can be interpreted geometrically as a. projection of a pattern of points upon an axis. As we have seen in Section 29.4 on matrix products, if X is an nxp matrix and if v is a /? vector then the product of X with v produces the n vector s: s = Xv
(29.76)
in particular, we can express the /th element of s by means of the scalar product of the /th row of X with v: S:
=X
with /= 1, ..., n
52
The matrix X defines a pattern P"" of n points, e.g. x^, in 5^ which are projected perpendicularly upon the axis v. The result, however, is a point s in the dual space 5". This can be understood as follows. The matrix X is of dimension nxp and the vector V has dimensions p. The dimension of the product s is thus equal to n. This means that s can be represented as a point in 5^^. The net result of the operation is that the axis v in 5^ is imaged by the matrix X as a point s in the dual space 5". For every axis v in 5^ we will obtain an image s formed by X in the dual space. In this context, we use the word image when we refer to an operation by which a point or axis is transported into another space. The word projection is reserved for operations which map points or axes in the same space [11]. The imaging of v in S^ into s in S"" is represented geometrically in Fig. 29.9a. Note that the patterns of points P"^ and P^ are represented schematically by elliptic envelopes. In a similar way, we can think of X^as representing a pattern P^ ofp points, e.g. Xy, in 5" which can be projected upon an axis u and which results into a point I (not to be confounded with the sum vector 1) in the complementary (or dual) space 5^ (Fig. 29.9b): (29.77)
l = X^u
in particular, we can express the jih element of 1 as the scalar product of the jth column of X with u: withy = 1, ...,/7
/. =xTu
Using the same argument as above, we can see that the product is of dimension p, since the matrix X^ has dimensions pxn and the vector u possesses dimension n. Hence, the vector can be imaged as a single point in the dual space S^. We state that the vector u in 5" is imaged by the matrix X^ into the point 1 in the dual space S^.
:P
^
Fig. 29.10. Geometrical interpretation of multiple linear regression (MLR). The pattern of points in S^ representing a matrix X is projected upon a vector b, which is imaged in S" by the point y. The orientation of the vector b is determined such that the distance between y and the given y is minimal.
53
In a general way, we can state that the projection of a pattern of points on an axis produces a point which is imaged iJi the dual space. The matrix-to-vector product can thus be seen as a device for passing from one space to another. This property of swapping between spaces provides a geometrical interpretation of many procedures in data analysis such as multiple! linear regression and principal components analysis, among many others [12] (see Chapters 10 and 17). In multiple linear regression (A^LR) we are given an nxp matrix X and an n vector y. The problem is to find aii unknown p vector b such that the product y of X with b is as close as possible to the original y using a least squares criterion: y = Xb
(29.78)
where X describes n objects (rows) by means of p independent measurements (columns) and where y represents the dependent variable for each of the n objects. In particular, for the ith object we write: y^ =xjh
with / = 1, ..., n
The least squares criterion states that the norm of the error between observed and predicted (dependent) measurements 11 y - yl I must be minimal. Note that the latter condition involves the minimization of a sum of squares, from which the unknown elements of the vector b can be determined, as is explained in Chapter 10. The geometrical interpretation of MLR is given in Fig. 29.10. The n rows (objects) of X form a pattern P"" of points (represented by x^) which is projected upon an (unknown) axis b. This causes the axis b in S^ to be imaged by X in the dual space S"^ at the point y. The vector of observed measurements y has dimension n and, hence, is also represented as a point in S'^. Is it possible then to define an axis b in S^ such that the predicted y coincides with the observed y? Usually this will not be feasible. One may propose finding the best possible b such that y comes as close to y as possible. A criterion for closeness is to ask for the distance between y and y, which is equal to the norm 11 y - yl I, to be as small as possible. From the above we conclude that the product of a matrix with a vector can be interpreted geometrically as an operation by which a pattern of points is projected upon an axis. This projection produces an image of the axis at a point in the dual space. The concept can be extended to the product of a matrix with another matrix. In this case we can conceive of the latter as a set of axes, each of which produces image points in the dual space. In the special case when this matrix has only two columns, the product can be regarded as a projection of a pattern of points upon the plane formed by the two axes. As a result one obtains two image points (one for each axis that defines the plane of projection) in the dual space. The least squares solution of MLR can be formally defined in terms of matrix products (Section 10.2):
54
h = (X'X)-'X'y
(29.79)
where X is the nxp matrix of independent variables, y is the n vector of dependent observations and b is the p vector of regression coefficients. This leads to an expression for the n vector of estimated dependent observations y: y = X(X'^X)-^X^y
(29.80)
where the term X(X^X)"^X^ is called an orthogonal projection operator. In geometrical terms, this operator projects from 5'' to 5^ and back to S"^. When applied to the dependent observations in y, it returns the predictions in y that can be obtained by using the information contained in X. In other words, y reproduces the part of the variation in y that can be predicted from X. The above expression is useful, for example, in the case when one obtained p independent measurements from a number n^ of additional samples in the form of an n^xp additional data matrix X^,. Predictions for the corresponding dependent observations y^ are then derived from: i
=:X,(X^X)-^X^y
(29.81)
In the general case we use the symbols U and V to represent projection matrices in S'^ and 5^, each containing r projection vectors, and the symbols S and L to represent their images in the dual space:
s = xv
for a projection in 5^, producing an image in S^
(29.82)
L = X^U
for a projection in 5^^, producing an image in S^
(29.83)
Fig. 29.11. Geometrical interpretation of a rotation of 5^ as a change of the frame of coordinate axes to new directions which are defined by the columns in the rotation matrix V (left panel). Likewise, a rotation ofS" can be interpreted as a change of the frame of coordinate axes to new directions which are defined by the columns in the rotation matrix U (right panel).
55
where X is an nxp matrix and where V and U are/7Xr and nxr projection matrices, respectively. The images S and L are nxr and/?xr matrices, respectively. We have chosen the symbols S and L in accordance with conventional nomenclature used in multivariate data analysis, where projections of rows of a data matrix are called scores and where projections of columns are called loadings. An orthogonal projection is obtained when the projection matrices V or U are orthonormal: V^V = U^U = I,
(29.84)
where I^ is the identity matrix of dimension r, i.e. the number of projection vectors in V and U. Eigenvector projections are those in which the projection vectors u and v are eigenvectors (or singular vectors) of the data matrix. They play an important role in multivariate data analysis, especially in the search for meaningful structures in patterns in low-dimensional space, as will be explained further in Chapters 31 and 32 on the analysis of measurement tables and general contingency tables. A projection is called a rotation when the projection matrix U or V is nonsingular and square. In a geometrical sense, one obtains in this case that the number of projection vectors equals the number of dimensions of space. The result of a rotation is a change of the frame of coordinates (Fig. 29.11). A non-orthogonal rotation will result in an oblique frame of coordinate axes, and interdistances between points that represent rows and columns will generally not be conserved. Of special interest are orthogonal rotations which are obtained by multiplying a matrix with an orthonormal rotation matrix. In the case of an nxp non-singular matrix X with n>p,we have: S = XV
with
VV^ = V^V = I^
(29.85)
where V is apxp orthonormal rotation matrix and where S denotes the rotated nxp matrix X. In the case of an nxp non-singular matrix X with « < p, we obtain: L = X^U
with
UU^ = U^U = I,
(29.86)
where U is an nxn orthonormal rotation matrix and where L stands for the rotated pxn matrix X^. Orthogonal rotation produces a new orthogonal frame of reference axes which are defined by the column-vectors of U and V. The structural properties of the pattern of points, such as distances and angles, are conserved by an orthogonal rotation as can be shown by working out the matrices of cross-products: SS^ = XVV^X^ = XX'^
(29.87)
56
or LL^ = X^UU^X = X'^X
(29.88)
where use is made of the orthogonality of rotation matrices. After an orthogonal rotation one can also perform a backrotation toward the original frame of reference axes: SV^ = XVV^ = X
(29.89)
or LU^ = X'^UU^ = X^
(29.90)
where V and U are orthonormal rotation matrices and where use is made of the same property of orthogonality as stated above. References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12.
W.R. Dillon and M. Goldstein, Multivariate Analysis, Methods and Applications. Wiley, New York, 1984. F.R. Gantmacher, The Theory of Matrices. Vols. 1 and 2. Chelsea Publ., New York, 1977. N.C. Giri, Multivariate Statistical Inference. Academic Press, New York, 1972. N. Cliff, Analyzing Multivariate Data. Academic Press, San Diego, CA, 1987. R.J. Harris, A Primer on Multivariate Statistics. Academic Press, New York, 1975. C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman and Hall, London, 1980. M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North Holland, New York, 1983. T.W. Anderson, An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom. Intell. Lab. Syst., 4 (1988) 11-25. P.E. Green and J.D. Carroll, Mathematical Tools for Applied Multivariate Analysis. Academic Press, New York, 1976. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990. K. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal. Chem., 59 (1987) 1007A-1009A.
57
Chapter 30
Cluster analysis 30.1 Clusters Clustering or cluster analysis is used to classify objects, characterized by the values of a set of variables, into groups. It is therefore an alternative to principal component analysis for describing the structure of a data table. Let us consider an example. About 600 iron meteorites have been found on earth. They have been analysed for 13 inorganic elements, such as Ir, Ni, Ga, Ge, etc. One wonders if certain meteorites have similar inorganic composition patterns. In other words, one would like to classify the iron meteorites according to these inorganic composition patterns. One can view the meteorites as the 600 objects of a data table, each object being characterized by the concentration of 13 elements, the variables. This means that one views the 600 objects as points (or vectors) in a 13-dimensional space. To find groups one could obtain, as we learned in Chapter 17, a principal component plot and consider those meteorites that are found close together as similar and try to distinguish in this way clusters or groups of meteorites. Instead of proceeding in this visual way, one can try to use more formal and therefore more objective methods. Let us first attempt such a classification by using only two variables (for instance, Ge and Ni). Fictitious concentrations of these two metals for a number of meteorites (called A, B, ..., J) are shown in Fig. 30.1. A classification of these meteorites permits one to distinguish first two clusters, namely ABDFG and CEHIJ. On closer observation, one notes that the first such group can be divided into two sub-groups, namely ABF and DG, and that in the second group one can also discern two sub-groups, namely CEIJ and H. There are two ways of representing these data by clustering. The first is depicted by the tree, also called a dendrogram, of Fig. 30.2 and consists in the elaboration of a hierarchical classification of meteorites. It is hierarchical because large groups are divided into smaller ones (for instance, the group ABDFG splits into ABF and DG). These are then split up again until eventually each group consists of only one meteorite. This type of classification is very often used in many areas of science. Figure 30.3 shows a very small part of the classification of plants. Individual species are
58 Concentration of Ni
i
Concentration of Ge
Fig. 30.1. Concentration of Ni and Ge for ten meteorites A to J.
ABCDEFGHIJ
1
. ABFDG
I
I
I
DG
CEJI
rti
rh
I r I I
A B F
D G
C E J I
ABF
'
CEHIJ
'
1 H
Fig. 30.2. Hierarchical clustering for the meteorites described in Fig. 30.1.
grouped in genera, genera in families, etc. This classification was obtained historically by determining characteristics such as the number of cotyledons, the flower formula, etc. More recently, botanists and scientists from other areas such as bacteriology where classification is needed, have reviewed the classifications in their respective fields by numerical taxonomy [1,2]. This consists of considering the species as objects, characterized by certain variables (number of cotyledons, etc.). The data table thus obtained is then subjected to clustering. Numerical taxonomy has inspired other experimental scientists, such as chemists to apply clustering techniques in their own field. The other main possibility of representing clustered data is to make a table containing different clusterings. A clustering is a partition into clusters. For the example of Fig. 30.1, this could yield Table 30.1. Such a table does not necessarily yield a complete hierarchy (e.g. in going from 6 to 7, objects J and I, separated for clustering 6, are joined again for 7). Therefore, the presentation is called nonhierarchical. Classical books in the field are the already cited book by Sneath and Sokal [1] and that by Everitt [3]. A more recent book has been written by Kaufman and
59
(angiosperms)
(dicotyledones)
(monocotyledones)
(papilionaceae)
(rosaceae)
r
pi
o
&
I I I Fig. 30.3. Taxonomy of some plants.
TABLE 30.1 A list of clusterings derived from Fig. 30.1 by non-hierarchical clustering No. of clusters
1 2 3 4 6 7 0
Composition of the clusters
A A A A A A A
B B B B
-
C F F F B B B
D D
F
-
E G D D
F F
-
G G D D
-
F
-
-
-
H E C C
G G D
-
G C
I H E E C C G
J I H J I
J I I
-
H
-
J
-
E C
-
E J E
J I -
H
-
H
J
-
Rousseeuw [4]. Massart and Kaufman [5] and Bratchell [6] wrote specifically for chemometricians. Massart and Kaufman's book contains many examples, relevant to chemometrics, including the meteorite example [7]. More recent examples concern classification, for instance according to structural descriptions for toxicity testing [8] or in connection with combinatorial chemistry [9], according to chemical
60
composition for aerosol particles [10], Chinese teas [11] or mint species [12] and according to physicochemical parameters for solvents [13]. The selection of representative samples from a larger group for multivariate calibration (Chapter 36) by clustering was described by Naes [14].
30.2 Measures of (dis)similarity 30.2.1 Similarity and distance To be able to cluster objects, one must measure their similarity. From our introduction, it is clear that "distance" may be such a measure. However, many types of similarity coefficients may be applied. While the terms similarity or dissimilarity have no unique definitions, the definition of distance is much clearer [6]. A dissimilarity between two objects / and i' is a distance if Dii^ > 0 where D,-. = 0 if x,- = x,.
(30.1)
(where x- and x- are the row-vectors of the data table X with the measurements describing objects / and /')
Ar = A .
(30.2)
D,, + D , , > A r
(30.3)
Equation (30.1) shows that distances are zero or positive; eq. (30.2) that they are symmetric. Equation (30.3), where a is another object, is called the metric inequality. It states that the sum of the distances from any object to objects / and i' can never be smaller than the distance between / and i\ 30.2.2 Measures of (dis)similarity for continuous variables 30.2.2.1 Distances The equation for the Euclidean distance between objects / and i is
Ar=JX(^,v-^.';)'
(30.4)
where m is the number of variables. The concept of Euclidean distance was introduced in Section 9.2.3. In vector notation this can be written as: D^,=(x,-x,)^(x,-x,)
61
In some cases, one wants to give larger weights to some variables. This leads to the weighted Euclidean distance:
Ar=JSv./x,-x,.)^ V ./=i
with X w . = l
(30.5)
j
The standardized Euclidean distance is given by:
^/r=JS[(^//-^.y)/^,]'
(30.6)
where Sj is the standard deviation of the values in theyth column of X:
'j=\lt^-,-^j)'
(30.7)
It can be shown that the standardized Euclidean distance is the Euclidean distance of the autoscaled values of X (see further Section 30.2.2.3). One should also note that in this context the standard deviation is obtained by dividing by n, instead of by(n-l). The Mahalanobis distance [15] is given by: Z)^,=(x,-x,)^C-Ux,-x,)
(30.8)
where C is the variance-covariance matrix of a cluster represented by x^ (e.g. x^ is the centroid of the cluster). It is therefore a distance between a group of objects and a single object i. The distance is corrected for correlation. Consider Fig. 30.4a; the distance between the centre C of the cluster and the objects A and B is the same in Euclidean distances but, since B is part of the group of objects outlined by the ellipse, while A is not, one would like a distance measure such that CA is larger than CB. The point B is "closer" to C than A because it is situated in the direction of the major axis of the ellipse while A is not: the objects situated within the ellipse have values ofx^ and X2 that are strongly correlated. For A this will not be the case. It follows that the distance measure should take correlation (or covariance) into account. In the same way, in Fig. 30.4b, clusters Gl and G2 are closer together than G3 and G4 although the Euclidean distances between the centres are the same. All groups have the same shape and volume, but Gl and G2 overlap, while G3 and G4 do not. Gl and G2 are therefore more similar than G3 and G4 are. Generalized distance summarizes eqs. (30.4) to (30.8). It is a weighted distance of the general form:
62
x^i
Xi»
b)
a)
Fig. 30.4. Mahalanobis distance: (a) object Bis closer to centroidC of cluster Gl then object A; (b)the distance between clusters Gl and G2 is smaller than between G3 and G4.
D,^,=(x,-x,yW(x,-x,)
(30.9)
where W represents an mxm weighting matrix. Four particular cases of the generalized distance are mentioned below: W W W
= = =
W
=
I defines ordinary Euclidean distances diag (w) produces weighted Euclidean distances diag (1/d^) where d represents the vector of column-standard deviations of X, yielding standardized Euclidean distances C~\ where C represents the variance-covariance matrix as defined in eq. (30.8), defines Mahalanobis distances.
Euclidean distances (ordinary or standardized) are used very often for clustering purposes. This is not the case for Mahalanobis distance. An application of Mahalanobis distances can be found in Ref. [16]. 30.2.2.2 Correlation coefficient Another way of measuring similarity between / and /' is to measure the correlation coefficients between the two row-vectors x, and x^/. The difference between using Euclidean distance and correlation is explained with the help of Fig. 30.5 and Table 30.2. In Chapter 9 it was shown that r equals the cosine of the angle between vectors. Consider the objects /, i' and T. The Euclidean distance, Z)^-. in Fig. 30.5 is the same as Z),^-. However, the angle between x^ and x,/ is much smaller than between x, and x^.. and therefore the correlation coefficient is larger. How to choose between the two is not evident and requires chemical considerations. This is shown with the example of Table 30.2, which gives the retention indices of five substances on three gas chromatographic stationary phases (SFs). The question is which of these phases should be considered similar. The similarity measure to be chosen depends on the point of view of the analyst. One point of view might be that
63
Fig. 30.5. The point / is equidistant to /' and f according to the Euclidean distances {Dw and Du") but much closer to /' (cos 0,/') than to i" (cos 6r), when a correlation-based similarity measure is applied.
those SFs that have more or less the same over-all retention, i.e., the same polarity towards a variety of substances, are considered to be similar. In that case, SF3 is very dissimilar from both SFj and SF2, while SFj and SF2 are quite similar. The best way to express this is the Euclidean distance. D^^ and D22, are then much higher than D^2' On the other hand, the analyst might not be interested in global retention indices. Indeed, by increasing the temperature for SF3, he would obtain similar retention indices as for the other two. He will then observe that the relative retention time, i.e. the retention times of the substances compared with each other, are the same for SFj and SF3 and different from SF2. Chemically, this means that SF3 has different polarity from SFj, but the same specific interactions. This is best expressed by using the correlation coefficient as the similarity measure. Indeed, ri3 = 1, indicating complete similarity, while r^2 ^^^ ^23 ^ ^ niuch lower. Since both r = 1 and r = -1 are considered to indicate absolute similarity and if, as with Euclidean distance, one would like the numerical value of the similarity measures to increase with increasing dissimilarity, one should use, for instance, 1 - |r|. TABLE 30.2 Retention indices of five substances on three stationary phases in GLC Stationary phases (SF)
1
2
3
4
5
1 2 3
100 120 200
130 110 260
150 170 300
160 150 320
170 145 340
64
30.2.2.3 Scaling In the meteorite example, the concentration of Ni is of the order of 50000 ppm and the Ga content of the order of 50 ppm. Small relative changes in the Ni content then have, of course, a much higher effect on the Euclidean distance than equally high relative changes of the Ga content. One might also consider two metals M and N, one ranging in concentration from 900 to 1100 ppm, the other from 500 to 1500 ppm. Concentration changes from one end of the range to the other in N would then be more important in the Euclidean distance than the same kind of change in M. It is probable that the person carrying out the classification will not agree with these numerical consequences and consider them as artefacts. Both problems can be solved by scaling the variables. The most usual way of doing this is using the z-transform, also called autoscaling (see also Chapter 3.3). One then determines ^^^ij_2^
(30.10)
where x^j is the value for object / of variabley, x^ is the mean for variable7, and Sj is the standard deviation for variable j . One then uses z in eq. (30.4), which is equivalent to applying the standardized Euclidean distance (eq. (30.6)) to the jc-values. Other possibilities are range scaling and logarithmic transformation. In range scaling one does not divide by s^ as in eq. (30.6), but by the range r^ of variable/ x^^_-^ r.
(30.11)
If one wants z,y expressed on a 0-1 scale, this becomes:
where JC^^^J^ is the lowest value of jc^. The logarithmic transform, too, reduces variation between variables. Its effect is not to make absolute variation equal but to make variation comparable in the following sense. Suppose that variable 1 has a mean value of 100 and variable 2 a mean value of 10. Variable 1 varies between 50 and 150. If the variation is proportional to the mean values of the variables, then one expects variable 2 to vary between more or less 5 and 15. In absolute values the variation in variable 1 is therefore much larger. When one transforms variables 1 and 2 by taking their logarithms the variation in the two transformed variables becomes comparable. Log-transformation to correct for heteroscedasticity in a regression context is described in Section 8.2.3.1. It also has the advantage that the scaling does not change when data are added. This is not so for eqs. (30.10) and (30.11), since one must recompute Xj, r- or Sy
65
Scaling is a very important operation in multivariate data analysis and we will treat the issues of scaling and normalisation in much more detail in Chapter 31. It should be noted that scaling has no impact (except when the log transform is used) on the correlation coefficient and that the Mahalanobis distance is also scaleinvariant because the C matrix contains covariance (related to correlation) and variances (related to standard deviation).
30.2.3 Measures of (dis)similarity for other variables 30.2.3.1 Binary variables Binary variables usually have values of 0 (for attribute absent) or 1 (for attribute present). The simplest type of similarity measure is the matching coefficient. For two objects / and /' and attribute j :
V/ = 0 if ^/y^^ry The matching coefficient is the mean of the 5--values for all m attributes 1 m
(30.13) m j=i
This means that one counts the number of attributes for which / and /' have the same value and divides this by the number of attributes. The Jaccard similarity coefficient is slightly more complex. It considers that the simultaneous presence of an attribute in objects / and i indicates similarity, but that the absence of the attribute has no meaning. Therefore: ^..,.= 1
if
x^. = x^^=i
ignored if x^^ = x^^j = 0 The Jaccard similarity coefficient is then computed with eq. (30.13), where m is now the number of attributes for which one of the two objects has a value of 1. This similarity measure is sometimes called the Tanimoto similarity. The Tanimoto similarity has been used in combinatorial chemistry to describe the similarity of compounds, e.g. based on the functional groups they have in common [9]. Unfortunately, the names of similarity coefficients are not standard, so that it can happen that the same name is given to different similarity measures or more than one name is given to a certain similarity measure. This is the case for the Tanimoto coefficient (see further).
66
The Hamming distance is given by: d^i^j = I if x.j^x^^j d^^^j = 0 if Xij = Xi^j
and
It can be shown [5] that the Hamming distance is a binary version of the city block distance (Section 30.2.3.2). Some authors use the Hamming distance as the equivalent of Euclidean distance of binary data. In that case:
The literature also mentions a normalized Hamming distance, which is then equal to either: 1 '^ "^.1=
or 1 '"
\mf:t The first of these two is also called the Tanimoto coefficient by some authors. It can be verified that, since distance = 1 - similarity, this is equal to the simple matching coefficient. Clearly, confusion is possible and authors using a certain distance or similarity measure should always define it unambiguously. 30.2.3.2 Ordinal variables For those variables that are measured on a scale of integer values consisting of more than two levels, one uses the Manhattan or city-block distance. This is also referred to as the Lj-norm. It is given for variabley by:
m
D,,=X4o
(30-14)
67
X2 4
^1 Fig. 30.6. Dii' is the Euclidean distance between / and /', dm and dwj are the city-block distances between / and /' for variables xi and X2 respectively. The city-block distance is dm + dm.
Here, too, scaling can be required when the ranges of the variables are dissimilar. In this case, one divides the distances by the range r^ for variable j IJ
In this way one obtains t/-values from 0 to 1. Then s^^y = 1 - d^^^j. Manhattan distances can be used also for continuous variables, but this is rarely done, because one prefers Euclidean distances in that case. Figure 30.6 compares the Euclidean and Manhattan distances for two variables. While the Euclidean distance between / and i' is measured along a straight line connecting the two points, the Manhattan distance is the sum of the distances parallel to the axes. The equations for both types of distances are very similar in appearance. In fact, they both belong to the Minkowski distances given by:
fm D,,
\
\lr
(30.15)
The Manhattan distance is obtained for r = 1 and the Euclidean distance for r = 2. In this context the Euclidean distance is also referred to as the L2-norm. 30.2.3.3 Mixed variables In some cases, one needs to combine variables of mixed types (binary, ordinal or continuous). The usual way to do this is to eliminate the effect of varying ranges by scaling. All variables are transformed, so that they take values from 0 to 1 using range scaling for the continuous variables or the procedure for scaling described
68
for ordinal variables in Section 30.2.3.2, while binary variables are expressed naturally on a 0-1 scale. The range scaled similarity for variables on an interval scale is obtained as
with Zij and z- as defined in eq. (30.12). Then one can determine the similarities of the objects / and /' by summing the range scaled similarities for all variables j . A distance measure can be obtained by computing
where d^^^j is the range scaled similarity between objects / and i' for variable 7.
30.2.4 Similarity matrix The similarities between all pairs of objects are measured using one of the measures described earlier. This yields the similarity matrix or, if the distance is used as measure of (dis)similarity, the distance matrix. It is a symmetrical nxn matrix containing the similarities between each pair of objects. Let us suppose, for example, that the meteorites A, B, C, D, and E in Table 30.3 have to be classified and that the distance measure selected is Euclidean distance. Using eq. (30.4), one obtains the similarity matrix in Table 30.4. Because the matrix is symmetrical, only half of this matrix needs to be used.
TABLE 30.3 Example of a data matrix System
Concentrations (arbitrary units) Metal a
A B C D E
100 80 80 40 50
Metal b
80 60 70 20 10
Metal c
70 50 40 20 20
Metal d
60 40 50 10 10
69
30.3 Clustering algorithms 30.3.1 Hierarchical methads There is a wide variety of hierarchical algorithms available and it is impossible to discuss all of them here. Therefore, we shall only explain the most typical ones, namely the single linkage, the complete linkage and the average linkage methods. In the similarity matrix, one seeks the two most similar objects, i.e., the objects for which S^^^ is largest. When using distance as the similarity measure, this means that one looks for the smallest D^^ value. Let us suppose that it is D^^, which means that of all the objects to be classified, q and p are the most similar. They are considered to form a new combined object /?*. The similarity matrix is thereby reduced to (n - 1) x (w - 1). In average linkage, the similarities between the new object and the others are obtained by averaging the similarities of q and p with these other objects. For example, D^^* = {D^^ + D^p)/2. In single linkage, D^^* is the distance between the object / and the nearest of the linked objects, i.e., it is set equal to the smallest of the two distances D-^ and D^^: Z)^^* = min (D^^, D^^). Complete linkage follows the opposite approach: D^^* is the distance between / and the furthest object q or p. In other words Z)^^* = max(Z)j^, Z),^. At the same time, one starts constructing the dendrogram by linking together q and p at the level D^^. This process is repeated until all objects are linked in one hierarchical classification system, which is represented by a dendrogram. This procedure can now be illustrated using the data of Tables 30.3 and 30.4. The smallest D is 14.1 (between D and E). D and E are combined first and yield the combined object D*. The successive reduced matrices obtained by average linkages are given in Table 30.5, those obtained by single linkage in Table 30.6 and those obtained by complete linkage in Table 30.7. The dendrograms are shown in Fig. 30.7. Clusters are then obtained by cutting the highest link(s). For instance, by breaking the highest link in Fig. 30.7a, one obtains the clusters (ABC) and (DE). Cutting the second highest links leads to the clustering (A) (BC) (DE). How many links to cut is not always evident (see also Section 30.3.4.2). TABLE 30.4 Similarity matrix (based on Euclidean distance) for the objects from Table 30.3
A B C D E
0 40.0 38.7 110.4 111.4
0 17.3 70.7 72.1
0 78.1 80.6
0 14.1
70
A OT
B
C
D
E
U M
A
B
C
D
E
A
B
C
D
E
U M
IUM
bO\
1001
Fig. 30.7. Dendrograms for the data of Tables 30.3-30.7: (a) average linkage; (b) single linkage; (c) complete linkage. TABLE 30.5 Successive reduced matrices for the data of Table 4 obtained by average linkage (a) B A 0 B 40.0 0 C 38.7 17.3 D* 110.9 71.4 D* is the object resulting from the combination of D and E.
D*
0 79.3
(b) A B* A 0 B* 39.3 0 D* 110.9 75.3 B* is the object resulting from the combination of B and C. (c) A* D* A* 0 D* 93.1 0 A* is the object resulting from the combination of A and B*. (d) The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(a).
D*
0
71 TABLE 30.6 Successive reduced matrices for the data of Table 30.4 obtained by single linkage (a) A B C D*
A 0 40.0 38.7 110.4
B
C
0 17.3 70.7
0 78.1
B*
D*
D*
(b) A B* D*
A 0 38.7 110.4
0 70.7
0
(c) A* D*
A* 0 70.7
D* 0
(d) The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(b).
We observe that, in this particular instance, the only noteworthy difference between the algorithms is the distance at which the last link is made (from 111.4 for complete linkage to 70.7 for single linkage). When larger data sets are studied, the differences may become more pronounced. In general, average linkage is preferred. In the average linkage mode, one may introduce a weighting of the objects when clusters of unequal size are linked. Both weighted and unweighted methods exist. Another method which gives good results (i.e., has been shown to give meaningful clusters) is known as Ward's method [17]. It is based on a heterogeneity criterion. This is defined as the sum of the squared distances of each member of a cluster to the centroid of that cluster. Elements or clusters are joined with as criterion that the sum of heterogeneities of all clusters should increase as little as possible. Single linkage methods have a tendency to chain together ill-defined clusters (see Fig. 30.8). This eventually leads to include rather different subjects (A to X of Fig. 30.8) into the same long drawn-out cluster. For that reason one sometimes
72 TABLE 30.7 Successive reduced matrices for the data of Table 30.4 obtained by complete linkage (a) A
A B C D*
B
C
D*
0 40.0 38.7 111.4
0 0
17.3 72.1
80.6
B*
D*
(b) A B* D*
A 0 40.0 111.4
0 80.6
0
(c) A* D*
A* 0 111.4
D* 0
(d) The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(c).
X,*
Fig. 30.8. Dissimilar objects A and X are chained together in a cluster obtained by single linkage.
says that the single linkage methods is space contracting. The complete linkage method leads to small, tight clusters and is space dilating. Average linkage and Ward's method are space conserving and seem, in general, to give the better results.
73 TABLE 30.8 Values characterizing the objects of Fig. 30.9 X2
A B C D E F G
45 24 14 64 36 56 20
24 43 23 52 121 140 148
TABLE 30.9 Euclidean distance between points in Fig. 30.9 (from Ref [18]) B A B C D E F G
0 28 32 35 100 119 127
0 23 40 80 104 105
C
D
0 60 103 128 126
0 76 90 105
0 29 30
0 35
Single linkage has the advantage of mathematical simplicity, particularly when it is calculated using an operational research technique called the minimal spanning tree [18]. Although the computations seem to be very different from those in Table 30.6, exactly the same results are obtained. To explain the method we need a matrix with some more objects. The data matrix is given in Table 30.8 and the resulting similarity matrix (Euclidean distances) in Table 30.9. We may think of these objects as towns, the distances between which are given in the table, and suppose that the seven towns must be connected to each other by highways (or a production unit serving six clients using a pipeline). This must be done in such a way that the total length of the highway is minimal. Two possible configurations are given in Fig. 30.9. Clearly, (a) is a better solution than (b). Both (a) and (b) are graphs that are part of the complete graph containing all possible links and both are connected graphs (all of the nodes are linked directly or indirectly to each other). These graphs are called trees and the tree for which the sum of the values of the links is minimal is called the minimal spanning tree. This
74
Fig. 30.9. Examples of trees in a graph; (a) is the minimal spanning tree [18].
minimal spanning tree is also the optimal solution for the highway problem. The terminology used in this chapter comes from graph theory. Graph theory is described in Chapter 42. Several algorithms can be used to find the minimal spanning tree. One of these is KruskaVs algorithm [19] which can be stated as follows: add to the tree the edge with the smallest value which does not form a cycle with the edges already part of the tree. According to this algorithm, one selects first the smallest value in Table 30.9 (link BC, value 23). The next smallest value is 28 (link AB). The next smallest values are 29 and 30 (links EF and EG). The next smallest value in the table is 32 (link AC). This would, however, close the cycle ABC and is therefore eliminated. Instead, the next link that satisfies the conditions of Kruskal's algorithm is AD and the last one is DE. The minimal spanning tree obtained in this way is that given in Fig. 30.9(a). After careful inspection of this figure, one notes that two clusters can be obtained in a formal way by breaking the longest edge (DE). When a more detailed classification is needed, one breaks the second longest edge, and so on until the desired number of classes is obtained (see Section 30.3.4.2). In the same way, clusters were obtained from Fig. 30.7 by breaking first the lowest link, i.e. the one with highest distance. An example of an application is shown in Fig. 30.10. This concerns the classification of 42 solvents based on three solvatochromic parameters (parameters that describe the interaction of the solvents with solutes) [13]. Different methods were applied, among which was the average linkage method, the result of which is shown in the figure. According to the method applied, several clusterings can be found. For instance, the first cluster to split off from the majority of solvents consists of solvents 36, 37, 38, 39, 40, 41, 42 (r-butanol, isopropanol, n-butanol,
75 0.371 -r
0.332 I 0.293 t 0.254 0.215 + 0.176 0.138 -f
0.099 I 0.060 + 0.021
I ^ rfiTmm rrrfFT?! n rmTrii I m n JUL
JL
JL
Fig. 30.10. Hierarchical agglomerative classification of solvents according to solvent-solute and solvent-solvent interactions [13].
ethanol, methanol, ethyleneglycol and water). This solvent class consists of amphiprotic solvents (alcohols and water). This is then split further into the monoalcohols on the one hand and ethyleneglycol and water, which have higher association ability. In this way, one can develop a detailed classification of the solvents. Another use of such a classification is to select different, representative objects. Snyder [20] used this to select a few solvents, that would be different and representative of certain types of solvent-solute interactions. These solvents were then used in a successful strategy for the optimization of mobile phases for liquid chromatographic separation. The hierarchical methods so far discussed are called agglomerative. Good results can also be obtained with hierarchical divisive methods, i.e., methods that first divide the set of all objects in two so that two clusters result. Then each cluster is again divided in two, etc., until all objects are separated. These methods also lead to a hierarchy. They present certain computational advantages [21,22]. Hierarchical methods are preferred when a visual representation of the clustering is wanted. When the number of objects is not too large, one may even compute a clustering by hand using the minimum spanning tree. One of the problems in the approaches described above and, in fact, also in those described in the next sections, is that when objects are added to the data set, the
76
05%
03[l
LI
[\
Va-1
A Fig. 30.11. The three-distance clustering method [23]. The new object A has to be classified. In node Ka it must be decided whether it fits better in the group of nodes represented by Vi, the group of nodes represented by V2, or does not fit in any of the nodes already represented by V^.
whole clustering must be carried out again. A hierarchical procedure which avoids this problem has been proposed by Zupan [23,24] and is called the three-distance clustering method (3-DM). Let us suppose that a hierarchical clustering has already been obtained and define V^ as a node in the dendrogram, representing all the objects above it. Thus 1 Z^ ^ii^Xi i=\
i=\
^i2^"'^2j i=\
where v,^ is the mean vector of the n^ vectors, representing the % objects / = 1,..., n^ above it and m is the number of variables. For instance, in Fig. 30.11, Vj is the mean of three objects Oj, O2, O3. Suppose now also that in an earlier stage one has decided that the new object A belongs rather to the group of objects above V^ than to the group represented by V^. We will now take one of three possible decisions: (a) A belongs to Vj (b) A belongs to V2 (c) A does not belong to V^ or V2. This depends on the similarity or distance between A, V^ and Vo- O^ie determines the similarities 5^^ , 5^^^, and 5^ ^ . If 5^^ is highest, then A belongs to V^ and the same process as for V^ is repeated for Vj. If 5^^^ is highest, then A belongs to the group represented by V2 and if 5^, y^ is highest, then A belongs to V^^ but not to V, or Vo. A new branch is started for A between K, and V,a-\' 30.3.2 Non-hierarchical methods Let us now cluster the objects of Table 30.8 with a non-hierarchical algorithm. Instead of clustering by joining objects successively, one wants to determine
77 1I
oD
oF
»t1
^
E
^
X2 Fig. 30.12. Forgy's non-hierarchical classification method. A,..., G are objects to be classified; * 1, * 4 are successive centroids of clusters.
directly a A'-clustering, by which is meant a classification into /f clusters. We will apply this for 2 clusters. Of course, one is able to see that the correct 2-clustering is (A,B,C,D) (E,F,G). In general, one uses m-dimensional data and it is then not possible to visually observe clusters. In this section, we will also suppose that we are not able to do this. To obtain 2 clusters, one selects 2 seed points among the objects and classifies each of the objects with the nearest seed point. In this way, an initial clustering is obtained. For the objects of Table 30.8, A and B are selected as first seed points. In Fig. 30.12 it can be seen that this is not a good choice (A and E would have been better), but it should be remembered that one is supposed to be unable to observe this. D is nearest to A and C, E, F and G are nearest to B (Table 30.9). The initial clustering is therefore (A,D) (B,C,E,F,G). For each of these clusters, one determines the centroid (the point with mean values of variables x^ and X2 for each cluster). For cluster (A,D), the centroid (* 1) is characterized by Xi = (45 + 64) / 2 = 54.5 ^2 = (24 + 52) / 2 = 38 and for cluster (B,C,E,F,G) the centroid (*2) is given by jcj = (24 + 14 + 36 + 56 + 20) / 5 = 30 X2 = (43 + 23 + 121 + 140 + 148) / 5 = 95 The two centroids are shown in Fig. 30.12. In the next step, one reclassifies each object according to whether it is nearest to *1 or *2. This now leads to the clustering (A,B,C,D) (E,F,G). The whole procedure is then repeated: new centroids are computed for the clusters (A,B,C,D) and (E,F,G). These new centroids are
78
situated in *3 and *4. Reclassification of the objects leads again to (A,B,C,D) (E,F,G). Since the new clustering is the same as the preceding one, this clustering is considered definitive. The method used here is called Forgy's method [25]. This is one of the K-center or K'Centroid methods, another well-known variant of which is MacQueen's K-means method [26]. Forgy's method involves the following steps. (1) Select an initial clustering. (2) Determine the centroids of the clusters and the distance of each object to these centroids. (3) Locate each object in the cluster with the nearest centroid. (4) Compute new cluster centroids and go to step (3). One continues to do this until convergence occurs (i.e., until the same clustering is found in two successive assignment steps). Instead of using centroids as the points around which the clusters are constructed, one can select some of the objects themselves. These are then called centrotypes. Such a method might be preferred if one wants to select representative objects: the centrotype object will be considered to be representative for the cluster around it. Returning to the simple example of Fig. 30.12, suppose that one selects objects A and E as centrotypes. Thus, B, C and D would be classified with A since they are nearer to it than to E, and F and G would be clustered with E. This method is based on an operations research model (Chapter 42), the so-called location model. The points A-G might then be cities in which some central facilities must be located. The criterion to select A and E as centrotypes is that the sum of the distances from each town to the nearest facility is minimal when the facilities are located in A and E. This means that A, B, C and D will then be served by the facility located in A and E, F and G by the one located in E. An algorithm that allows to do this was described in the clustering literature under the name MASLOC [27]. Numerical algorithms such as genetic algorithms or simulated annealing can also be applied (e.g. Ref [11]). Both methods described above belong to a class of methods that is also called partitioning or optimization or partitioning-optimization techniques. They partition the set of objects into subsets according to some optimization criterion. Both methods use representative elements, in one case an object of the set to be clustered (the centrotype), in the other an object with real values for the variables that is not necessarily (and usually not) part of the objects to be clustered (the centroid). In general, one maximizes between-cluster Euclidean distance or minimizes within-cluster Euclidean distance or variance. This really amounts to the same. As described by Bratchell [6], one can partition total variation, represented by T, into between-group (B) and within-group components (W). T =B+W
79
^2*
Fig. 30.13. Agglomerative methods will first link A and B, so that meaningless clusters may result. The non-hierarchical K=2 clustering will yield clusters I and II.
where T is the total sums of squares and products matrix, related to the variancecovariance matrix, B is the same matrix for the centroids and W is obtained by pooling the sums of squares and product matrices for the clusters. One can also write tr(T) = tr(B) + tr(W) Since T and therefore also tr(T) is constant, minimizing tr(W) is equivalent to maximizing tr(B). It can be shown that tr(B) is the sum of squared Euclidean distances between the group centroids. An advantage of non-hierarchical methods compared to hierarchical methods is that one is not bound by earlier decisions. A simple example of how disastrous this can be is given in Fig. 30.13 where an agglomerative hierarchical method would start by linking A and B. On the other hand, the agglomerative methods allow better visualization, although some visualization methods (e.g. Ref. [28]) have been proposed for non-hierarchical methods. 30.3.3 Other methods A group of methods quite often used is based on the idea of describing high local densities of points. This can be done in different ways. One such way, mode analysis [29], is described by Bratchell [6]. A graph theoretical method (see also Chapter 42) by Jardine and Sibson [30] starts by considering each object as a node and by linking those objects which are more similar than a certain threshold. If a Euclidean distance is used, this means that only those nodes are linked for which the distance is smaller thap a chosen threshold distance. When this is done, one determines the so-called maximal complete subgraphs. A complete subgraph is a
80
Fig. 30.14. Step 1 of the Jardine and Sibson method [30]. Objects less distant than Dj are linked.
set of nodes for which all the nodes are connected to each other. Maximal complete subgraphs are then the largest (i.e. those containing most objects) of these complete subgraphs. In Fig. 30.14, they are (B, G, F, D, E, C), (A, B, C, F, G), (H, I, J, K), etc. Each of these is considered as the kernel of a cluster. One now joins those kernels that overlap to a large degree with, as criterion, the fact that they should have at least a prespecified number of nodes in common (for instance, 3). Since only the first two kernels satisfy this requirement, one considers A, B, C, D, E, F, G as one cluster and H, I, J, K as another. Another technique, originally derived from the potential methods described for supervised pattern recognition (Chapter 33), was described by Coomans and Massart [31]. A kernel or potential density function is constructed around each object. In Fig. 30.15 A, B, C and D are objects around which a triangular potential field is constructed (solid lines). The potentials in each point are summed (broken line). One selects the point with highest summed potential (B) as cluster center and measures the summed potential in the closest point. All points, such as A and C, that can be reached from B along a potential path, which decreases continuously, belong to the same summed potential hill and such objects are considered to be part of the cluster. When the potential is higher again, in a certain object, or there is a point mid-way between two points which has lower potential, then this means that one has started to climb a new hill and therefore the object is part of another cluster. The method has the advantage that the form of the cluster is not important, while most other methods select spherical or ellipsoid clusters. The disadvantage is that the width of the potential field must be optimized. All these methods and the methods of the preceding section have one characteristic in common: an object may be part of only one cluster. Fuzzy clustering applies other principles. It permits objects to be part of more than one cluster. This leads to results such as those illustrated by Fig. 30.16. Each object / is given a value
81
potential
B' \ \ \ \ \ AgC
D
distance
Fig. 30.15. One-dimensional example of the potential method [31].
Xjt
II
/B
^1 Fig. 30.16. Fuzzy clustering. Two fuzzy clusters (I and II) are obtained. For example/M = 1 ,/AII = 0,/BI = 0 , / B I I = 1 , / C I = 0.47,/CII = 0.53.
f-i^ for a membership function (see Chapter 19) in cluster k. The following relationships are defined for all / and k:
and K
82
where K is the number of clusters. When/j^ = 1, then it means that / unambiguously belongs to cluster k, otherwise the larger/j^ is, the more / belongs to cluster k. The assignment of the membership values is done by an optimization procedure. Of the many criteria that have been described, probably the best known is that of Ruspini [32]:
where 5 is an empirical constant and d^^ a distance. The criterion is minimized and therefore requires membership values of / and i to be similar when their distance is small. Fuzzy clustering has been applied only to a very limited extent in chemometrics. A good example concerning the classification of seeds from images is found in Ref. [33]. As described in the Introduction to this volume (Chapter 28), neural networks can be used to carry out certain tasks of supervised or unsupervised learning. In particular, Kohonen mapping is related to clustering. It will be explained in more detail in Chapter 44. 30.3.4 Selecting clusters 30.3.4.1 Measures for clustering tendency Instead of carrying out the actual cluster analysis to find out whether there is structure in the data, one might wonder if it is useful to do so and try to measure clustering tendency. Hopkin's statistic and modifications of it have been described in the literature [34,35]. The original procedure is based on the fact that if there is a clustering tendency, distances between points and their nearest neighbour will tend to be smaller than distances between randomly selected artificial points in the same experimental domain and their nearest neighbour. The method consists of the following steps (see also Fig. 30.17): - select at random a small number (for example 5%) of the real data points; - compute the distance d^ to the nearest data points for each selected data point i\ - generate at random an equal number of artificial points in the area studied; - compute the distance Uj to the nearest real data point for each artificial point; - determine// = S«y/(Xwy +X
Y.d^ and H will be higher than 0.5.
83
Fig. 30.17. Square symbols are the actual objects and circled squares are the marked objects. Open circles are artificial points (adapted from Ref. [34]).
30.3.4.2 How many clusters? In hierarchical clustering one can obtain any number of clusters K,\
84
'24
Fig. 30.18. Nine objects, A-I, are clustered. K=2 and A^ = 3 clusterings are obtained. The 2-clustering is correct but not relevant. The 3-clustering is meaningful and robust.
as the similarity value at which links are cut or the optimization criterion of Section 30.3.3 in function of K. Large steps in that function occur when a significant A'-clustering is encountered. An information theoretical procedure is described in [10]. Another approach to select clusters is to validate them using a supervised pattern recognition technique (see Chapter 33), such as linear discriminant analysis (LDA) [36]. Cross-validation can be applied to decide whether the LDA does separate the clusters, in which case they are considered to be significant.
30.3.5 Conclusion The result of the clustering procedure depends on which procedure is applied and on the similarity measures used. Each gives a different view of the complex reality in the data set. It is therefore highly recommended that a clustering method is combined with a PC A or PLS display (see Chapters 17, 31 and 35) and, if possible, that several clustering methods and several types of similarity are used.
85
References 1. 2.
3. 4. 5. 6. 7.
8. 9.
10.
11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21.
P.H. A. Sneath and R.R. Sokal, Numerical Taxonomy. The Principles and Practice of Numerical Classification. Freeman, San Francisco, CA, 1973. G.C. Mead, A.P. Norris and N. Bratchell, Differentiation of Staphylococcus aureus from freshly slaughtered poultry and strains 'endemic' to processing plants by biochemical and physiological tests. J. Appl. Bacteriol., 66 (1989) 153-159. B.S. Everitt, Cluster Analysis. Heineman, London, 1981. L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990. D.L. Massart and L. Kaufman, Interpretation of Analytical Data by the Use of Cluster Analysis. Wiley, New York, 1983. N. Bratchell, Cluster analysis. Chemom. Intell. Lab. Syst., 6 (1989) 105-125. D.L. Massart, L. Kaufman and K.H. Esbensen, Hierarchical non-hierarchical clustering strategy and application to classification of iron-meteorites according to their trace element patterns. Anal. Chem., 54 (1982) 911-917. R.G. Lawson and P.C. Jurs, Cluster analysis of acrylates to guide sampling for toxicity testing. J. Chem. Inf. Comput. Sci., 30 (1990) 137-144. R.D. Brown and Y.C. Martin, Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci., 36 (1996) 572-584. I. Bondarenko, H. Van Malderen, B. Treiger, P. Van Espen and R. Van Grieken, Hierarchical cluster analysis with stopping rules built on Akaike's information criterion for aerosol particle classification based on electron probe X-ray microanalysis. Chemom. Intell. Lab. Syst., 22 (1994) 87-95. L.X. Sun, F. Xu, Y.Z. Liang, Y.L. Xie and R.Q. Yu, Cluster analysis by the K-means algorithm and simulated annealing. Chemom. Intell. Lab. Syst., 25 (1994) 51-60. E. Marengo, C. Baiocchi, M.C. Gennaro, P.L. Bertolo, S. Lanteri and W. Garrone, Classification of essential mint oils of different geographic origin by applying pattern recognition methods to gas chromatographic data. Chemom. Intell. Lab. Syst., 11 (1991) 75-88. A. de Juan, G. Fonrodona and E. Casassas, Solvent classification based on solvatochromic parameters: a comparison with the Snyder approach. Trends Anal. Chem., 16(1) (1997) 52-62. T. Naes, The design of calibration in near infra-red reflectance analysis by clustering. J. Chemom., 1 (1987) 129-134. P.C. Mahalanobis, On the Generalized Distance in Statistics, Proc. Nat. Inst. Sci. (India) 12 (1936)49-55. P. Dagnelie and A. Merckx, Using generalized distances in classification of groups. Biom. J., 33(1991)683-695. J.H. Ward Jr., Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc, 58 (1963) 236-244. D.L. Massart and L. Kaufman, Operations research in analytical chemistry. Anal. Chem., 47 (1975) 1244A-1253A. J.B. Kruskal Jr., On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc, 7 (1956) 48-50. L.R. Snyder, Classification of the solvent properties of common liquids. J. Chrom. Sci., 16 (1978)223-234. P. MacNaughton-Smith, W.T. Williams, M.B. Dale and H.T. Clifford. Computer J., 14 (1972) 162.
86 22. 23. 24. 25. 26.
27. 28. 29. 30. 31. 32. 33. 34. 35. 36.
A. Thielemans, M.P. Derde and D.L. Massart, CLUE. Elsevier Scientific Software, Elsevier, Amsterdam, 1985. J. Zupan, A new approach to binary tree-based heuristics. Anal. Chim. Acta, 122 (1980) 337-346. J. Zupan and D.L. Massart, Application of the three-distance clustering method in analytical chemistry. Anal. Chem., 61 (1989) 2098-2102. E.W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21 (1965) 768. J. MacQueen, Some methods for classification and analysis of multivariate observations. In: L. Le Cam and J. Neyman (eds.). Proceedings 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, CA, 1967, pp. 281-297. D.L. Massart, F. Plastria and L. Kaufman, Non-hierarchical clustering with MASLOC. Pattern Recogn., 16(1983)507-516. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Applied. Math., 20 (1987) 53-65. D. Wishart, Mode analysis. In: A.J. Cole (ed.). Numerical Taxonomy. Academic Press, London, 1969, pp. 282-308. N. Jardine and R. Sibson, The construction of hierarchic and non-hierarchic classifications. Computer J., 11 (1968) 177-184. D. Coomans and D.L. Massart, Potential methods in pattern recognition. Part 2. CLUPOT: an unsupervised pattern recognition technique. Anal. Chim. Acta., 133 (1981) 225-239. E. Ruspini, Numerical methods for fuzzy clustering, Inf Sci., 2 (1970) 319-350. Y. Chtioui, D. Bertrand, D. Barba and Y. Dattee, Application of fuzzy C-means clustering for seed discrimination by artificial vision. Chemom. Intell. Lab. Syst., 38 (1997) 75-87. B. Hopkins, A new method for determining the type of distribution of plant individuals. Ann. Bot., 18(1954)213-227. J. Jurs and R.G. Lawson, New index for clustering tendency and its application to chemical problems. Chemom. Intell. Lab. Syst., 10 (1991) 81-83. E. Marengo and R. Todeschini, Linear discriminant hierarchical clustering: a modeling and cross-validable divisive clustering method. Chemom. Intell. Lab. Syst., 19 (1993) 43-51.
Additional reading G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies. 1. Hierarchical systems, Comput. J., 9 (1967) 373-380. P. Willett and V. Winterman, A comparison of some measures for the determination of inter-molecular structural similarity. Measures of inter-molecular structural similarity. Quant. Struct.- Act. Relat.,5(1986) 18-25.
87
Chapter 31
Analysis of Measurement Tables Introduction Measurement tables are the raw data that result from measurements on a set of objects. For the sake of simplicity we restrict our arguments to measurements obtained by means of instruments on inert objects, although they equally apply to sensory observations and to living subjects. By convention, a measurement table is organized such that its rows correspond to objects (e.g. chemical substances) and that its columns refer to measurements (e.g. physicochemical parameters). Here we adopt the point of view that objects are described in the table by means of the measurements performed upon them. Objects and measurements will also be referred to in a more general sense as row-variables and column-variables. A heterogeneous table contains measurements that are defined with different units; for example, chromatographic retention time in seconds, molecular volume in nm^, biological activity in mg substance per kg body weight, partition coefficient (dimensionless), etc. In a homogeneous table, all measurements are expressed in the same unit, such as retention times obtained with various chromatographic methods in seconds, or biological activities from a battery of tests in mg substance per kg body weight, etc. A special type of homogeneous measurements is found in a compositional table which describes chemical samples by means of the relative concentrations of their components. By definition, relative concentrations in each row of a compositional table add up to unity or to 100%. Such a table is said to be closed with respect to the rows. In general, closure of a table results when their rows or columns add up to a constant value. This operation is only applicable to homogeneous tables. Yet another type of homogeneous table arises when the rows or columns can be ordered according to a physical parameter, such as in a table of spectroscopic absorptions by chemical samples obtained at different wavelengths. Multivariate analysis of these different types of measurements (heterogeneous, homogeneous, compositional, ordered) may require special approaches for each of them. For example, compositional tables that are closed with respect to the rows, require a different type of analysis than heterogeneous tables where the columns are defined with different units. The basic approach of principal components
analysis, however, can be very helpful in understanding the overall structure of the data, i.e. its dimensionality, correlations between measurements, clustering of objects, specificities of objects for certain measurements, patterns of objects correlating with external parameters, and much more. The objective of this chapter is to discuss principal components analysis in relation to measurement tables. An introduction to principal components analysis has already been presented in Chapter 17, the notation of which is followed here as closely as possible, with two exceptions which will be indicated where they first occur. A measurement table is different from a contingency table. The latter results from counting the number of objects that belong simultaneously to various categories of two measurements (e.g. molar refractivity and partition coefficient of chemical compounds). It is also called a two-way table or cross-tabulation, as the total number of objects is split up in two ways according to the two measurements that are crossed with one another. The analysis of contingency tables is dealt with specifically in Chapter 32. Most of the algebra of vectors and matrices that is used in this chapter has been explained in Chapters 9 and 29. Small discrepancies between the tabulated values in the examples and their exact values may arise from rounding of intermediate results.
31.1 Principal components analysis A first introduction to principal components analysis (PCA) has been given in Chapter 17. Here, we present the method from a more general point of view, which encompasses several variants of PCA. Basically, all these variants have in common that they produce linear combinations of the original columns in a measurement table. These linear combinations represent a kind of abstract measurements or factors that are better descriptors for structure or pattern in the data than the original measurements [1]. The former are also referred to as latent variables [2], while the latter are called manifest variables. Often one finds that a few of these abstract measurements account for a large proportion of the variation in the data. In that case one can study structure and pattern in a reduced space which is possibly two- or three-dimensional. Historically, a distinction has been made between PCA of column-variables and that of row-variables. These are referred to as R-mode or Q-mode PCA, respectively. The modem approach is to consider both analyses as dual and to unify the two views (of rows and columns) into a single display, which is called biplot and which will be discussed in greater detail later on.
89
31,1.1 Singular vectors and singular values Let us suppose that we have a measurement table X with n rows and/7 columns. Each element x^y of the table then represents the value of they th measurement on the /th object, where / ranges from 1 to AI and where y ranges from 1 to p. In contrast with the notation in Chapter 17, we use the symbol p instead of m to denote the number of columns in a date table in order to avoid a conflict with the symbol m which we reserve for denoting means, later on in this chapter. An important theorem of matrix algebra, called singular value decomposition (SVD), states that any nxp table X can be written as the matrix product of three terms U, A and V: X = UAVT
or
A = UTXV
(31.1a)
where U is an nxr column-orthonormal matrix, where V is a pxr column-orthonormal matrix and where A is an rxr diagonal matrix. The dimension r can be at most equal to the smaller of the dimensions n or p. The elements on the diagonal of A are defined as positive values, while the values in the off-diagonal positions are zero. An equivalent notation of SVD using the column-vectors of U and V, and using the diagonal elements of A, is in the form: r
X = X,n,
\J +;i2 U2 ^I + . . . + ? t , u , v j =^^kn^\J
(31.1b)
k
Orthonormality implies that: U'^ U = V^ V = I,
(31.2)
where I^ is the rXr identity matrix. This means that the columns in U and V are normalized to unit sums of squares and that their sums of cross-products are zero. Vectors that possess these properties are said to be normalized and orthogonal, hence orthonormal. It can be proved that the decomposition is always possible and that the solution is unique (except for the algebraic signs of the columns of U and V) [3]. Singular value decomposition of a rectangular table is an extension of the classical work of Eckart and Young [4] on the decomposition of matrices. The decomposition of X into U, V and A is illustrated below using a 4x3 data table which has been adapted from Van Borm [5]. (A similar example has been used for the introduction to PC A in Chapter 17.) The data in Table 31.1 represent the concentrations of the trace elements Na, CI and Si (columns) in atmospheric samples that have been collected at the prevailing wind directions of 0, 90, 180 and 270 degrees (rows).
90 TABLE 31.1 Concentrations of trace elements in atmospheric samples at prevailing wind directions
0 90 180 270
Na
CI
Si
0.212 0.072 0.036 0.078
0.399 0.133 0.063 0.141
0.190 0.155 0.213 0.273
The SVD of the table of concentrations of trace elements in atmospheric samples yields (in accordance with Section 17.6):
UAV^ =
0.753 0.343
0.618] -0.127 [0.626 0
0.690
0.622
0.302
-0.567 [o
0.556
-0.783
0.473
-0.529
0.212 0.399 0.190 0.072 0.133 0.155 0.036 0.063 0.078 0.141
0.213 0.273
1 ro.371 0.214J [o.280
=X
with the orthonormality condition for the singular vectors of the wind directions: 0.618" -0.127
'1 O'
0.302
-0.567
0 1
0.473
-0.529
r0753 U^U
[0753
0.343
0.302
i 0.618
-0.127
-0.567
0.473 f 0.343 -0.529
=1.
and with the orthonormality condition for the singular vectors of the trace elements: 0.371 VTv = 0.280
0.690 0.556
[0.371 0.690 -0783 0.622 0.622
0.280" 0.556 -0.7831
1 0 0 1
I,
Note that, although X has dimensions 4x3, only 2 nonzero eigenvalues can be extracted. This is the result of a linear dependency among the columns of X which takes the form: x^^i =0.5245x^.^2 +0.01437x^.^3
91
or, equivalently, in terms of the concentrations of Na, CI and Si: [Na] = 0.5245 [CI] + 0.01437 [Si] The above relation holds within the limited precision of our calculations of the singular values, which in the present example is about four significant digits. In the context of SVD, U is called the matrix of row-singular vectors, V the matrix of column-singular vectors and A denotes the diagonal matrix of associated singular values. U and V are also referred to as the matrix of left-singular vectors and the matrix of right-singular vectors, respectively. As we will discuss in more detail in Section 31.2 on geometrical interpretation, the ^th column of U defines the kth singular vector in row-space S"^. Similarly, the ^th column of V defines the kth singular vector in column-space 5^. Since singular vectors in U and V are orthonormal, they define an orthogonal system of basis vectors in each of the dual spaces 5" and 5^. According to the previous definition in Section 9.2.1, we define row-space 5" as the coordinate space in which the p columns of an nxp matrix X can be represented as a pattern P^ of p points. Similarly, column-space S^ is the coordinate space in which the n rows of X can be represented as a pattern P^ of n points. These concepts are explained in greater detail in Chapter 29. Note that corresponding columns of U, V and A refer to the same singular vectors. The matrix of singular values is denoted here by the symbol A instead of the symbol W, which is used in Chapter 17, in order to maintain consistency with the notation A^ introduced in Chapter 29 for the matrix of eigenvalues. 31.1.2 Eigenvectors and eigenvalues The number of singular vectors r is at most equal to the smallest of the number of rows n or the number of columns p of the data table X. For the sake of simplicity we will assume here that p is smaller than n, which is most often the case with measurement tables. Hence, we can state here that r is at most equal to p or equivalently that r
(31.3)
where C^ is thepxp square symmetric matrix of cross-products of the columns of X, and where C^ is tht nxn square symmetric matrix of cross-products of the rows ofX.
92
Equation (31.3) defines the eigenvalue decomposition (EVD), also referred to as spectral decomposition, of a square symmetric matrix. The orthonormal matrices U and V are the same as those defined above with SVD, apart from the algebraic sign of the columns. As pointed out already in Section 17.6.1, the diagonal matrix A^ can be derived from A simply by squaring the elements on the main diagonal of A. The decompositions of C„ and C^ into U, V and A^ are illustrated in the following example, which makes use of the previously computed results. For the cross-products between wind directions, we obtain:
UA^U^ =
0.753
0.618
0.343
-0.127
0.392
0
0.302
-0.567
0
0.046
0.473
-0.529
0.240 0.098 0.073 0.125
0.098 0.073 0.125 0.047 0.044 0.067 = C„ 0.044 0.051 0.070 0.067 0.070 0.100
0.753 0.618
0.343 -0.127
0.302 -0.567
0.473 -0.529
For the cross-products between trace elements we find that: 0.371 VA^yT = 0.690 0.622
0.280 0.556 -0.783
0.392
0
0.371
0.690
0.622
0
0.046
0.280
0.556
-0.783
0.058 0.107 0.080 0.107 0.200 0.148 = C. 0.080 0.148 0.180 In the context of EVD, U is called the matrix of row-eigenvectors, V the matrix of column-eigenvectors and A^ the diagonal matrix of (associated) eigenvalues. Singular vectors and eigenvectors are identical, up to an algebraic sign, and the associated eigenvalues are the squares of the corresponding singular values. Eigenvectors in U and V are computed independently from the cross-product matrices C„ and C^, and hence their algebraic sign is arbitrary. Singular vectors, however, are computed simultaneously from the data matrix X, and hence they always appear with the appropriate algebraic sign. If the sign of a particular
93
column of U happens to be reversed as a result of the calculation, then the sign of the corresponding column of V will be reversed also automatically. An alternative way of defining EVD is as follows: U^ 0 ^ 0 = ^ ^ C^ V = A2
(31.4)
which can be obtained by multiplying both sides of eq. (31.3) with U and U^, or V and V^, and applying the orthogonality conditions (eq. (31.2)). These operations show that the square symmetric matrix C^ is diagonalized by the orthonormal matrix U into the diagonal matrix A^. Diagonalization is the operation by which a square matrix is transformed into a diagonal one. Likewise, the square symmetric matrix C^ is diagonalized by the orthonormal matrix V into the same diagonal matrix A^. If Uj and Vi represent the eigenvectors corresponding to the largest eigenvalue 'k], then according to eq. (31.4) we have: u T C „ u , = v T c ^ v , =^^
(315^)
where Uj and Vj are such that X] is maximal [6]. By virtue of the orthonormality of eigenvectors we can rewrite the above expressions in the forms:
or
(31.5b)
The same applies to the other eigenvectors U2 and V2, etc., with additional constraints of orthonormality of Uj, U2, etc. and of Vj, V2, etc. By analogy with eq. (31.5b) it follows that the r eigenvalues in A^ must satisfy the system of linear homogeneous equations: (C^-X\IJu,=0,
with/:=l,...,r
(31.6a)
or
where !„ represents the nxn identity matrix and where Ip is thepxp identity matrix [7]. For the sake of completeness we mention here an alternative definition of eigenvalue decomposition in terms of a constrained maximization problem which can be solved by the method of Lagrange multipliers: max [u;^ C^ Ui - X\ (uj u, - 1)]
(31.6b)
94
or m^x[yj
Cpy,-X]{yJ
v^ -1)]
where X] represents the Lagrange multiplier associated with the first eigenvectors u, and Vj. The above Lagrange expressions define that the sum of squares of the projections of X^ and X upon the normalized vectors u^ and v, must be maximal. Differentiation of these expressions with respect to the unknown Uj and Vj, and equalization of the results to zero, leads to the expressions in eq. (31.5b). Similar Lagrange expressions as in eq. (31.6b) can be derived for the other eigenvectors U2 and V2, etc., with additional constraints of orthonormality of Uj, U2, etc. and of Vj, V2, etc. Differentiation with respect to the unknown eigenvectors then leads to the system of linear homogeneous equations of eq. (31.6a). A necessary condition for obtaining non-trivial solutions for u^ and v^ in eq. (31.6a) is that the determinants of the coefficient matrices must be zero, which results into the so-called characteristic equation: IC„ ->.2 I J = 0
withik=l,...,r
(31.7)
or
ic,-x^,i,i=o The determinants can be developed into a polynomial equation of degree r of which the r positive roots are the eigenvalues X\, where r
fi
P
^ ^ 2 = t r ( A ^ ) = tr(CJ = tr(C^) = X S ^ 0 = ^
(3L8)
where trace (tr) means sum of the elements on the main diagonal of a square matrix. Note that the global sum of squares c is denoted by the symbol SS^ in other chapters, where it is referred to as the total sum of squares. From the numerical results in the previous example we find that the global sum of squares of the concentrations of all trace elements in all wind directions amounts to: tr(A2) = 0.392 + 0.046 = 0.438
95
tr(CJ = 0.240 + 0.047 + 0.051 + 0.100 = 0.438 tr(Cp = 0.058 + 0.200 + 0.180 = 0.438 Each eigenvector u^ or v^ contributes an amount X\ to the global sum of squares c of X. Hence, eigenvectors can be ranked according to their contributions to c. From now on we assume that the columns in U and V are arranged in decreasing order of their contributions. 31.1.3 Latent vectors and latent values In order to avoid confusion we will in this chapter consistently use the term latent vector or latent variable as being synonymous with singular vector and with eigenvector. Similarly, we will use the term latent value when referring to singular value or to the square root of eigenvalue. In the literature, latent vectors are also indicated by means of the term factor or component. The process of extracting latent vectors from a matrix of cross-products according to eqs. (31.3) or (31.4) is cdiWtd factorization. In a broad sense, factor analysis designates a class of methods of multivariate data analysis which involve factorization, although, historically, the name factor analysis has been first proposed for the analysis of tables of correlation coefficients [8]. This subject is treated extensively in Chapter 34. 31.1.4 Scores and loadings We have seen above that the r columns of U represent r orthonormal vectors in row-space 5". Hence, the r columns of U can be regarded as a basis of an r-dimensional subspace S"" of 5". Similarly, the r columns of V can be regarded as a basis of an r-dimensional subspace S^ of column-space 5^. We will refer to S^ as the factor space which is embedded in the dual spaces S^ and S^. Note that r
(31.9)
where th^ factor scaling coefficient a usually takes the value 0,0.5 or 1. The matrix S is an nxr orthogonal matrix, called score matrix, where r
(31.10)
where A^" is a diagonal matrix obtained by raising the diagonal elements of A to the power 2a.
96
In the same way, we define the coordinates for the p columns of X in factorspace S^\ L = VAP
(31.11)
where the factor scaling coefficient p is usually assigned the value 0,0.5 or 1. The matrix L is a pxr orthogonal matrix, called loading matrix. Orthogonality of L can be derived from the orthonormality of V, using eq. (31.2): (31.12)
L ' ^ L = AP V ^ V A P = A 2 P
where A^^ is a diagonal matrix obtained by raising the diagonal elements of A to the power 2(3. A traditional notation in chemometrics for SVD defines scores and loadings by means of the symbols T and P such that X = T P^, which is equivalent to X = U A V, where T = U A and P = V. This notation corresponds with the case a = 1 and p = 0, which is the most frequently used combination of factor scaling coefficients in chemometrics. 31.1.5 Principal components In the special case where a = 1 we can express the score matrix S in the form: S = UA = XV
(31.13)
as follows fromeqs. (31.1) and (31.2). Each element of S can be regarded as a scalar product: p
Sii,=xJ v^=^jc,yVy^
with/= 1,...,/z and/: = 1,..., r
(31.14a)
j
where the vector x^ represents the rth row of X and where the vector v^ represents the kth column of V. Each column of S represents a row-principal component of X and can be interpreted as a linear combination of the columns of X using the elements of V as weighting coefficients: Sfc = Xv, = X^y^y^
^i^h k = 1,..., r
(31.14b)
In the case of a = 1 we compute the scores S for the wind directions in the current example of concentrations of trace elements in atmospheric samples as follows:
97
S = UA =
0.753
0.618
0.343
-0.127
0.626 0
0.302
-0.567
0
0.473
-0.529
0.212
0.399 0.190
0.072 0.133 0.155
= xv= 0.036
0.214
0.063 0.213
0.078 0.141 0.273
0.371 0.690 0.622
0.472
0.132
0.215 0.556 = 0.189 -0.783 0.296
-0.027
0.280'
-0.122 -0.113
In Fig. 31.1a these scores are used as the coordinates of the four wind directions in 2-dimensional factor-space. From this so-called score plot one observes a large degree of association between the wind directions of 90, 180 and 270 degrees, while the one at 0 degrees stands out from the others. When a = 0.5 we obtain somewhat different values for the scores in S: 0753 0.343 S = UAi'2^ 0.302 0.473
0.618 -0.127 -0.567 -0.529
0.791 0 0.462 0
0.596 0.271 0.239 0.374
0.286 -0.059 -0.262 -0.244
Finally, when a = 0 then S is equal to U according to eq. (31.9). In the same way, we consider the special case where P = 1, which yields for the loading matrix L the expression: (31.15)
L = VA = X^U by virtue of eqs. (31.1) and (31.2). Each element of L can be written as a scalar product: n
U = x j u^ = ^x^j Uij^
with; = 1,..., /? and /: = 1,..., r
(31.16a)
where the vector x^ represents the yth column of X and where the vector u^ represents the k\h column of U. Each column of L represents also a column-principal component of X and can be regarded as a linear combination of the rows of X using the elements of U as weighting coefficients: 1^ =X'^ u^ =^^iUij^
with^= 1,..., r
(31.16b)
98
©
a=l
\= .046
S''
^
1 A,.= .392
-.5
®
1 ^ , = .392
I
-.5
-.5 -^ Fig. 31.1. (a) Score plot in which the distances between representations of rows (wind directions) are reproduced. The factor scaling coefficient a equals 1. Data are listed in Table 31.1. (b) Loading plot in which the distances between representations of columns (trace elements) are preserved. The factor scaling coefficient p equals 1. Data are defined in Table 31.1.
99
The columns of the loading matrix L contain the principal components of X in column-space 5''. In the case of |3 = 1, one calculates the loadings L of the three trace elements of the current example as follows below: 0.371 L = V A = 0.690 0.622
0.212 XTU = 0.399
0.190
0.280 0.556 -0.783
0.072
0.036
0.626
0
0
0.214
0.078
0.133 0.063 0.141 0.155 0.213
0.232
0.060
0.432
0.119
0.390
-0.168
0.273
0.753
0.618
0.343
-0.127
0.302
-0.567
0.473
-0.529
These loadings have been used for the construction of the so-called loading plot in Fig. 31.1b which shows the positions of the three trace elements in 2-dimensional factor-space. The elements Na and CI are clearly related, while Si takes a position of its own in this plot. When p = 0.5 we obtain: 0.371 L = V A'/2 = 0.690 0.622
0.280 0.556 -0.783
0.791 0 0 0.462
0.292 0.546 0.492
0.129 0.257 -0.362
Finally, when |3 = 0 then L is equal to V as follows from eq. (31.11). Summarizing, we have obtained two sets of principal components S and L (for a particular a and |3) from one and the same data table X. In the next section on the geometrical interpretation of principal components, we will show in a more general way that S contains the coordinates of the rows of X in factor space and that L contains the coordinates of the columns of X in the same factor space. Sometimes one refers to U and V as the principal components of X. This is also legitimate when one refers to the special case where a = P = 0, as can be readily seen fromeqs. (31.9) and (31.11).
100
31.1.6 Transition formulae The relationships between scores, loadings and latent vectors can be written in a compact way by means of the so-called transition formulae:
s = xv U = S A-^
(from eq. (31.13))
(31.17)
L = X^U V = LA-^
(from eq. (31.15))
In this way, given U one can compute V, and vice versa. These transition formulae are the basis for the calculation of U and V by the so-called NIPALS algorithm which will be explained in Section 31.4.
31.1.7 Reconstructions In the previous subsection, we have described S and L as containing the coordinates of the rows and columns of a data table in factor-space. Below we show that, in some cases, it is possible to graphically reconstruct the data table and the two cross-product matrices derived from it. It is not possible, however, to reconstruct at the same time the data and all the cross-products, as will be seen. We distinguish between three types of reconstructions. In the case where a equals 1 we can reconstruct the diagonalized cross-product matrix A^: S'rS = AU'rUA = A2
(31.18)
and the cross-product matrix C^: SS^ = U A A U'^ = U A^ U^ = C„
(31.19)
using eqs. (31.2), (31.3) and (31.13). In the current example using a = 1, we can reconstruct the sums of crossproducts C„ between the wind directions from their scores S:
101
SS'T =
0.472 0.215
0.132] -0.027 ro.472
0.189 0.296
-0.122 [o.l32 -0.027 -0.122 -0.113 -0.113
0.240
0.098
0.073 0.125
0.098
0.047
0.044 0.067
0.073 0.044
0.051 0.070
0.125 0.067
0.070
0.215
0.189 0.296
= C.
0.100
In the case where P equals 1 we can reconstruct the diagonalized cross-product matrix A^: L^ L = A V^ V A = A^
(31.20)
and the cross-product matrix Cp. (31.21) using eqs. (31.2), (31.3) and (31.15). Note that the diagonal matrix A is invariant under transposition. In the current example using p = 1, we reconstruct the sums of cross-products C^ between the trace elements from their loadings L:
LL
T
_
0.232 0.432 0.390
0.060 0.119 -0.168
0.058 0.107 0.080
0.107 0.080 0.201 0.148 = C. 0.148 0.180
0.232 0.060
0.432 0.119
0.390 -0.168
A very special case arises when a + P equals 1. If we form the product of the score matrix S with the transpose of the loading matrix L, then we obtain the original measurement table X: SL'T =UA« AP yi" =UAV'^ = X by making use of eqs. (31.1), (31.9) and (31.11) where a + P = 1.
(31.22)
102
This implies that the scalar product of a score vector s^ with a loading vector ly allows us to reconstruct the value x^j in the table X. Written in full, this is equivalent to: r
^yi=l^ikhi:=^u
(31-23)
k
where the vector sj is the iih row of S and where the vector IJ is theyth row of L. Summarizing, we find that, depending on the choice of a and (5, we are able to reconstruct different features of the data in factor-space by means of the latent vectors. On the one hand, if a = 1 then we can reproduce the cross-products C„ between the rows of the table. On the other hand, if p equals 1 then we are able to reproduce the cross-products C^ between columns of the table. Clearly, we can have both a = 1 and p = 1 and reproduce cross-products between rows as well as between columns. In the following section we will explain that cross-products can be related to distances between the geometrical representations of the corresponding rows or columns. If we want at the same time to reconstruct the data in the table X we must have that a + p = 1. A frequently used compromise is a = 1 and p = 0, where we sacrifice distances between columns for distances between rows, while still satisfying the constraint a + p = 1 for reconstruction of the data. This is an asymmetrical choice for a and p. Another compromise is to have a = 0.5 and P = 0.5 which allows for an approximate reconstruction of distances together with the reconstruction of the data. This is a symmetrical choice for a and p. An important aspect of latent vectors analysis is the number of latent vectors that are retained. So far, we have assumed that all latent vectors are involved in the reconstruction of the data table (eq. (31.1)) and the matrices of cross-products (eq. (31.3)). In practical situations, however, we only retain the most significant latent vectors, i.e. those that contribute a significant part to the global sum of squares c (eq.(31.8)). If we only include the first r* latent variables we have to redefine our relationships between data, latent vectors and latent values: X = U* A* V*^ 4- E = X* + E
(31.24)
where U*, A* and V* represent the matrices formed from the first r* columns of U and V and from the first r* rows and columns of A, where E is the residual data matrix, and where X* refers to the partially reconstructed data matrix. The matrix X* can be regarded as a least squares approximation of the original matrix X by means of a reduced number of latent variables r* out of a total of r [4]. If we retain only the first factor in the current example of concentrations of trace elements in atmospheric samples, we obtain:
103
0.753
X* =U* A* V * j
0.343 0.302
(0.626) (0.371 0.690 0.622) =
0.473
0.037
0.074 -0.104
0.175 0.325 0.294
0.212 0.399 0.190
0.080 0.148 0.134
0.072 0.133 0.155
-0.008 -0.015
0.021
0.070 0.130 0.118
0.036 0.063 0.213
-0.034 -0.068
0.095
0.110 0.204 0.184
0.078 0.141 0.273
-0.032 -0.063
0.089
= X-E The residual matrix E contains the contributions of the remaining second factor to the concentrations of the trace elements at the various wind directions. If we disregard E, we have to rewrite the reconstruction formulae (eqs. (31.19), (31.21) and (31.22)) accordingly:
c: =s* s*"^
with a = 1
c;=L*L*T
with P = 1
T *T X* = S** L'
with a + P = 1
(31.25)
A measure for the goodness of the reconstruction is provided by the relative contribution y of the retained latent vectors to the global sum of squares c (eq. (31.8)):
IK 1^ (31.26)
I^ In our example, we find that y = 0.895 for r* = 1. We will discuss various methods which can guide in the choice of the number of relevant latent vectors r* in Section 31.5.
104
31.2 Geometrical interpretation 37.2./ Line of closest fit In the previous section we have developed principal components analysis (PCA) from the fundamental theorem of singular value decomposition (SVD). In particular we have shown by means of eq. (31.1) how an nxp rectangular data matrix X can be decomposed into an nxr orthonormal matrix of row-latent vectors U, a p x r orthonormal matrix of column-latent vectors V and an rxr diagonal matrix of latent values A. Now we focus on the geometrical interpretation of this algebraic decomposition. In Chapter 29 we introduced the concept of the two dual data spaces. Each of the n rows of the data table X can be represented as a point in the p-dimensional column-space S^. In Fig. 31.2a we have represented the n rows of X by means of the row-pattern P^. The curved contour represents an equiprobability envelope, e.g. a curve that encloses 99% of the points. In the case of multinormally distributed data this envelope takes the form of an ellipsoid. For convenience we have only represented two of the p dimensions of 5^ which is in reality a multidimensional space rather than a two-dimensional one. One must also imagine the equiprobability envelope as an ellipsoidal (hyper)surface rather than the elliptical curve in the figure. The assumption that the data are distributed in a multinormal way is seldom fulfilled in practice, and the patterns of points often possess more complex structure than is shown in our illustrations. In Fig. 31.2a the centroid or center of mass of the pattern of points appears at the origin of the space, but in the general case this needs not to be so. Similarly, Fig. 31.2b shows the column-pattern P^ of the p columns of the data table X by means of an elliptical envelope in the dual n-dimensional row-space S"^. The ellipses should be interpreted as (hyper)ellipsoidal equiprobability envelopes of multinormal data. In practice the data are rarely multinormal and the centroid (or center of mass) of the pattern does not generally appear at the origin of space. An essential feature is that the equiprobability envelopes are similarly shaped in Figs. 31.2a and b. The reason for this will become apparent below. Note that in the previous section we have assumed by convention that n exceeds /?, but this is not reflected in Figs. 31.2a and b. In Fig. 31.2a we have represented the rth row x, of the data table X as a point of the row-pattern P" in column-space S^. The additional axes Vj and YJ correspond with the columns of V which are the column-latent vectors of X. They define the orientation of the latent vectors in column-space S^. In the case of a symmetrical pattern such as in Fig. 31.2, one can interpret the latent vectors as the axes of symmetry or principal axes of the elliptic equiprobability envelopes. In the special case of multinormally distributed data, v^ and V2 appear as the major and minor
105
®
©
Fig. 31.2. Geometrical example of the duality of data space and the concept of a common factor space, (a) Representation of n rows (circles) of a data table X in a space S^ spanned hyp columns. The pattern P" is shown in the form of an equiprobability ellipse. The latent vectors V define the orientations of the principal axes of inertia of the row-pattern, (b) Representation of/? columns (squares) of a data table X in a space S" spanned by n rows. The pattern P^ is shown in the form of an equiprobability ellipse. The latent vectors U define the orientations of the principal axes of inertia of the column-pattern, (c) Result of rotation of the original column-space S^ toward the factor-space 5'' spanned by r latent vectors. The original data table X is transformed into the score matrix S and the geometric representation is called a score plot, (d) Result of rotation of the original row-space S" toward the factor-space 5'^ spanned by r latent vectors. The original data table X^ is transformed into the loading table L and the geometric representation is referred to as a loading plot, (e) Superposition of the score and loading plot into a biplot.
106 axes of symmetry of the ellipse. In the general case, Vj and V2 are called axes of inertia, because they account for a maximal part of the global sum of squares of the data which we have denoted by the symbol c in eq. (31.8). Note that inertia of a pattern of points can be related to the sum of squares of their coordinates, if we assume that the masses of the points are all equal. If we project a particular row x^ upon VJ, then from eq. (31.13) we deduce that the projection is at a distance s^^ from the origin: x^
y,=s,,
or more generally for all the rows: Xv, =Si
(31.27)
Since this latent vector is defined as the vector for which the sum of squares of the projections is maximum (eq. (31.5)), we can interpret Vj as an axis of maximal inertia: s^ s, =v;^ X^ Xvi ^v?' C^ Vi =\\
(31.28)
Hence ^^ is the fraction of the total sum of squares (or inertia) c of the data X that is accounted for by Vj. The sum of squares (or inertia) of the projections upon a certain axis is also proportional to the variance of these projections, when the mean value (or sum) of these projections is zero. In data analysis we can assign different masses (or weights) to individual points. This is the case in correspondence factor analysis which is explained in Chapter 32, but for the moment we assume that all masses are identical and equal to one. We now consider a subspace of SP which is orthogonal to Vj and we repeat the argument. This leads to V2, and in the multidimensional case to all r columns in V. By the geometrical construction, all r latent vectors are mutually orthogonal, and r is equal to the number of dimensions of the pattern of points represented by X. This number r is the rank of X and cannot exceed the number of columns /? in X and, in our case, is smaller than the number of rows in X (because we assume that n is larger than/?). One of the earliest interpretations of latent vectors is that of lines of closest fit [9]. Indeed, if the inertia along Vj is maximal, then the inertia from all other directions perpendicular to Vj must be minimal. This is similar to the regression criterion in orthogonal least squares regression which minimizes the sum of squared deviations which are perpendicular to the regression line (Section 8.2.11). In ordinary least squares regression one minimizes the sum of squared deviations from the regression line in the direction of the dependent measurement, which assumes that the independent measurement is without error. Similarly, the plane formed by Vj and V2 is a plane of closest fit, in the sense that the sum of squared deviations perpendicularly to the plane is minimal. Since latent vectors v^ contribute
107
in decreasing order to the global sum of squares (as measured by their contributions X\), we obtain increasingly better approximations to the data structure X by means of the (hyper)planes of closest fit which are defined by V. Usually we will not include all r latent vectors in such a regression model. If a large fraction of the total inertia is accounted for by the r* first latent variables, where r is much smaller than p, then we have achieved a substantial dimension reduction. The question of how many latent variables should be included is discussed in Section 31.5. The same geometrical considerations can be applied to the dual representation of the column-pattern P^ in row-space S"- (Fig. 31.2b). Here Uj is the major axis of symmetry of the equiprobability envelope. The projection of theyth column Xy of X upon Uj is at a distance from the origin given by: xj
n,=lj,
or more generally from eq. (31.15) for all the columns: X^ Ui = l i
(31.29)
Since the latent vector maximizes the sum of squares of the projections (eq. (31.5)) we can also interpret u^ as an axis of inertia: i;^ l,=vij
XX^ Ui = u ^ C, n,=X]
(31.30)
Note that the inertia X] loaded by Uj in S"^ is the same as that loaded by Vj in S^. For this reason, we must consider Uj and Vj as two different expressions of one and the same latent vector. The former is developed in S^ while the latter is constructed in 5^. It is important to note that the score vector Sj results from the projection of X upon Vj. While Vj is a latent vector in S^ (because it has p elements), Sj is a vector in 5" (because it possesses n elements). In Chapter 29 we have shown that the pattern P" in 5^, when projected upon the axis represented by Vj, produces an image thereof in the dual space S"^ at the point represented by Sj. If we normalize Sj (by division with the corresponding latent value X^) we obtain the latent vector Uj. Similarly, the loading vector Ij results from the projection of the pattern P^ in S^ upon Uj. While the latent vector Uj is in 5" (because of its n elements), Ij is in 5^ (because it hasp elements). By the same argument as employed above, we find that the pattern P^ in 5"^, when projected upon the axis represented by Uj, produces an image thereof in the dual space 5^ at the point represented by Ij. Normalization of Ij (by division with X^) yields the latent vector Vj. This is the geometrical basis for the transition formulae which have been discussed before (eq. (31.17)) and for the NIPALS algorithm which is used for the calculation of singular vectors and which is explained in Section 31.4.
108
Once we have obtained the projections S and L of X upon the latent vectors V and U, we can do away with the original data spaces S^ and S"^. Since V and U are orthonormal vectors that span the space of latent vectors S^, each row / and each columny of X is now represented as a point in S\ as shown in Figs. 31.2c and d. The coordinate axes of this space S^ are oriented along the axes of inertia of the equiprobability envelopes. In the case of multinormally distributed data the semiaxes of the (hyper)ellipsoids have lengths equal to the corresponding latent values in A. The lengths of these semi-axes can also be interpreted as the standard deviations of the corresponding latent vectors. Since the scores and the loadings are associated with the same latent values, we obtain that the semi-axes of the ellipses in Figs. 31.2a and b are identical. The geometrical interpretation of PCA, as expressed by eqs. (31.13) and (31.15) can thus be seen as an orthogonal rotation of the original data spaces 5^ and S"" into 5^ The rotation is called orthogonal because the rotation matrices V and U are orthonormal (eq. (31.2)). Orthogonal rotations do not change Euclidean distances between the points in the patterns. In particular, the total inertia of the pattern is invariant under orthogonal rotation, a property which has been defined already in eq. (31.8). Fig. 31.2c is traditionally called a score plot, while Fig. 31.2d is called a loading plot. Since U and V express one and the same set of latent vectors, one can superimpose the score plot and the loading plot into a single display as shown in Fig. 31.2e. Such a display was called a biplot (Section 17.4), as it represents two entities (rows and columns of X) into a single plot [10]. The biplot plays an important role in the graphic display of the results of PCA. A fundamental property of PCA is that it obviates the need for two dual data spaces and that instead of these it produces a single space of latent variables. 31.2.2 Distances It has been shown in the previous section that there is an infinity of decompositions of a data table X into scores S and loadings L depending upon the choice of the factor scaling coefficients a and p (eqs. (31.9) and (31.11)). The most current settings of a and (3, however, are 0,0.5 and 1. It has been noted that when a equals 1, distances between the representations of the rows are preserved in the score plot. In the case when p equals 1, distances between the representations of the columns are preserved in the loading plot. In the score plot of Fig. 31.3a we represent the Euclidean distance between rows / and i' by (i,,-, while their angular distance, as seen from the origin, is denoted by i^^-. Euclidean distances from the origin of the rows / and /' are indicated as d^ and J,, respectively. As shown in Section 9.2.3, this means in algebraic terms that:
109
df=sj
s,=xj
x.=c,
(31.31)
where the vectors sj and sj^ are the / and i'-th row of the score matrix S, where the vectors x^ and xj^ are the / and /'-th row of the matrix X and where C^ is the matrix of cross-products between rows of X (eq. (31.3)). The above relations can also be derived from eq. (31.13) and by taking account of the orthonormality of singular vectors. In particular, c^^ and C,Y represent the sums of squared elements of the rowvectors Xj and Xj/. Consequently, d^ and d^^ are the norms (or lengths) of the ith and i'-th rows, respectively. The relation that links the three expressions of eq. (31.31) together is called the triangle equation or cosine rule: dl = df + df^ - Id^d^. cos ^ii.
(31.32a)
and from eq. (31.31) we can also derive that: dl.=c,+c,,-2c,,
(31.32b)
where 0^^ is the sum of cross-products x] x,. between the / and i-\h rows of X. The term cos i^- is a measure for the association (or similarity) between the rows / and I. From the score plot in Fig. 31.3a we learn that it represents the angular distance between the points / and i as seen from the origin of space. This measure of association is often used to define similarities between objects in cluster analysis (Chapter 30). In the current example we can calculate the distances of the wind directions from the origin either from the data X, from the scores S or from the sums of cross-products C^. In the case of wind direction 0 we obtain: t/f^i =s;?Li s^.^i =(0.472^ +0.132^) = 0.240 = x^^i x.^i =(0.212^ +0.399^ +0.190^) = 0.240 ^ c , . i , =0.240 or J,^i =0.490 The distances between wind directions can be calculated from the sums of crossproducts in C„. In the case of wind directions 0 and 90° we derive: dl'=n =(^ii=u +^/r=22 -2c,.,.,=i2 =0.240 +0.047 - 2(0.098) = 0.092 or d..^^^2 =0.303
110
@
a=l
®
P=i
S^
>v
->
a + p=l
©
s^
—7
0
@
a+p=l
/N S"^ s / •P / /
s O
U 1
y\
0
/ •
©
,/
./ L
/
^
a+p=l
•
®
a+p=l /N
s-^
>
A^
S^
s, X ->
^
Fig. 31.3. (a,b) Reproduction of distances D and angular distances 0 in a score plot (a = 1) or loading plot (p = 1) in the common factor-space 5^ (c,d) Unipolar axis through the representation of a row or column and through the origin 0 of space. Reproduction of the data X is obtained by perpendicular projection of the column- or row-pattern upon the unipolar axis (a + P = 1). (e,0 Bipolar axis through the representation of two rows or two columns. Reproduction of differences (contrasts) in the data X is obtained by perpendicular projection of the column- or row-pattern upon the bipolar axis (a + p = 1).
Ill
The distances are reported in the score plot of Fig. 31.1a. By analogy we obtain for the distances from the origin d^ and dj^ of the columns^, / and for the distance J^y/between these columns: 4,=(l,-I,,)^(l,-V) = (x,-x,0^(x,-v) d^-^h=^^j='jj
(31-33)
where the vectors 1J and 1 J. are they and/-th rows of the loading matrix L, where the vectors Xj and Xy/ are the j and /-th columns of the matrix X and where C^ represents the matrix of cross-products between the columns of X (eq. (31.3)). In particular, Cjj and Cjy represent the sums of squared elements of the jth and y'-th columns of X respectively. The above relations can also be derived from eq. (31.15) and by taking account of the orthonormality of singular vectors. Here dj and di^ are called the norms (or lengths) of they andy'-th columns of X respectively. The triangle equation or cosine rule links the above expressions in eq. (31.33) together: d].. = dj + d} - Idjdy COS ^^y
(31.34a)
and from eq. (31.33) we can also derive that:
where Cjj^ is the sum of cross-products xJ Xy between the y and y'-th columnvariables of X. The term cos i^^^/ is a measure of association (or similarity) between these column-variables. From Fig. 31.3b we can see that it can be obtained from the angular distance between the pointsy andy" as seen from the origin of space. This measure of association is closely related to the coefficient of correlation between measurements (Section 8.3). From the current example we can calculate the distances from the origin of the trace elements, either from the data X, from the loadings L (with p = 1) or from the sums of cross-products C^. In the case of Na: d%^ ^ I j l i I,.^i =(0.232^ +0.060^) = 0.057 = x]^, x.^i =(0.212^ +0.072^ +0.036^ +0.0782) = 0.058 = c^..ii =0.058 or rf,.^i =0.240 The distances between trace elements are calculated from the sums of crossproducts in Cp. In the case of Na and CI:
112
4-=i2 =^jj=n +^/r=22 -^^jr=\2 =0.058+ 0.200-2(0.107) = 0.044 or djj^^^2 =0.208 The distances are reported in the loading plot of Fig. 31.1b. In the following section on preprocessing of the data we will show that column-centering of X leads to an interpretation of the sums of squares and cross-products in C^ in terms of the variances-covariances of the columns of X. Furthermore, cos 'd^- then becomes the coefficient of correlation between these columns. 31.2.3 Unipolar axes The interpretation of biplots is made easier by the construction of axes in it. These axes are used in the same way as in a bivariate Cartesian diagram. Perpendicular projection of the points in the diagrams upon a coordinate axis allows us to determine (or reconstruct) the values in the table. We consider the special biplot in which both rows and columns are represented in a single display of latent variables subjected to the constraint that a + p equals 1. As we have seen above, this constraint allows us to reconstruct the original data X from which the latent variables U, V and the latent values A have been computed (eq. (31.22)). Algebraically, the reconstruction of the values of X has been defined by the matrix product of the scores S with the transpose of the loadings L (eq. (31.22)). Geometrically, one reconstructs the value x^j by perpendicular projection of the point represented by 1^ upon the axis represented by s, as shown in Fig. 31.3c: X = SLT' or sjl • =iy s^ =Xij
with / = 1,..., nandy = 1,...,/?
(31.35)
The same result can be obtained by perpendicular projection of a point s- upon an axis ly as shown in Fig. 31.3d. We call the axes through the origin and through either s, and ly unipolar axes, because these axes are defined by a single variable or pole, as opposed to bipolar axes which are discussed below. Each row and each column of a table X may thus be seen as a pole through which unipolar axes can be drawn. Perpendicular projection of the row-pattern P" upon a unipolar axis through columny reproduces Xy. Likewise, perpendicular projection of the column-pattern P^ upon a unipolar axis through row / reproduces x,. In practical applications of biplots one may add tick marks to unipolar axes in order to facilitate their reading.
113
Figure 31.4 shows the biplot of the trace elements and wind directions for the case when a = p = 0.5. Since here we have that a -H (3 equals 1, we can reconstruct the values in the columns of the data table X by means of perpendicular projections upon unipolar axes. In Fig. 31.4a we have drawn a unipolar axis through CI. Perpendicular projection of the four wind directions upon this axis reconstructs the order of the concentrations of CI at the four wind directions as listed in Table 31.1. Now we have established a way which leads back from the graphic display to the tabulated data. This interpretation of the biplot emphasizes the one-to-one relationship between the data and the plot. Such a relationship is also inherent in the ordinary bivariate (or Cartesian) diagram. The distance from the origin d^ or dj is a measure of the information contained in the corresponding row or column. If the distance is relatively large, in comparison to others, then the corresponding row or column can be seen to contribute more information (inertia, variance) to the result of the analysis. In the case when the distance is zero, then the row or column carries no information (inertia, variance) since all its elements are zero. In the case when not all r latent variables have been retained, the reproduction will be approximate and we have to rewrite eq. (31.35) in the form: sY \]=\y
s* =jc*.
(31.36)
where the asterisk denotes that only the first r* latent variables have been included in the reconstruction of the data X, and where r* < r. 31.2.4 Bipolar axes A bipolar axis is defined by two rows s, and s^/ in Fig. 31.3e. If we project a column Ij upon this bipolar axis, we reconstruct the difference between two elements in X. This follows readily from eq. (31.22): (s,.-s,)^l;=sn,-sn,.=^,-^,,,
(31.37)
This bipolar axis defines a contrast between the row-variables / and /'. Note that a contrast involves the difference of two quantities. By virtue of the symmetry between scores and loadings, we can also construct bipolar axes through two columns ly and 1^^ such as is shown in Fig. 31.3f. When we project a row s^ upon this bipolar axis we construct a difference between two elements in X. The proof follows readily from eq. (31.22): sjilj-l,)
= sjlj-sllj,=x,-x,,
(31.38)
This bipolar axis defines a contrast between the column-variables7 and/. In Fig. 31.4b we have constructed a bipolar axis through CI and Si. Perpendicular projections of the wind directions upon this axis are in the same order as the
114
©
a = p = 0.5
6 180
®
O 270
a = P = 0.5
.5n
/
i Na
/
-.5
90^
.5
? 270.'
180
/ .5 -• Fig. 31.4. (a) Biplot in which the concentrations of an atmospheric trace element (CI) are reconstructed by perpendicular projection upon a unipolar axis, (b) Biplot in which the differences (contrasts) between two atmospheric trace elements (CI, Si) are reproduced by perpendicular projection upon a bipolar axis.
115
differences of the Si and CI concentrations in Table 31.1. Similar geometric operations allow us to reconstruct the values and the contrasts (differences) in the rows of the data table X. It is also possible to add tick marks to the axes such that the values and contrasts can be read off immediately. This is done by calibration of the axes, specifically by relating the distances on the axes from an arbitrary origin to the values in the table by means of linear regression. It is important to realize that the number of contrasts increases with the square of the number of variables {n or/?). Some contrasts, however, are more important than others and many of them are linearly dependent. To evaluate the importance of a contrast we have to examine the distances d^^^ and djy (Figs. 31.3a and b). If these are large in comparison to others, then the corresponding contrasts are more important. Of course, a distance of zero means that the contrasts are non-existent, i.e. the corresponding rows or columns in X are identical or differ only by a constant value. 31.3 Preprocessing Preprocessing is the operation which precedes the extraction of latent vectors from the data. It is an operation which is carried out on all the elements of an original data table X and which produces a transformed data table Z. We will discuss six common methods of preprocessing, including the trivial case in which the original data are left unchanged. The effects of each of these six types of preprocessing will be illustrated numerically by means of the small 4x3 data table from the study of trace elements in atmospheric samples which has been used in previous sections (Table 31.1). The various effects of the transformations can be observed from the two summary statistics (mean and norm). These statistics include the vector of column-means m^ and the vector of column-norms d^ of the transformed data table Z: 1 ""
^j=-^Zij
withy = 1, ...,/?
1 ^
i
and the vector of row-means m^ and the vector of row-norms d„: 1 ^ m^ = — y Zii PJ I ^
^N-E4 P
j
with / = 1, ..., n
(31.39b)
116
In our notation m^ represents an element of the vector m^ and rrij is an element of the vector m^. The same convention applies to the elements d-, dj and the corresponding vectors d„, d^. The global mean m and the global norm d are defined as follows: 1
n
^P
i
p
J
d'=-tt4
(31.40)
^Pi J Note that a norm is defined here as a root mean square rather than as a sum of squares, because of the division by n and/7. This greatly simplifies our notation. The vector of column-means m^ defines the coordinates of the centroid (or center of mass) of the row-pattern P^ that represents the rows in column-space S^. Similarly, the vector of row-means m„ defines the coordinates of the center of mass of the column-pattern P^ that represents the columns in row-space 5". If the column-means are zero, then the centroid will coincide with the origin of 5^ and the data are said to be column-centered. If both row- and column-means are zero then the centroids are coincident with the origin of both 5'^ and 5^. In this case, the data are double-centered (i.e. centered with respect to both rows and columns). In this chapter we assume that all points possess unit mass (or weight), although one can extend the definitions to variable masses as is explained in Chapter 32. The vector of column-norms d^ of the table defines the weighted distances of the points in P^ from the origin of space in S^, Similarly, the vector of row-norms d^ represents the weighted distances of the points in P"^ from the origin of space in S^. If all column-norms are equal then the pattern PP will appear on a (hyper)sphere about the origin of 5". In this case the data are said to be column-normalized. If, in addition, the center of mass of P^ is at the origin of S^, then we deal with column-standardized data (i.e. column-centered and column-normalized). Note that the norm is also proportional to the length of the vectors, i.e. the distance of its endpoint from the origin of space. These concepts are explained in Chapter 29. By way of graphical example of the various algebraic and geometrical concepts that are introduced in this chapter, we will make use of a measurement table adapted from Walczak etal. [II]. Table 31.2 describes 23 substituted chalcones in terms of eight chromatographic retention times. Chalcone molecules are constituted of two phenyl rings joined by a chain of three-carbon atoms which carries a double bond and a ketone function. Substitutions have been made on each of the phenyl rings at the para-positions with respect to the chain. The substituents are CF3, F, H, methyl, ethyl, /-propyl, r-butyl, methoxy, dimethylamine, phenyl and NO2. Not all combinations two-by-two of these substituents are represented in the
117 TABLE 31.2 Chromatographic retention times of 23 doubly substituted chalcones, as determined by 8 HPLC chromatographic methods using heptane as the mobile phase. The methods differ by addition of 0.5 percent of a chemical modifier. DMF
Mean
2.98 2.20 2.26 3.00 4.66 3.74 2.38 2.67 4.33 6.24 4.84 6.81 6.59 11.90 16.85 14.39 27.62 18.04 24.38 21.27 41.17 39.34 42.38
2.85 1.53 1.62 2.80 4.15 3.40 1.90 2.25 3.22 4.98 3.82 5.62 5.76 10.50 12.82 11.41 20.24 15.28 18.94 17.12 26.35 21.80 26.48
1.66 1.68 1.71 2.02 2.68 2.17 1.83 2.01 2.58 2.98 3.39 5.13 5.17 7.01 6.76 8.77 10.24 9.18 10.50 13.28 15.10 13.96 18.76
13.48
9.78
6.46
X-Y
CH2CI2
THE
Dioxan
Ethanol
Propano][ Octanol DMSO
H-CH3 H-tBu H-iPr H-H F-H H-F H-Et H-Me F-Me F-F
2.02 2.99 3.00 2.98 3.56 2.82 3.12 3.28 3.87 3.47 5.86 9.00 9.12 10.34 6.76 14.58 10.15 12.15 11.93 22.19 15.94 14.32 26.07
1.78 2.24 2.24 2.39 2.94 2.38 2.34 2.52 3.00 2.85 4.59 6.62 6.76 8.22 5.60 10.72 8.11 9.82 9.81 16.08 12.67 12.43 19.28
1.72 2.30 2.29 2.31 2.81 2.30 2.43 2.57 3.06 2.92 4.43 6.68 6.74 7.59 5.92 10.68 7.10 9.42 9.40 15.79 12.74 12.01 18.93
0.62 0.64 0.66 0.78 0.98 0.83 0.70 0.81 0.91 1.05 0.96 1.78 1.75 2.03 1.80 2.19 2.68 2.50 2.68 3.44 3.36 3.52 4.24
0.57 0.59 0.62 0.76 0.92 0.76 0.68 0.76 0.87 0.96 0.93 1.58 1.63 2.02 1.66 2.08 2.37 2.30 2.59 3.48 3.08 3.20 3.92
0.78 0.92 0.96 1.17 1.40 1.13 1.08 1.22 1.38 1.39 1.68 2.92 3.01 3.48 2.68 4.08 3.68 3.95 4.27 6.85 5.48 5.04 8.78
8.67
6.76
6.61
1.78
1.67
2.93
Me-Phe MeO-Me Me-MeO F-MeO H-NO2 MeO-Phe F-NO2 N02-Me NO2-H MeO-MeO MeO-NOj NO2-F NMe2-N02 Mean
measurement table. Retention times have been observed by high-performance liquid chromatography (HPLC) using heptane as the mobile phase with additions of 0.5% of one of eight different modifiers: CH2CI2, tetrahydrofurane (THF), dioxane, ethanol, propanol, octanol, dimethylsulfoxide (DMSO) and dimethylformamide (DMF). We expect to discover some structure in the set of methods (measurements) as well as in the set of substituted compounds (objects). Furthermore, we want to investigate the specific interactions of substituents for one or more of the modifiers. Biplots constructed from this table are shown in Figs. 31.5 to 31.11. The horizontal and vertical axes of these biplots represent scores and loadings of the
118
first two latent variables as defined by eqs. (31.9) and (31.11), using the compromise a = P = 0.5. This particular choice of a and (3 ensures that projections on unipolar and bipolar axes can be carried out, while compromising on the representation of distances between rows and between columns. The relative inertias contributed by the two latent vectors are indicated alongside the horizontal and vertical axes. In some cases we have reflected one or two of these axes such as to make the individual biplots comparable with each other. Rows of Table 31.2 refer to the 23 substituted chalcones which are represented on the biplots by means of graded circles. Areas of the circles are proportional to the mean retention times of the corresponding compounds as obtained with the eight chromatographic methods. Here large areas of circles correspond with compounds whose average retention times are large. Columns of the table refer to the chromatographic methods (heptane plus 0.5% of one of the different modifiers). They are represented by means of graded squares on the biplot. Areas of these squares are proportional to the mean retention times produced with the corresponding methods by the 23 substituted chalcones. Large areas of squares correspond with methods that on the average tend to produce large retention times in the 23 compounds. The areas of circles and squares provide an indication of the average size or importance of the objects and measurements described by the table, i.e. average retention time of compounds and average retention time with chromatographic methods. 31.3.1 No transformation The case in which the original data are left unchanged can be simply defined by means of: Zij-x^j
with / = 1, ..., n andy = 1, ...,/7
(31.41)
Generally when no preprocessing is applied, we will find that all means and norms are distinct as in Table 31.3. TABLE 31.3 Atmospheric data from Table 31.1, after no transformation. The means m and norms d have been computed column-wise, row-wise and globally over all elements of the table
Na
CI
Si
m„
d„
0 90 180 270
0.2120 0.0720 0.0360 0.0780
0.3990 0.1330 0.0630 0.1410
0.1900 0.1550 0.2130 0.2730
0.2670 0.1200 0.1040 0.1640
0.2830 0.1250 0.1299 0.1830
m^,
0.0995 0.1199
0.1840 0.2240
0.2078 0.2121
0.1638
—
—
0.1911
d.
119
— 9r'N02
)HmO-H02
^Q
# a-N02 DMFfH H02-H PROPAJfOL B-CF3\
B-lPr^-B^
»™^^^
2 ^kiatm2-lK)2 .H.-PH.
^,_^^ DIOXAH^TBT
«-°"^tM.-M*0
^MmO-MmO
97%
Fig. 31.5. Biplot of 23 substituted chalcones (circles) and 8 chromatographic methods (squares) as described by their retention times in Table 31.2, after no transformation of the data. Areas of circles and squares are related to the mean retention times of the corresponding compounds and methods, such as they appear in the margins of the table.
Figure 31.5 shows the biplot of the 23 substituted chalcones and eight chromatographic methods as described by the retention times in Table 31.2. The first latent variable accounts for 97% of the inertia, while the second one carries about 3%. A small fraction of the inertia (less than 0.3%) is distributed over the other latent variables. The origin of the space is represented by a small cross and appears at the far left of the biplot. The two centroids (of compounds and of methods) are manifestly at a large distance from the origin. Note that the circles and squares are ordered by increasing areas along the horizontal axis. We conclude from this, that the first latent variable represents a strong size component. It accounts for the size of compounds and measurements as appears from the marginal means of Table 31.2,
31.3.2 Column-centering Column-centering is a customary form of preprocessing in principal components analysis (Section 17.6.1). It involves the subtraction of the corresponding column-means m^ from each element of the table X:
120 TABLE 31.4
Atmospheric data from Table 31.1, after column-centering. Na
CI
Si
^n
dn
0 90 180 270
0.1125 -0.0275 -0.0635 -0.0215
0.2150 -0.0510 -0.1210 -0.0430
-0.0178 -0.0528 0.0053 0.0653
0.1032 -0.0437 -0.0598 0.0003
0.1405 0.0452 0.0790 0.0468
m^
0 0.0669
0 0.1278
0 0.0430
^
^iJ ^ "^(Z -m. ~
with / = ! , . . . , n andy = 1, ..'^P
0
—
0.0869
(31.42)
with 1 "" m After this transformation we find that the column-means m^ are zero as shown in Table 31.4. In this case the column-norms d^ are called column-standard deviations. The square of these numbers are the column-variances, whose sum represents the global variance in the data. Note that the column-variances are heterogeneous which means that they are very different from each other. The column-centered biplot of the chromatographic data is shown in Fig. 31.6. The first and second latent variables contribute 94% and 5% to the total inertia of the data (or total variance in this case). About 0.5% is distributed over the residual latent variables. Here the centroid of compounds coincides with the origin of space (indicated by the small cross near the middle of the plot). The centroid of the methods is not at the origin, since all methods are projected at the right side of the origin on Fig. 31.6. This is indicative of a strong size component, which is also evidenced by the ordering of the compounds according to increasing average retention time. This observation can also be made by looking at the areas of the circles which increase from left to right. Inspection of the positions of the chromatographic methods on the biplot of Fig. 31.6 shows that the three alcohols (ethanol, propanol and octanol) contribute relatively little to the global inertia (variance) of the data, since their distances from the origin are small in comparison to those of DMSO, DMF, THF, dioxane and methylenedichloride (CH2CI2). Judging from the angular distances, as seen from the origin, we find rather strong correlations between DMSO and DMF, on the one
121
P* \
X
Hm0-K02
00
-W02
\.
9'-
2 4
#
fl-N02
B-r
fl-fl,*
f
^^j^
W mC''^<'^^^
«^
Nv B-cr3*
N02-F^
V02-H^
•'-'
^''[UDMT
mr-H
X ^
/^20
1
..^^-^^^^^^^ ^^ /N
^^STHANOL M0-Ph« ^ . . ^ ^ ^ " ^ j^o ^«QP-w«>^ i ^ O ^ ^ ^ ^ - * ^
/
•1>^
^"'^^'^^'^
5
/
# rHtoONP^^^^*''^^^ \ I^
0
^'^'^N
/
A
^''•>
10
1
*"*,
lfi«A2-V02^k
M 0M0O-Ph«
\^^^/
CH2Cl2j3^
5 20 22
/^^ /
N ^ 24
WoO-Ate<J^
X
34 »
Fig. 31.6. Biplot of chromatographic retention times in Table 31.2, after column-centering of the data. Two unipolar axes and one bipolar axis have been drawn through the representations of the methods DMSO and methylenedichloride (CH2CI2). The projections of three selected compounds are indicated by dashed lines. The values read off from the unipolar axes reproduce the retention times in the corresponding columns. The values on the bipolar axis reproduce the differences between retention times.
hand, and between dioxane, THF and methylenedichloride, on the other hand. The correlation between the three alcohols cannot be established accurately on this plot because of their proximity to the origin. There are two outstanding poles on this biplot. DMSO and dimethylchloride are at a large distance from the origin and from one another. These poles are the most likely candidates for the construction of unipolar axes. As has been explained in the previous section, perpendicular projections of points (representing compounds) upon a unipolar axis (representing a method) leads to a reproduction of the data in Table 31.3. In this case we have to substitute the untransformed value x^^ in eq. (31.35) by z^. of eq. (31.42):
s7i 7
~'^ij
-^ij
^j
(31.43)
Since m^ is a constant for the given unipolar axis through the yth method, we obtain that the projections on this axis are equal to x^j minus a constant. The unipolar axis through the origin and DMSO reproduces rather well the data in the corresponding column of Table 31.2. By perpendicular projection of the
122
center of the circles we can read off the approximate retention times on this unipolar axis. For example, the compound with substituents F-NO2 projects at about a value of 28 which compares well with the tabulated value of 27.62. The DMSO axis separates compounds with NO2 and methoxy substituents from the others. The unipolar axis through methylenedichloride also reproduces the data in the corresponding column of Table 31.2. This additive seems to selectively delay the elution of compounds with dimethylamine and some methoxy substituents. Finally, we have constructed a bipolar axis through DMSO and methylenedichloride. By perpendicular projection of the centers of the circles upon this bipolar axis we obtain the differences in retention times obtained respectively with DMSO and with methylenedichloride. Using a similar reasoning as developed above for unipolar axes we can perform a substitution of eq. (31.42) in eq. (31.38) which leads to: sj{\j
-lj.)
= Zij -Zij'=Xij -x.y-{mj
-my)
(31.44)
This shows that the bipolar axis reproduces the difference x^j - x- between method7 a n d / of the table minus a constant (nij - m^). This bipolar axis defines a clear contrast between the N02-substituted compounds and the others. For example, projection of the compound with substituents dimethylamine-N02 defines a DMSO-methylenedichloride contrast of about 16 (Fig. 31.6), where the real difference is 42.38 - 26.07 or 16.31 (Table 31.2). Graphical estimation of the same contrast for the MeO-MeO compound yields a value o f - 1 , which compares well with the difference between tabulated data of 21.27 - 22.19 or -0.92. 31.3.3 Column-standardization Column-standardization is the most widely used transformation. It is performed by division of each element of a column-centered table by its corresponding column-standard deviation (i.e. the square root of the column-variance): (Xij -JHj)
Zij = -^
—
with / = 1, ..., n and; = 1,..., p
(31.45)
with
d]=^l(^,-mjr In the context of data analysis we divide by n rather than by (n - 1) in the calculation of the variance. This procedure is also called autoscaling. It can be verified in Table 31.5 how these transformed data are derived from those of Table 31.4.
123 TABLE 31.5 Atmospheric data from Table 31.1, after column-standardization. Na 0 90 180 270
1.681 -0.411 -0.949 -0.321
m^ d.
0 1
CI 1.683 -0.399 -0.947 -0.337 0 1
Si
in„
d.
-0.413 -1.228 0.122 1.519
0.984 -0.679 -0.591 0.287
1.394 0.782 0.777 0.917
0 1
0
—
—
1
In the corresponding column-standardized biplot of Fig. 31.7 we find all representations of the eight chromatographic methods more or less at the same distance from the origin of space. The circle is distorted because of the large difference between the contributions of the first and second latent variables (95 and 4%) and the choice of a = (3 = 0.5 which has been made at the outset. The combined effect is an apparent dilation of the vertical axis. The distances between compounds in Fig. 31.7 are not notably affected by the transformation in comparison with the previous Fig. 31.6. This biplot allows more easily to perceive the correlations between measurements. Three clusters are now put in evidence, namely (1) DMSO and DMF, (2) ethanol and propanol, (3) octanol, dioxane, THF and methylenedichloride. The line segments drawn from the origin have been added to emphasize these groupings. Unipolar axes could have been defined here in the same way as in Fig. 31.6. Bipolar axes on the column-standardized biplot, however, cannot be interpreted directly in terms of the original data in X. 31.3.4 Log column-centering The transformation by log column-centering consists of taking logarithms followed by column-centering. The choice of the base of the logarithms has no effect on the interpretation of the result, but decimal logs will be used throughout. y^j = log^.) z.. = y..~m^ with 1 ""
with / = 1,..., n andj = 1,...,/?
(31.46)
124
4%
# a-N02
• B-r
• a-LPr
• ' ^ '^
B-Hm B-Et
Fig. 31.7. Biplot of chromatographic retention times in Table 31.2, after column-standardization of the data. Unipolar semi-axes have been drawn through all points representing methods. The particular arrangement of the methods is indicative for the presence of a strong size component in the data.
In this case it is required that the original data in X are strictly positive. The effect of the transformation appears from Table 31.6. Column-means are zero, while column-standard deviations tend to be more homogeneous than in the case of simple column-centering in Table 31.4 as can be seen by inspecting the corresponding values for Na and CI. In the log column-centered biplot of Fig. 31.8 one observes that the centroid of the compounds coincides with the origin. Also, the chromatographic methods are at a more uniform distance from the origin than it is the case with simple column-centering in Fig. 31.6. The effect of log column-centering is to reduce the heterogeneity of the variances. The result is close to that of column-standardization in Fig. 31.7. While the logarithmic function reduces the effect of large values it enhances that of the smaller ones. In the chromatographic application this is an advantage, as appears from the widening of the cluster of compounds with the substituents CF3, F, H, methyl, ethyl, /-propyl, r-butyl on the left side of Fig. 31.7. With log column-centering we obtain unipolar axes by substituting eq. (31.46) in eq. (31.35): ^j=^ij
'•ytj ~^j
=^^s(Xij)-^j
(31.47)
125 TABLE 31.6
Atmospheric data from Table 31.1, after log column-centering. Na
CI
Si
m^
d.
0 90 180 270
0.4183 -0.0507 -0.3517 -0.0159
0.4326 -0.0445 -0.3690 -0.0191
-0.0297 -0.1181 0.0200 0.1278
0.2738 -0.0711 -0.2336 0.0309
0.3479 0.0785 0.2945 0.0751
m^
0 0.2746
0 0.2853
0 0.0888
^
0 —
0.2343
where rrij is the logarithm of the geometric mean of theyth column in Table 31.2, which is a constant for this unipolar axis. Hence, the unipolar axis through DMSO reproduces the logarithms of the data in the corresponding column of the table. Note that the unipolar axis in Fig. 31.8 has been constructed with logarithmical subdivisions. A similar construction applies to the unipolar axis through methylenedichloride. The bipolar axis through DMSO and methylenedichloride defines logarithms of the ratio of the data in the corresponding columns. By substitution of eq. (31.46) in eq. (31.38) and after rearrangement we obtain that: s^ ( • . / - • ' . / ) = ^ ( / - ^ ( / ' =yij -yty - ( ^ . / ~ ^ / )
(xA = log
(31-48) -(mj-my)
Consequently, the bipolar axis represents the (log) ratio of columns j and / in the original data table up to the term nij - rrif which is a constant for the given bipolar axis. Note that the axis which represents the ratio of retention times with DMSO/methylenedichloride is also divided according to a logarithmic scale. In Fig. 31.8 we determine the DMSO/methylenedichloride ratio of the dimethylamine-N02 substituted compound approximately at 1.6, where the computed ratio from Table 31.2 is 42.38/26.07 or 1.63. 31.3.5 Log double-centering Preprocessing by log double-centering consists of first taking logarithms, and then to center the data both by rows and by columns: y^j = log(x^j) ^ =yij -rrii
with / = 1, ..., n andy = 1,...,/? -rrij^m
(31.49)
126
\3% r 4 \ ^
• B-CF3
^^v^
H-W02
F-JI02
9r-F
DMSO DMFW^LS
j
i%
ETBANOL^ PROPAITOL"
8
\^^ ^
^
4
2
l^-^-"^^
OCTAWOIQ
f CHF #toO-Ph«^v^^3| DJOXAN
•^'•-PhB
M«0-Ha ^^,y^
^^MAO-K02
NO2-Mm
^ ^ • f-ite
R-tBu
^y^
N02
^ • H-H
^
2.s^^
r
2
NMtt^-KO^'^
CB2C12*f^
%H»-MQO
97%*^ Fig. 31.8. Biplot of chromatographic retention times in Table 31.2, after log column-centering of the data. The values on the bipolar axis reproduce the (log) ratios between retention times in the two corresponding columns.
with 1
and
n
p
m=— ^ ^
y..
It is assumed that the original data in X are strictly positive. As is evident from Table 31.7 both the row-means m„ and the column-means m^ of the transformed table Z are equal to zero. The biplot of Fig. 31.9 shows that both the centroids of the compounds and of the methods coincide with the origin (the small cross in the middle of the plot). The first two latent variables account for 83 and 14% of the inertia, respectively. Three percent of the inertia is carried by higher order latent variables. In this biplot we can only make interpretations of the bipolar axes directly in terms of the original data in X. Three prominent poles appear on this biplot: DMSO, methylenedichloride and ethylalcohol. They are called poles because they are at a large distance from the origin and from one another. They are also representative for the three clusters that have been identified already on the column-standardized biplot in Fig. 31.7.
127
TABLE 31.7 Atmospheric data from Table 31.1, after log double-centering Na
CI
Si
m„
dn
0 90 180 270
0.1446 0.0204 -0.1181 -0.0468
0.1589 0.0266 -0.1354 -0.0500
-0.3034 -0.0470 0.2536 0.0969
0 0 0 0
0.2146 0.0333 0.1794 0.0685
m^
0 0.0968
0 0.1082
0 0.2049
—
d.
0 0.1450
A bipolar axis through columns7 a n d / can be interpreted in the same way as in the log column-centered case (eq. (31.48)) since the terms rrij and m^/ cancel out. The first (close to horizontal) axis between DMSO and ethanol represents the (log)ratios of the corresponding retention times. They can be read off by vertical projection of the compounds on this scale. Note that the scale is divided logarithmically. In the same way, one can read off the (log)ratios of methylenedichloride and ethanol from the second (close to vertical) axis on Fig. 31.9. Graphical estimation of these contrasts for the dimethylamine-N02 substituted chalcone produces 9.5 on the DMSO/ethanol axis and 6.2 on the methylenedichloride/ ethanol axis of Fig. 31.9. The exact ratios from Table 31.2 are 10.00 and 6.14, respectively. The first bipolar axis (DMSO/ethanol) accounts for the contrast between compounds with NO2 substitutions and those without. Compounds with a NO2 substituent systematically obtain higher scores on this bipolar axis than others. The second bipolar axis (methylenedichloride/ethanol) seems to produce an order of the substituents according to their electronic properties. To emphasize this point we have reproduced the log double-centered biplot again in Fig. 31.10. The dashed line near the middle separates the class of NO2 substituted chalcones from the other compounds. Further, we have joined substituents by line segments according to the sequence CF3, F, H, methyl, ethyl, /-propyl, ^butyl, methoxy, phenyl and dimethylamine. The electronic properties of these substituents vary progressively from electron acceptors to electron donors [11] in accordance with their scores on the second bipolar axis. The size component which may be strongly present (as in this chromatographic application) is eliminated by the operation of double-centering. Hence, doublecentered latent variables only express contrasts. In column-centered biplots one may find that one latent variable expresses mainly size and the others mainly contrasts. In general, none of the latter is a pure component of size or of contrasts. If we want to see size and some contrasts represented in a biplot, column-centering
128
114% # MmO-Phm
. — • 194m2-N02
€.5
—
Mm-Phmm^ CH2Cl2i 3 tf
DIOXAJ p i ^ j ^ Mm -MmO • • MmO-Mm V ^ •
•r-M.O
\ \ aoczAMOL
a-tBu •
0M.O-WO2 -^ \ «02-B% 4.
^02-T \ ^ ^ ^y^^^io
^y<^13 11
E-iPr • B-Xt \
^^^^^ 1
%F-N02
• B-H» \ E-B% \
\
PRQPANOLB \ x ^
'-B^^ > ^
^^"^ ^ • E-Cr3
Fig. 31.9. Biplot of chromatographic retention times in Table 31.2, after log double-centering. Two bipolar axes have been drawn through the representation of the methods DMSO, methylenedichloride (CH2CI2) and ethanol.
is indicated. If we only want to see contrasts, then double-centering is the method of choice. Compounds and chromatographic methods that are far away from the origin of the biplot possess a high degree of interaction (producing large contrasts). Those close to the center have little interaction (producing small contrasts). Compounds and methods that are at a distance from the origin and in the same direction possess a positive interaction with one another (they attract each other). Those that are opposite with respect to the origin show a negative interaction with each other (they repel each other). Interaction of compounds with methods (and vice-versa) can be interpreted on the biplot by analogy with mechanical forces of attraction (for positive interaction) and repulsion (for negative interaction). The mechanical analogy also illustrates that interaction is mutual. If a compound is attracted (or repelled) by a chromatographic modifier, then that modifier is also attracted (or repelled) by the compound. This property of interactions finds an analogy in Newton's third law of the forces. In a chromatographic context, attraction is to be interpreted as an interaction between compound and modified stationary phase which results in a greater elution time of the compound. Similarly, repulsion leads to a shorter elution time of the compound.
129
14%
83% Fig. 31.10. Same biplot of chromatographic retention times as in Fig. 31.9. The line segments connect compounds that share a common substituent. The horizontal contrast reflects the presence or absence of a NO2 substituent. The vertical contrast expresses the electronegativity of the substituents.
One can also state that the log double-centered biplot shows interactions between the rows and columns of the table. In the context of analysis of variance (ANOVA), interaction is the variance that remains in the data after removal of the main effects produced by the rows and columns of the table [12]. This is precisely the effect of double-centering (eq. (31.49)). Sometimes it is claimed that the double-centered biplot of latent variables 1 and 2 is identical to the column-centered biplot of latent variables 2 and 3. This is only the case when the first latent variable coincides with the main diagonal of the data space (i.e. the line that makes equal angles with all coordinate axes). In the present application of chromatographic data this is certainly not the case and the results are different. Note that projection of the compounds upon the main diagonal produces the size component. The transformation by log double-centering has received various names among which spectral mapping [13], logarithmic analysis [14], saturated RC association model [15], log-bilinear model [16] and spectral map analysis or SMA for short [17].
130
31.3.6 Double-closure A matrix is said to be closed when the sums of the elements of each row or column are equal to a constant, for example, unity. This is the case with a table of compositional data where each row or column contains the relative concentrations of various components of a sample. Such compositional data are closed to 100%. Centering is a special case where the rows or columns of a table are closed with respect to zero. Here we only consider closure with respect to unity. A data table becomes column-closed by dividing each element with the corresponding columnsum. A table is row-closed when each element has been divided by its corresponding row-sum. By analogy with double-centering, double-closure involves the division of each element of a table by its corresponding row- and column-sum and multiplication by the global sum: ^ij ~
^^^^ ' ~ ^' ••' ^ ^uidy = 1, ...,p
(31.50)
•^i-»- -^+7
with p
n
n
p
J
i
i
J
where x„^, x+^ and x^^ represent the vector of row-sums, the vector of column-sums and the global sum of the elements in the table X. Note that eq. (31.50) can also be written as: Zii=-^^—
(31.51)
m, nij
where m^, m^, m represent the row-means, column-means and global mean of the elements of X. Double-closure is only applicable to data that are homogeneous, i.e. when measurements are expressed in the same unit. The data must also be non-negative. If these requirements are satisfied, then the row- and column-sums of the table can be thought to express the size or importance of the items that are represented by the corresponding rows and columns. Double-closure is the basic transformation of correspondence factor analysis (CFA), which is applicable to contingency tables and, by extension, to homogeneous data tables [18]. Data in contingency tables represent counts, while data in homogeneous tables need only be defined in the same unit. Although CFA will be discussed in greater detail in Chapter 32 on the multivariate analysis of contingency tables, it is presented here for comparison with the other methods of PC A which have been discussed above.
131
If the original data represent counts or frequencies one defines: X:.
X,:
E(x,^) = -^^^ x^^.
rU: 171:
=^ - ^ m
(31.52)
where E(x^j) is called the expected value of x^y under the condition of fixed marginal totals. (See Chapter 26 on 2 x 2 contingency tables). For this reason the operation of double-closure can also be considered as a division by expected values. It follows that the transformed data Z in eqs. (31.50) or (31.51) can be regarded as ratios of X to their expected values E(X). In CFA the means m^, m^, m and norms d„, d^, d are computed by weighted sums and weighted sums of squares: p
m^ = 2^Wi Zij = 1
with / = 1, ..., n
J n
^.1 =^^i
^ij = 1
n
p
'
J
with7 = 1, ...,/?
(31.53)
p
J
dj^^w.zfj n
p
i
J
where the vectors of weight coefficients w„ and w^ are defined from the marginal sums: w,=^^
(31.54)
132 TABLE 31.8 Atmospheric data from Table 31.1, after double-closure. The weights w are proportional to the row- and column-sums of the original data table. They are normalized to unit sum.
0 90 180 270
^P
m^ d.
Na
CI
Si
w„
m.
d.
0.3067 -0.0126 -0.4303 -0.2173
0.3299 -0.0136 -0.4609 -0.2349
-0.4391 0.0181 0.6143 0.3121
0.4075 0.1832 0.1587 0.2503
0 0 0 0
0.3759 0.0155 0.5259 0.2672
0.2025 0 0.2821
0.3744 0 0.3032
0.4229 0 0.4036
1 —
0
—
—
— 0.3454
When all weight coefficients w„ and w^ are constant then we have the special case: w.=—
and n
w =— P
(31.55)
which leads to the usual definition of means and norms of eq. (31.39). The effect of double-closure is shown in Table 31.8. For convenience, we have subtracted a constant value of one from all the elements of Z in order to emphasize the analogy of the results with those obtained by log double-centering in Table 31.7. The marginal means in the table are average values for the relative deviations from expectations and thus must be zero. The analysis of Table 31.2 by CFA is shown in Fig. 31.11. As can be seen, the result is very similar to that obtained by log double-centering in Figs. 31.9 and 31.10. The first latent variable expresses a contrast between NO2 substituted chalcones and the others. The second latent variable seems to be related to the electronic properties of the substituents. The contributions of the two latent variables to the total inertia is 96%. The double-closed biplot of Fig. 31.11 does not allow a direct interpretation of unipolar and bipolar axes in terms of the original data X. The other rules of interpretation are similar to those of the log double-centered biplot in the previous subsection. Compounds and methods that seem to have moved away from the center and in the same directions possess a positive interaction (attraction). Those that moved in opposite directions show a negative interaction (repulsion). There is a close analogy between double-closure (eq. (31.51)) and log doublecentering (eq. (31.49)) which can be rewritten as: z,,. ={y,j -\-m)-{m, with
+ m^
(31.56)
133
6%
• B-tBu P
JDIOKAN^
PROPANOLa a ^ - ^ ETHAIfOl
90% Fig. 31.11. Biplot of chromatographic retention times in Table 31.2, resulting from correspondence factor analysis, i.e. after double-closure of the data. The line segments have been added to emphasize contrasts in the same way as in Fig. 31.10.
ytj = iogixij)
with / = 1,..., n andy = 1,..., p
where m^, m^ and m now represent the means of the logarithmically transformed data Y. It has been proved that, in the limiting case when the contrasts in the data are small, the two expressions produce equivalent results [14]. The main difference between the methods resides in the quantity that is analyzed. In CFA one analyses the global distance ofchi-square of the data [18] as explained in Section 32.5. In log double-centered PCA or SMA one analyses the global interaction in the data [17]. Formally, distances of chi-square and interactions are defined in the same way as weighted sums of squares of the transformed data. Only the type of transformation of the data differs from one method to another. (See Goodman [15] for a thorough review of the two approaches.) Contrasts in CFA are expressed in terms of distances of chi-square, while in SIMA they can be interpreted as (log) ratios. While the former is only applicable to homogeneous data, the latter lends itself also to heterogeneous data, since differences between units are cancelled out by the combined operations of logarithms and centering.
134
31.4 Algorithms A large variety of algorithms is available for the extraction of latent vectors from rectangular data tables and from square symmetric matrices. We only discuss very briefly a few of these. The NIPALS algorithm [19] is applied to a rectangular data table and produces row- and column-latent vectors sequentially in decreasing order of their (associated) latent values. A better performing, but also more complex, algorithm for rectangular data tables is the Golub-Reinsch [20] singular value decomposition, which produces all row- and column-latent vectors at once. The counterpart of NIPALS for square symmetric matrices is the power algorithm of Hotelling [21], which returns row- or column-latent vectors sequentially, in decreasing order of their (associated) latent values. More demanding alternatives are the Jacobi [22] and HouseholderQR [23] algorithms which produce all row- or column-latent vectors at once. The choice of a particular method depends on the size and shape (tall or wide) of the data table, the available computer resources and programming languages. If one is only interested in the first two latent vectors (e.g. for the construction of a biplot) one may take advantage of the iterative algorithms (NIPALS or power) since they can be interrupted at any time. The latter are also of particular interest with matrix oriented computer notations such as APL [24], MATLAB [25] and SAS/IML [26]. Algorithms are also available in libraries of Fortran functions for solving problems in linear algebra such as LINPACK [27], in EISPACK [28] which is dedicated to the solution of eigenvalue problems and in the Pascal library of numerical recipes [29]. 31.4.1 Singular value decomposition The 'non-linear iterative partial least square' or NIPALS algorithm has been designed by H. Wold [19] for the solution of a broad class of problems involving relationships between several data tables. For a discussion of partial least squares (PLS) the reader is referred to Chapter 35. In the particular case where NIPALS is applied to a single rectangular data table one obtains the row- and column-singular vectors one after the other, in decreasing order of their corresponding singular values. The original concept of the NIPALS approach is attributed to Fisher [30]. In NIPALS one starts with an initial vector t with n arbitrarily chosen values (Fig. 31.12). In a first step, the matrix product of the transpose of the nxp table X with the AX-vector t is formed, producing the p elements of vector w. Note that in the traditional NIPALS notation, w has a different meaning than that of a weighting vector which has been used in Section 31.3.6. In a second step, the elements of the p-vector w are normalized to unit sum of squares This prevents values from becoming too small or too large for the purpose of numerical computation. The
135
NIPALS X
t
y^
0
/
/ /~7 w
/
//
/ 7
/
i
.'1
/3
A
-,Jr^
Fig. 31.12. Dance-step diagram, illustrating a cycle of the iterative NIPALS algorithm. Step 1 multiplies the score vector t with the data table X, which produces the weight vector w. Step 2 normalizes w to unit sum of squares. In step 3, X is multiplied by w, yielding an updated t.
third step involves the multiplication of X with w, which produces updated values for t. The complete cycle of the NIPALS algorithm can be represented by means of the operations, given an initial vector t: w = X^t w
(w^ w) 1/2
-^ W
(31.57)
t = Xw which is equivalent to the transition formulae of eq. (31.17). The three steps of the cycle are represented in Fig. 31.12 which shows the transitions in the form of a dance step diagram [31]. The cycle is repeated until convergence of w or t, when changes between current and previous values are within a predefined tolerance (e.g. 10"^). At this stage one can derive the first normalized row-latent vector Uj and column-latent vector Vj from the final w and t: ( t ^ t ) 1/2
(31.58)
W
'
(w^ w)^/2
(In practice, the normalization of w can be omitted and one can write Vj = w.)
136
The corresponding latent value Xj, is then defined by means of: X,=nJ
Xy,
(31.59)
in accordance with eq. (31.1). A crucial operation in the NIPALS algorithm is the calculation of the residual data matrix which is independent of the contributions by the first singular vector. This can be produced by the instruction: X-X,(u^
\J)^X
(31.60)
in which the new X represents the residual data matrix. Geometrically, one may consider the residual data matrix X as the projections of the corresponding patterns of points onto (dual) subspaces which are orthogonal to Uj and Vj. If all elements of the residual table are zero (or near zero within given tolerance limits), then we have exhausted the data matrix X and no further latent vectors can be extracted. Otherwise, the algorithm (eq. (31.57)) is repeated, yielding the singular vectors U2 and V2 together with the (associated) singular value ^2- The way in which the residual matrix has been defined (eq. (31.60)) ensures that U2 is orthogonal to Uj and that V2 is orthogonal to Vj. A new residual data matrix is computed and tested for being zero. If not, a third singular vector is extracted and so on, until complete exhaustion of X. At the end we have decomposed the data table X into r orthogonal components (where r is at most equal to p, which is assumed to be smaller than n): r
X = X^it("it v J ) = U A V ^
(31.61)
k
By construction, all singular vectors in U and V are normalized and mutually orthogonal. The NIPALS algorithm is easy to program, particularly with a matrix-oriented computer notation, and is highly efficient when only a few latent vectors are required, such as for the construction of a two-dimensional biplot. It is also suitable for implementation in personal or portable computers with limited hardware resources. A more efficient but also more demanding algorithm is the Golub-Reinsch [20] singular value decomposition. This is a non-iterative method. Its first step is a so-called Householder transformation [32] of a rectangular data table X which produces a bidiagonal matrix, i.e. a matrix in which all elements must be zero, except those on the principal diagonal and the diagonal immediately below (or above) (Fig. 31.13a). The second step reduces the bidiagonal matrix into a diagonal matrix A by means of the so-called QR transformation [33]. The elements of this principal diagonal are the singular values. At the end of the transformations, the Golub-Reinsch algorithm also delivers the row- and column-singular vectors in U and V.
137
(a)
Golub - Reinsch X
A
u ->
0 •
•
-^
\
•
0^
Householder step
VT
0 • 0 \
QR step
(b) Jacobi
c
A2
-^
•
1
•
•
\
0 0 \
•
+ •
M
•
+ •
m
M (c) Householder - QR
c
A2
-^ Householder step
-^
•
\
VT
0 0 \
QR step
Fig. 31.13. Schematic example of three common algorithms for singular value and eigenvalue decomposition.
138
31.4.2 Eigenvalue decomposition The power algorithm [21] is the simplest iterative method for the calculation of latent vectors and latent values from a square symmetric matrix. In contrast to NIPALS, which produces an orthogonal decomposition of a rectangular data table X, the power algorithm decomposes a square symmetric matrix of cross-products X^ X which we denote by C^. Note that C^ is called the column-variancecovariance matrix when the data in X are column-centered. In the power algorithm one first computes the matrix product of C^ with an initial vector of/? random numbers v, yielding the vector w: w = C^v
(31.62a)
The result is then normalized, which produces an updated vector v: v = w/(wTw)^/2
(31.62b)
The normalization step prevents the elements in v from becoming either too large or too small during the numerical computation. The two operations above define the cycle of the powering algorithm, which can be iterated until convergence of the elements in the vector v within a predefined tolerance. It can be easily shown that after n iterations the resulting vector w can be expressed as: wocC^ V
(31.63)
where the symbol C^ represents the nth power of the matrix C^, i.e. the matrix product of n terms equal to C^. The name of the method is derived from this property. The normalized vector Vj which results from the iterative procedure is the first dominant eigenvector of C^. Its associated eigenvalue X\ follows from eq. (31.5a): ^^^v^^C^v,
(31.64)
A key operation in the power algorithm is the calculation of the deflated cross-product matrix which is independent of the contribution by the first eigenvector. This is achieved by means of the instruction: C , - M ( v , v?-)^Cp
(31.65)
in which the original matrix C^ is replaced by the residual matrix. The geometrical equivalent of the deflation operation is a projection of a pattern of points into a subspace which is orthogonal to Vj. If all elements of the residual
139
matrix are zero (or near zero within a specified tolerance), then we have exhausted the cross-product matrix C^. Otherwise, the algorithm (eq. (31.62)) is repeated, yielding the second eigenvector V2 and its associated eigenvalue \\. Because of the definition of the residual matrix (eq. (31.65)), we obtain that y^ is orthogonal to Vj. A new residual matrix is computed and the process is repeated until complete exhaustion of C^. At the end we have decomposed the cross-product matrix C^ into r orthogonal components (where r is at most equal to p)\ C , = i M ( v , v J ) = VA2V^
(31.66)
k
Once we have computed the matrix of column-eigenvectors V we can derive the corresponding row-eigenvectors U using eq. (31.13): U = XVA-i
(31.67)
By construction, all eigenvectors are normalized and mutually orthogonal. As we have remarked before, singular vectors and eigenvectors are identical (up to an algebraic sign) and have been called latent vectors in the context of data analysis. Similarly, singular values are the square roots of the eigenvalues and have been called latent values. The eigenvalues are the contributions to the trace of the matrix of cross-products, or global sum of squares, conform to eq. (31.8). Although the results of NIPALS are equivalent to those of the power algorithm, the latter converges more rapidly. It is, however, numerically less stable when results are computed with finite precision. In practice, use of the power algorithm is advantageous with a matrix-oriented computer notation, when hardware resources are limited and when only a few latent vectors are required (as for the construction of a two-dimensional biplot). In all other cases, one should make use of more powerful and efficient algorithms, such as those described below. A non-iterative algorithm for the diagonalization of a symmetric square matrix (C) is attributed to Jacobi [22]. Basically, this method begins with considering the 2x2 submatrix of C formed by the two symmetrical off-diagonal elements with the largest absolute value and its two corresponding diagonal elements. It is possible to find a simple 2x2 transformation matrix which diagonalizes this 2x2 symmetrical submatrix within the larger symmetrical matrix (Fig. 31.13b). Of course this affects the other off-diagonal elements which in general still remain different from zero. The procedure is repeated by selecting the new off-diagonal elements with the largest absolute value, which is again zeroed by means of an appropriate orthogonal transformation matrix. This process, when continued, converges to a state were all off-diagonal elements of the square symmetric matrix are zero (within given limits of tolerance). After convergence, the diagonalized matrix A^ contains the eigenvalues. The product of the successive transformation matrices
140
produces the matrix of corresponding eigenvectors V. A detailed description of this algorithm has been given by Ralston and Wilf [34]. An even more efficient, but also more complex, algorithm for the diagonalization of a square matrix is the Householder-QR eigenvalue decomposition (Fig. 31.13c). It is similar in structure to the Golub-Reinsch singular value decomposition which we have mentioned above (Fig. 31.13a). The first part of this high-performance algorithm applies a Householder [32] transformation which converts a square matrix (C) into a tridiagonal matrix, i.e. a matrix in which all elements are equal to zero except those on the main diagonal and the diagonals below and above. In a second part, this tridiagonal matrix is transformed into a diagonal one by means of the QR transformation [33,35] which can be further optimized for efficiency [36]. After diagonalization one finds the eigenvalues on the main diagonal of the transformed square matrix A^. As a result of the transformations one also obtains the corresponding eigenvectors V. Note that this algorithm can also be applied to non-symmetrical square matrices for the solution of general eigenproblems such as arise in linear discriminant analysis and canonical correlation analysis [37]. (The latter are also discussed in Chapters 33 and 35.) Computer programs for the Householder-QR algorithm are available from the function libraries mentioned above. A comparison of the performance of the three algorithms for eigenvalue decomposition has been made on a PC (IBM AT) equipped with a mathematical coprocessor [38]. The results which are displayed in Fig. 31.14 show that the Householder-QR algorithm outperforms Jacobi's by a factor of about 4 and is superior to the power method by a factor of about 20. The time for diagonalization of a square symmetric value required by Householder-QR increases with the power 2.6 of the dimension of the matrix. Usually it is assumed that the number of rows of a data table X exceeds the number of columns. In the opposite case, the calculation of the column-eigenvectors V from the matrix X^ X may place high demands on computer time and memory. In the so-called kernel algorithm this inconvenience is greatly reduced [39]. The basic idea here is to first compute the row-eigenvectors U and eigenvalues A^ of the matrix X X^, and then to derive the corresponding column-eigenvectors V from the expression X^ U A"^ (according to eq. (31.17b)).
31.5 Validation A question that often arises in multivariate data analysis is how many meaningful eigenvectors should be retained, especially when the objective is to reduce the dimensionality of the data. It is assumed that, initially, eigenvectors contribute only structural information, which is also referred to as systematic information.
141 100
Time s Power Jacobi 10
-\
Householder-QR
IBM PC-AT + coprocessor ~l
'
20
50
Dimension Fig. 31.14. Performance of three computer algorithms for eigenvalue decomposition as a function of the dimension of the input matrix. The horizontal and vertical scales are scaled logarithmically. Execution time is proportional to a power 2.6 of the dimension.
With increasing number of eigenvectors, however, noise is progressively contaminating the eigenvectors and eventually only pure noise may be carried by the higher-order eigenvectors. The problem then is to define the number of eigenvectors which account for a maximum of structure, while carrying a minimum of noise. Since one has not always access to replicated data, there may be uncertainty about the extent of the random noise in the data and, hence, the problem is by no means trivial. The many solutions that have been proposed thus far have been reviewed recently by Deane [40]. Some empirical methods of validation are also discussed in Chapter 36. In this section we only discuss three representative approaches, i.e. empirical, statistical and by internal validation. (There are several variants of these approaches which we cannot discuss here.) It is not always clear which method should be preferred under given circumstances [37]. It may be useful to apply
142
several methods of validation in parallel and to be prepared for the possibility of conflicting outcomes. None of these approaches of internal validation, however, are perfect substitutes for replicated data and external validation. As is often the case, the proof of the pudding is in the eating. Each of the three approaches will be applied in this section to the transformed retention times of the 23 chalcones with eight chromatographic elution methods in Table 31.2. The transformation is defined by the successive operations of logarithms, double-centering and global normalization which is typical for the method of spectral map analysis (SMA): y^j = logjc^y
with / = 1,..., n andj = 1,...,/?
(31.68)
g.. = y.. - nii -nij + m
'•'
d
with .
p
.
n
p
j
n
i
1
n
p
^p
i
j
-^
^
n
p
^P
i
j
The rank of the transformed table of chromatographic retention times Z is equal to seven.
31.5.1 Scree-plot This empirical test is based on the so-called Scree-plot which represents the residual variance as a function of the number of eigenvectors that have been extracted [42]. The residual variance V of the r*-th eigenvector is defined by: r
V(r*)=
^l\
withl
where r is the number of nontrivial (zero) eigenvalues.
(31.69)
143
V(r*) A
Fig. 31.15. Scree-plot, representing the residual variance V* as a function of the number of factors r* that has been extracted. The diagram is based on a factor analysis of Table 31.2 after log double-centering. A break point occurs after the second factor, which suggests the presence of only two structural factors, the residual factors being attributed to noise and artefacts in the data.
It is assumed that the structural eigenvectors explain successively less variance in the data. The error eigenvalues, however, when they account for random errors in the data, should be equal. In practice, one expects that the curve on the Scree-plot levels off at a point r* when the structural information in the data is nearly exhausted. This point determines the number of structural eigenvectors. In Fig. 31.15 we present the Scree-plot for the 23x8 table of transformed chromatographic retention times. From the plot we observe that the residual variance levels off after the second eigenvector. Hence, we conclude from this evidence that the structural pattern in the data is two-dimensional and that the five residual dimensions contribute mostly noise. 31.5.2 MalinowskVs F-test This is a statistical test designed by Malinowski [43] which compares the variance contributed by a structural eigenvector with that of the error eigenvectors. Let us suppose that A:f is the variance contributed by the last structural eigen-
144 TABLE 31.9 Validation results obtained from factor analysis of Table 31.2, containing the retention times of 23 chalcones in 8 chromatographic methods, after log double-centering and global normalization. The results are used in the Malinowski's F-test and in cross-validation by PRESS.
r*
r-r*
^i"
V(r*)
F(r*)
PRESS(r*)
W(r*)
1 2 3 4 5 6 7
6
5 4 3 2 1 0
0.8286 0.1366 0.0229 0.0066 0.0031 0.0015 0.0008
0.1715 0.0349 0.0120 0.0054 0.0023 0.0008
29.0 19.6
45.181 22.301 23.639 38.126 47.603 52.856 55.351
0.4936 1.0600 1.6128 1.2486 1.1104 1.0472
0
7.6 3.7 2.7 1.9 —
—
vector. This structural variance should be larger than the average V of the variances contributed by the r-r* error eigenvectors that follow: -
]
V(r*) =
'
1
y X\=--^V(r*)
withl
(31.70)
Hence, the number of structural eigenvectors is the largest r* for which Malinowski 's F-ratio is still significant at a predefined level of probability a (say 0.05): F(r*) = ^ - ^ V(r*)
with 1 < r* < r
(31.71)
which is to be tested with 1 and (r - r*) degrees of freedom. Note that this test assumes that the errors in the data are homogeneous and normally distributed. In Table 31.9 we represent the results of Malinowski's F-test as computed from the eigenvalues of the transformed retention times in Table 31.2. The second eigenvector produces an F-value which still exceeds the critical F-statistic (6.6 with 1 and 5 degrees of freedom) at the 0.05 level of probability. Hence, from this evidence we conclude again that there are two, possibly three, structural eigenvectors in this data set. 31.5.3 Cross-validation The method of cross-validation is based on internal validation, which means that one predicts each element in the data set from the results of an analysis of the remaining ones. This can be done by leaving out each element in turn, which in the case of an nxp table would require nxp analyses. Wold [44] has implemented a scheme for leaving out groups of elements at the same time, which reduces the
145
Wfl ff Fig. 31.16. Two of the four masks that are used in the calculation of a PRESS value in cross-validation. The remaining two masks are obtained by a parallel shift of the diagonal lines that represent data points to be deleted.
number of analyses to four. In this scheme, elements to be left out are located on equally spaced diagonals (pseudodiagonals) such as illustrated by the masks of Fig. 31.16. Actually, there are four such masks, but only two are shown here. This scheme ensures that not more and not less than about one fourth of each row and column are left out at the same time. The elements on the diagonals are predicted from the remaining ones. Then the diagonals are shifted one element to the right and the procedure is repeated. After four such analyses we have predicted all elements of the table. The predicted residual error sum of squares or PRESS for r* eigenvectors is given by:
pREss(r')= yy(z,-ii) i
(31.72)
J
where z* is the value predicted for Zij using r* eigenvectors [44]. For a general discussion of PRESS the reader is referred to Section 10.3.4 and to Chapter 36. As more stmctural eigenvectors are included we expect PRESS to decrease up to a point when the structural information is exhausted. From this point on we expect PRESS to increase again as increasingly more error eigenvectors are included. In order to determine the transition point r* one can compare PRESS(r*+l) with the previously obtained PRESS(r*). The number of structural eigenvectors r* is reached when the ratio: PRESS(r* +1) (31.73) W(r*): with 1 < r* < r PRESS(r*) becomes larger than 1 [45]. Note that the symbols Vand Win this section represent the traditional notation which differs from our standardized nomenclature in this chapter.
146
In a more recent variant of cross-validation one replaces PRESS(r*) in the denominator by the residual sum of squares RSS(r*): n
p
RSS(r') = XS(^.7-V)'
(31.74)
where J. J is the reconstructed value of z,y using r* eigenvectors and without leaving any element out. The results of cross-validation of the transformed retention times of Table 31.2 are compiled in the last two columns of Table 31.9. They point toward the presence of two structural factors.
31.6 Principal coordinates analysis 31.6.1 Distances defined from data Principal coordinates analysis (PCoA) is applied to distance tables rather than to original data tables, as is the case with principal components analysis (PCA). We consider an nxn table D of distances between the n row-items of an nxp data table X. Distances can be derived from the data by means of various functions, depending upon the nature of the data and the objective of the analysis. Each of these functions defines a particular metric (or yardstick), and the graphical result of a multivariate analysis may largely depend on the particular choice of distance function. The squared Euclidean distance (also called Pythagorean distance) has been defined in Section 9.2.3: dl =(x, - x . ) ^ (^i-^r)
= f,(x,j
-XrjV
with/,r = 7, ..., n
(31.75)
j
where the vectors xj and x^. represent the iih and i'-th rows of X. It can be regarded as a special case of the squared weighted Euclidean distance (Section 30.2.2.1). A property of (weighted) Euclidean distance functions is that the distances between row-items D are invariant under column-centering of the table X: d'h=j^wj((x^j-mj)-(x,j-mj)y=j;^wj(x,j -x.j)' ./•
j
J
p
n
with ^ / = — X ^ij
^"^
X ^7 "^ ^
=dl
(31.76)
147
Geometrically, column-centering of X is equivalent to a translation of the origin of column-space toward the centroid of the points which represent the rows of the data table X. Hence, the operation of column-centering leaves distances between the row-points unchanged. The squared generalized distance is a weighted distance of the general form: (31.77)
J2,=(x.-x,0^ W(x,-x,0
where W represents apxp weighting matrix. Depending upon the definition of W, the generalized distance encompasses the ordinary and weighted Euclidean distances as well as the Mahalanobis distance (Section 30.2.2.1). The Minkowski distance defines a class of distance functions which are characterized by the parameter r (Section 30.2.3.2):
^ir=H^^ij
(31.78)
with r > 1
-^r/
In the case of r = 2 we obtain the ordinary Euclidean distance of eq. (31.75), which is also called the L2-norm. In the case of r = 1 we derive the city-block distance (also called Hamming-, taxi- or Manhattan-distance), which is also referred to as the Lj-norm: (31.79)
^n'=Zl^// -^r./'
The squared Chi-square distance is appropriate for the analysis contingency tables (when the data represent counts) and for cross-tabulations (when the data represent parts of a whole): 1 f
^^-I
+./ V^/+
\2 ' 7
^/'.+ y
=1^
(31.80)
where x-^, x^-, x^^ represent the /th row-sum, theyth column-sum and the global sum over all rows and columns, respectively. The Chi-square distance can be seen as a weighted Euclidean distance on the transformed data: (31.81)
Zij=Xij/E{Xij)
where the expected value Eix^^ has been defined in Section 16.2.3 as:
E{x,) =
•^i+
^+j
(31.82)
148
In this respect, the weight coefficients are proportional to the column-sums. Distances of Chi-square form the basis of correspondence factor analysis (CFA) which is discussed in Chapter 32. 31.6.2 Distances derived from comparisons of pairs Suppose we have a set of n items which are to be compared two-by-two according to the presence or absence ofp characteristics. The similarity between any pair /, /' can be defined, for example, by means of Gower's similarity index [46] which was defined in Chapter 30. Gower recommends to transform the similarities S into distances D by means of: d,, = {\-s^^'^
(31.83)
There are many alternative indices of similarity, but Gower's is considered to be fairly general and can be related to several other similarity indices [47]. 31.6.3 Eigenvalue decomposition In Section 31.2.2 we have derived the Euclidean distances D between two items / and /' from their corresponding variances and covariances C:
The inverse relationship has been established by Young and Householder [48] and is also referred to as Gower's formula because of the extensive use he has made of it [49]: Cn'-^-\idl-dl-d\-^d^)
(31.84)
where d], d\ and d^ refer to the mean of row /, the mean of column i and the global mean over all rows and columns, respectively, of the squared distance matrix D^. The variance-covariance matrix C equals the double-centered squared distance matrix D (multiplied by the constant -0.5). Once the nXAz variance-covariance matrix C has been derived one can apply eigenvalue decomposition (EVD) as explained in Section 31.4.2. In this case we obtain: C = U A^ U'^
with U'^ U = I
where U represents the Azxr matrix of orthonormal factor scores, where A^ is the ry.r diagonal matrix of eigenvalues and where r is the rank of C.
149
Factor scores for each of the n objects with respect to the r computed factors are then defined in the usual way by means of the wxr orthogonal factor score matrix S: S = UA These factor scores can be used for the construction of a score plot in which each object is represented as a point in the plane of the first two dominant factors. A close analogy exists between PCoA and PCA, the difference lying in the source of the data. In the former they appear as a square distance table, while in the latter they are defined as a rectangular measurement table. The result of PCoA also serves as a starting point for multidimensional scaling (MDS) which attempts to reproduce distances as closely as possible in a low-dimensional space. In this context PCoA is also referred to as classical metric scaling. In MDS, one minimizes the stress between observed and reconstructed distances, while in PCA one maximizes the variance reproduced by successive factors.
31.7 Non-linear principal components analysis Non-linear PCA can be obtained in many different ways. Some methods make use of higher order terms of the data (e.g. squares, cross-products), non-linear transformations (e.g. logarithms), metrics that differ from the usual Euclidean one (e.g. city-block distance) or specialized applications of neural networks [50]. The objective of these methods is to increase the amount of variance in the data that is explained by the first two or three components of the analysis. We only provide a brief outline of the various approaches, with the exception of neural networks for which the reader is referred to Chapter 44. 31.7.1 Extensions of the data by higher order terms One approach is to extend the columns of a measurement table by means of their powers and cross-products. An example of such non-linear PCA is discussed in Section 37.2.1 in an application of QSAR, where biological activity was known to be related to the hydrophobic constant by means of a quadratic function. In this case it made sense to add the square of a particular column to the original measurement table. This procedure, however, tends to increase the redundancy in the data. 31.7.2 Non-linear transformations of the data Another approach is to transform the original columns of a measurement table by suitable non-linear functions in order to obtain some desirable effect. This is
150
done, for example, in the HOMALS program (Homogeneity and Alternating Least Squares) [51] where non-linear transformations are applied in order to maximize the homogeneity of the data. In this context, homogeneity is inversely related to the within-variance of the rows of the data table. For example, a table in which the values in each individual row are constant (but different from zero) possesses maximum homogeneity [52]. The HOMALS approach is most suitable in the case of categorical data, i.e. when the column-items represent categories of one or more variables such as age, gender, etc. Correspondence factor analysis (CFA) which is discussed in Chapter 32 may be regarded as a special case of HOMALS. The logairithmic transformation prior to column- or double-centered PCA (Section 31.3) can be considered as a special case of non-linear PCA. The procedure tends to make the row- and column-variances more homogeneous, and allows us to interpret the resulting biplots in terms of log ratios. 31.7.3 Non-linear PCA biplot The non-linear PCA biplot approach makes use of distance metrics that differ from the ordinary and weighted Euclidean metrics. We have already discussed the linear biplot resulting from PCA of a column-centered measurement table (Section 31.3.3). In this case, row-items are represented as a pattern of points and columnitems are depicted by means of unipolar axes, i.e. straight lines through the origin of the plot (which is at the same time the centroid of the pattern). The original data in a column of the table can also be read off by means of perpendicular projection of the row-points upon the unipolar axes and by interpolating between equidistant markers (tick marks). In the non-linear case, the axes are curved and the markers on them are no longer equidistant (Fig. 31.17). The theory of the non-linear PCA biplot has been developed by Gower [49] and can be described as follows. We first assume that a column-centered measurement table X is decomposed by means of classical (or linear) PCA into a matrix of factor scores S and a matrix of factor loadings L: X = U A V^ = S L'^ with S =UA
and
L =V
This corresponds with a choice of factor scaling coefficients a = 1 and (3 = 0, as defined in Section 31.1.4. Note that classical PCA implicitly assumes a Euclidean metric as defined above. Let us consider theyth coordinate axis of column-space, which is defined by a ;?-vector of unit length e^ of the form: e. = (0 0
...
1 ... 0 0)
151
©
®
Fig. 31.17. (a) In a classical PCA biplot, data values xij can be estimated by means of perpendicular projection of the iih row-point upon a unipolar axis which represents the jth column-item of the data table X. In this case the axis is a straight line through the origin (represented by a small cross), (b) In a non-linear PCA biplot, theyth column-item traces out a curvilinear trajectory. The data value JC/, is now estimated by defining the shortest distance between the iih row point and theyth trajectory.
which has zeroes in all p positions except at theyth one. The projection of e^ onto factor-space yields the coordinates of the jth column-item of X, which are contained in the 7th row of the factor loading matrix L. Instead of e^ we may also project the vector kj Cy, where kj is a variable coefficient ranging from -00 to +00. In that case, we obtain a series of points the locus of which is the unipolar axis through the origin of space and through the representation of the 7th column-item of X. In classical PCA, the origin of space is also the centroid of the row-items of the column-centered table X. This projection method can also be used to define markers along the 7th unipolar axis, by choosing appropriate values of the co-
152
efficient/:y (e.g. ...,-10,-5,0,5,10, ...).If we set/:y equal to the successive values x^^ in theyth column of X we obtain the exact projections of the row-items (/ = 1,..., n) along theyth axis. These projections usually differ from the orthogonal projections in the plane of the biplot when the contribution of the first two factors is less than 100% of the variance in X. The same idea can be developed in the case of a non-Euclidean metric such as the city-block metric or Lj-norm (Section 31.6.1). Here we find that the trajectories, traced out by the variable coefficient kj are curvilinear, rather than linear. Markers between equidistant values on the original scales of the columns of X are usually not equidistant on the corresponding curvilinear trajectories of the nonlinear biplot (Fig. 31.17b). Although the curvilinear trajectories intersect at the origin of space, the latter does not necessarily coincide with the centroid of the row-points of X. We briefly describe here the basic steps of the algorithm and we refer to the original work of Gower [53,54] for a formal proof. Given the original nxp measurement table X, one derives the table of nxn distances D between the n row-items, using a particular distance function cp: p
j
Using D as input we apply principal coordinates analysis (PCoA) which we discussed in the previous section. This produces the nxn factor score matrix S. The next step is to define a variable point along theyth coordinate axis, by means of the coefficient kj and to compute its distance d(kj) from all n row-points: p
df(kj) = ^(f(x,y,kjbjj.)
for/=l,...,n
(31.85)
./•'
where bjj^ represents Kronecker's 5 which equals 1 if y = / and 0 otherwise. As indicated above, one may either assign a set of equidistant values to kp or the n values x^j in theyth column of X. The r coordinates of the variable point which traces out the trajectory of theyth column-item in the r-dimensional biplot are compiled in the r-vector Sy. The elements of the latter can be estimated by means of linear regression of the nxr factor scores S upon an n-vector f^ which is defined as:
f^.,=-Udfikj)-dl+^dA 2
V
2
(31.86)
where the terms of the right-hand member have been defined above. The result of the linear regression can be expressed in the form: s^. = (S^Sr'ST f.
(31.87)
153
Note that df and d^ represent the row- and global means of the squared distance matrix D, respectively. The latter needs only be computed once and for all. The n distances of the variable point d{kj) to the n row-points, however, are to be reevaluated for every column-item of X and for every marker which is to appear in the corresponding trajectory in the biplot. Usually, the range of k^ is limited between the minimum and maximum value in the 7th column of X or somewhat beyond (say 10% of the range on either side).
31.8 Three-way principal components analysis Multiway and particularly three-way analysis of data has become an important subject in chemometrics. This is the result of the development of hyphenated detection methods (such as in combined chromatography-spectrometry) and yields three-way data structures the ways of which are defined by samples, retention times and wavelengths. In multivariate process analysis, three-way data are obtained from various batches, quality measures and times of observation [55]. In image analysis, the three modes are formed by the horizontal and vertical coordinates of the pixels within a frame and the successive frames that have been recorded. In this rapidly developing field one already finds an extensive body of literature and only a brief outline can be given here. For a more comprehensive reading and a discussion of practical applications we refer to the reviews by Geladi [56], Smilde [57]andHenrion[58]. In the present terminology, a three-way data array is defined by n rows, p columns and q layers, with indices ij and k, respectively.
31.8.1 Unfolding A straightforward approach to three-way PCA, which avoids some of the inherent problems of three-way analysis, is to unfold the three-way data array into a two-way array, i.e. a rectangular data table, which can then be processed according to the methods outlined in previous sections. The operation of unfolding can be likened to spreading out of a stack of continuous forms (such as is used with mechanical pin-fed printers). Unfolding of a three-way data array, however, can be done in three different manners as shown in Fig. 31.18. Theoretically, an nxpxq array will give rise to an nx(pxq), px(nxq) or qx(nxp) rectangular data table. Application of PCA to each of these three unfoldings will generally produce a different result. In particular, one obtains three different biplots for one and the same three-way table, which may be undesirable. Strictly speaking, unfolding avoids the three-way problem by reducing it to the analysis of a two-way table.
154
pl
I
/
I
.
I
I
I
} ^
I
n
PI
nXq
nXp \
nL _ _Ji _ Fig. 31.18. Three ways of unfolding a three-way table into a rectangular matrix.
31.8.2 The TuckerS model The so-called TuckerS model is defined by the decomposition of a three-way table X into a three-way core matrix Z and three two-way loading matrices A, B, C (one for each mode): (31.88) f
S
h
155
where e-jj^ represents the residual error term. In this way, an nxpxq table X is decomposed into an rxsxt core matrix Z and the nxr, pxs, qxt loading matrices A, B, C for the row-, column- and layer-items of X. The loading matrices are column-wise orthonormal, which means that: A^A = T ,
8^8 = 1
and
C^ C = I
where I^, I^ and I^ represent the unit matrices of dimensions r, s and t, respectively. The number of factors r, s and t, assigned to each mode, is generally different. They are chosen such as to be less than the dimensions of the original three-way table in order to achieve a considerable amount of data reduction. The elements of Z represent the magnitude of the factors and the extent of their interaction [57]. Computationally, the core matrix Z and the loadings matrices A, B and C are derived such as to minimize the sum of squared residuals. By analogy with singular vector decomposition (SVD) and ordinary PCA, one can also define the Tucker3 model in an extended matrix notation: (31.89)
X=A Z B+ E
The extended matrix notation is represented here by the three dots that surround the core matrix. A graphical example of the Tucker3 model is rendered in Fig. 31.19.
>
I
r'
Fig. 31.19. Tucker3 or core matrix decomposition of an nx/7X^ three-way table X. The matrix Z represents the rx5xr core matrix. A, B and C are thenxr,/?X5 and ^xdoading matrices of the row-, columnand layer-items of X, respectively.
156
31.8.3 The PARAFAC model The decomposition of a three-way table X can also be defined by means of the PARAF AC model in terms of three two-way loading matrices (one for each mode) A, B, C: r
^iik = Yu^if /
bjf ^kf + ^ijk
(31.90)
where e^j^ represents a residual error term. This decomposition is also referred to as the trilinear model In this case, the nxpxq three-way table X is decomposed into the nxr, pxr, qxr loading matrices A, B, C for the row-, column- and layer-items of X. In the PARAFAC model, the three loading matrices A, B, C are not necessarily orthogonal [56]. The solution of the PARAFAC model, however, is unique and does not suffer from the indeterminacy that arises in principal components and factor analysis. In contrast to the Tucker3 model, described above, the number of factors in each mode is identical. It is chosen to be much smaller than the original dimensions of the data table in order to achieve a considerable reduction of the data. The elements of the loading matrices A, B, C are computed such as to minimize the sum of squared residuals. The PARAFAC model can also be defined by means of an extended matrix notation: C X=A i B
(31.91)
where I represents an rxrxr identity array (with ones on the superdiagonal and zeroes in the off-diagonal positions). The decomposition is represented schematically in Fig. 31.20. It can be seen from (eqs. (31.89) and (31.91)) that the PARAFAC model is a special case of the Tucker3 model where Z = I and r = 5 = r.
31.9 PCA and cluster analysis Cluster analysis (which is covered extensively in Chapter 30) can be performed on the factor scores of a data table using a reduced number of factors (Section 31.1.4) rather than on the data table itself. This way, one can apply cluster analysis on the structural information only, while disregarding the noise or artefacts in the data. The number of structural factors may be determined by means of internal
157
A /
c
r_
I y---
Fig. 31.20. PARAFAC decomposition of an nxpxq three-way table X. The core matrix I is an rxrxr three-way identity matrix. A, B and C represent the nxr, pxr and qxr loading matrices of the row-, column- and layer-items of X, respectively.
validation (Section 31.5). If only two or three structural factors are present in the data, it can be helpful to provide the clustering tree in addition to the plot of factor scores in order to better comprehend the latent structure. It is also possible to represent the cluster structure graphically on the score plot, for example, by grouping those points that cluster together at a certain level in the clustering tree [59]. Although cluster analysis is based on the distances or similarities between a set of objects, its graphic representation in the form of a clustering tree has no metric property. The separation between any two leaves (i.e. terminal objects) in the tree usually bears no relation to their actual distance as defined by the data table or on the score plot with a reduced number of factors. In fact, given n objects one can easily show that there are 2^"^ equivalent hierarchical clustering trees, which can be obtained by reversals of the branches of the tree. In some special cases it is possible, however, to derive a unique or canonical representation of the objects in a clustering tree, by relating the order of appearance in the tree to their corresponding positions on the score plot. This is relevant, for example, in so-called phylogenetic applications, where the distance or similarity between any two objects reflects the time that elapsed after their evolutionary divergence. Such applications arise in biological taxonomy, in the study of proteins, in the classification of DNA sequences, etc. In these cases, one often finds that the objects (species, proteins, DNA strings) in a score plot are arranged approximately on a circle around the origin, at least when a two-factor solution is capable of reproducing the hereditary
158
or ancestral structure adequately. If the center of the score plot represents the primeval ancestor, then one expects all living present-day species, proteins or DNA strings to be more or less at an equal distance in evolutionary space. There may be discrepancies, however, because biological clocks may tick at different rates in different branches of the evolutionary tree. The positions of the objects are defined unambiguously on the score plot, apart from reflections about the factor axes. Hence, given a particular orientation of the axes, one may rearrange the clustering tree such that the order of its leaves corresponds optimally with the order of the objects in the score plot [60]. This then leads to a canonical representation of the clustering tree which is based on a preliminary PCA of the data. References 1. 2.
3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 15.
16.
IT. Joliffe, Principal Components Analysis. Springer, New York, 1986. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom. Intell. Lab. Syst., 4 (1988) 11-25. J. Mandel, Use of the singular value decomposition in regression analysis. Am. Statistician, 36 (1982) 15-24. C. Eckart and G. Young, The approximation of one matrix by another of lower rank. Psychometrika, 1 (1936)211-218. W. Van Borm, Source Apportionment of Atmospheric Particles by Electron Probe X-ray Microanalysis and Receptor Models. Doctoral Thesis, University of Antwerp, 1989. F.R. Gantmacher, The Theory of Matrices. Vols. 1 and 2. Chelsea Publ., New York, 1977. G.H. Dunteman, Principal Components Analysis. Sage Publications, Newbury Park, CA, 1989. L.L. Thurstone, Multiple-factor Analysis. A Development and Expansion of the Vectors of Mind. Univ. Chicago Press, Chicago, 1947. K. Pearson, On lines and planes of closest fit to systems of points in space. Phil. Mag., 2 (6th Series) (1901), 559-572. K.R. Gabriel, The biplot graphic display of matrices with applications to principal components analysis. Biometrika, 58 (1971) 453-467. B. Walczak, L. Morin-Allory, M. Chretien, M. Lafosse and M. Dreux, Factor analysis and experiment design in high-performance liquid chromatography. III. Influence of mobile phase modifications on the selectivity of chalcones on a diol stationary phase. Chemom. Intell. Lab. Syst., 1 (1986)79-90. H.F. Gollob, A statistical model which combines features of factor analytic and analysis of variance techniques. Psychometrika, 33 (1968) 73-111. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim. Forsch. (Drug Res.), 26 (1976) 1295-1300. J.B. Kasmierczak, Analyse logarithmique. Deux exemples d'application. Revue de Statistique Appliquee, 33 (1985) 13-24. L. A. Goodman, Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency tables. Int. Statistical Rev., 54 (1986)243-309. Y. Escoufier and S. Junca, Least squares approximation of frequencies or their logarithms. Int. Statistical Rev., 54 (1986) 279-283.
159 17.
P.J. Lewi, Spectral map analysis. Analysis of contrasts, especially from log-ratios. Chemom. Intell. Lab. Syst., 5 (1989) 105-116. 18. J.P. Benzecri, L'analyse des Donnees. Vol II. L'analyse des Correspondances. Dunod, Paris, 1973. 19. H. Wold, Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) algorithm. In: Perspectives in Probability and Statistics, J. Gani (Ed.). Academic Press, London, 1975, pp. 117-142. 20. G.H. Golub and C. Reinsch, Singular value decomposition and least squares solutions. Numer. Math., 14(1970)403-420. 21. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ. Psychol., 24 (1933) 417^41. 22. C.G.J. Jacobi, Ueber ein leichtes Verfahren die in der Theorie der Saekularstoerungen vorkommender Gleichungen numerisch aufzuloesen. J. Reine Angew. Math., 30 (1846) 51-95. 23. B.N. Parlett, The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs, NJ, 1980. 24. K.E. Iverson, Notation as a tool of thought (1979 Turing award lecture). CACM, 23 (1980) 444^65. 25. MATLAB, Engineering Computation Software, The Math Works Inc., Natick, 1992. 26. SAS/IML (Interactive Matrix Language) Software: User and Reference Manual. Version 6,1 st Ed., SAS Institute Inc., Gary, NC, 1990. 27. J.J. Dongarra, C.B. Moler, J.R. Bunch and G.W. Stewart, LINPACK. SIAM, Philadelphia, 1979. 28. B.T. Smith, Matrix Eigensystem Routines — EISPACK Guide. 2nd edn.. Lecture Notes in Computer Science, Vol. 6. Springer, New York, 1976. 29. W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes. The Art of Scientific Computing. Cambridge Univ. Press, Cambridge, UK, 1986, pp. 335-363. 30. R. Fisher and W. A. MacKenzie, Studies in crop variation II. The manurial response of different potato varieties, J. Agricult. Sci., 13 (1923) 311-320. 31. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of real-life performance from laboratory results. In: Scientific Computing and Automation (Europe) 1990. E.J. Karjalainen (Ed.). Elsevier, Amsterdam, 1990, pp. 199-210. 32. A.S. Householder, The Theory of Matrices in Numerical Analysis. Blaisdell, New York, 1964. 33. J.G.F. Francis, The QR transformation. Parts I and II. Computer J., 6 (1958) 378-392. 34. A. Ralston and H.S. Wilf, Mathematical Methods for Digital Computers. Wiley, New York, 1960. 35. H. Rutishauser, Solution of eigenvalue problems with the LR-transformation. Appl. Math. Ser. Nat. Bur. Stand., 49 (1958) 47-81. 36. J.H. Wilkinson, Global convergence of tridiagonal QR algorithm with origin shift. Algorithms Applic, 1 (1968) 409-420. 37. W.W. Cooley and P.P. Lohnes, Multivariate Procedures for the Behavioral Sciences. Wiley, New York, 1962. 38. A. Ley sen. Systematic study of algorithms for eigenvector extraction on PC. Thesis, Kath. Ind. Hogeschool der Kempen, Geel, Belgium, 1989. 39. S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many variables and fewer objects. Part I: theory and algorithm. J. Chemom., 8 (1994) 11-125. 40. J.M. Deane, Data reduction using principal components analysis. In: Multivariate Pattern Recognition in Chemometrics, R. Brereton (Ed.). Chapter 5, Elsevier, Amsterdam, 1992, pp. 125-165. 41. R.G. Brereton, Chemometrics, Applications of Mathematics and Statistics to Laboratory Systems. Ellis Horwood, New York, 1990, p. 221.
160 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59.
60.
R.B. Cattell, The Scree test for the number of factors. Multivariate Behavioral Res., 1 (1966) 245-276. E.R. Malinowski, Statistical F-tests for abstract factor analysis and target testing. J. Chemom., 3(1988)49-60. S. Wold, Cross-validatory estimation of the number of components in factor and principal components models. Technometrics, 20 (1978) 397^05. M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986, p. 255. J.C. Gower, A general coefficient of similarity and some of its properties. Biometrics, 27 (1971)857-874. I.E. Frank and R. Todeschini, The Data Analysis Handbook. Elsevier, Amsterdam, 1994. G. Young and A.S. Householder, Discussion of a set of points in terms of their mutual distances. Psychometrika, 3 (1938) 19-22. J.C. Gower, Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53 (1966) 325-328. G.E. Hinton, How neural networks learn from experience. Sci. Am., Sept. (1992) 145-151. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990. J.P. van de Geer, Multivariate Analysis of Categorical Data: Applications. Sage Publications, Newbury Park, CA, 1993. J.C. Gower and S.A. Harding, Non-linear biplots. Biometrika, 75 (1988) 445-455. J.C. Gower, Adding a point to vector diagrams in multivariate analysis. Biometrika, 55 (1968) 582-585. T. Kourti and J.F. MacGregor, Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemom. Intell. Lab. Syst., 28 (1995) 3-21. P. Geladi, Analysis of multi-way (multi-mode) data. Chemom. Intell. Lab. Syst., 7 (1989) 11-30. A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15 (1992) 143-157. R. Henrion, N-way principal component analysis. Theory, algorithms and applications. Chemom. Intell. Lab. Syst., 25 (1994) 1-23. P.N. Nyambi, J. Nkengasong, P. Lewi, K. Andries, W. Janssens, K. Fransen, L. Heyndrickx, P. Piot and G. van der Groen, Multivariate analysis of human immunodeficiency virus type 1 neutralization data. J. Virol., 70 (1996) 6235-6243. P.J. Lewi, H. Moereels and D. Adriaensen, Combination of dendrograms with plots of latent variables, Application of G-protein coupled receptor sequences. Chemom. Intell. Lab. Syst., 16 (1992) 145-154.
161
Chapter 32
Analysis of Contingency Tables 32.1 Contingency table Table 32.1 describes 30 persons who have been observed to use one of four available therapeutic compounds for the treatment of one of three possible disorders. The four compounds in this measurement table are the benzodiazepine tranquillizers Clonazepam (C), Diazepam (D), Lorazepam (L) and Triazolam (T). The three disorders are anxiety (A), epilepsy (E) and sleep disturbance (S). In this example, both measurements (compounds and disorders) are defined on nominal scales. Measurements can also be defined on ordinal scales, or on interval and ratio scales in which case they need to be subdivided in discrete and non-overlapping categories. Each column of a measurement table can be expanded into an indicator table. The rows of an indicator table refer to the same objects and in the same order as in the measurement table. The columns of the indicator table represent non-overlapping categories of the selected measurement. Table 32.1 has been expanded into the indicator Table 32.2 for compounds and into the indicator Table 32.3 for disorders. In the indicator table for compounds, a value of one in a particular row is recorded if a person has used the corresponding compound. In the indicator table for disorders, one in a particular row indicates that the person has been treated for the corresponding disorder. All other elements of the row are set to zero. Note that the order of the columns in the indicator tables is not relevant. A contingency table X can be constructed by means of the matrix product of two indicator tables: X = FJ
(32.1)
where I is the indicator table for the first measurement and J is the indicator matrix for the second measurement. The rows of I and J must refer to the same objects and in the same order. The dimensions of a contingency table are defined by the number of categories of the two measurements. Each element x^^ in an nxp contingency table X represents the number of objects that can be associated simultaneously with category / of the first measurement and with category j of the second measurement. The element x^y can be interpreted as
162 TABLE 32.1 Measurement table describing the use of four compounds (C, D, L, T) for the treatment of three disorders (A, E, S) by 30 persons. For explanation of symbols, see text. #
Compound
Disorder
1
L
A
2
D
E
3
C
S
4
T
S
5
L
A
6
D
A
7
C
E
8
C
S
9
D
A
10
C
E
11
D
S
12
L
A
13
D
S
14
C
E
15
C
S
16
D
A
17
L
S
18
D
E
19
L
S
20
L
A
21
C
E
22
T
S
23
D
A
24
T
S
25
D
E
26
T
A
27
T
S
28
C
E
29
D
E
30
D
A
163 TABLE 32.2 Indicator table for compounds, from Table 32.1 Compound D 1 2
0
0 0
1
1 0
0
0 0
3
1
4
0
0
5
0
0
6
0
7
1
0
0
0
8
1
0
0
0
9
0
1
10
1
0
0
0
11
0
1
0
0
12
0
0
1
0
13
0
1
0
0
14
1
0
0
0
15
1
0
0
0
1
0
0
0
1 1
0
0 0
0
0
16
0
1
0
0
17
0
0
1
0
18
0
1
0
0
19
0
0
1
0
20
0
0
1
0
21
1
0
0
0
22
0
0
0
1
23
0
1
0
0
24
0
0
0
1
25
0
1
0
0
26
0
0
0
1
27
0
0
0
1
28
1
0
0
0
29
0
1
0
0
30
0
1
0
0
All
8
11
6
5
164 TABLE 32.3 Indicator table for disorders, from Table 32.1 #
Disorder A
1
1
E
S
0
0
2
0
1
0
3
0
4
0
5
1
0
0
6
1
0
0
7
0
0
1
0
1
1 0
0
8
0
9
1
1
10
0
1
0
11
0
0
1
12
1
0
0
13
0
0
1
14
0
1
0
15
0
0
1
16
1
0
17 18
0 0
0 1
1 0
0
0
•
0
19
0
0
1
20
1
0
0
21
0
1
0
22
0
0
1
23
1
0
0
24
0
0
1
25
0
1
0
26
1
0
0
27
0
0
1
28
0
1
0
29
0
1
0
30
1
0
0
All
10
9
11
165 TABLE 32.4 Contingency table X derived from the indicator Tables 32.2 and 32.3 Anxiety (A) Clonazepam (C) Diazepam (D) Lorazepam(L) Triazolam (T) Sum
0 5 4 1 10
Epilepsy (E)
Sleep (S)
Sum
5 4 0
3 2 2 4
8 11 6 5
9
11
30
0
the number of occurrences of / contingent (or conditional) uponj, or equivalently as the number of instances ofj contingent upon /. The 4x3 contingency Table 32.4 is the result of multiplying the 30x4 indicator Table 32.2 for compounds with the 30x3 indicator Table 32.3 for disorders. In this chapter we deal explicitly with contingency tables that result from combining two measurements and which are generally referred to as two-way contingency tables. From this point of view, it can be regarded as a generalization of Chapter 16 to the case of more than two categories for each of the two measurements. It is possible to cross multiple measurements, which then results in a multi-way contingency table. For example, when three measurements with respectively n, p and q categories are crossed with one another, this results in an nxpxq three-way contingency table. Recently there has been a growing interest in the analysis of this type of data structure. In this chapter, however, we only deal with two-way contingency tables. Summation row-wise (horizontally) of the elements of a contingency table produces the vector of row-sums with elements x^^. Summation column-wise (vertically) yields the vector of column-sums with elements x^j. The global sum is denoted by x^_^.. These marginal sums are defined as follows: p
x^^ = ^x^j ./
with / = 1,..., n
n
•^+./=S-^y n
p
I
J
=11'.
withy =1,...,/?
(32.2)
166
32.2 Chi-square statistic Each element x^j of a contingency table X can be thought of as a random variate. Under the assumption that all marginal sums are fixed, we can derive the expected values E{x^j) for each of the random variates x^j [1]: E(x^^) =
•^i+ •^+7
with/= 1,..., Az and
(32.3)
j=l,...,p
Table 32.5 presents the expected values of the elements in the contingency Table 32.4. Note that the marginal sums in the two tables are the same. There are, however, large discrepancies between the observed and the expected values. Small discrepancies between the tabulated values of our illustrations and their exact values may arise due to rounding of intermediate results. For example, the observed value for Clonazepam in anxiety has been recorded as 0 in Table 32.4. The corresponding expected value in Table 32.5 has been computed from eq. (32.3) as follows: £:(x,j):
x^+ x+1 _ 8x 10 = 2.67 jc_ ~ 30
where jCj j is the element in the first row and first column of the table, where Xj^, x^^ are the corresponding marginal sums, and where x^^ is the global sum. The generalization of Pearson's chi-square statistic x^ for 2x2 contingency tables, which has been discussed in Section 16.2.3, can be written as: ( n
P
(Y
_ f(y
\\2
n
^1+
-"'•+./
p ^-h+
Eix,) with (n-\){p-\)
V J
(32.4)
•^1+ -^+7
degrees of freedom.
TABLE 32.5 Table of expected values E(X) computed from Table 32.4 Anxiety Clonazepam Diazepam Lorazepam Triazolam Sum
2.67 3.67 2.00 1.67 10
Epilepsy 2.40 3.30 1.80 1.50 9
Sleep 2.93 4.03 2.20 1.83 11
Sum 8 11 6 5 30
167
The chi-square statistic is a measure for the global deviation of the observed values in a contingency table from their expected values. The number of degrees of freedom of the chi-square statistic defines the number of elements of the table that can be varied independently of each other. Since n row-sums and/? column-sums are held fixed and since both add up to the same global sum, the number of degrees of freedom of an nxp contingency table is np minus n -\- p - I or (n - I) (p - I). The chi-square statistic can be tested for significance using tabulated critical values of the statistic for a fixed level of significance (e.g. 0.05, 0.01 or 0.001) and for the specified degrees of freedom. In the case of the 4x3 contingency Table 32.4 we obtain a chi-square value of 15.3 with 6 degrees of freedom, which is significant at the 0.05 level of probability as it exceeds the critical value of 12.6. The probability corresponding with a chi-square value exceeding 15.3 equals 0.02. From this result we may be led to conclude that there are significant differences between the prescription patterns of the four compounds, as well as between the treatment patterns of the three disorders. One should be cautious, however, in drawing such a conclusion, because of three considerations. First, the statistical evidence of this positive outcome is not overwhelming, as it leaves a two percent chance of being false. Secondly, the chi-square test is not quite appropriate in this case, since all but two cell frequencies are smaller than five and three of them are even zero. Finally, a statistically significant result does not necessarily mean that the differences between compounds and disorders are of practical relevance. These considerations should not preclude, however, an exploratory analysis of the data in order to search for meaningful correlations and trends that might be the object of more detailed studies in the future. In the case of a negative outcome of the chi-square test, one might have assumed that there are no differences between the drugs and between the disorders. This conclusion also might be erroneous in view of the lack of power that is inherent to a small sample (30 persons in this example). In this case, an exploratory analysis may still yield relevant associations and tendencies which can guide the design of future studies. 32.3 Closure In the literature we encounter three common transformations of the contingency table. These can be classified according to the type of closure that is involved. By closure we mean the operation of dividing each element in a row or column of a table by its corresponding marginal sum. We reserve the word closure for the specific operation where the elements in a row or column of the table are reduced to unit sum. This way, we distinguish between closure and normalization, as the latter implies an operation which reduces the elements of a table to unit sums of squares. In a strict sense, closure applies only to tables with non-negative elements.
168
Clonazepam
Diazepam
y/^y^yyM
Lorazepam
Triazolam
V^/M^^A Anxiety
Epilepsy
Sleep
Fig. 32.1. Row-profiles representing the data in the rows of Table 32.4.
32.3.1 Row-closure Comparison between rows of a contingency table X is made easier after dividing each element of the table by its corresponding row-sum. This operation is called row-closure as it forces all rows of the table to possess the same unit sum. After closure, the rows of the table are called row-profiles. These can be represented in the form of stacked histograms such as shown in Fig. 32.1. The average or expected row-profile is obtained by dividing the marginal row in the original table by the global sum. The matrix F of deviations of row-closed profiles from their expected values is defined by: fx^A
+7
(32.5)
^/+
under the assumption of fixed marginal sums. 32.3.2 Column-closure Comparison between columns of a contingency table X is also made easier after dividing each element of the table by its corresponding column-sum. This operation is referred to as column-closure as it makes all columns of the table possess the same unit sum. After closure, the columns of the table are called column-profdes. These can also be visualized in the form of stacked histograms as is shown in Fig. 32.2.
169
Anxiety
Epilepsy
Sleep Clonazepam Diazepam
Lorazepam Triazolam
Fig. 32.2. Column-profiles representing the data in the columns of Table 32.4.
The average or expected column-profile results from dividing the marginal column by the global sum. The matrix G of deviations of column-profiles from their expected values is given by: ^ (32.6)
^^:/=-
v^+^ J
"+./
"+./
assuming fixed marginal sums. 32.3.3 Double-closure Double-closure is the joint operation of dividing each element of the contingency table X by the product of its corresponding row- and column-sums. The result is multiplied by the grand sum in order to obtain a dimensionless quantity. In this context the term dimensionless indicates a certain symmetry in the notation. If x were to have a physical dimension, then the expressions involving x would appear as dimensionless. In our case, x represents counts and, strictly speaking, is dimensionless itself. Subsequently, the result is transformed into a matrix Z of deviations of double-closed data from their expected values: •^ij •^++
^ = •^i+
•^+j
assuming fixed marginal sums.
•^ij •^++ •^i+ •^+j
-1 =
(32.7) E(x,j)
170 TABLE 32.6 Table of double-closed data Z, with weighted means, weighted sums of squares and weight coefficients added in the margins, from Table 32.4
z
Anxiety
Epilepsy
Sleep
Clonazepam Diazepam Lorazepam Triazolam
-1.000 0.364 1.000 -0.400
1.083 0.212 -1.000 -1.000
0.023 -0.504 -0.091 1.182
0.000 0.000 0.000 0.000
0.000 0.542 0.333
0.000 0.696 0.300
0.000 0.328 0.367
0.000
p
wp
W;,
0.686 0.151 0.636 0.865
0.267 0.367 0.200 0.167
0.510 1.000
The effect of these operations is shown in Table 32.6 using the data of our example in Table 32.4. For example, in the case of Clonazepam in anxiety, the observed value x^^ has been shown to be equal to 0 in Table 32.4, and the corresponding expected value has been derived to be 2.67 in Table 32.5. The corresponding deviation of the double-closed value from its expected value is then computed as: ^^
£(jcii)
2.67
It is important to realize that closure may reduce the rank of the data matrix by one. This is the case with row-closure when n>p, and with column-closure when n
32.4 Weighted metric A weighted Euclidean metric is defined by the weighted scalar product: n
x'^ Wy = ^ w , jc,};,
n
with
X^'"^
(32.8)
where x and y are vectors with the same dimension n and where W is a diagonal matrix of dimension n: W = diag(w)
171
in which the elements of the weight vector w are positive constants which add to unity. In what follows we will omit the predicate 'Euclidean' which is intended implicitly throughout this chapter. A weighted metric is uniquely defined by the metric matrix W. The usual metric is but a special case of the weighted metric, which is obtained when W equals the identity matrix I (divided by the dimension n). In the usual metric we assign equal weights to all the axes of coordinate space. This is the metric that is used in daily life for measuring distances, lengths, widths, heights, etc. Weighted metrics play an important role in multivariate data analysis when different weights are assigned to objects and measurements according to their size or importance. This is the case in correspondence factor analysis which will be discussed in Section 32.6. It has been shown in Chapter 29 that the set of vectors of the same dimension defines a multidimensional space S in which the vectors can be represented as points (or as directed line segments). If this space is equipped with a weighted metric defined by W, it will be denoted by the symbol S^. The squared weighted distance between two points representing the vectors x and y in S^ is defined by the weighted scalar product: n
llx-yl|2 = ( x - y ) T W ( x - y ) = X w , ^ , - } ' , ) '
(32.9)
where llx - yll^ refers to the weighted norm or length of the vector x - y in the weighted metric defined by W. Distances in S^ are different from those in the usual space S. A weighted space S^ can be represented graphically by means of stretched coordinate axes [2]. The latter result when the basis vectors of the space are scaled by means of the corresponding quantities in vw, where the vector w contains the main diagonal elements of W. Figure 32.3 shows that a circle is deformed into an ellipse if one passes from usual coordinate axes in the usual metric I to stretched coordinate axes in the weighted metric W. In this example, the horizontal axis in5^ is stretched by a factor ^1.6 = 1.26 and the vertical axis is shrunk by a factor V0.4 = 0.63. A similar effect in S^ is observed on angular distances (or angles), which are also different from those in the unweighted space 5: cosd^ =
^—
(32.10)
11x11 J l y l l , with ||xl|2=xTWx
and
llyll^^y'^Wy
(32.11)
where d^ is the angular distance and where 11x11^ and llyll^are the weighted norms or
172
® 1. (Ya4)
1 (Yr6)
W:
1 2
2
0
0 1
1.6 0
w=
0
0.4
Fig. 32.3. Effect of a weighted metric on distances, (a) representation of a circle in the space S defined by the usual Euclidean metric, (b) representation of the same circle in the space 5^ defined by a weighted Euclidean metric. The metric is defined by the metric matrix W.
lengths of the vectors x and y as measured from the origin of 5^. Two vectors x and y are said to be orthogonal in the weighted metric defined by W if and only if: x'^Wy = 0
(32.12)
Generally, two vectors that are orthogonal in S will be oblique in 5^, unless the vectors are parallel to the coordinate axes. This is illustrated in Fig. 32.4. Furthermore, if X and y are orthogonal vectors in S, then the vectors W~^^^x and W"^^^y are orthogonal in 5^. This follows from the definition of orthogonality in the metric W (eq. 32.12): xTW-l/2;Y^-l/2y = xV = 0 In Section 29.3 it has been shown that a matrix generates two dual spaces: a row-space 5^^ in which the p columns of the matrix are represented as a pattern P^, and a column-space 5^ in which the n rows are represented as a pattern P^. Separate weighted metrics for row-space and column-space can be defined by the corresponding metric matrices W^ and W^. This results into the complementary weighted spaces 5^ and 5^, each of which can be represented by stretched coordinate axes using the stretching factors in ^w „ and J w ^ , where the vectors w^ and w^ contain the main diagonal elements of W„ and W^.
173
® 1 (Yoi)
1 (VT.6) Fig. 32.4. Effect of a weighted metric on angular distances, (a) representation of two line segments that are perpendicular in the space S defined by the usual Euclidean metric, (b) representation of the same two line segments in the space 5H, defined by a weighted Euclidean metric. The metric matrix W is the same as in Fig. 32.3b.
Weight coefficients can be used to define weighted measures of location and spread. In particular, we define the weighted mean m^ of a vector x by means of: m^ = x ^ w = xTWl = ^ W.
(32.13)
X:
and the weighted variance c^ by means of: n
c^ = ( x - m ^ l ) T W ( x - m ^ l ) = (xT - m ^ l ' ^ ) ^ Wl = Xw,.(jc,. - m ^ ) ^ (32.14) where 1 represents a sum vector with n elements equal to one, and where the superscript w indicates that the measures are taken in a weighted metric. The above expression of the weighted variance is a special case of a weighted sum of squares, which arises when the elements of the vector have been previously reduced by their mean value. Note that the weight coefficients are normalized to unit sum. In this notation, x^ results in squaring each element of x. The expression x^Wl produces the weighted sum of elements in x. The usual expressions for the mean and the variance result when the weight coefficients are constant and defined as: w,
= •
1
and
w^=-
1
with / = l,...,n
and
y = !,...,/?
174
In the case of an nxp matrix X we can define two sets of weighted measures of location and spread, one for each of the two dual spaces S^ and 5 ^ , in terms of weighted means and weighted variances: m:=XW^l^=XWp
or
<=X>^v^,y J
(32.15) n
m;=X^W„l„=X'rw„
or
c: = ( X - i n - l p 2 W^ 1^
or
mJ=J^w,x,j
c-
=j^Wjix,-mry (32.16)
c ; - ( X ^ -m;^ i:y
W„ 1„
or
cj =J^w,(x,j
-mj)'
i
where 1„ and 1^ represent sum vectors with n and p elements equal to one, respectively. In this notation, X^ results in squaring each element of X. The expression XW^l^ produces a weighted summation over the columns of X, and X^W„1„ performs a weighted summation over the rows of X. One may assign greater weight (or mass) to objects and measurements that should have a greater influence in the analysis than the others. In regression problems one may define weights to be inversely proportional to the variance of the objects, such as to lend more influence to those that have been measured with greater precision. In correspondence factor analysis (CFA) the weights are defined from the marginal sums of the table, as will be shown in Section 32.6. The result of an exploratory analysis depends to some extent on the choice of the weight coefficients in w„ and w^ . Sometimes the effect is only slight, for example, when the observed data in a contingency table are close to their expected values. In other cases, when the discrepancies between observed and expected values are large, the effect of variable weighting can be prominent. As a general rule, one will select the weights according to the nature of the data and the objective of the analysis.
175
32.5 Distance of chi-square The global distance of chi-square 5 of a contingency table X is derived from the chi-square statistic as follows: .2 .2
^
P
X
52=^ =XS^ •^++
-^^^
./•
/
-^ i+
(32.17)
•^+j
which is a dimensionless quantity. We refer to the quantity 5^ as the global interaction between the rows and columns of the contingency table. The concept of distance of chi-square is important in the multivariate analysis of contingency tables. As we will see in Section 32.6 on correspondence factor analysis (CFA), results are often converted into a metric diagram in which distances between representations of rows and distances between representations of columns are related to chi-square values. Later in this section, we will show that the global distance of chi-square is equal to the global weighted sum of squares of the transformed contingency table. There are three ways by which the global distance of chi-square can be meaningfully rewritten as a weighted sum. These correspond with the three different ways of closing the data in the original contingency table X, such as has been described above in Section 32.3. The metric matrices W„ and W^ have to be defined differently for each of the three cases. In the case of the contingency Table 32.4 we obtained a chi-square of 15.3. Taking into account that the global sum equals 30, this produces a global interaction of 15.3/30 = 0.510. The square root of this value is the global distance of chi-square which is equal to 0.714. 32.5.1 Row-closure The terms in the expression of the global distance of chi-square 6 can be rearranged into: 5 ' = j : ^ t ^++
— / / =i-,. tw, // j
^+j
with W: =
and
w: = '
X
.
(32.18)
176
where F is the matrix of deviations of row-closed profiles, cf. eq. (32.5): ^ ij
"^ +j
fij =
/ = 1,..., nandy = 1, ...,p
with
In 5^ the row-profiles are centred about the origin, as can be shown by working out the expression for the weighted column-means m^: •^i+ -^+7
m
y-^++
"++
=0
with;= 1, ...,/7
y
Weighted sums of squares cJJ' of the row-profiles in F around the origin of 5^ can be expressed as distances of chi-square: .2
r Ay
^r=l^jf.;=l
A+^
1
v^±:__^jiiZ__s2 =6
with /= 1, ..., n
(32.19)
"+j
The distance 5, is a measure of the difference of the ith row-closed profile with respect to the average or expected row-profile. The distance 6,,^ between two row-profiles / and /' can also be expressed formally as a distance of chi-square: .2 U
I J
= 8^/
l^j(f,-frjy=l
with i, I = 1,..., n
(32.20)
"+;
32.5.2 Column-closure Another rearrangement of the terms in the expression of the global distance of chi-square 5 leads to:
^'=t — t—sl=t-.t->sl with
(32.21)
177
w,
and
= •
Wj=-
"+.1
X;^
where G stands for the matrix of deviations of column-closed profiles, cf. eq. (32.6): with/= 1,..., n •^+.7
and
j=l,...,p
*^++
In this case we find that the pattern of column-profiles in S^ is centred about the origin, as can be seen by computing the weighted row-means m„: •^i+
<=l^j8, =Z \^-^++
•^+J
^++
=0
with / = 1,..., n
J
The weighted sums of squares c^ of the points around the origin of 5^ can be expressed as distances of chi-square: \2
(
-.r=I-<^^=i:
V^+7'
= 8y
withy = !,...,/?
(32.22)
The distance 5^ is a measure of the difference of the yth column-profile from the average or expected column-profile. The distance 6^y.between two column-closed profilesy and/ can also be defined as a distance of chi-square: (X
X ij
^^i(8il
-Sij'f
=S
.^ iJ
V^+./
1LZ_-K2 6t
with7,/= l,...,p
(32.23)
X^
32.5.3 Double-closure A last rearrangement of terms in the expression of the global distance of chi-square 6 produces:
l^l-^4=l-il-.4 ++
J
•^++
(32.24)
178
with and
w- =-
w. =-
''+./
and where Z is the matrix of deviations of the double-closed data, cf. eq. (32.7): Zij=^
1
with/= l,...,/2
and
7 = 1,...,/?
The pattern of points produced by Z is centred in both dual spaces S^ and 5^, since the weighted row- and column-means m„ and m^ are zero: '^ f m
^ij
:0
withy = 1,..., P
=0
with / = 1,..., n
V^+7 P
m r=I-.^. =S
f X
ij
X.
^+j
^++y
The global weighted mean of Z can also be shown to be zero: m
Z^/Z^7^.y =0
Weighted sums of squares c]^ and c^ around the origins of the two spaces can be interpreted as distances of chi-square:
v-X
2
^8j
withy = 1, ...,/? (32.25)
< ^ r = X ^ / 4 ^^?
w i t h / = 1, ..., Az
where 5, and 5^ have been defined above in terms of the elements and marginal sums of X. These distances are measures for the differences of the row- or columnclosed profiles from their average or expected profiles. The symbols cf' and c'J denote the weighted sums of squares of row / and column j in the transformed matrix Z. The values of Z in Table 32.6 already reveal some peculiar aspects of the data. One should compare these values with zero, i.e. the value that would be obtained if the observed values were equal to their expected ones. The larger the deviation from zero, the stronger is the interaction between the corresponding row and
179
column. Positive deviations point toward a positive interaction. For example, Triazolam shows a strong positive interaction with sleep (1.182), which means that the compound Triazolam is particularly indicated for the treatment of sleeplessness. Likewise, a large negative value reflects a strong negative interaction. For example. Triazolam presents also a strong negative interaction with epilepsy (-1.000), which means that Triazolam is rarely prescribed for the treatment of epilepsy. In the following section we will refer to interaction by the name of correspondence. Note that the mean and mean square values in Table 32.6 are the result of weighted averages. Using the deviations Z of the double-closed data from their expected values and the column-weights w^ in Table 32.6, we compute the distances of chi-square from the origin of Triazolam (/ = 4) using eq. (32.25): 3
S?=4 = S ^ . / ^47 = 0.333(-0.400)^ + 0.300(-l.000)2 + 0.367(1.182)^ = 0.865 ./•
or 5,..4 =0.931 Similarly, we compute the distance of chi-square from the origin for epilepsy, using the same deviations Z and the row-weights w^: 4
b%2 = S ^ / 4 =0.267(1.083)2 +0.367(0.212)2+ 0.200(-l.000)2 i
+ 0.167(-1.000)2 = 0.696 or 6 ,.^2 = 0-834 Distances between points in 5^ and S^ can also be defined formally as distances of chi-square: p
c7i' = X^.,-(2<:/ - ^ n ) ' = S r
with /, i' = \,..., n (32.26)
n
cjr=X
^/ (^ij - ^')'
= ^'
withy,/=1,..., p
i
where 5 - and 5^y. have been defined before in terms of the elements and marginal sums of X. We use the symbols cJJ. and cjj^ to denote the weighted sums of cross-products between rows / and i' and between columnsy and/ in the transformed matrix Z.
180
Using the deviations Z and the column-weights w^ in Table 32.6, we compute the distance of chi-square between Lorazepam (/ = 3) and Triazolam (/ = 4) from eq. (32.26): 3
S^=34 = X ^ / ^ 3 ; -^47)^ = 0.333(1.000 +0.400)2+ 0.300(-l.000+ 1.000)2 j
^2_ + 0.367(-0.091 - 1.182)2 = 1.248
or 8,r=34 = i n 7
In a similar way we derive the distance of chi-square between anxiety (/ = 1) and epilepsy (/ = 2) using the same deviations Z and the row-weights w„: 4
^r=\2 =Ya^Mi\
-^ii)^
=0.267(-1.000-1.083)2+ 0.367(0.364-0.212)2
+ 0.200(1.000 + 1.000)2 + o.l67(-0.400 + 1.000)^ = 2.025 or 5r=i2 = 1-423 The global weighted sum of squares c^ of the transformed data Z can be shown to be equal to the global interaction 52 between rows and columns:
^ " = X ^ / S^7^^=52 '
(32.27)
j
One finds an illustration of the measures of location and spread in the margins of Table 32.6. The geometric properties of row-closed and column-closed profiles are summarized in Fig. 32.5. In the following section on the analysis of contingency tables we will relate the distances of chi-square in terms of contrasts. In the present context we use the word contrast in the sense of difference (see also Section 31.2.4). For example, we will show that the distance of chi-square from the origin 5, can be related to the amount of contrast contained in row / of the data tables, with respect to what can be expected. Similarly, the distance 5^ can be associated to the amount of contrast in column y, relative to what can be expected. In a geometrical sense, one will find rows and columns with large contrasts at a relatively large distance from the origin of 5^ and 5 ^ , respectively. The distance of chi-square 6- then represents the amount of contrast between rows / and i with respect to the difference between their expected values. Similarly, the distance of chi-square 6^^ indicates the amount of contrast between columns^ a n d / in comparison to the difference between their
181
Fig. 32.5. (a) Geometric representation of row-closed profiles in the weighted column-space S^. The row-profiles are shown as an elliptic pattern P". The figure shows the distance of a row-profile / from the origin and the distance between two row-profiles i and i\ (b) Geometric representation of column-closed profiles in the weighted row-space S^. The column-profiles are shown as an elHptic pattern P\ The figure shows the distance of a column-profile; from the origin and the distance between two column-profiles7 and/.
expected values. In geometrical terms, one will find in 5^ rows with large contrast between them at a relatively large distance from one-another. Likewise, columns with large contrast between them will also be found in S^ at a relatively large distance from one another. Note that double-closure yields the same results as those produced by rowclosure in 5^ and by column-closure in S^. From an algorithmic point of view, double-closure is the more attractive transformation, although row- and columnclosure possess a strong didactic appeal.
182
32.6 Correspondence factor analysis 32.6.1 Historical background The origin of the analysis of contingency tables can be traced back to the definition by Pearson in 1900 of the chi-square statistic as a measure for goodness of fit [3]. The partition of the chi-square into main effects and interaction has been described around 1950 by Lancaster [4]. One of the first practical applications of multivariate analysis to contingency tables is due to Hill [5] who developed the method of reciprocal averaging. This method is used for the ordination of plant species from tabulated surveys carried out in various plots of land yielding a species x plots matrix. The result allows one to order species according to a latent vector which can be ascribed to environmental characteristics, such as density of the soil, abundance of sunlight, thickness of cover, etc. Reciprocal averaging is considered as the forerunner of correspondence factor analysis (CFA) which has been developed by Benzecri [6,7] and a group of French statisticians [8]. It is often also referred to as correspondence analysis (CA) for short. Originally, the term correspondence means association between rows and columns of a table, i.e. the number of joint occurrences of the categories represented by the rows and columns of a contingency table. The French term 'analyse des correspondances' points to the analysis of the interaction between rows and columns. The French school also used a geometrical approach for the interpretation of its results, especially by means of the biplot which has been designed by Gabriel [9]. In the discussion of measurement tables in Chapter 31, we have defined biplots as the joint representation of rows and columns of a table in a low-dimensional space spanned by latent vectors. The concept of latent vectors will be generalized to the case of weighted metrics in the following section. Initially, CFA has been applied in linguistic studies. Later on, applications have been described in many diverse fields of the empirical sciences. The method of CFA can be extended to the analysis of tables with non-negative elements, not necessarily counts, provided that these have been recorded with a common unit of measurement. Such a table is called homogeneous, as opposed to a heterogeneous table in which the rows or columns are defined with different units. Heterogeneous tables can be analyzed by means of CFA, after subdivision of the measurements into discrete categories (e.g., in the case of lipophilicity: strongly lipophilic, weakly lipophilic or hydrophilic, strongly hydrophilic). CFA has played a decisive role in the development of multivariate data analysis, especially by means of its graphic aspect. It also has stimulated interest in related areas such as principal components analysis and cluster analysis. The position of CFA, as well as that of its related methods, is that of a preliminary step to statistical inference. In this respect, it has been regarded as an
183
inductive approach [2]. This means that, first, hypotheses have to be formulated from an analysis of empirical evidence. After this exploratory phase, the newly formulated hypotheses have to be tested more rigorously using appropriate statistical methods of inference. Correspondence factor analysis can be described in three steps. First, one applies a transformation to the data which involves one of the three types of closure that have been described in the previous section. This step also defines two vectors of weight coefficients, one for each of the two dual spaces. The second step comprises a generalization of the usual singular value decomposition (SVD) or eigenvalue decomposition (EVD) to the case of weighted metrics. In the third and last step, one constructs a biplot for the geometrical representation of the rows and columns in a low-dimensional space of latent vectors. 32.6.2 Generalized singular value decomposition We assume that Z is a transformed nxp contingency table (e.g. by means of row-, column- or double-closure) with associated metrics defined by W^ and W^. Generalized SVD of Z is defined by means of: Z = A^B^
(32.28)
where |Li is an rxr diagonal matrix of singular values and where r is the rank of Z. The matrices A and B are the orthonormal matrices of generalized singular vectors, respectively with dimensions nxr and pxr in the weighted metrics W„ and W^. Orthonormality in the weighted metrics implies that: A ^ W > = RTW^B = I,
(32.29)
where I^ is the identity matrix of dimension r. It can be proved that the above generalized SVD of the matrix Z can be derived from the usual SVD of the matrix W^^^ ZW^^^: W'J^ ZW^/^ = UAV^
(32.30)
where A is a diagonal matrix of dimension r, and where U and V are the usual orthonormal matrices of dimensions nXr and pxr, which contain the singular vectors of W^/2 2W^/2.
u^u = v^v = I, and where I^ and r are defined as above. The results of applying these operations to the double-closed data in Table 32.6 are shown in Table 32.7. The analysis yielded two latent vectors with associated singular values of 0.567 and 0.433.
184 TABLE 32.7 1/2 Usual singular vectors extracted from the transformed Z in Table 32.6, i.e. W^ Compound Clonazepam Diazepam Lorazepam Triazolam
u,0.750 -0.025 -0.619 -0.232
Uj0.113 -0.540 -0.148 0.820
Disorder Anxiety Epilepsy Sleep
Vj -0.644 0.762 -0.075
1/2 ZW^ ^-0.502 -0.346 0.792
In Table 32.7 we observe a contrast (in the sense of difference) along the first row-singular vector Uj between Clonazepam (0.750) and Lorazepam (-0.619). Similarly we observe a contrast along the first column-singular vector Vi between epilepsy (0.762) and anxiety (-0.644). If we combine these two observations then we find that the first singular vector (expressed by both Uj and Vj) is dominated by the positive correspondence between Clonazepam and epilepsy and between Lorazepam and anxiety. Equivalently, the observations lead to a negative correspondence between Clonazepam and anxiety, and between Lorazepam and epilepsy. In a similar way we can interpret the second singular vector (expressed by both U2 and V2) in terms of positive correspondences between Triazolam and sleep and between Diazepam and anxiety. From eq. (32.30) we can derive the solution for the generalized SVD problem of Z:
B = W-'^^\
(32.31)
|Ll = A where U, V and A have been defined above in eq. (32.30). The results of generalized SVD, when applied to the 4x3 contingency table, are presented in Table 32.8. The generalized singular vectors can be interpreted in the same way as we have done for the usual ones in Table 32.7. The only difference is that the elements of A and B contain the coordinates of the four compounds and of the three disorders in a two-dimensional diagram with a weighted metric defined by w^ and w^, the coefficients of which are shown in Table 32.6. Triazolam obtained the smallest weight (0.167) of the four compounds and Diazepam is given the largest weight (0.367). This amounts to a factor of about 2 between the largest and the smallest weights. The weights of the disorders are more homogeneous as they range between 0.300 and 0.367. The heterogeneity among the weights
185 TABLE 32.8 Generalized singular vectors extracted from Z in Table 32.6 Compound Clonazepam Diazepam Lorazepam Triazolam
Disorder 1.452 -0.042 -1.385 -0.569
0.220 -0.892 -0.331 2.010
Anxiety Epilepsy Sleep
-1.115 1.390 -0.124
-0.870 -0.632 1.308
assigned to the compounds is responsible for a distortion of the contrasts in Table 32.8 when compared to those of Table 32.7. This can be understood as follows. The elements of U and A are more or less proportional to one another with a proportionality constant of about 2. The same is true for the elements of V and B. For example, in the case of Clonazepam we find that the values a^, a^2 (1-452, 0.220) are about twice the corresponding values w^, u^2 (0.750, 0.113). Similarly, in the case of epilepsy, we obtain that the values ^21. ^22 (1-390, -0.632) are about twice the corresponding values V21, V22 (0.762, -0.346). Some values, however, do not conform to this proportionality. For example, in the case of Triazolam we observe that a^2 (2.010) is more than twice the corresponding value W42 (0.820) and in the case of anxiety we have that b^2 (-0.870) is less than twice the corresponding Vj2 (-0.502). These distortions are the result of variable weights of the compounds in the generalized SVD. In a sense the interpretation of A and B is fairer than that of U and V, since the former accounts for the differences between importance of the compounds (and to a lesser extent of the disorders). Compounds that are more often prescribed have a greater influence on the result of the analysis than those that are less frequently used. One can define transition formulae for the two sets of generalized latent vectors in A and B (see also Section 31.1.6): A = ZW^BA^ (32.32) B = Z^W A A These transition formulae express one set of generalized latent vectors (A or B) in terms of the other set (B or A). They follow readily from the definition of the generalized SVD problem which has been stated above. Generalized EVD can be developed along similar lines. By analogy with the analysis of measurements tables in Chapter 31, we distinguish between EVD on rows and EVD on columns. In the row-problem we compute the matrix of generalized row-eigenvectors A and the diagonal matrix of eigenvalues \i^ from:
186
ZW^Z'^ = A^i^A^
with A'^W.A = I,
(32.33)
From the solution of the usual EVD problem (Section 31.4.2) follows that: W^/2 ZW^ T7 W^^2 = UA^U "^
with
U'^U = I,
(32.34)
which yields: A = W;^/2u
^„d
^^^
In the column-problem we compute the matrix of generalized column-eigenvectors B and the diagonal matrix of eigenvalues |i^ from: Z^W^Z = B^i^B'^
with
B'^W^B = I,
(32.35)
From the solution of the usual EVD problem follows also: W^^^Z'^ W^ZW;^^^ = VA^ V^
with
V^V = I,
(32.36)
which yields: B = W;^/2V
and
^ =A
Note that the eigenvalues in A^ are the same in both the row- and column-problems. The rank r of Z is also the rank of the weighted cross-products matrices Z^W^Z and TN^^, Singular vectors and eigenvectors are the same, and it is appropriate to refer to them by the more general terms of latent vectors [10]. Singular values are the square roots of the corresponding eigenvalues. The latter express the contribution of the associated latent vectors to the global interaction between rows and columns. The trace of A^ is equal to the traces of the weighted cross-product matrices, which in turn are equal to the global weighted sum of squares c"^ or global interaction 6^ (eq. (32.27)): tr(A2) = tr(Wy2Z'^W„ZW^/2)^ = tr(W^/2ZW Z'^W^/2)^^w ^§2
(3237)
187
32.6.3 Biplots In CFA we can derive biplots for each of the three types of transformed contingency tables which we have discussed in Section 32.3 (i.e., by means of row-, column- and double-closure). These three transformations produce, respectively, the deviations (from expected values) of the row-closed profiles F, of the column-closed profiles G and of the double-closed data Z. It should be reminded that each of these transformations is associated with a different metric as defined by W^ and W^. Because of this, the generalized singular vectors A and B will be different also. The usual latent vectors U, V and the matrix of singular values A, however, are identical in all three cases, as will be shown below. Note that the usual singular vectors U and V are extracted from the matrix W^^^ZW^^^. In what follows we restrict the discussion of biplots to the case of double-closed data Z as defined by the elements: ^.. = ^JL^IL _ 1
^ith / = 1, ..., n
and; = 1,...,/?
(32.38)
using the metric matrices W^ and W^ with main diagonal elements: w-=
and
Wj-
^++
^++
All other cases can be readily derived by analogy. In the case of row-closed profiles we have to perform the following substitutions: (32.39)
z.:/ = fii =
w,. = - ^
and
Wf
_ •^++
J
•^-\-+
^+7
With column-closed profiles we sul ^ =
w,.
_
(32.40)
Sij =
•^++
•*^/+
and
W . := J
- ^ •^++
It can be proved that the matrix W^^^ ZW^^^ is the same in all cases. Hence, the decomposition UAV^ is also the same, and since the decomposition is unique we obtain that U, V and A are the same for all three cases. The matrices A and B, however, depend on the type of closure that has been applied to the data. From the latent vectors and singular values one can compute the nxr generalized score matrix S and the pxr generalized loading matrix L. These matrices contain the coordinates of the rows and columns in the space spanned by the latent vectors: S = AA«
(32.41a)
and L = BAP
(32.41b)
where the factor scaling coefficients a and P may take the values 0, 0.5 and 1. In Chapter 31 we have amply discussed the implications of the various choices of a and p for the interpretation of the biplots. The main considerations are reviewed below in the context of CFA: a=l for reproduction of distances from the origin 5, and distances 6,,/ between the row-profiles / and i\ p=i for reproduction of distances from the origin 5y and distances 5^^. between the column-profilesy and/, a + p=l for reproduction of the values Zij by means of perpendicular projections upon unipolar axes; also for reproduction of the differences (or contrasts) z^ - Zi^j and Zij - Zif by means of perpendicular projections upon bipolar axes. Unipolar and bipolar axes have been discussed in Section 31.2. Briefly, a unipolar axis is defined by the origin and the representation of a row or column. A bipolar axis is drawn through the representations of two rows or through the representations of two columns. Projections upon unipolar axes reproduce the values in the transformed data table. Projections upon bipolar axes reproduce the contrasts (i.e. differences) between values in the data table. In the case when a = 1 we obtain: SS^ ^AAAA'r = A A 2 A ' ^ (32.42)
189
as follows from eq. (32.41a). In particular we can write: s^ s.=\\s,\\^=j^sf,=j^wjzl=bj k
with/=l,...,n
(32.43)
j
which shows that the distances of the rows of the biplot in S'^ are equal to the distances of chi-square in 5^ which we have derived in Section 32.5. In the case of (3 = 1 we can show that: LL^ =BAABT =BA^B^ (32.44) = W-'^^ VA^V^ W;i/^ = Z ^ W, Z as follows from eq. (32.41b). In particular we have:
IJl,.=lll,.|P=t'/* = i ^ . - 4 = S ? k
with7=l,...,p
(32.45)
i
from which we conclude that distances of the columns on the biplot in S"^ are equal to the distances of chi-square in 5^ which we have already derived in Section 32.5. In the special case of a + (3 = 1 we can state: SL^
^AA^APB'T = A A B T
(32.46) = W;i/2 UAV^ W;i/2 = Z which follows from eq. (31.41). In particular we obtain: sj 1. = ^ s^j^ Ijj^ = Zij
with / = 1,..., n andy = 1,..., p
(32.47)
which shows that the projection in 5"^ of point s, upon a unipolar axis 1^ reproduces the value z„: r
sj (IJ - ly) = ^ s^^ilj^ - ly^) = Zij - z^y
with / = 1,..., n a n d ; , / = 1,..., p
k
(32.48) which defines the projection in S"^ of point s^ upon the bipolar axis 1^ - \y resulting in the contrast Zij - Zif.
190 TABLE 32.9 Generalized scores and loadings computed from Z in Table 32.6 Compound Clonazepam Diazepam Lorazepam Triazolam
Disorder 0.823 -0.024 -0.785 -0.322
0.095 -0.388 -0.144 0.873
Anxiety Epilepsy Sleep
-0.632 0.788 -0.070
-0.378 -0.275 0.568
By similar arguments we derive that: r
(s,. - s,.,)'' \j = X(^ik - ^n)hk = ^ - ^i'j
^ith /, r = 1,..., n
k
and7= l,...,p
(32.49)
which defines the projection in y of point \j upon the bipolar axis s, - s,^. The generalized scores and loadings from the 4x3 contingency table are given in Table 32.9 for the particular choice of a = (3 = 1. A plot of these generalized scores and loadings is shown in Fig. 32.6 using the coordinates of the points which are listed in Table 32.9. Distances from the origin and distances between points have been computed according to the expressions given in Section 32.5. In this example we find two distinct eigenvectors with eigenvalues of 0.322 and 0.188, respectively. This is compatible with the maximal rank of a 4x3 data matrix after double-closure. It should be reminded that double-closure always reduces the rank of the data matrix by one. The sum of the eigenvalues (0.510) is equal to the global sum of squares (interaction) which is shown in Table 32.6. The two plots can be superimposed into a biplot as shown in Fig. 32.7. Such a biplot reveals the correspondences between the rows and columns of the contingency table. The compound Triazolam is specific for the treatment of sleep disturbances. Anxiety is treated preferentially by both Lorazepam and Diazepam. The latter is also used for treating epilepsy. Clonazepam is specifically used with epilepsy. Note that distances between compounds and disorders are not to be considered. This would be a serious error of interpretation. A positive correspondence between a compound and a disorder is evidenced by relatively large distances from the origin and a common orientation (e.g. sleep disturbance and Triazolam). A negative correspondence is manifest in the case of relatively large distances from the origin and opposite orientations (e.g. sleep disturbance and Diazepam). The geometrical reconstruction of the values in Z by perpendicular projection of points upon axes can be justified algebraically by the matrix product of the scores S with the transpose of the loadings L according to eqs. (32.41) and (32.46):
191 Triazolam 1 -i
\=
0.188
?^,= 0.322 \ Lorazepam O Diazepam
\=
®
0.188
Sleep
0.834'**-'^>.A= 0.322 j 1.423
Epilepsy
4 Fig. 32.6. (a) Generalized score plot derived by correspondence factor analysis (CFA) from Table 32.4. The figure shows the distance of Triazolam from the origin, and the distance between Triazolam and Lorazepam. (b) Generalized loading plot derived by CFA from Table 32.4. The figure shows the distance of epilepsy from the origin, and the distance between epilepsy and anxiety.
SL^ = A A « A P B T = A A B T
=Z
(32.50)
provided that a + |3 = 1. The most common choices for the exponents in CFA are a = (3 = 1 and a = (3 = 0.5. The choices a = 1, p = 0 and a = 0, p = 1 will produce biplots that are
192
H X.2= 0.188 Triazolam Sleep Clonazepam O -1
Lorazepam O Anx?etv
\= DiazepaS
0.322 1 D Epilepsy
Fig. 32.7. CFA biplot resulting from the superposition of the score and loading plots of Figs. 32.6a and b. The coordinates of the products and the disorders are contained in Table 32.9.
multidimensional extensions of the triangular (also called barycentric) diagrams that are used for representing ternary mixtures. The reconstruction Z* of the transformed contingency table Z in a reduced space of latent vectors follows from: S*L*'r = Z*
(32.51)
where S* and L* are obtained from the first r* columns of S and L, with r* < r. It is assumed that the columns of S and L are arranged in decreasing order of their corresponding singular values. The r*-dimensional subspace of dominant latent vectors can be interpreted as a (hyper)plane of closest fit to the pattern of points represented in the r-dimensional space of latent vectors. This is analogous to the interpretation of the results of principal components analysis (PCA) which has been discussed in Chapters 17 and 31. The goodness of the fit can be judged from the relative contribution y of the first r* latent vectors to the global interaction, which can be expressed in the form:
y=
lK'lK
(32.52)
CFA can also be defined as an expansion of a contingency table X using the generalized latent vectors in A, B and the singular values in A:
193
with / = 1,..., n and J = 1,..., p
1 + Z ^ i t ^ / ^ ^ . jk
(32.53)
J
as can be derived from eqs. (32.28) and (32.38).
32.6A Application CFA is applied to the contingency Table 32.10 which has been adapted from a retrospective study of US doctorates in chemistry and in other fields awarded to men and women [11]. The rows of this table refer to consecutive time intervals from 1920 up to 1989. The columns indicate non-overlapping categories of doctorates. The marginal sums provide the totals by time interval and by category of doctorate. These will be used for the assignment of weights to the rows and columns of the table. Note that the data from 1920 up to 1959 have been grouped into intervals of 10 or 5 years in order to level out statistical fluctuations due to small counts, especially in the category of women doctorates in chemistry. This pooling of rows does not affect the subsequent analysis by CFA. Indeed, rows (or columns) that are similar can be grouped together, provided that the corresponding weights are also added together. This is called the principle of distributional equivalence of CFA [2]. Casual inspection of the table reveals that the total number of doctorates has reached a peak value of 33727 in 1973 which has only been surpassed in the last year of the time series. Peak values are also observed in the 1973 category of men in other fields and in the 1970 category of men in chemistry. In both categories related to women, no such peak values are observed as the annual counts seem to increase more or less steadily over the whole period. A more comprehensive analysis is obtained by CFA, the results of which are summarized in Tables 32.11 and 32.12. The first column of these tables contains the diagonal elements of the metric matrices W^ and W^. These normalized weight coefficients are proportional to the marginal sums of the original contingency Table 32.10 and sum to unity. The following three columns form the matrix of generalized scores S and the matrix of generalized loadings L. These matrices satisfy the relations (eq. (39.41)): S = AA
and
L = BA
which follows from the generalized SVD (eq. (32.50)): Z = AART where:
with
A^W^A = B^W^B = I,
194 TABLE 32.10 Doctorates awarded in the US between 1920 and 1989 [11] Year
W o m e n in chemistry
1989
142 256 221 222 228 45 66 66 66 93 94 94 120 139 146 180 172 202 179 174 192 189 180 196 220 255 235 271 297 320 362 396 406 429 497
Total
7350
1920-1929 1930-1939 1940-1949 1950-1954 1955-1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
M e n in chemistry
W o m e n in other fields
M e n in other fields
Total
1782
1674
8291
11889
3707
3507
18116
25586
5105
3871
21358
30555
4950
3370
30102
38644
4840
4388
34714
44170
1060
1045
7850
10000
1162
1185
9692
12105
1163
1185
9692
12106
1163
1186
9691
12106
1427
1702
13865
17087
1702
13865
17088
1427 1427
1702
13865
17088
1644
2295
16236
20295
1643
2761
18291
22834
1801
3225
20562
25734
2043
3764
23449
29436
2032
4403
25165
31772
1809
5080
25910
33001
1670
5903
25975
33727
1618
6241
24967
33000
1570
7001
24150
32913
1434
7487
23813
32923
1390
7665
22437
31672
1349
8117
21188
30850
1347
8701
20932
31200
1283
9132
20312
30982
1376
9637
20071
31319
1407
9786
19584
31048
1462
10188
19243
31190
1445
10340
19148
31253
1474
10337
19028
31201
1507
10848
19019
31770
1568
10964
19340
32278
1589
11361
20077
33456
1474
12013
20335
34319
65148
203766
680333
956597
195 TABLE32.il Weights, scores, distances, contributions and precisions from CFA applied to Table 32.10 /
w.
Sl
1920-1929
0.012
-0.96
0.95
0.11
1930-1939
0.027
-0.98
0.85
0.05
1940-1949
0.032
-1.18
1.10
1950-1954
0.040
-1.35
1955-1959
0.046
1960
0.010
1961 1962
S2
K
In
T^n
1.35
0.023
0.999
1.29
0.045
0.999
-0.11
1.62
0.084
0.995
0.37
-0.03
1.40
0.079
1.000
-1.17
0.15
-0.03
1.18
0.064
1.000
-1.11
0.11
-0.06
1.12
0.013
0.998
0.013
-1.12
-0.05
0.02
1.12
0.016
1.000
0.013
-1.12
-0.05
0.02
1.12
0.016
1.000 1.000 0.998
S3
1963
0.013
-1.12
-0.05
0.02
1.12
0.016
1964
0.018
-1.05
-0.23
0.04
1.07
0.021
1965
0.018
-1.05
-0.23
0.04
1.07
0.021
0.998
1966
0.018
-1.05
-0.23
0.04
1.07
0.021
0.998
1967
0.021
-0.92
-0.21
0.05
0.94
0.019
0.998 0.995
1968
0.024
-0.81
-0.31
0.06
0.87
0.018
1969
0.027
-0.77
-0.33
0.04
0.83
0.019
0.998
1970
0.031
-0.74
-0.32
0.06
0.81
0.020
0.995
1971
0.033
-0.63
-0.36
0.02
0.73
0.018
0.999
1972
0.034
-0.45
-0.43
0.05
0.63
0.014
0.994
1973
0.035
-0.25
-0.44
-0.00
0.51
0.009
1.000
1974
0.034
-0.13
-0.39
-0.03
0.41
0.006
0.996
1975
0.034
0.08
-0.32
-0.03
0.33
0.004
0.990
1976
0.034
0.22
-0.32
-0.05
0.39
0.005
0.983
-0.08
0.44
0.006
0.971
-0.08
0.56
0.010
0.980
1977
0.033
0.34
-0.26
1978
0.032
0.53
-0.18
1979
0.033
0.67
-0.12
-0.07
0.68
0.015
0.990
1980
0.032
0.82
-0.07
-0.04
0.82
0.022
0.998
1981
0.033
0.91
0.01
-0.10
0.92
0.028
0.990
1982
0.032
0.98
0.07
-0.06
0.98
0.031
0.996
1983
0.033
1.07
0.14
-0.05
1.08
0.038
0.998
1.13
0.041
1.000
1984
0.033
1.12
1985
0.033
1986
0.033
1987
0.15
-0.02
1.12
0.18
0.04
1.14
0.042
0.999
1.21
0.23
0.06
1.24
0.051
0.998
0.034
1.19
0.24
0.06
1.22
0.050
0.998
1988
0.035
1.20
0.23
0.07
1.22
0.052
0.996
1989
0.036
1.32
0.22
0.14
1.34
0.065
0.990
196 TABLE 32.12 Weights, loadings, distances, contributions and precisions from CFA applied to Table 32.10
Women in chemistry Men in chemistry Women in other fields Men in other fields
0.008 0.068 0.213 0.711
1 ( X,i X^
^
5
1,
1.
1.
0.99 -1.46 1.69 -0.38
0.72 1.20 0.18 -0.18
0.66 -0.03 -0.02 0.00
^P
1.39 1.89 1.70 0.42
with / = 1, ..., n andy = 1, ...,p
0.015 0.244 0.617 0.124
0.770 1.000 1.000 1.000
(32.54)
\^i-\- ^+j
and where r is the rank of Z. The rank r is at most equal to the smaller ofn-l and p- 1. Hence, in the present application to a 35x4 contingency table we can at most obtain three distinct latent vectors. For practical reasons we have divided the transformed data (eq. (32.7)) by the global distance of chi-square 6 (eq. (32.24)). This type of normalization ensures that the global interaction in the data equals unity:
The three latent vectors account for respectively 86, 13 and 1% of the interaction. The next two columns in Tables 32.11 and 32.12 show the distances 5„ and 6^ of rows and columns from the origin of space and their contributions y^ and y^ to the interaction:
S?=I4
and
y,. =w, 5?
with / = 1, ..., n (32.55)
s]=i:^
and
y^ =Wj 5]
withy = 1, ...,p
The final column in Tables 32.11 and 32.12 lists the precisions n^ and n^ with which the rows and columns are represented in the plane spanned by the first two latent vectors:
197
M&n Chemistry
m
®^40-49 j
\.®
1
20-29
® 30-39 ||IlVoi77e;3
Chemistry
^50-54
1
nf
Men Other
Fields
j^p' 0^^-9^O
Women Other
Fields
^^^^^^m^^^^^^i"®^^^^
Fig. 32.8. CFA biplot computed from the data in Table 32.10. Circles represent years and squares identify the four educational categories. The centre of the plot is represented by a small cross. The coordinates of the years and the categories are contained in Tables 32.11 and 32.12. Factor scaling coefficients were defined as a = P = 1.
with / = 1, ..., n k
(32.56) with7= 1, ...,/7 Figure 32.8 shows the biplot constructed from the first two columns of the scores matrix S and from the loadings matrix L (Table 32.11). This biplot corresponds with the exponents a = 1 and p = 1 in the definition of scores and loadings (eq. (39.41)). It is meant to reconstruct distances between rows and between columns. The rows and columns are represented by circles and squares respectively. Circles are connected in the order of the consecutive time intervals. The horizontal and vertical axes of this biplot are in the direction of the first and second latent vectors which account respectively for 86 and 13% of the interaction between rows and columns. Only 1% of the interaction is in the direction perpendicular to the plane of the plot. The origin of the frame of coordinates is indicated
198
by a small cross (+) near the centre of the biplot. As can be seen from the precisions K^ in Table 32.11, all years are well-represented in the plane of the biplot. The precisions n^ in Table 32.12, however, show that the category of women in chemistry is not so well-represented in the biplot as its precision amounts only to 0.770. It is readily evident that the most dominant (horizontal) latent variable reflects a contrast between women and men. From 1966 onwards there appears a sustained increase of the proportion of women doctorates. The increase is most prominent during the seventies as evidenced by the relatively large distances between adjacent time intervals from 1971 to 1980. The second (vertical) latent variable is defined by a contrast between chemistry and the other fields. Contrast is to be understood here as a difference between profiles. Initially there is a progressive decrease in the share of chemistry doctorates when compared to those in other fields. Around 1973, however, the trend reverses and the proportion of chemical degrees rises slowly, but steadily. The correspondences between rows and columns are evidenced by the positions of the circles and squares with respect to the origin of the plot. Those points that are closest to the origin are most similar to the average profile. Those that are further away show specific differences with respect to the average profile. Circles and squares that moved toward the border of the biplot, and in the same direction, possess a positive correspondence. They seem to attract each other. Those that moved in the opposite direction demonstrate a negative correspondence. They seem to repel each other. For example, the category of chemistry doctorates awarded to men exhibits a positive correspondence with the early years, and at the same time a negative correspondence with the more recent years. A mechanical analogy of forces of attraction and repulsion between a circle and a square is appropriate here. One should refrain, however, from judging distances between circles and squares. As stated already before, their closeness is not a measure of their correspondence, only the distances from the origin and their angular distance matter. In the case when one of the two measurements of the contingency table is divided in ordered categories, one can construct a so-called thermometer plot. On this plot we represent the ordered measurement along the horizontal axis and the scores of the dominant latent vectors along the vertical axis. The solid line in Fig. 32.9 displays the prominent features of the first latent vector which, in the context of our illustration, is called the women/men factor. It clearly indicates a sustained progress of the share of women doctorates from 1966 onwards. The dashed line corresponds with the second latent vector which can be labelled as the chemistry/ other fields factor. This line shows initially a decline of the share of chemistry and a slow but steady recovery from 1973 onwards. The successive decline and rise are responsible for the horseshoe-like appearance of the pattern of points representing
199 1.5 n Score ^ Chemistry / Other Fields 0.5 H
-0.5
1973 Women / Men
1966
-1.5 20
30
40
50
60
70
80
90 Years
Fig. 32.9. Thermometer plot representing the scores of the first and second component of a CFA applied to Table 32.10. The solid line denotes the first component which accounts for the women/men contrast in the data. The broken line corresponds with the second component which reveals a contrast between chemistry and other fields.
the rows in the biplot of Fig. 32.8. This phenomenon is also called the Guttman effect [2]. The effect is due to the fact that a linearly varying series of scores or loadings on factor 1 (e.g. 1, 2, 3) can be uncorrelated or orthogonal to a parabolic series on factor 2 (e.g. -30, 0, 10). For the purpose of comparison, we also discuss briefly the biplot constructed from the CFA using the exponents a = 0.5 and (3 = 0.5 (Fig. 32.10). Such a display is meant to reconstruct the values in the transformed contingency table Z by projections of points representing rows upon axes representing columns (or vice versa): cf. eq. (32.46) where: S = AA 1/2
and
L = BA 1/2
cf. eq. (32.41)
This type of biplot does not reproduce distances or angles accurately, especially when A,j and A-2 are very distinct. It can be shown that the distortion of distances that occurs in the vertical direction is proportional to the square root of \IX2. The advantage of this type of biplot is that it allows us to construct bipolar axes for any
200
Fig. 32.10. CFA biplot computed from the data in Table 32.10. Factor scaling coefficients were defined as a = p = 0.5. This definition allows us to draw bipolar axes through the four educational categories, showing the contrast between women and men (horizontally) and between chemistry and other fields (vertically).
pair of categories. For example, the more horizontal axes represent the contrasts between women and men. The more vertical axes display contrasts between chemistry and the other fields. By perpendicular projection of the circles upon any bipolar axis one obtains a relative ordering of the time intervals according to a particular contrast. In the case of the women/men contrast in other fields (the slightly positively sloping axis) we first note a somewhat stable situation until 1966, when the contrast evolves rapidly in favour of women doctorates. Similarly, the chemistry/other fields contrast in women (the more vertical axis at the right) shows a strong regression of chemistry until about 1975. In later years the situation seems to have stabilized. Both types of symmetric displays exhibited in Figs. 32.9 and 32.10 have their merits. They are called symmetric because they produce equal variances in the scores and in the loadings. In the case when a = p = 1, we obtain that the variances along the horizontal and vertical axes are equal to the eigenvalues'h?associated to the dominant latent vectors. In the other case when a = (3 = 0.5, the variances are found to be equal to the singular values X.
201
32.7 Log-linear model 32J.1 Historical introduction The log-linear model (LLM) is closely related to correspondence factor analysis (CFA). Both methods pursue the same objective, i.e. the analysis of the association (or correspondence) between the rows and columns of a contingency table. In CFA this can be obtained by means of double-closure of the data; in LLM this is achieved by means of double-centring of the logarithmic data. According to Andersen [12] early applications of LLM are attributed to the Danish sociologist Rasch in 1963 and to Andersen himself. Later on, the approach has been described under many different names, such as spectral map analysis [13,14] in studies of drug specificity, as logarithmic analysis in the French statistical literature [15] and as the saturated RC association model [16]. The term log-bilinear model has been used by Escoufier and Junca [17]. In Chapter 31 on the analysis of measurement tables we have described the method under the name of log double-centred principal components analysis. Here we develop the method specifically from the point of view of contingency tables and within the context of weighted metrics. We will show that LLM differs only from CFA in the type of preprocessing that is applied to the contingency table. The results of both approaches are often similar when there are no extreme contrasts in the data. 32.7.2 Algorithm We assume that X is a contingency table in which all elements are positive. We are also given a vector of weights (or masses) w^ for the rows and another set of weights w^ for the columns of the table. For convenience, we assume that all weight coefficients are normalized to unit sum. These weights can be defined proportionally to the marginal sums of the table, although in LLM this is not a strict requirement. One may assign the weights according to the objective of the analysis, i.e. by giving a more prominent weight to the more relevant rows or columns than to the others, or by assigning identical weights to rows, to columns or to both rows and columns. The weight vectors define the weighted metrics of the row- and column-spaces: W^ = diag(wJ
and
W^ = diag(w^)
Preprocessing of the contingency table involves the taking of (natural) logarithms: y^j = logiXij)
with / = 1,..., n
followed by double centring:
and
j = 1,..., p
202
Zij =yij
-yi.-y.j
+x.
where y^, yj andy represent the row-, column- and global means in the weighted metrics W„ and W^: y,. = YW^ 1 , = Yw^
or
yy,-., = S w , y,j yij
(32.57)
j n
y..=YT^W,l,=Y^w,
y. J = S
"^j yij
i
n
= 1^ W YW 1 =w^ Yw
or
y..
p
yij
where the sum vectors 1„ and 1^ have been defined before in eqs. (32.15) and (32.16). From this point on, the analysis is identical to that of CFA. Briefly, this involves a generalized singular vector decomposition (SVD) of Z using the metrics W„ and W^, such that: Z = AAB'^
cf. eq. (32.50)
where A is the rxr diagonal matrix of singular values, A is the nxr matrix of generalized latent vectors for rows, B is the pxr matrix of generalized latent vectors for columns, and where r is the rank of Z. The generalized latent vectors A and B are orthogonal in the weighted metrics W^ and W^, which implies that: A^W^A = B^W^B = I, For the same reason as for double-closure, double-centring always reduces the rank of the data matrix by one, as a result of the introduction of a linear dependence among the rows and columns of the data table. Biplots can be constructed in the same way as in the case of CFA, by defining scores S and loadings L, cf. eq. (32.41): S = AA« L = BAP The choice of a = (3 = 1 will reproduce distances between rows and between columns of Z. The choice a = p = 0.5 allows one to reconstruct the values and contrasts of Z by perpendicular projections of points, representing rows, upon axes representing columns (or vice versa). With these two choices for a and (3, the analysis is symmetrical with respect to rows and columns.
203
The conventions of the LLM biplot are similar to those defined for the CFA biplot. Circles represent the row-categories and squares denote the columncategories of the contingency table. When the areas of circles and squares are made proportional to the marginal sums of the table, then the areas confer a notion of size or importance of the row- and column-categories that are represented. The positions of the circles and squares from the origin (which appears as a small cross near the centre) express the degree of difference or contrast of the corresponding rows and columns with respect to their average profiles. Rows or columns that are represented close to the origin show little contrast with their average profiles. Those that are further away will have greater contrasts in their profiles. The direction in which circles and squares move from the origin, in the same or in opposite directions, is an expression of their positive or negative correspondence. The way interaction is expressed in the LLM biplot is identical to that of the CFA biplot. We can also define bipolar axes through two selected row-categories or through two selected column-categories which form the poles of bipolar axes. Additionally, these axes can be calibrated according to the ratios of the two categories that define the axes. Perpendicular projection of the centres of the circles upon any bipolar axis produces an (approximate) reading of the corresponding ratios, as has been explained in Chapter 31. The approximation depends on the precision with which the points are represented in the plane of the plot and is limited by the visual interpolation on logarithmic scales. LLM can also be defined as an expansion of a contingency table X using the generalized latent vectors in A and B and the singular values in A by analogy with eq. (32.53): ^/.
x^j =
^.j
with/= 1, ...,/2
- e x p ^^k^ikt^.
and
7 = 1,...,/?
(32.58)
J
where x^, x j and x are the geometrical row-, column- and global means of X. If the contrasts in the data are small, then the exponents in the above expression will be small also. In that case we can further expand the exponential function, which produces:
k
and which is similar to the expansion in eq. (32.53) obtained with CFA.
204
32.7.3 Application The LLM approach described above has been applied to the 35x4 contingency Table 32.10 after some modification. In this case, we have replaced the pooled data in the first five rows by the corresponding average annual values. Further, columns 1 and 2 have been combined with columns 3 and 4 in order to produce the categories of women in all fields and of men in all fields. The reason for the change is our objective to produce an analysis of the ratios between chemistry and all fields rather than between chemistry and the other fields. In this analysis, weight coefficients for rows and for columns have been defined as constants. They could have been made proportional to the marginal sums of Table 32.10, but this would weight down the influence of the earlier years, which we wished to avoid in this application. As with CFA, this analysis yields three latent vectors which contribute respectively 89,10 and 1% to the interaction in the data. The numerical results of this analysis are very similar to those in Table 32.11 and, therefore, are not reproduced here. The only notable discrepancies are in the precision of the representation of the early vears up to 1972, which is less than in the previous application, and in the precision of the representation of the category of women chemists which is better than in the previous analysis by CFA (0.960 vs 0.770). The overall interpretation of the LLM biplot of Fig. 32.11 is the same as obtained with the CFA biplot of Fig. 32.10. The first (horizontal) latent variable seems to be associated primarily with the women/men contrast, while the second (vertical) latent variable is mostly associated with the chemistry/all fields contrast. Thermometer plots, which represent the scores of the various time intervals as a function of time, are similar to those in Fig. 32.9. They are not reproduced here as they point to the same remarkable events, i.e. the sustained rise of the proportion of women since 1966 and the recovery of the share of chemistry in 1973. In Fig. 32.11 the women/men ratio in 1980 is estimated visually to be 0.190 in chemistry and 0.430 in all fields. The exact ratios as computed from Table 32.10 are 0.198 and 0.435 respectively. The chemistry/all fields ratio in 1970 is visually estimated at 0.090 for men and at 0.041 for women. The exact values from Table 32.10 are 0.080 and 0.046 respectively. The question may be asked why there should be different methods when they provide more or less the same kind of information. One answer is that not all contingency tables will yield results that are agree as well as in the case we have studied here. When there are very large contrasts, the results of CFA and LLM tend to disagree. This occurs when there are several zeroes in the table, which in LLM have to be replaced by small positive values. In that case it may pay off to analyse the same contingency table with different methods, each of which may reveal different aspects of the multivariate patterns [18]. Furthermore, LLM can be
205
Fig. 32.11. Log-linear model (LLM) biplot computed from the data in Table 32.10. Conventions are the same as in Fig. 32.10. The areas of circles (representing years) and of squares (representing categories) are made proportional to the row- and column-totals in Table 32.10.
applied to tables of heterogeneous data, i.e. data that have been recorded with different units, while CFA can only be applied to tables of homogeneous data, unless the heterogeneous scales of measurement are subdivided into discrete categories. References 1. 2. 3. 4. 5. 6. 7.
B.S. Everitt, The Analysis of Contingency Tables. Chapman and Hall, London, 1977. M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, London, 1984. A. Agresti, Categorical Data Analysis. Wiley, New York, 1990. M.G. Kendall and A. Stuart, The Advanced Theory of Statistics. Vol. II. Ch. Griffin, London, 1960. M.O. Hill, Reciprocal averaging: an eigenvector method of ordination. J. EcoL, 61 (1973) 237-224. J.-P. Benzecri, L'analyse des Donnees. Vol. II, L'analyse des Correspondances. Dunod, Paris, 1973. J.-P. Benzecri, Histoire et Prehistoire de 1'Analyse des Donnees. Dunod, Paris, 1982.
206 8. 9. 10.
11. 12. 13. 14. 15. 16.
17. 18.
M.O. Hill, Correspondence analysis: a neglected multivariate method. Appl. Statist., 23 (1974) 340-355. K.R. Gabriel, The biplot graphic display of matrices with application to principal components analysis. Biometrika, 58 (1971) 453-446. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom. Intell. Lab. Syst., 4 (1988) 11-25. K.G. Everett and W.S. Deloach, Chemistry doctorates awarded to women in the United States, A historical perspective. J. Chem. Educ, 68 (1991) 545-547. E.B. Andersen, Discussion of paper by L.A. Goodman. Int. Stat. Review, 54 (1986) 271-272. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim. Forsch. (Drug Res.), 26 (1976) 1295-1300. P.J. Lewi, Spectral Map Analysis, Analysis of contrasts especially from log-ratios. Chemom. Intell. Lab. Syst., 5 (1989) 105-116. J.B. Kasmierczak, Analyse logarithmique, Deux exemples d'applicarion. Rev. Statist. Appliquee, 33 (1985) 13-24. L.A. Goodman, Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency table. Int. Stat. Review, 54 (1986)243-309. Y. Escoufier and S. Junca, Least squares approximation of frequencies or their logarithms. Int. Stat. Rev., 54 (1986) 279-283. A. Thielemans, P.J. Lewi and D.L. Massart, Similarities and differences among multivariate display techniques by Belgian Cancer Mortality Distribution data. Chemom. Intell. Lab. Syst., 3(1988)277-300.
207
Chapter 33
Supervised Pattern Recognition 33.1 Supervised and unsupervised pattern recognition In Section 17.9 a method, called linear discriminant analysis, was introduced and applied to derive a rule which would discriminate between wines from three origins. To start with, about 100 wine samples of known origin were analyzed: 8 variables (or features) were determined for each. One wanted to know if, in some way, these results could be used to derive a procedure to determine the origin of new samples. In pattern recognition terminology, this question was rephrased as: use the learning or training objects (i.e. those with known origin) to derive a classification rule which allows to classify new objects with unknown origin in one of three known classes, based on the values of the features of the new object. This is called supervised pattern recognition or supervised learning. Mathematically, this means that one needs to assign portions of an 8-dimensional space to the three classes. A new sample is then assigned to the class which occupies the portion of space in which the sample is located. Supervised pattern recognition is distinct from unsupervised pattern recognition. In the latter one applies essentially clustering methods (Chapter 30) to classify objects into classes that are not known beforehand. In supervised pattern recognition, one knows the classes and has to decide in which of those an object should be classified. Supervised pattern recognition techniques essentially consist of the following steps. 1. 2.
3. 4.
Selection of a training or learning set. This consists of objects of known classification for which a certain number of variables are measured. Feature selection, i.e. the selection of variables that are meaningful for the classification and elimination of those that have no discriminating (or, for certain techniques, no modelling power). This step is discussed further in Section 33.3. Derivation of a classification rule, using the training set. This is the subject of Section 33.2. Validation of the classification rule, using an independent test set. This is described in more detail in Section 33.4.
208
Many books have been published about pattern recognition; one of these is directed towards chemometrics [1].
33.2 Derivation of classification rules 33,2.1 Types of classification rules There are many types of pattern recognition which essentially differ in the way they define classification rules. In this section, we will describe some of the approaches, which we will then develop further in the following sections. We will not try to develop a classification of pattern recognition methods but merely indicate some characteristics of the methods, that are found most often in the chemometric literature and some differences between those methods. A first distinction which is often made is that between methods focusing on discrimination and those that are directed towards modelling classes. Most methods explicitly or implicitly try to find a boundary between classes. Some methods such as linear discriminant analysis (LDA, Sections 33.2.2 and 33.2.3) are designed to find explicit boundaries between classes while the k-nearest neighbours (A;-NN, Section 33.2.4) method does this implicitly. Methods such as SIMCA (Section 33.2.7) put the emphasis more on similarity within a class than on discrimination between classes. Such methods are sometimes called disjoint class modelling methods. While the discrimination oriented methods build models based on all the classes concerned in the discrimination, the disjoint class modelling methods model each class separately. LDA and several other supervised methods focus on finding optimal boundaries between classes: their first goal is to discriminate. In Section 17.9 we explained also with an example from clinical chemistry that canonical variates can be determined and plotted to discriminate between classes and that this is a way of performing LDA. The example concerned the thyroid gland. People whose thyroid gland functions normally are called euthyroid (EU) and patients whose thyroid gland is too active or not active enough are called, respectively, hyperthyroid (HYPER) or hypothyroid (KYPO). Clinicians want to make a distinction between the three classes. This can be done using five chemical determinations among which e.g. serum thyroxine or thyroid-stimulating hormone. This is clearly a multivariate (five-dimensional) situation and LDA is applied. Allocation of patients to one of the classes can be achieved in a more formal way by determining the centroid of each group and drawing a boundary half-way between the two centroids [2]. This is shown in Fig. 33.1. Figure 33.1 is typical of many situations in clinical chemistry: it shows a tight normal group (the EU group) and spreading out from it much more disperse
209
a
.2 a
a
Discriminant function 1 Fig. 33.1. Canonical variate plot for three classes with different thyroid status. The boundaries are obtained by linear discriminant analysis [2].
abnormal groups (the HYPO and HYPER groups). In fact, this kind of picture is also found in many non-clinical situations too (see, for instance, the air pollution situation of Fig. 17.10). If one now determines boundaries as described in the preceding paragraph, namely by situating linear boundaries half-way between the centroids of adjacent classes, some patients of the more disperse abnormal classes are classified as members of the more condensed class. This classification problem can then be solved better by developing more suitable boundaries. For instance, using so-called quadratic discriminant analysis (QDA) (Section 33.2.3) or density methods (Section 33.2.5) leads to the boundaries of Fig. 33.2 and Fig. 33.3, respectively [3,4]. Other procedures that develop irregular boundaries are the nearest neighbour methods (Section 33.2.4) and neural nets (Section 33.2.9). One of the problems with discrimination-oriented methods is that we need to classify each object in one of the given classes. It is, however, quite possible that an
210
o o c
•4—•
ac a c
> 3
Q 3 , 3 ^ 3 3
. - - i .
.^r' 1 1 t' I 1 1 1 '* ,'3111 I1 1 1 I \ 31 1 1 I I I I I I n 1 \ :^^ nil 11 11111 i ^'^^ 1 111 111 1 1 1 1 I 1 1 V' 1 2 3 ^ s l l l l l 2 x ' 2 2 2 2 1 "->>^.J-'2 2 2 2
\
3
2 2
2 2
\
3
2
2 2 2
3
2
2
•I
2
\
' 2 2
3
Discriminant function 1 Fig. 33.2. As Fig. 33.1 but with boundaries obtained by quadratic discriminant analysis [3].
object should not be classified in any of these classes. Returning to the wine example, where a classification between wines from three given origins was wanted, we must realize that the sample submitted for classification may, in reality, belong to none of the given origins, but to a fourth one. However, using the discrimination-oriented methods, we will classify it necessarily in one of the three given regions. A different approach to supervised pattern recognition can then be useful. This consists of making a separate model of each class. Objects which fit the model for a class are considered part of it and objects which do not fit are classified as non-members. In discrimination terms, we could say that the class model discriminates between membership and non-membership of a certain class. In statistical terms, we can state that these methods perform outlier tests. The conceptually simplest model, which for reasons explained later is called UNEQ, is based on the multivariate normal distribution. Suppose we have carried
211
c a c
3
X 1
\ 3| / 1 1 I 1/ 1 11 1 ;3111 II \ 31 M 11111 iiim ' il 1111 ^ " 1' 111 111112' 3N ••••• ^/ V 11 1 1 1 1 1 1 11 1 1 1 1 1 / 1 2 ^\ 1 11 1 1 2 ^ ^ 2 2 2 2 1 \ 1 , ''2 2 2 \ 1 / 2 2
/2
3 3 3 3
\>
2
2 2
2 2 2
Discriminant function 1 Fig. 33.3. As Fig. 33.1 but with boundaries obtained by a density method [4].
out two tests, jCj and X2, with which we want to describe the health of a patient. Only healthy patients are investigated. In Fig. 33.4, we could say that the ellipse describing the 95% confidence limit for a bivariate normal distribution can be considered as a model of the class of healthy patients. Those within its limits are considered healthy and those outside would be considered non-members of the healthy class. The bivariate normal distribution is therefore a model of the healthy class. In three dimensions (Fig. 33.5), the model takes the shape of an ellipsoid and in m dimensions, we must imagine an m-dimensional hyperellipsoid. In the figure, two classes are considered and we now observe that four situations can be encountered when classifying an object, namely: (a) the object is part of class K, (b) the object is part of class L, (c) the object is not a member of class K or L: it is an outlier, and
212 X
1 f
Fig. 33.4. Nintety-five percent confidence limit for a bivariate distribution as class envelope.
Fig. 33.5. Class envelopes in three dimensions as derived from the three-variate normal distribution.
(d) if K and L overlap the object can be a member of both classes K and L: it is situated in a region of doubt. UNEQ is applied only when the number of variables is relatively low. For more variables, one does not work with the original variables, but rather with latent variables. A latent variable model is built for each class separately. The best known such method is SIMCA. We also make a distinction between parametric and non-parametric techniques. In the parametric techniques such as linear discriminant analysis, UNEQ and SIMCA, statistical parameters of the distribution of the objects are used in the derivation of the decision function (almost always a multivariate normal distribution
213
is assumed). The most important disadvantage of parametric methods is that to apply the methods correctly statistical requirements must be fulfilled. The nonparametric methods such as nearest neighbours (Section 33.2.4), density methods (Section 33.2.5) and neural networks (Section 33.2.9 and Chapter 44) are not explicitly based on distribution statistics. The most important advantage for the parametric methods is that probabilities of correct classification can be more easily estimated than with most non-parametric methods. 33.2.2 Canonical variates and linear discriminant analysis LDA is the best studied method of pattern recognition. It was originally proposed by Fisher [2] and is applied very often in chemometrics. Applications can be found for instance in the classification of Eucalyptus oils based on gas-chromatographic data [6], the automatic recognition of substance classes from GC/MS [7], the recognition of tablets and capsules with different dosages with the use of NIR spectra [8] and in the already cited clinical chemical example (see Section 33.1). It appears that there are several ways of deriving essentially the same methodology. This may be confusing and, following a short article by Fearn [9], we will try to explain the different approaches. A detailed overview is found in the book by McLachlan [10]. Let us first consider two classes K and L in a bivariate space (xj, Xr^. Figure 33.6a shows the objects in this space. In Fig. 33.6b bivariate probability ellipses are drawn representing the normal (bivariate) probability distributions to which the objects belong. Since there are two classes, there are two such ellipses. Basically, an object will be classified in the class for which it has the highest probability. In Fig. 33.6b, object A is classified in class ^ because it has a (much) higher probability in K than in L. In Fig. 33.6c, an additional ellipse is drawn for each class. These ellipses both represent the same probability level in their respective classes; they touch in point O half-way between the two class centres. Line a is the tangent to the two ellipses in point O. Any point to the left of it, has a higher probability to belong to K and to the right it is more probable that it belongs to L. Line a can be used as a boundary, separating K from L. In practice, we would prefer an algebraic way to define the boundary. For this purpose, we define line d, perpendicular to a. One can project any object or point on that line. In Fig. 33.6c this is done for point A. The location of A on d is given by its score on d. This score is given by: Z) = WQ + WjX] + ^2^2
(33.1)
When working with standardized data WQ = 0. The coefficients Wj and w^ are derived in a way described later, such that D = 0 in point O and D > 0 for objects belonging to L and Z) < 0 for objects of A'. This then is the classification rule.
214
a)
X V X
•
.
• • • X• • • • • • •
'K
b)
X 0
-2
-6
Fig. 33.6. Caption opposite.
-4
0 X1
X X X X „ X
X
X
215
Fig. 33.6. (a) Two classes A'and L to be discriminated, (b) confidence limits around the centroids oiK and L, (c) the iso-probability confidence limits touch in O; a is a line tangential to both ellipses; d is the optimal discriminating direction; A is an object.
It is observed that D as defined by eq. (33.1) is a latent variable, in the same way as a principal component. We can consider LDA, as was the case for principal components analysis, as a feature reduction method. Let us therefore again consider the two-dimensional space of Fig. 33.6. For feature reduction, we need to determine a one-dimensional space (a line) on which the points will be projected from higher, here two-dimensional space. However, while principal components analysis selects a direction which retains maximal structure in a lower dimension among the data, LDA selects a direction which achieves maximum separation among the given classes. The latent variable obtained in this way is a linear combination of the original variables. This function is called the canonical variate. When there are k classes, one can determine k- \ canonical variates. In Fig. 33.1, two canonical variates are plotted against one another for three overlapping classes. A new sample can be allocated by determining its location in the figure. The second way of introducing LDA, discovered by Fisher, is therefore to rotate through O a line
216
• • •
• •
• •
• •
• •
• •
• •
• •
• •
•
X X • • X
X
X X X
X X
X X X X
^^
Fig. 33.7. A univariate classification problem.
until the optimal discriminating direction is found (d in Fig. 33.6c). This rotation is determined by the values of Wj and W2 in eq. (33.1). These weights depend on several characteristics of the data. To understand which ones, let us first consider the univariate case (Fig. 33.7). Two classes, K and L, have to be distinguished using a single variable, jCj. It is clear that the discrimination will be better when the distance between Xj^^ and x^j (i.e. the mean values, or centroids, of jc, for classes K and L) is large and the width of the distributions is small or, in other words, when the ratio of the squared difference between means to the variance of the distributions is large. Analytical chemists would be tempted to say that the resolution should be as large as possible. When we consider the multivariate situation, it is again evident that the discriminating power of the combined variables will be good when the centroids of the two sets of objects are sufficiently distant from each other and when the clusters are tight or dense. In mathematical terms this means that the between-class variance is large compared with the within-class variances. In the method of linear discriminant analysis, one therefore seeks a linear function of the variables, D, which maximizes the ratio between both variances. Geometrically, this means that we look for a line through the cloud of points, such that the projections of the points of the two groups are separated as much as possible. The approach is comparable to principal components, where one seeks a line that explains best the variation in the data (see Chapter 17). The principal component line and the discriminant function often more or less coincide (as is the case in Fig. 33.8a) but this is not necessarily so, as shown in Fig. 33.8b. Generalizing eq. (33.1) to m variables, we can write: D = w'^x + Wo
(33.2)
where it can be shown [10] that the weights w are determined for a two-class discrimination as w^ =(x, - X 2 ) ' ' S-' and
(33.3)
217
Xi4 o
o
a) ^ PC.DF
Xi4
b)
Fig. 33.8. Situation where principal component (PC) and linear discriminant function (DF) are essentially the same (a) and very different (b).
Wo = - - ( ^ 1 - ^ 2 ) ' ^ S U x i + X 2 )
(33.4)
In eq. (33.3) and (33.4) Xj and X2 ^re the sample mean vectors, that describe the location of the centroids in m-dimensional space and S is the pooled sample variance-covariance matrix of the training sets of the two classes. The use of a pooled variance-covariance matrix implies that the variancecovariance matrices for both populations are assumed to be the same. The consequences of this are discussed in Section 33.2.3. Example: A simple two-dimensional example concerns the data from Table 33.1 and Fig. 33.9. The pooled variance-covariance matrix is obtained as [K^K + L^L]/(ni + ^2 - 2), i.e. by first computing for each class the centred sum of squares (for the diagonal elements) and the cross-products between variables (for the other
218 1 —
lO
1
1
\
1
•y—I
o
-
14
-
n ^
o
12
o
^
^
o
^
10
^
o ^
o
8
o
-
)K
^ o
6
o
o
)K
~
)i(
-
X
-
4
)K
-
2
1
{
1
1
10
12
14
x1 Fig. 33.9. LDA applied to the data of Table 33.1; n is a new object to be classified.
elements), then summing the two matrices and dividing each element hyn^+n2-k (here 10 + 10 - 2 = 18). As an example we compute the cross-term s^2 (which is equal to 521). This calculation is performed in Table 33.2. In the same way we can compute the diagonal elements, yielding 2.78 3.78
3.78 ^ 10.56
and 0.70
s-' = -0.25
-0.25 0.18
Since (x,
- X 2 ) :
(6-11) ( 9 - 7)
-5 2
219 TABLE 33.1 Example data set for linear discriminant analysis Class 1
Class 2
Object
^1
X2
Object
^1
X2
1
8
15
11
11
11
2
7
12
12
9
5
3
8
11
13
11
8
4
5
11
14
12
6
5
7
9
15
13
10
6
4
8
16
14
12
7
6
8
17
10
7
8
4
5
18
12
4
9
5
6
19
10
5
10
6
5
20
8
2
mean
6
9
mean
11
7
TABLE 33.2 Computation of the cross-product term in the pooled variance-covariance matrix for the data of Table 33.1 Class 1
Class 2
(8-6) (15-9) = 12
(ll-ll)(ll-7) =0
(7-6)(12-9) = 3
(9-ll)(5-7) =4
(8-6)(ll-9) =4
(ll-ll)(8-7) =0
(5-6)(ll-9)=-2
(12-ll)(6-7) = - l
(7 - 6) (9 ~ 9) = 0
(13-ll)(10-7) = 6
(4-6)(8-9) = 2
(14-11)(12-7)=15
(6 - 6) (8 - 9) = 0
(10-ll)(7-7) = 0
(4-6)(5-9) = 8
(12-ll)(4-7) = - 3
(5-6)(6-9) = 3
(10-ll)(5-7) = 2
(6-6)(5-9) = 0
(8-11) (2-7) = 15
S class 1 = 30
I class 2 = 38
Degrees of freedom: 1 0 + 1 0 - 2 = 1 8 . Cross-product term: (30 + 38)/18 = 3.78
220
Wi=-4.00 W2 = 1.62 Wo= 21.08 and D = 21.08-4.00 jci +1.62 JC2 We can now classify a new object n. Consider an object with x^ = 9 and JC2 = 13. For this object D = 21.08 - 4.00 X 9 + 1.62 X 13 = 6.14 Since D > 0, it is classified as belonging to class 1. For two classes, Fisher arrived at similar results for the equations given above by considering LDA as a regression problem. A response factory, indicating class membership, is introduced: >^ = -1 for all objects belonging to class K and y = -\-\ for all objects belonging to class L. We then obtain the regression equation for y = y(xj,X2). It is shown that, when there is an equal number of objects in K and L, the same w values are obtained. If the number is not the same, then w^ and W2 are still the same but WQ changes. So far, we have described only situations with two classes. The method can also be applied to K classes. It is then sometimes called descriptive linear discriminant analysis. In this case the weight vectors can be shown to be the eigenvectors of the matrix: A = W-^ B
(33.5)
where W is the within-group sum of squares and cross-products matrix and B is the between-groups sum of squares and cross-products matrix. As described in [11] this leads to non-symmetrical eigenvalue problems. 33.2.3 Quadratic discriminant analysis and related methods There is still another approach to explain LDA, namely by considering the Mahalanobis distance (see Chapter 30) to a class. All these approaches lead to the same result. The Mahalanobis distance is the distance to the centre of a class taking correlation into account and is the same for all points on the same probability ellipse. For equally probable classes, i.e. classes with the same number of training objects, a smaller Mahalanobis distance to class K than to class L, means that the probability that the object belongs to class K is larger than that it belongs to L.
221
The Mahalanobis distance representation will help us to have a more general look at discriminant analysis. The multivariate normal distribution for m variables and class K can be described by f (X.) =
~
— C-^^^^'MK
(33.6)
with Dlfj^ the Mahalanobis distance to class K. Dj,j,=(x-ii^y
r^Hx-R^)
(33.7)
where ^ij^ and F^ are the population mean vector and variance-covariance matrix of K respectively. They can be estimated by the sample parameters x^ and C^. From these equations, one can derive the following classification rule: classify object u of unknown class in the class K for which Dj^j^^^ is minimal, given DIK.U=(^U-^KVC-,\X^-X,)
(33.8)
where x^ is the vector of x values describing object u. This equation is applied when the a priori probability of the classes is the same. When this is not so, an additional term has to be added. When all C^ are considered equal, this means that they can be replaced by S, the pooled variance-covariance matrix, which is the case for linear discriminant analysis. The discrimination boundaries then are linear and DJ^J^^^ is given by Dl,,^,=(x,-x,y S-\x,-x,)
(33.9)
Friedman [12] introduced a Bayesian approach; the Bayes equation is given in Chapter 16. In the present context, a Bayesian approach can be described as finding a classification rule that minimizes the risk of misclassification, given the prior probabilities of belonging to a given class. These prior probabilities are estimated from the fraction of each class in the pooled sample:
where % is the prior probability that an object belongs to class K, rij^ is the number of objects in the training set for class K and N is the total number of objects in the training set. One computes OMA:,M ^^ DIK,U = ( X „ - X ^ ) ^
Ci,Hx„ - x ^ ) + lnlC^I-21n7i^
and classifies u in the class for which this value is smallest.
(33.10)
222
X^t
Q)
^2
X, •
b)
X2 Fig. 33.10. Situations with unequal variance-covariance: (a) unequal variance, (b) unequal covariance.
Equation (33.10) is applied in what is called quadratic discriminant analysis (QDA). The equations can be shown to describe a quadratic boundary separating the regions where Dl^j^ ^ is minimal for the classes considered. As stated earlier, LDA requires that the variance-covariance matrices of the classes being considered can be pooled. This is only so when these matrices can be considered to be equal, in the same way that variances can only be pooled, when they are considered equal (see Section 2.1.4.4). Equal variance-covariance means that the 95% confidence ellipsoids have an equal volume (variance) and orientation in space (covariance). Figure 33.10 illustrates situations of unequal variance or covariance. Clearly, Fig. 33.1 displays unequal variance-covariance, so that one must expect that QDA gives better classification, as is indeed the case (Fig. 33.2). When the number of objects is smaller than the number of variables m, the variance-covariance matrix is singular. Clearly, this problem is more severe for QDA (which requires m < /i^) than for LDA, where the variance-covariance matrix is pooled and therefore the number of objects N is the sum of all objects
223
over all classes. It follows that both QDA and LDA have advantages: QDA is less subject to constraints in the distribution of objects in space and LDA requires less objects than QDA. Friedman [12] has also shown that regularised discriminant analysis (RDA), a form of discriminant analysis intermediate between QDA and LDA, has advantages compared to both: it is less subject to constraints without requiring more objects. The method has been used in chemometrics, e.g. for the classification of seagrass [13] or pharmaceutical preparations [14]. 33.2.4 The k-nearest neighbour method A mathematically very simple classification procedure is the nearest neighbour method. In this method one computes the distance between an unknown object u and each of the objects of the training set. Usually one employs the Euclidean distance D (see Section 30.2.2.1) but for strongly correlated variables, one should prefer correlation based measures (Section 30.2.2.2). If the training set consists of n objects, then n distances are calculated and the lowest of these is selected. If this is D^i, where u represents the unknown and / an object from learning class L, then one classifies u in group L. A three-dimensional example is given in Fig. 33.11. Object u is closest to an object of the class L and is therefore considered to be a member of that class. In a more sophisticated version of this technique, called the k-nearest neighbour method (/:-NN method), one selects the k nearest objects to u and applies a majority rule: u is classified in the group to which the majority of the k objects belong. Figure 33.12 gives an example of a 3-NN method. One selects the three nearest neighbours (A, B and C) to the unknown u. Since A and B belong to L, one
Fig. 33.11. 1-NN classification of the unknown u.
224
X, Fig. 33.12. 3-NN classification of the unknown u.
classifies u in category L. The choice of k is determined by optimization: one determines the prediction abihty with different values ofk. Usually it is found that small values of k (3 or 5) are to be preferred. The method has several advantages, the first being its mathematical simplicity, which does not prevent it from yielding classification results as good and often better than the much more complex meth|ods discussed in other sections of this chapter. Moreover, it is free from statistical assumptions, such as normality of the distribution of the variables. This does not mean that the method is not subject to any problem. One such problem is that the method is sensitive to gross inequalities in the number of objects in each class. Figure 33.13 gives an example. The unknown is classified into the class with largest membership, because in the zone of overlap between classes more of its members are present. In fact, the unknown is closer to the centre of the other class, so that its classification is at least doubtful. This can be overcome by not using a simple majority criterion but an altemative one, such as "classify the object in the larger class AT if for /:= 10 at least 9 neighbours (out of 10) belong to K, otherwise classify the test object in the smaller class L". The selection of k and the alternative criterion value should be determined by optimization [15].
225
•
•
U
•• • J •
••
• • • • • • •
•
•
•
•
• •
^ o
-
Fig. 33.13. A situation which necessitates classification of the unknown u by alternative A:-NN criteria.
The nearest neighbour method is often applied, with, in view of its simplicity, surprisingly good results. An example where k-NN performs well in a comparison with neural networks and SIMCA (see further) can be found in [16].
33.2.5 Density methods In density or kernel methods one imagines a potential field around the objects of the learning set. For this reason these methods have also been called potential methods. A variant for clustering was described in Section 30.3.3. One starts with the selection of a potential function. Many functions can be used for this purpose, but for practical reasons it is recommended that a simple one such as a triangular or a Gaussian function is selected. The function is characterized by its width. This is important for its smoothing behaviour (see below). Figure 33.14 shows a Gaussian function for a class ^ in a one-dimensional space. The cumulative potential function is determined by adding the heights of the individual potential functions in each position along the x axis. The figure shows that the cumulative function constitutes a continuous line which is never zero within a class. This is done separately for each class.
226
Fig. 33.14. Density estimate for a test set using normal potential functions (univariate case).
Fig. 33.15. Classification of an unknown object u. f(K) and f(L) indicate the potential functions for classes K and L.
By dividing the cumulative potential function of a class by the number of samples contributing to it, one obtains the (mean) potential function of the class. In this way, the potential function assumes a probabilistic character and, therefore, the density method permits probabilistic classification. The classification of a new object u into one of the given classes is determined by the value of the potential function for that class in u. It is classified into the class which has the largest value. A one-dimensional example is given in Fig. 33.15. Object u is considered to belong to K, because at the location of u the potential value of K is larger than that of L. The boundary between two classes is given by those positions where the potentials caused by these two classes have the same value. The boundaries can assume irregular values as shown in Fig. 33.3. One of the disadvantages of the method is that one must determine the smoothing parameter by optimisation. When the smoothing parameter is too small (Fig. 33.16a) many potential functions of a learning class do not overlap with each other, so that the continuous surface of Fig. 33.15 is not obtained. A new object u may then have a low membership value for a class (here class K) although it clearly belongs to that class. An excessive smoothing parameter leads to a too flat surface (Fig. 33.16b), so that discrimination becomes less clear. The major task of the
227
Fig. 33.16. Influence of the smoothing parameters on the potential surfaces of classes which are (a) too small and (b) too large.
learning procedure is then to select the most suitable value of the smoothing parameter. Advantages of these methods are that no a priori assumptions about distributions are necessary and that probabilistic decisions can be taken more easily than with k-NN. In chemometrics, the method was introduced under the name ALLOC [17, 18]. The methodology was described in detail in a book by Coomans and Broeckaert [19]. The method was developed further by Forina and coworkers [20,21]. 33.2.6 Classification trees In Section 18.4, we explained that inductive expert systems can be applied for classification purposes and we refer to that section for further information and example references. It should be pointed out that the method is essentially univariate. Indeed, one selects a splitting point on one of the variables, such that it achieves the "best" discrimination, the "best" being determined by, e.g., an entropy function. Several references are given in Chapter 18. A comparison with other methods can be found, for instance, in an article by Mulholland et al. [22]. Additionally, Breiman et al. [23] developed a methodology known as classification and regression trees (CART), in which the data set is split repeatedly and a binary tree is grown. The way the tree is built, leads to the selection of boundaries parallel to certain variable axes. With highly correlated data, this is not necessarily the best solution and non-Hnear methods or methods based on latent variables have been proposed to perform the splitting. A combination between PLS (as a feature reduction method — see Sections 33.2.8 and 33.3) and CART was described by
228
Yeh and Spiegelman [24]. Very good results were also obtained by using simple neural networks of the type described in Section 33.2.9 to derive a decision rule at each branching of the tree [25]. Classification trees have been used relatively rarely in chemometrics, but it seems that in general [26] their performance is comparable to that of the best pattern recognition methods. 33.2.7 UNEQ, SIMCA and related methods As explained in Section 33.2.1, one can prefer to consider each class separately and to perform outlier tests to decide whether a new object belongs to a certain class or not. The earliest approaches, introduced in chemometrics, were called SIMCA (soft independent modelling of class analogy) [27] and UNEQ [28]. UNEQ can be applied when only a few variables must be considered. It is based on the Mahalanobis distance from the centroid of the class. When this distance exceeds a critical distance, the object is an outlier and therefore not part of the class. Since for each class one uses its own covariance matrix, it is somewhat related to QDA (Section 33.2.3). The situation described here is very similar to that discussed for multivariate quality control in Chapter 20. In eq. (20.10) the original variables are used. This equation can therefore also be used for UNEQ. For convenience it is repeated here. Dl={x,-x^y
C-\x,-x^)
(33.11)
where Dj^ is the squared Mahalanobis distance between object / and the centroid x^ of the class A'and C is the variance-covariance matrix of the n training objects defining class K (see also eq. (33.7)). When D^ becomes too large for a certain object, this means that it is no longer considered to be part of the class. The Mahalanobis distance follows the Hotelling 7^-distribution. The critical value t^ is defined as: 2 _ m{n - \){n •\-\) ^crit ~ Z ^^ Ma,m,n-m)
(33.12)
n{n-m) UNEQ requires a multivariate normal distribution and can be applied only when the ratio objects/variables is sufficiently high (e.g. 3). When the ratio is less, as also explained in Chapter 20, one can measure distances in the principal component (PC)-space instead of the original space and in the pattern recognition context this is usually either necessary or preferable. In SIMCA one applies latent variables instead of the original variables. It, too, can be viewed as a variant of the quadratic discriminant rule. SIMCA, the original version of which [27] was published in 1976, starts by determining the number of principal components or eigenvectors needed to describe the structure of the training class (Section 31.5). Usually
229
PCI
*-x,
• X ,
b)
a)
PCI
c)
PCI
PCI
d)
Fig. 33.17. SIMCA: (a) step 1 in a 1 PC model; (b) step 1 in a 2 PC model; (c) step 2 in a 1 PC model; (d) step 2 in a 2 PC model.
cross-validation (Section 31.5.3) is preferred. Let us call the number of eigenvectors retained r*. If r* = 1, this means that (see Fig. 33.17) all data are considered to be modelled by a one-dimensional model (a line), for r* = 2, by a twodimensional model (a plane), etc. The residuals of the training class towards such a model are assumed to follow a normal distribution with a residual standard deviation
230
^ = JSE4/[(''-''*K'^-^*-l)J
(33.13)
The residuals from the model can be computed from the scores on the nonretained eigenvectors, i.e. the scores t^j on the eigenvectors r* + 1 to r (r = min {{n 1), m)). Then ^ = J S I,tfj/[(r-r^)(n^r^-l)]
(33.14)
If care is not taken about the way s is obtained, SIMCA has a tendency to exclude more objects from the training class than necessary. The 5-value should be determined by cross-vahdation. Each object in the training set is then predicted, using the r*-dimensionaI PCA model obtained, for the other (n-l) training set objects. The (residual) scores obtained in this way for each object are used in eq. (33.14) [30]. A confidence limit is obtained by defining a critical value of the (Euclidean) distance towards the model. This is given by ^crit=V^^
(33.15)
F^rit is the tabulated one-sided value for (r - r*) and (r - r*) (n - r* - 1) degrees of freedom. The s^^-^^ is used to determine the boundary (the cylinder) around the PCI line in Fig. 33.17c and the planes around the PCI, PC2 plane in Fig. 33.17d. Objects with s < 5^^^ belong to class K, otherwise they do not. To predict whether a new object x^^^ belongs to class K one verifies whether it falls within the cylinder (for a one-dimensional model), between the limiting planes (for a two-dimensional model, etc.). Suppose the following r* dimensional PC model was obtained X^=T^L1,4-E^
(33.16)
with X^ the centred X-matrix for class K, T^^ the (un-normed) score matrix (nxr*), (T^ = U^ A;^, where U^ is the normed score matrix and A^^ is the singular value matrix). L^ = the loading matrix (mxr'^) Ef. = the matrix of residuals (nxm) For a new object x^^^ one first determines the scores using eq. (33.17) tLw=(x„ew-x^)'^L^
(33.17)
231
The Euclidean distance from the model is then obtained, similarly to eq. (33.14) as: (33.18) V./=r*+l
If ^ng^ < s^^-^^, then the new object belongs to class K, otherwise it does not. A discussion concerning the number of degrees of freedom can be found in [31 ]. This article also compares SIMCA with several other methods. A useful tool in the interpretation of SIMCA is the so-called Coomans plot [32]. It is applied to the discrimination of two classes (Fig. 33.18). The distance from the model for class 1 is plotted against that from model 2. On both axes, one indicates the critical distances. In this way, one defines four zones: class 1, class 2, overlap of class 1 and 2 and neither class 1 nor class 2. By plotting objects in this plot, their classification is immediately clear. It is also easy to visualize how certain a classification is. In Fig. 33.18, object a is very clearly within class 1, object b is on the border of that class but is not close to class 2 and object c clearly belongs to neither class. The first versions of SIMCA stop here: it is considered that a new object belongs to the class, if it fits the r*-dimensional PC model. However, one can also consider T distance from class 1 1
1 1
outlier zone
1
class 2
1 1
• c
1
I I 1 1 1 1 1
1 1 1 _4 _ _
.
• b
1 1 1
overlap zone
1 1 1
• a class 1
1 1
distance from class 2 Fig. 33.18. The Coomans plot. Object a belongs to class 1, object b is a borderline class 1 object, object c is an outlier towards the two classes.
232
that objects that fit the PC-model but, in that model, are far from the members of the training class, are also outliers. Therefore, a second step was added to the original version of SIMCA by closing the tube or the box (Fig. 33.17). This was originally done by treating each PC in an univariate way. The limits were situated on each of the r* PCs: max
max(r^) + 0.5 s,
and ^min "~
min(y -- 0.5 s,
(33.19)
where max(rj^) is the largest among the scores of the training objects of class K on the PC considered and s^ is the standard deviation of the scores along that PC. SIMCA has inspired several related methods, such as DASCO [33] and CLASSY [34,35]. The latter has elements of the potential methods and SIMCA, while the former starts with the extraction of principal components, as in SIMCA, but then follows a quadratic discriminant rule. SIMCA has been applied very often and with much success in chemometrics. Examples are food authentication [36], or pharmaceutical identifications such as the recognition of excipients from their near infrared spectra [37] or of blisterpacked tablets using near infrared spectra [38]. Environmental applications have been published by many authors [39,40], for instance by Kvalheim et al. [41,42]. Lavine et al. [43] apply it to fuel spills from high-speed gas chromatograms, and compares SIMCA with DASCO and RDA. Chemometricians often consider SIMCA as the preferred supervised pattern recognition method for all situations. However, this is not evident. When the accent is on discrimination, discrimination oriented methods should be used. The testing procedure underlying SIMCA and other outlier tests has the disadvantage that one has to set a confidence level, a. If the data are normally distributed, a% (e.g. 5%) of objects belonging to the class will be considered as not belonging to it. When applying discriminating methods, such as LDA, to a discrimination oriented classification, this misclassification problem can be avoided. As explained already, SIMCA can be applied as an outlier test, similarly to the multivariate QC tests referred to earlier. Feam et al. [44] have described certain properties of SIMCA in this respect and compared it with some alternatives. 33.2.8 Partial least squares In Section 33.2.2 we showed how LDA classification can be described as a regression problem with class variables. As a regression model, LDA is subject to the problems described in Chapter 10. For instance, the number of variables should not exceed the number of objects. One solution is to apply feature selection or
233
reduction (see Section 33.3) and the other is to apply methods such as partial least squares (PLS - see Chapter 35) [45]. When there are two classes to be discriminated PLSl is applied, which means that there is one independent variable y, which for each object has a value 0 or 1. When there are more classes, PLS2 is applied. The independent variable then becomes a vector of class variables, one for each class, with a value of 1 for the class to which it belongs and zeros for all other variables. Suppose that there are 4 classes and that a certain object belongs to class 2, then for that object y^ = [0 1 0 0]. One might be tempted to use PLSl, with an independent variable that can take the values 1, 2, 3 and 4. However, this would imply an ordered relationship between the four classes, such that the distance between class 3 and 1 is twice that between class 2 and 1. 33.2.9 Neural networks A more recently introduced technique, at least in the field of chemometrics, is the use of neural networks. The methodology will be described in detail in Chapter 44. In this chapter, we will only give a short and very introductory description to be able to contrast the technique with the others described earlier. A typical artificial neuron is shown in Fig. 33.19. The isolated neuron of this figure performs a two-stage process to transform a set of inputs in a response or output. In a pattern recognition context, these inputs would be the values for the variables (in this example, limited to only 2, jc^ and X2) and the response would be a class variable, for instance j = 1 for class K and j = 0 for class L. The inputs, x^ and X2, are linked to the neuron by weights. These weights are determined by training the neuron with a set of training objects, but we will consider in this chapter that this has already been done. In the first stage a weighted sum of the jc-values is made, Z = w^x^ + ^2X2-
OUT IN Fig. 33.19. An artificial neuron. The inputs are weighted and summed according to Z = wixi + W2X2, X is transformed by comparison with T and leads to a 0/1 value for y.
234
Fig. 33.20. Output of the artificial neuron with values wi = 1, W2 = 2, 7 = 1.
In the second stage, Z is transformed with the aid of a transfer function. For instance, it can be compared to a threshold value. If Z > 7, then y=l and if I.
7and therefore lead to an output j j = 1 (i.e. the object is class K), all combinations below it to 3^1 = 0. The procedure described here is equivalent to a method called the linear learning machine, which was one of the first supervised pattern recognition methods to be applied in chemometrics. It is further explained, including the training phase, in Chapter 44. Neurons are not used alone, but in networks in which they constitute layers. In Fig. 33.21 a two-layer network is shown. In the first layer two neurons are linked each to two inputs, jCj and ^2. The upper one is the one we already described, the lower one has w, = 2, W2 = 1 and also T = 1. It is easy to understand that for this neuron, the output y2 is 1 on and above line b in Fig. 33.22a and 0 below it. The outputs of the neurons now serve as inputs to a third neuron, constituting a second layer. Both have weight 0.5 and 7 for this neuron is 0.75. The output y^-^^^i of this neuron is 1 if Z = 0.5 y^ + 0.5 ^2 > 0.75 and 0 otherwise. Since ^i and y2 have as possible values 0 and 1, the condition for r > 0.75 is fulfilled only when both are equal to 1, i.e. in the dashed area of Fig. 33.22b. The boundary obtained is now no longer straight, but consists of two pieces. This network is only a simple demonstration network. Real networks have many more nodes and transfer functions are usually non-linear and it will be intuitively clear that boundaries of a very complex nature can be developed. How to do this, and applications of supervised pattern recognition are described in detail in Chapter 44 but it should be stated here that excellent results can be obtained.
235
Layer 1
Layer 2
Fig. 33.21. A two-layer neural network.
a)
b)
Fig. 33.22. (a) Intermediate (yi mdy2) outputs of the neural network of Fig. 33.21; (b) final output of the neural network.
The similarity in approach to LDA (Section 33.2.2) and PLS (Section 33.2.8) should be pointed out. Neural classification networks are related to neural regression networks in the same way that PLS can be applied both for regression and classification and that LDA can be described as a regression application. This can be generalized: all regression methods can be applied in pattern recognition. One must expect, for instance, that methods such as ACE and MARS (see Chapter 11) will be used for this purpose in chemometrics.
236 33.3 Feature selection and reduction One can — and sometimes must — reduce the number of features. One way is to combine the original variables in a smaller number of latent variables such as principal components or PLS functions. This is calltd feature reduction. The combination of PCA and LDA is often applied, in particular for ill-posed data (data where the number of variables exceeds the number of objects), e.g. Ref. [46]. One first extracts a certain number of principal components, deleting the higher-order ones and thereby reducing to some degree the noise and then carries out the LDA. One should however be careful not to eliminate too many PCs, since in this way information important for the discrimination might be lost. A method in which both are merged in one step and which sometimes yields better results than the two-step procedure is reflected discriminant analysis. The Fourier transform is also sometimes used [14], and this is also the case for the wavelet transform (see Chapter 40) [13,16]. In that case, the information is included in the first few Fourier coefficients or in a restricted number of wavelet coefficients. In feature selection one selects from the m variables a subset of variables that seem to be the most discriminating. Feature selection therefore constitutes a means of choosing sets of optimally discriminating variables and, if these variables are the results of analytical tests, this consists, in fact, of the selection of an optimal combination of analytical tests or procedures. One way of selecting discriminating features is to compare the means and the variances of the different variables. Variables with widely different means for the classes and small intraclass variance should be of value and, for a binary discrimination, one therefore selects those variables for which the expression (^iK-^jL)/p]K + ^i)
(33.20)
is maximal. It should be noted that, in this way, we select the individually best variables. As the correlation between variables is not taken into account, this means one not necessarily selects the best combination of variables. Most of the supervised pattern recognition procedures permit the carrying out of stepwise selection, i.e. the selection first of the most important feature, then, of the second most important, etc. One way to do this is by prediction using e.g. cross-validation (see next section), i.e. we first select the variable that best classifies objects of known classification but that are not part of the training set, then the variable that most improves the classification already obtained with the first selected variable, etc. The results for the linear discriminant analysis of the EU/HYPER classification of Section 33.2.1 is that with all 5 or 4 variables a selectivity of 91.4% is obtained and for 3 or 2 variables 88.6% [2] as a measure of classification success. Selectivity is used here. It is applied in the sense of Chapter
237
16 because of the medical context. Selectivity is the number of true negatives divided by the sum of true negatives (the EU-cases that are classified as such) and false positives (the EU-cases that are classified wrongly as HYPER). Of course, one should also consider the sensitivity (Chapter 16), which was done in the article, but, for simplicity's sake, will not be discussed here. For the HYPER/EU discrimination, the elimination of successively two, or more variables leads to the expected result that, since there is less information, the classification is less successful. On the other hand, for the HYPO/EU discrimination, a less evident result is obtained. Five variables yield a selectivity of 80.0%, 4 of 83.0%, 3 of 86.7% and 2 of 96.7%. A smaller number of tests leads to an improvement in the classification results. One concludes that the variables eliminated either have no relevance to the discrimination considered, and therefore only add noise, or else that the information present in the eliminated variables was redundant or correlated with respect to the retained variables. Another approach requires the use of Wilks' lambda. This is a measure of the quality of the separation, computed as the determinant of the pooled within-class covariance matrix divided by the determinant of the covariance matrix for the whole set of samples. The smaller this is, the better and one selects variables in a stepwise way by including those that achieve the highest decrease of the criterion. In SIMCA, we can determine the modelling power of the variables, i.e. we measure the importance of the variables in modelling the class. Moreover, it is possible to determine the discriminating power, i.e. which variables are important to discriminate two classes. The variables with both low discriminating and modelling power are deleted. This is more a variable elimination procedure than a selection procedure: we do not try to select the minimum number of features that will lead to the best classification (or prediction rate), but rather eliminate those that carry no information at all. It should be stressed here that feature selection is not only a data manipulation operation, but may have economic consequences. For instance, one could decide on the basis of the results described above to reduce the number of different tests for a EU/HYPO discrimination problem to only two. A less straightforward problem with which the decision maker is confronted is to decide how many tests to carry out for a EU/HYPER discrimination. One loses some 3% in selectivity by eliminating one test. The decision maker must then compare the economic benefit of carrying out one test less with the loss contained in a somewhat smaller diagnostic success. In fact, he carries out a cost-benefit analysis. This is only one of the many instances where an analytical (or clinical) chemist may be confronted with such a situation.
238
33.4 Validation of classification rules In the training or learning step, one develops a decision model (a rule) which allows classification of the unknown samples to be carried out. The decision model of Fig. 33.6 consists of line a and the classification rule is that objects to the right of it are assigned to class L and objects to the left to class K. Once a decision rule has been obtained, it is still necessary to demonstrate that it is a good one. This can be done by observing how successful it is at classifying known samples (test set). One distinguishes between recognition and prediction ability. The recognition (or classification) ability is characterized by the percentage of the members of the training set that are correctly classified. The prediction ability is determined by the percentage of the members of the test set correctly classified by using the decision functions or classification rules developed during the training step. When one only determines the recognition ability, there is a risk that one will be deceived into taking an overoptimistic view of the classification result. It is therefore also necessary to verify the prediction ability. Both recognition and prediction ability are usually expressed as (correct) classification rate, although other possibilities exist (see Section 33.3). The situation is very similar to that in regression (see Section 10.3.4), where we validated the regression model by looking how well it modelled the objects that were included in the calibration set (goodness- or lack-of-fit) and new samples (prediction performance). This analogy should not surprise since regression and classification are both modelling methods. Validation in pattern recognition is therefore similar to validation in multivariate calibration and the reader should refer to Chapter 36 for more details. The ideal situation is when there are enough samples available to create separate (independent) training and test sets. When this is not possible, an artifice is necessary. The prediction ability is determined by developing the decision model on a part of the training set only and using the other part as a mock test set. Often this is repeated a few times until all training samples have been used as test samples. If several objects at a time are considered as test samples, this is called a resampling or {internal) cross-validation method, k-fold cross-validation ov jackknife method; when only one sample at a time is removed from the training set, it is called a leave-one-out procedure. If the training set consists of 20 objects, a jackknife method could be carried out as follows. We first delete objects 1-6 from the training set and develop the classification rules with the remaining objects 7-20. Then, we consider 1-6 as the test set, classify them with the rules obtained on objects 7-20 and note how many objects were classified correctly. The whole procedure is then repeated after replacing first objects 1-6 in the training set but deleting 7-12. The latter objects are classified using the classification rule developed on a training set consisting of objects 1-6 and 13-20. Finally, a training set
239
consisting of 1-12 is used and objects 13-20 serve as test set and are classified. The percentage of successes on the three runs together is then called the prediction ability. In general, it is found that prediction ability is somewhat less good than recognition ability. If the prediction and the recognition ability are substantially different, this means that the decision rules depend too much on the actual objects in the training set: the solution obtained is not stable and should therefore not be trusted. Many other subjects are important to achieve successful pattern recognition. To name only two, it should be investigated to what extent outliers are present, because these can have a profound influence on the quality of a model and to what extent clusters occur in a class (e.g. using the index of clustering tendency of Section 30.4.1). When clusters occur, we must wonder whether we should not consider two (or more) classes instead of a single class. These problems also affect multivariate calibration (Chapter 36) and we have discussed them to a somewhat greater extent in that chapter. References 1. 2.
3.
4. 5. 6. 7.
8. 9. 10. 11. 12.
R.G. Brereton, ed., Multivariate Pattern Recognition in Chemometrics. Elsevier, Amsterdam, 1992. D. Coomans, M. Jonckheer, D.L. Massart, I. Broeckaert and P. Blockx, The application of linear discriminant analysis in the diagnosis of thyroid diseases. Anal. Chim. Acta, 103 (1978) 409-415. D. Coomans, I. Broeckaert, M. Jonckheer and D.L. Massart, Comparison of multivariate discrimination techniques for clinical data — Application to the thyroid functional state. Meth. Inform. Med., 22 (1983) 93-101. D. Coomans, I. Broeckaert and D.L. Massart, Potential methods in pattern recognition. Part 4, Anal. Chim. Acta 132 (1981) 69-74. R. Fisher, The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 (1936) 179-188. P.J. Dunlop, CM. Bignell, J.F. Jackson, D.B. Hibbert, Chemometric analysis of gas chromatographic data of oils from Eucalyptus species. Chemom. Intell. Lab. Systems 30(1995) 59-67. K. Varmuza, F. Stangl, H. Lohninger and W. Werther, Automatic recognition of substance classes from data obtained by gas chromatography, mass spectrometry. Lab. Automation Inf. Manage., 31 (1996)221-224. A. Candolfi, W. Wu, S. Heuerding and D.L. Massart, Comparison of classification approaches applied to NIR-spectra of clinical study lots. J. Pharm. Biomed. Anal., 16 (1998) 1329-1347. T. Fearn, Discriminant analysis. NIR News, 4 (5) (1993) 4-5. G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, 1992. M.G. Kendall and A.S. Stuart, The Advanced Theory of Statistics, Vol. 3. Ch. Griffin, London, 1968. J.H. Friedman, Regularized discriminant analysis. J. Am. Stat. Assoc. 84 (1989) 165-175.
240 13. 14.
15.
16.
17. 18. 19. 20. 21. 22.
23. 24. 25. 26. 27. 28. 29. 30. 31.
32.
33.
Y. Mallet, D. Coomans and O. de Vel, Recent developments in discriminant analysis on high dimensional spectral data. Chemom. Intell. Lab. Systems, 35 (1996) 157-173. W. Wu, Y. Mallet, B. Walczak, W. Penninckx, D.L. Massart, S. Heuerding and F. Erni, Comparison of regularized discriminant analysis, linear discriminant analysis and quadratic discriminant analysis, applied to NIR data. Anal. Chim. Acta, 329 (1996) 257-265. D. Coomans and D.L. Massart, Alternative K-nearest neighbour rules in supervised pattern recognition. Part 2. Probabilistic classification on the basis of the kNN method modified for direct density estimation. Anal. Chim. Acta, 138 (1982) 153-165. E.R. Collantes, R. Duta, W.J. Welsh, W.L. Zielinski and J. Brower, Reprocessing of HPLC trace impurity patterns by wavelet packets for pharmaceutical finger printing using artificial neural networks. Anal. Chem. 69 (1997) 1392-1397. J.D.F. Habbema, Some useful extensions of the standard model for probabilistic supervised pattern recognition. Anal. Chim. Acta, 150 (1983) 1-10. D. Coomans, M.P. Derde, I. Broeckaert and D.L. Massart, Potential methods in pattern recognition. Anal. Chim. Acta, 133 (1981) 241-250. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making. Wiley, Chichester, 1986. M. Forina, C. Armanino, R. Leardi and G. Drava, A class-modelling technique based on potential functions. J. Chemom. 5 (1991) 435^53. M. Forina, S. Lanteri and C. Armanino, Chemometrics in Food Chemistry. Topics Curr. Chem., 141 (1987)93-143. M. Mulholland, D.B. Hibbert, P.R. Haddad and P. Parslov, A comparison of classification in artificial intelligence, induction versus a self-organising neural networks. Chemom. Intell. Lab. Systems, 30 (1995) 117-128. L.J. Breiman, R. Freidman, R. Olsen and C. Stone, Classification and Regression Trees. Wadsworth, Pacific Grove, CA, 1984. C.H. Yeh and C.H. Spiegelman, Partial least squares and classification and regression trees. Chemom. and Intell. Lab. Systems, 22 (1994) 17-23. A. Sankar and R. Mammone, A fast learning algorithm for tree neural networks. In: Proc. 1990 Conf on Information Sciences and Systems, Princeton, NJ, 1990, pp. 638-642. D.H. Coomans and O. Y. de Vel, Pattern analysis and classification, in J. Einax (ed). The Handbook of Environmental Chemistry, Vol. 2, Part G. Springer, 1995, pp. 279-324. S. Wold, Pattern recognition by means of disjoint principal components models. Pattern Recogn.,8(1976) 127-139. M.P. Derde and D.L. Massart, UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution. Anal. Chim. Acta, 184 (1986) 33-51. I. Frank and J.M. Friedman, Classification: oldtimers and newcomers. J. Chemom. 3 (1989) 463-475. R. De Maesschalck, A. Candolfi, D.L. Massart and S. Heuerding, Decision criteria for SIMCA applied to Near Infrared data. Chemom. Intell. Lab. Syst., in prep. H. Van der Voet and P.M. Coenegracht, The evaluation of probabilistic classification methods, Part 2. Comparison of SIMCA, ALLOC, CLASSY and LDA. Anal. Chim. Acta, 209 (1988) 1-27. D. Coomans, I. Broeckaert, M.P. Derde, A. Tassin, D.L. Massart and S. Wold, Use of a microcomputer for the definition of multivariate confidence regions in medical diagnosis based on clinical laboratory profiles. Comp. Biomed. Res., 17 (1984) 1-14. I. Frank, DASCO: a new classification method. Chemom. Intell. Lab. Syst., 4 (1988) 215-222.
241 34.
35.
36. 37.
38.
39.
40.
41. 42.
43.
44. 45. 46.
H. Van Der Voet, P.M.J. Coenegracht and J.B. Hemel, New probabilistic versions of the Simca and Classy classification methods. Part 1. Theoretical description. Anal. Chim. Acta, 192 (1987) 63-75. H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 1. Anal. Chim. Acta, 161 (1984), 115-123; Part 2. Anal. Chim. Acta, 161 (1984) 125-134. M. Forina, G. Drava and G. Contarini, Feature selection and validation of SIMCA models: a case study with a typical Italian cheese. Analusis 21 (1993) 133-147. A. Candolfi, R. De Maesschalk, D.L. Massart, P.A. Hailey and A.C.E. Harrington, Identification of pharmaceutical excipients using NIR spectroscopy and SIMCA. J. Pharm. Biomed. Anal., in prep. M.A. Dempster, B.F. MacDonald, P.J. GemperHne and N.R. Boyer, A near-infrared reflectance analysis method for the non invasive identification of film-coated and non-film-coated, blister-packed tablets. Anal. Chim. Acta, 310 (1995) 43-51. D. Scott, W. Dunn and S. Emery, Pattern recognition classification and identification of trace organic pollutants in ambient air from mass spectra. J. Res. Natl. Bur. Stand., 93 (1988) 281-283. E. Saaksjarvi, M. Khaligi and P. Minkkinen, Waste water pollution modeling in the southern area of Lake Saimaa, Finland, by the simca pattern recognition method. Chemom. Intell. Lab. Systems, 7 (1989) 171-180. O.M. Kvalheim, K. 0ygard and O. Grahl-Nielsen, SIMCA multivariate data analysis of blue mussel components in environmental pollution studies. Anal. Chim. Acta, 150 (1983) 145-152. B.G.J. Massart, O.M. Kvalheim, F.O. Libnau, K.I. Ugland, K. Tjessem and K. Bryne, Projective ordination by SIMCA: a dynamic strategy for cost-efficient environmental monitoring around offshore installations. Aquatic Sci., 58 (1996) 121-138. B.K. Lavine, H. Mayfield, P.R. Kromann and A. Faruque, Source identification of underground fuel spills by pattern recognition analysis of high-speed gas chromatograms. Anal. Chem., 67 (1995) 3846-3852. B. Mertens, M. Thompson and T. Fearn, Principal component outlier detection and SIMCA: a synthesis. Analyst 119 (1994) 2777-2784. L Stable and S. Wold, Partial least square analysis with cross-vahdation for the two-class problem: a Monte Carlo study. J. Chemometrics, 1 (1987) 185-196. A.F.M. Nierop, A.C. Tas, J. Van der Greef, Reflected discriminant analysis. Chemom. Intell. Lab. Syst., 25 (1994) 249-263.
This Page Intentionally Left Blank
243
Chapter 34
Curve and Mixture Resolution by Factor Analysis and Related Techniques 34.1 Abstract and true factors In Chapter 31 we stated that any data matrix can be decomposed into a product of two other matrices, the score and loading matrix. In some instances another decomposition is possible, e.g. into a product of a concentration matrix and a spectrum matrix. These two matrices have a physical meaning. In this chapter we explain how a loading or a score matrix can be transformed into matrices to which a physical meaning can be attributed. We introduce the subject with an example from environmental chemistry and one from liquid chromatography. Let us suppose that dust particles have been collected in the air above a city and that the amounts of/? constituents, e.g. Si, Al, Ca,..., Pb have been determined in these samples.The elemental compositions obtained for n (e.g. 100) samples, taken over a grid of sampling points, can be arranged in a data matrix X (Fig. 34.1). Each row of the table represents the elemental composition of one of the samples. A column represents the amount of one of the elements found in the sample set. Let us further suppose that there are two main sources of dust in the neighbourhood of the sampled area, and that the particles originating from each source have a specific concentration pattern for the elements Si to Pb. These concentration patterns are described by the vectors Sj and S2. For instance the dust in the air may originate from a power station and from an incinerator, having each a specific concentration pattern, sj = [Si^, Al^, Ca^,... PbJ with A: = 1,2. Obviously, each sample in the sampled area contains particles from each source, but in a varying proportion. Some of the samples mainly contain particles from the power station and less from the incinerator. Other samples may contain an equal amount of particles of each source. In general, one can say that the composition x^ of any sample / of dust is a linear combination of the two source patterns s^ and S2 given by: x^ = c^^ s^ + c^2 S2. In this expression c^ gives the contribution of the first source and c^2 ^he contribution of the second dust source in sample /. For all n samples these contributions can be arranged in a nx2 matrix C giving X = CS^ where S is thepx2 matrix of the source patterns. If the concentration patterns of the
244 variables Si Al Ca .... Pb
12
Si Al Ca .... Pb 1 2 factors
i
FA measurements
concentrations Si Al Ca .... Pb
V PC loadings
PCA
PC-scores Fig. 34.1. The principle of factor analysis.
elements (Si to Pb) of the two dust sources are known, the relative contribution C of each source in the samples is estimated by solving the equation X = CS^ by multiple linear regression (see Chapter 10) provided that the concentration patterns are sufficiently different and that the number of sources nc is less than or equal to the number of measured elements p: C = XS(S^S)"^ Matters become more complex when the concentration patterns of the sources are not known. It becomes even more complicated when the number and the origin of the potential sources of the dust are unknown. In this case the number of sources and the concentration patterns of each source have to be estimated from the measured data table X. This operation is called factor analysis. In the terminology of factor analysis, the two sources of dust in our example are called factors, and the concentration patterns of the compounds in each source are calltdfactor loadings. In Chapters 17 and 31 we explained that a matrix X can be decomposed by SVD in a product of three matrices: the two matrices of singular vectors U and V and a diagonal matrix of singular values A such that: X - U A V^ nxp
nxp pxp
pxp
(34.1)
245
where n is the number of samples (rows), and p is the number of variables (columns). Often the decomposition is given as a product of only two matrices: X = T V^ nxp
nxp
(34.2)
pxp
where T (= UA) is the matrix of scores proportional to their 'size' and V is the loading matrix. The columns of V are colloquially denoted as the principal components of X. Each row of the original data matrix X can be represented as a linear combination of all PCs in the loading matrix. The multiplicative factors (or regressors) of that linear combination for a particular row, x^ of X are given by the corresponding row t^ in the score matrix: x, = t^ V^. X can be reconstructed within the measurement error E by taking the first nc PCs. X - T * V*T+ E nxp
nxnc ncxp
nxp
The symbol * means that the first nc significant columns of V and T are retained. The number of significant principal components, nc, which is the pseudo-rank of X, is usually unknown. Methods for estimating the number of components are discussed in Section 31.5. In the case of city air pollution caused by two sources of dust, one expects to find two significant PCs: PCj and PC2, which are the two first rows of V^. One could conclude that this decomposition yielded the requested information. Each row of X is represented by a linear combination of two source profiles PCj and PC2. The question arises then whether these profiles represent the true concentration profiles of the sources. Unfortunately, the answer is no! In Chapter 29, we explained that PCj and PC2 are obtained under some specific constraints, i.e., PCi is calculated under the constraint that it describes the maximum variance of X. The second loading vector PC2 also describes the maximum variance in X but now in the orthogonal direction to PCj. These constraints are not necessarily valid for the true factors. It is very improbable that the true source profiles are orthogonal. Therefore the PCs are called abstract factors. The aim of factor analysis is to transform these abstract factors into real factors. PC A is a purely mathematical operation, using no other information than that the rows in X can be described by a linear combination of a number of linearly independent vectors. However factor analysis usually requires the formulation of additional constraints to find a solution. These constraints are defined by the characteristics of the system being investigated. In this example, a constraint could be that all concentrations should be non-negative. Before going into more detail, a second example is given. Suppose that during the elution of two compounds from an HPLC, one measures n (=15) UV-visible
246 0.25
(T3
0.15
0.05
220
240 260 280 300 320 340 360 380 400 420 •
wavelength (nm)
Fig. 34.2. UV-visible spectra of mixtures offluorantheneand chrysene (see Fig. 34.3 for the pure spectra).
spectra at p (=20) wavelengths. Because of the Lambert-Beer law, all measured spectra are linear combinations of the two pure spectra. Together they form a 15x20 data matrix. For example the UV-visible spectra of mixtures of two polycyclic aromatic hydrocarbons (PAH) given in Fig. 34.2 are linear combinations of the pure spectra shown in Fig. 34.3. These mixture spectra define a data matrix X, which can be written as the product of a 15x2 concentration matrix C with the 2x20 matrix S^ of the pure spectra: X = C S^ nx/?
(34.3)
ny^nc ncxp
where AI (= 15) is the number of mixture spectra, nc (= 2) is the number of species, and p (= 20) is the number of wavelengths. The rows of X are mixture spectra and the columns are chromatograms at the p = 20 wavelengths. Here, columns as well as rows are linear combinations of pure factors, in this example pure row factors, being the pure spectra, and pure column factors, being the pure elution profiles. Any data matrix can be considered in two spaces: the column or variable space (here, wavelength space) in which a row (here, spectrum) is a vector in the multidimensional space defined by the column variables (here, wavelengths), and the row space (here, retention time space) in which a column (here, chromatogram) is a vector in the multidimensional space defined by the row variables (here, elution times). This duality of the multivariate spaces has been discussed in more detail in Chapter 29. Depending on the chosen space, the PCs of the data matrix
247
0)
o c
CD
o
220
240 260 280 300 320 340 360 380 400 420 wavelength (nm)
Fig. 34.3. UV-visible spectra of two polyaromatic hydrocarbons (PAHs), fluoranthene and chrysene.
have a different meaning. In wavelength space the eigenvectors of X^X represent abstract spectra, and in retention time space the eigenvectors of XX^ are abstract chromatograms. Irrespective of the chosen space, by decomposing matrix X with a PCA as many significant principal components should be found as there are chemical species in the mixtures. The decomposition in the wavelength space, for a system with two compounds is given by: X =T* V*^ + E nx/7
nx2 2x/7
(34.4)
nxp
By decomposing the HPLC data matrix of spectra shown in Fig. 34.2 according to eq. (34.4), a matrix V* is obtained containing the two significant columns of V. Evidently the loading plots shown in Fig. 34.4 do not represent the two pure spectra, though each mixture spectrum can be represented as a linear combination of these two PCs. Therefore, these two PCs are called abstract spectra. Equations (34.3) and (34.4) show a decomposition of the data matrix X in two ways: the first is a decomposition in real factors, a product of a matrix S^ of the spectra with a matrix C of concentration profiles, and the second is a decomposition in abstract factors T* and V*^. By factor analysis one transforms V*^ in eq. (34.4) into S^ in eq. (34.3). The score matrix T* gives the location of the spectra in the space defined by the two principal components. Figure 34.5 shows a scores plot thus obtained with a clear structure (curve). The cause of this structure is explained in Section 34.2.1.
248
• — • — • — • — f
20
240
260
280
300
320
340
360
380 400 420 wavelength (nm)
Fig. 34.4. The two first principal components of the data matrix of the spectra given in Fig. 34.2.
scores on PC2
0.3 0.4 scores on PC1
Fig. 34.5. Score plot (PCi score vs PC2 score) of the mixture spectra given in Fig. 34.2.
0.7
249
The decomposition into elution profiles and spectra may also be represented as: X^ = S C^ pxn
px2 2xn
The corresponding decomposition by a principal components analysis gives: XT ^ p* Q*T + E pxn
pxnc ncxn
pxn
where the rows in Q*^ now represent abstract elution profiles. It should be noted that the score matrix P* has another meaning than T* in eq. (34.4). It represents here the location of the chromatograms in the factor space Q*. It should also be noted that Q is equivalent to U in eq. (34.1) and that P is equivalent to AV^ in eq. (34.1). Here, too, one wants to transform abstract elution profiles Q*^ into real elution profiles C^ by factor analysis. The result of a PC A carried out in this retention time space is given in Fig. 34.6. The two first PCs clearly have the appearance of elution profiles, but are not the true elution profiles. Because elution profiles have a much smoother appearance than spectra, which may have a very irregular form, abstract elution profiles are sometimes easier to interpret than abstract spectra. For instance, one can easily derive the positions of the peak maxima and also distinguish significant PCs from those which represent noise. The scores plot (Fig. 34.7) in which the chromatograms are plotted in the space defined by the wavelengths is less easily interpreted than the corresponding scores plot of the spectra shown in Fig. 34.5. The plot in Fig. 34.7 does not reveal any structure because the consecutive chromatograms in X^ follow the irregular pattern of the absorptivity coefficients as a function of the wavelength. Therefore, if the aim of factor analysis is to transform PCs into real factors (by one of the methods explained in this chapter) one prefers the retention time space, because this yields loadings which are the easiest to interpret. On the other hand, if the aim of the analysis is to detect structure in the scores plot, the wavelength space is preferred. In this space regions of pure spectra (selective parts of the chromatograms), or regions where only binary mixtures are present, are more easily detected. The above considerations form the basis of the HELP procedure explained in Section 34.3.3. In some cases a principal components analysis of a spectroscopic- chromatographic data-set detects only one significant PC. This indicates that only one chemical species is present and that the chromatographic peak is pure. However, by the presence of noise and artifacts, such as a drifting baseline or a nonlinear response, conclusions on peak purity may be wrong. Because the peak purity assessment is the first step in the detection and identification of an impurity by factor analysis, we give some attention to this subject in this chapter.
250
13
17
21
25
29
retention time Fig. 34.6. The two first principal components of a LC-DAD data set in the retention time space.
scores on PC2 •
0.1 •
0.05 0 -0.05
• .
*•
•
t
•
••
•
•
•
•
•
• •
•
-0.1
• -0.15
1
0.2
1
0.4 0.6 scores on PC1
.
_- 1
0.8 Pi
Fig. 34.7. Score plot (PCi score vs PC2 score) of chromatograms in the retention time space.
251
Basically, we make a distinction between methods which are carried out in the space defined by the original variables (Section 34.4) or in the space defined by the principal components. A second distinction we can make is between full-rank methods (Section 34.2), which consider the whole matrix X, and evolutionary methods (Section 34.3) which analyse successive sub-matrices of X, taking into account the fact that the rows of X follow a certain order. A third distinction we make is between general methods of factor analysis which are applicable to any data matrix X, and specific methods which make use of specific properties of the pure factors.
34.2 Full-rank methods 34.2.1 A qualitative approach Before going into detail about various specific methods to estimate pure factors, we qualitatively describe how pure factors can be derived from the principal components. This is illustrated with a data matrix of the two-component HPLC example discussed in the previous section. When discussing the scores plot (Fig. 34.5) we mentioned that the scores showed some structure. From the origin, two straight lines depart which are connected with a curved line. In Chapter 17 we explained that these straight lines coincide with the pure spectra present in the pure elution time zones. The distance from the origin is a measure for the 'size' of the spectrum. The curved part represents the zone where two compounds co-elute. Going through the curve starting from the origin we find pure spectra of compound 1 in increasing concentrations, then mixtures of compounds 1 and 2, followed by the spectra of compound 2 in decreasing concentration, back to the origin. The angle between the two lines is a measure for the correlation between the two pure spectra. If the spectra are uncorrelated (i.e. very dissimilar) the two lines are orthogonal. At high correlations the angle becomes very small. The two lines define the directions of the pure factors in the PC1-PC2 space. In this simplified situation no factor analysis is needed to find the pure factors. From the scores plot we also observe that the pure factors have an angle aj and a2 respectively with PCj and PC2. Conversely, one could also find pure factors by rotating PCj over an angle a^ and PC2 over an angle a2. Finding these angles is the purpose of factor analysis. When pure spectra are present in the data set, finding these angles is quite straightforward. Therefore, several factor analysis methods aim at finding the purest rows (spectra) or purest columns (wavelengths) in the data set, which is discussed in Section 34.4. When no pure row or column is available for one of the factors, we cannot directly derive the rotation angles from the scores plot, because the straight line segments are missing. In this case we need to make some
252
assumptions about the pure factors in order to estimate the rotation angles. Obvious assumptions in chromatography are the non-negativity of the absorbances and the concentration profiles. If no constraints can be formulated to estimate the rotation angles, one must rely on abstract rotation procedures. An example of this type of rotation is the Varimax method of Kaiser [1] which is explained in Section 34.2.3. 34.2.2 Factor rotations A row of a data matrix can be interpreted as a point in the space defined by its column variables.
yi
n
yn
For instance, the first row of the matrix X defines a point with the coordinates (JCJ, 3^i) in the space defined by the two orthogonal axes x^ = (1 0) and y^ = (0 1). Factor rotation means that one rotates the original axes x^ = (1 0) and y^ = (0 1) over a certain angle 9. With orthogonal rotation in two-dimensional space both axes are rotated over the same angle. The distance between the points remains unchanged.
Fig. 34.8. Orthogonal rotation of the (jcj) axes into (/:,/) axes.
253
Rotated axes are characterized by their position in the original space, given by the vectors k^ = [cos9 -sinG] and F = [sinG cosG] (see Fig. 34.8). In PCA or FA, these axes fulfil specific constraints (see Chapter 17). For instance, in PCA the direction of k is the direction of the maximum variance of all points projected on this axis. A possible constraint in FA is maximum simplicity of k, which is explained in Section 34.2.3. The new axes (k,I) define another basis of the same space. The position of the vector [x- y-] is now [k^ /•] relative to these axes. Factor rotation involves the calculation of [k^ /J from [x^ y^], given a rotation angle with respect to the original axes. Suppose that after rotation the matrix X is transformed into a matrix F: 'k, 2
h' h
zn
.K
K.
Then the following relationship exists between [k^ /J and [x^ y^ k^ = x^ cosG - y^ sinG li = x^ sinG + jjCosG or in matrix notation (for all vectors /) 'k, k2 .n
h' k
=
Xi
y
^2
yi
K _ _^n
cosG
sinG
-sinG
cosG
yn\
which gives: F = XR
(34.4)
witr iv R— =
cosG sinG - s i r iG cosG
Columns and rows of R are orthogonal with a norm equal to one. Therefore, R defines a rotation, for which RR^ = R^ R = I.
254
The aim of factor analysis is to calculate a rotation matrix R which rotates the abstract factors (V) (principal components) into interpretable factors. The various algorithms for factor analysis differ in the criterion to calculate the rotation matrix R. Two classes of rotation methods can be distinguished: (i) rotation procedures based on general criteria which are not specific for the domain of the data and (ii) rotation procedures which use specific properties of the factors (e.g. non-negativity). 34.2.3 The Varimax rotation In the previous section we have seen that axes defined by the column variables can be rotated. It is also possible to rotate the principal components. Instead of rotating the axes which define the column space of X, we rotate here the significant PCs in the sub-space defined by V*^: F =V*'r R ncxp
ncxp
(34.5)
pxp
The columns of V* are the abstract factors of X which should be rotated into real factors. The matrix V*^ is rotated by means of an orthogonal rotation matrix R, so that the resulting matrix F = V*^ R fulfils a given criterion. The criterion in Varimax rotation is that the rows of F obtain maximal simplicity, which is usually denoted as the requirement that F has a maximum row simplicity. The idea behind this criterion is that 'real' factors should be easily interpretable which is the case when the loadings of the factor are grouped over only a few variables. For instance the vector f^ = [ 0 0 0 0.5 0.8 0.33] may be easier to interpret than the vector f J = [0.1 0.3 0.1 0.4 0.4 0.75]. It is more likely that the simple vector is a pure factor than the less simple one. Returning to the air pollution example, the simple vector f J may represent the concentration profile of one of the pollution sources which mainly contains the three last constituents. A measure for the simplicity of a vector f, is the variance of the squares of its p elements, which should be maximized [2]: 1 ^ — Simp = var(f,2 ) = _ V (f 2 ^ f 2 )2 P i=i
To illustrate the concept 3 vectors f, and fj: f f 0 0.1 0 0.3 0 0.1 0.5 0.4
of simplicity of a vector we calculate the simplicity of f.^ 0 0 0 0.25
f2 *2
0.01 0.09 0.01 0.16
255
0.8 0.33
0.4 0.75
0.64 0.11 var(fi2) = 0.053
0.16 0.56 var(f|) = 0.035
The simplicity of a matrix is the sum of the simplicities of its rows.Thus the simplicity of a matrix F with the rows fj^ and fj as given before is 0.088. fj and £2 are called varivectors. By a varimax rotation the matrix V*^ is rotated by means of an orthogonal rotation matrix R, so that the simplicity of the resulting matrix F = V*^ R is maximal. Several algorithms have been proposed [2] for the calculation of R. Let us suppose that a PC A of X with four variables yields the following two significant principal components: PCi! [0.1023 -0.767 0.153 0.614] and PC2: [0.438 0.501 0.626 0.407]. The simplicity of the matrix with the rows PC^ and PC2 is equal to 0.0676. A varimax rotation of this matrix yields the varivectors f^ = [-0.167 -0.916 -0.233 0.269] and f2^ =[0.417 -0.030 0.601 0.685] at an angle of 35 degrees with the PCs, and a corresponding simplicity equal to 0.1486. One can see that varivector f^ is mainly directed along variable 2, which is probably a pure factor. The other variables (1,3 and 4) load high on varivector ^2- Therefore, they belong to the second pure factor. Anyway, because the varivectors are simpler than the original PCs one can safely conclude that they resemble better the pure factors. Several applications of varimax rotation in analytical chemistry have been reported. As an example the varimax rotation is applied on the HPLC data table of
1
3
5
7
9
11 13 15 Retention time
Fig. 34.9. Varimax rotated principal components given in Fig. 34.6.
256
PAHs introduced in Section 34.1. A PCA applied on the transpose of this data matrix yields abstract 'chromatograms' which are not the pure elution profiles. These PCs are not simple as they show several minima and/or maxima coinciding with the positions of the pure elution profiles (see Fig. 34.6). By a varimax rotation it is possible to transform these PCs into vectors with a larger simplicity (grouped variables and other variables near to zero). When the chromatographic resolution is fairly good, these simple vectors coincide with the pure factors, here the elution profiles of the species in the mixture (see Fig. 34.9). Several variants of the varimax rotation, which differ in the way the rotated vectors are normalized, have been reviewed by Forina et al. [2]. 34.2.4. Factor rotation by target transformation factor analysis (TTFA) In the previous section, factors were searched by rotating the abstract factors into pure factors, obeying a number of constraints. In some cases, however, one may have a collection of candidate pure factors, e.g., a set of UV-visible or Mass spectra of chemical compounds. Having measured a data matrix of mixture spectra one could investigate whether compounds present in the mixture match with compounds available in the data base of pure spectra. In that situation one could first estimate the pure spectra from the mixture spectra and thereafter compare the obtained spectra with the spectra in the data base. Alternatively, one could identify the pure spectra by solving equation X = C S^ for C where S is the set of candidate spectra to be tested and X contains the mixture spectra. All non-zero rows of C indicate the presence of the spectrum in the corresponding row in S. This method fails when S does not contain all pure spectra present in the mixtures. Moreover, this procedure becomes unpractical when the number of candidate spectra is very large and the whole data base has to be checked. Furthermore, when the spectra to be checked are quite similar, the calculation of (S^S)~^ becomes unstable, leading to large errors in C and the indication of wrong spectra [3]. By target transformation factor analysis (TTFA), each candidate spectrum, called target is tested individually on its presence in the mixtures [4,5]. Here, targets are tested in the space defined by the significant principal components of the data matrix. Therefore, TTFA begins with a PCA of the data matrix X of the measured spectra. The principle of TTFA can be explained in an algebraic as well as geometrical way. We start with the algebraic approach. Because according to eq. (34.4) any row of X can be written as
\xp
Ixnc ncxp
Ixp
Each mixture spectrum is a linear combination of the nc significant eigenvectors. Equally, the pure spectra are linear combinations of the first nc PCs. A target
257
spectrum taken from the library can be tested on this property. If the test passes, the spectrum or target may be one of the pure factors. How this is done is explained below. The first step is to calculate the scores t i„ of the target spectrum, in, to be tested by solving equation:
lx/7
Ixnc ncxp
Ixp
givmg t;„ =inV*(V*T V*)-^
in V
These scores give the linear combination of the PCs that provides the best estimation (in a least squares sense) of the target spectrum. How good that estimation is can be evaluated by calculating the sum of squares of the residuals between the re-estimated target or output target (from its scores) and the input target. The output target out is equal to t •„ V*^. The overall expression for TTFA therefore becomes: out = inV*V*T
(34.6)
If the difference between out and in (||out - in||) can be explained by the variance of the noise, the test passes and the target is possibly one of the pure factors.
9
11 13 15 17 19 21 23 25 27 29 Retention time
Fig. 34.10. Simulated LC-DAD data-set of the separation of three PAHs (spectra 4, 5 and 6 in Fig. 34.11) (individual profiles and the sum).
258
C
0>
^'•
^
§
CO
1^ h ! ON
^§ ^^ - ^ ON
cxo
^^ W OS
*r: ^
s^ 3° "^ IT)
c ao :3
&^ a ^ OS CU
C/5
Xi
00
o
o
(^ (1)
cd O O
^-1
<j J3
259
Otherwise, the test fails. In principle there are no limitations on the number of compounds in the mixture. Let us illustrate the method with a LC-UV-visible data set of three overlapping peaks (Fig. 34.10). The TFA results obtained with six candidate spectra, including the pure spectra are given in Fig. 34.11. The assignment of the right spectra by an evaluation of the correlation coefficients between input and output target is obvious. Strasters et al. [6] applied this method to data sets obtained by HPLC-DAD and concluded that the pure factors can be very well recovered even at very low chromatographic resolutions (R^ < 0.02). A critical step in PCA-based methods is the determination of the number of significant PCs, which indicates the number of pure factors which have to be searched. A geometrical representation of target FA is as follows. Suppose a data set with spectra of a two-component system, measured at p wavelengths. All spectra are located on a plane defined by the two PCs, PCj and PC2 (see Fig. 34.12 where/? = 3). An input target in is a vector in the originalp-dimensional space. A target (e.g. inj) which lies in the plane PC1-PC2 is a linear combination of the two eigenvectors. Other targets (e.g. in2) are not. Whether a target lies in the plane defined by the PCs is tested by projecting the target on the plane (or space) defined by the eigenvectors. For a target which lies in the plane defined by the eigenvectors the input
Fig. 34.12. Geometrical representation of TTFA: target ini is accepted; target in2 is rejected.
260
and output targets are the same. For any other target (e.g. in2), the projected target out2 differs from the original one. If that difference is significantly greater than the noise, then this target is not one of the factors. Figure 34.12 also shows that the projected vector out2 is obtained by rotating one of the PCs. It also illustrates that another way to express the difference between the input target and output target is the angle between these two vectors. This angle is related to the correlation coefficient between the two vectors (Chapter 9).
34.2.5 Curve resolution based methods In previous methods no pre-knowledge of the factors was used to estimate the pure factors. However, in many situations such pre-knowledge is available. For instance, all factors are non-negative and all rows of the data matrix are nonnegative linear combinations of the pure factors. These properties can be exploited to estimate the pure factors. One of the earliest approaches is curve resolution, developed by Lawton and Sylvestre [7], which was applied on two-component systems. Later on, several adaptations have been proposed to solve more complex systems [8-10]. 34.2.5.1 Curve Resolution of two-factor systems In their fundamental paper on curve resolution of two-component systems, Lawton and Sylvestre [7] studied a data matrix of spectra recorded during the elution of two constituents. One can decide either to estimate the pure spectra (and derive from them the concentration profiles) or the pure elution profiles (and derive from them the spectra) by factor analysis. Curve resolution, as developed by Lawton and Sylvestre, is based on the evaluation of the scores in the PC-space. Because the scores of the spectra in the PC-space defined by the wavelengths have a clearer structure (e.g. a line or a curve) than the scores of the elution profiles in the PC-space defined by the elution times, curve resolution usually estimates pure spectra. Thereafter, the pure elution profiles are estimated from the estimated pure spectra. Because no information on the specific order of the spectra is used, curve resolution is also applicable when the sequence of the spectra is not in a specific order. As explained before, the scores of the spectra can be plotted in the space defined by the two principal components of the data matrix. The appearance of the scores plot depends on the way the rows (spectra) and the columns have been normalized. If the spectra are not normalized, all spectra are situated in a plane (see Fig. 34.5). From the origin two straight lines depart, which are connected by a curved line. We have already explained that the straight line segments correspond with the pure spectra which are located in the wings of the elution bands (selective retention time
261 scores on PC2 0.2
0.1 0
-0.1 -0.2 -0.3 -0.4
0.05
0.1
0.15 0.2 scores on PC1
0.25
Fig. 34.13. Score plot (ti vs 12) of the spectra given in Fig. 34.2 after normalisation. The points A and B are the purest spectra in the data set. The points A' and B' are the spectra at the boundaries of the non-negativity constraint.
zones). The curved line segment corresponds with the mixture spectra. When no pure spectra are present, no straight lines are observed making it very difficult to extract the pure spectra from this scores plot. Such plots are the basis of the HELP-method to assess peak purity in liquid chromatography, which is discussed in Section 34.3.3. When the spectra are normalized to a sum equal to one, the scores plot given in Fig. 34.5 becomes much simpler. All spectra are situated on a straight line (see Fig. 34.13). This is caused by the fact that by normalization of the rows the dimensionality of the data is reduced by one (because the spectra are not mean centred, we still find two significant PCs). The spectra on this line are arranged in a certain order. At both ends are the two spectra with the smallest contamination from the other compound. These are the purest spectrum (A) and the purest spectrum (B). In fact, these two spectra are the two purest rows in X which are the best available estimators of the two pure spectra. These estimates can be improved. First we know that the normalized pure spectra are located on the line passing through the line segment (A-B) of the mixture spectra. Thus the pure spectra should be found somewhere on the extrapolated ends of the line. The question arises how far one should extrapolate. At this point one needs constraints to bound the solution. The first constraint is that the absorbances are non-negative. At a given critical point (say AO located on the line, this constraint may be violated. This point has to be found. Let us explain how this is done with an example.
262 Suppose that 15 spectra have been measured at 20 wavelengths and that after normalization on a sum equal to one they are compiled in a data matrix. Suppose also that by PCA the following score and loading matrices are obtained:
7=
.340 .339
.073 .071
.339
.068
.338 .336
.064 .058
.333 .330
.049 .035
.324
.027
.317 -.008 .309 -.038 .300 -.072 .289 -.107 .279 -.140 .271 -.169 .265
-.191
v=
.372
.160
.609
.391
.148
-.208
.218
-.450
.314
-.583
.402
-.070
.088
-.230
.101
-.307
.058
-.063
.101
-J005
.100
J023
.206
.144
.139
.110
.239
.212
.016
-.007
.010
-.0108
.0035
.0006
(34.7)
.002 -.0009 .0004
-.001
0
0
The scores and the loadings are plotted in Figs. 34.13 and 34.4. The fact that all absorbances of the pure spectra Sj and S2 should be non-negative means that the linear combination of the two PCs, s = x^y^ + T2V2 should be non-negative for each element 5^ of s, given by: ^\k ••• ^2 ^2k - 0 for all /: = 1 to /?
(34.8)
Because the elements of v, all have equal sign, contrary to the elements of V2 which have mixed signs, x^ /Xj values satisfying eq. (34.8) are found on the interval [m,n], where m is equal to
263
-minM^
(V2^>0)
(34.9)
and n is equal to minP^
(V2^<0)
with Tj > 0.
The boundaries, m and n, define two lines in the (Vj, V2) plane: the line Xj = mi2 and the line T^ = nT2. The intersection of these two lines with the line defined by the mixture spectra gives two points A' and B' which are the estimates of the pure spectra. The intervals (A-AO and (B-BO define the solution bands between which the pure spectra are situated. For the example discussed here, the points A' and B' are found as follows. First, we calculate m and n by dividing all elements of Vj by the corresponding elements of ¥2, which gives: (^1A21; ^12/^22; ^13/^23; •••; ^ 1 A P = (2.32; 1.55; -0.711; -0.484; -0.538; -5.74; -0.382; -0.329; -0.92; -20.2; 4.34; 1.43; 1.26; 1.13; -2.28; -0.93; 5.8; -2.0; -0.4; -) The value 1.13 is the smallest positive value and -0.329 has the smallest absolute value of all negative numbers. Therefore, m = -1.13 and n = 0.329 leading to: -1.13 <Xi/T2< 0.329 or, the m-line is T^ = -1.13x2 ^^^ ^^^ n-line is Xj = 0.329X2In Fig. 34.13 the points A' and B' are situated at the intersections of the m-and ^-lines with the line defined by the mixture spectra. The scores are respectively equal to (0.3620,0.1187) and (0.2415, -0.273). According to eq. (34.8) the spectra corresponding to these scores are: s, = 0.2415 Vi-0.273 V2 S2 = 0.3620 Vi+0.1187 V2 which are plotted in Fig. 34.14. It should be borne in mind that the two spectra given in Fig. 34.14 are estimates of the pure spectra, which exactly fulfil the constraints. Two other estimates of the pure spectra are the purest mixture spectra A and B. When plotted in the same figure (see Figs. 34.15 and 34.16) a good impression is obtained of the remaining
264
-0.05
220
240
260
280
300
320
340
360 •
380 400 420 wavelength (nm)
Fig. 34.14. Estimated pure spectra at the points A' and B', with respectively the scores ^A'I , /A'2 and t^'i, ^B'2-
220
240 260 280 300 320 340 360 380 400 420 •
wavelength (nm)
Fig. 34.15. Solution band for the spectrum of fluoranthene.
uncertainty (or solution bands) in the spectra. The widths of the solution bands depend on the similarity of the spectra, and on the concentration ratios spanned by the mixtures. In HPLC with DAD these concentration ratios depend on the
265
220 240
260 280
300 320 340 360
380 400 420 •
wavelength (nm)
Fig. 34.16. Solution band for the spectrum of chrysene.
chromatographic overlap and on the concentration ratio of the two components. The effect of these two factors has been discussed in the literature [11]. The solution bands can be somewhat narrowed by taking into consideration a second constraint namely that the resulting concentrations by solving eq. (34.3) should be non-negative. By combining eqs. (34.3) and (34.4), it follows that each measured spectrum / can be represented as: ^i =^/l Sl +^/2 S2 =^a(^A'l Vi +^A'2 V2) + q2(^B'l ^i + ^B'2 ^2 ) (^A'l»^A'2) ^^d (^B'l»^B'2) ^^^ the scores of the two spectra meeting the non-negativity criterion. Equally, each mixture spectrum is a linear combination of the two first principal components: ^i ~ h\ ^1 "•" hi ^2
Therefore, for all spectra / = 1,...,« the scores t^^ and t^2 are given by: U\ ~^i\
^A'l "^ ^i2 ^B'l ^^^Ul
—^i\ ^A'2 "•" ^/2 ^B'2
By solving these two equations for c^ and c^2 ^^^ by applying the condition that c^j > 0 and c,2 > 0 we find [7] that the scores of the first pure spectrum (^^^ and ^^^^2) should fulfil the condition:
266 scores on PC2 0.1
n-line
0.05 0 -0.05 -0.1 -0.15
m-line 0.35
0.4 scores on PCI
0.45
0.5
Fig. 34.17. Non-negativity constraints (m- and w-line) applied on the scores of the normalized chromatograms.
'A'2
> max hi. til
^'1
m
and that the scores of the second pure spectrum (rg/i and ^3^2) should fulfil the condition: ^B'2
< min
'i2
=n
^B'l
If these conditions are not fulfilled by the scores (r^/j, ^^^2) ^^^ (^B'I' ^B'2) ^he pure spectra should be adapted by taking the intersections of the line defined by the mixture spectra and the lines T2 = m'T^ and T2 = nx^. In our example, the above conditions are met and therefore the solution bands are not affected. We can also estimate the pure column factors (here elution profiles). Therefore, we take the transpose of X (rows are chromatograms), normalize all rows to a sum equal to one, and calculate the PCs. These PCs represent abstract chromatograms as is shown in Fig. 34.6. The scores of the chromatograms are again situated on a straight line (Fig. 34.17). The purest elution profiles are at the two ends of the line segment. By applying eq. (34.9) one obtains the m- and n-lines defined by the constraint that the elution profiles should be non-negative (see Fig. 34.17):
267
0.15
0.05
29
33
37
retention time Fig. 34.18. Elution profiles estimated by curve resolution.
T2 =-0.5103 Xi(m-line) T2 = 0.1909 Ti(n-line) The intersection of these two lines with the line defined by the mixture chromatograms gives the scores of the 'pure' elution profiles (0.360, 0.070) and (0.325, -0.166) resulting in the chromatograms given in Fig. 34.18. Knowing the pure elution profiles, the spectra are easily calculated by solving eq. (34.3). 34.2.5.2 Curve resolution of three-factor systems Having derived a solution for two-component systems, we could try and extend this solution to three-component systems. A PC A of a data set of spectra of three-component mixtures yields three significant eigenvectors and a score matrix with three scores for each spectrum. Therefore, the spectra are located in a three-dimensional space defined by the eigenvectors. For the same reason, explained for the two-component system, by normalization, the ternary spectra are found on a surface with one dimension less than the number of compounds, in this case, a plane. If the ternary mixtures span the whole concentration range, i.e. from 0 to 100% for each component, some of the mixtures contains only one or two compounds. As explained before, the binary mixtures should be found on straight lines. In this
268
it
PureB
B+C A+B
•
A+B+C
Puree # • • • • • • •
A+C
Pure A
Fig. 34.19. Three-component spectra (normalized) projected in the space defined by the two first PCs.
case, three straight lines should be found representing the binary mixtures of compounds (A,B), (A,C) and (B,C), respectively. These three lines define a triangle. Ternary mixtures are located inside the triangle. The comers of the triangle represent the pure spectra, and on each side are the binary mixtures (see Fig. 34.19). The problem of finding the pure spectra (or pure factors) in a three-component (or three-factor) system is to first estimate the position of the lines on which the binary spectra are situated, followed by an estimate of the intersections (comer points). If no spectra of binary mixtures are in the data set, these lines are not directly observed and should be estimated. In a similar way as for the two-component case, these lines and comer points have to be found by applying constraints on the spectra (non-negativity of the absorbances) and on the resulting concentrations. Details of this method are found in Ref. [9]. Clearly, the factor analysis of three-component systems by curve resolution is much more complex than it is for the two-component case. In fact the generalisation of the method to systems with more than three components appears to be impractical. 34.2.6 Factor rotation by iterative target transformation factor analysis (ITTFA) Iterative target transformation factor analysis (ITTFA) is an extension of TTFA and has been introduced by Hopke et al. [12] in environmetrics and by Gemperline [13,14] and Vandeginste et al. [15] in chromatography. The idea behind ITTFA is
269
Fig. 34.20. ITTFA: projection of the input vector ini in the PC-space gives outi. A new input target in2 is obtained by adapting outi to specific constraints. in2 projected in the PC-space gives out2.
that, in the absence of good candidate targets to be tested by TTFA, one defines an initial target, which is gradually improved until the TTFA test passes. The first step of ITTFA is to project any input target on the eigenvector space (Fig. 34.20) defined by the data matrix of the mixtures, for example inj into outj. Next, the projected vector is inspected on its likelihood to be a pure factor. By examining outj, we find that some loadings do not fulfil specific requirements, e.g. if the factor should represent an elution profile, then all values should be nonnegative and no shoulders or secondary maxima should be present. Instead of rotating outj in the PC1-PC2 plane, we adapt the loadings of this vector in such a way that it obeys all imposed constraints; i.e. all negative values are replaced by zeros and, in the case of chromatography for instance, all secondary peaks are cut away. This step is schematically shown in Fig. 34.21. The consequence of this adaptation is that the vector outj is lifted from the plane and rotated over a small angle, giving in2. By this procedure, we hope to obtain a result that is closer to one of the true factors, than inj was. In the following step the vector in2 is considered as a new target, which in its turn is submitted to a TTFA. As a result, a second projected output target out2 is obtained. The overall result is a rotation of outj into out2 as is schematically shown in Fig. 34.20. The procedure of inspection and adaptation of the loadings is then repeated, until the difference between output and input targets converges to zero, indicating that a stable solution is obtained. This solution is one of the pure factors. To find a second factor, the procedure is repeated with a new initial target. The main condition for success is that good constraints can be formulated for adapting the targets and that good initial targets can be found.
270
retention time
retention time
Fig. 34.21. Example of the adaptation of projected targets.
In summary ITTFA comprises the following steps: (1) Calculate the significant PCs from the data matrix; (2) Choose an initial target ini; (3) Project in, in the space defined by the eigenvectors by applying eq. (34.6). An output target out, is obtained; (4) Evaluate the correlation between the input and output target; if this correlation is larger than a specified value, the procedure converges to a factor. Repeat the procedure with another initial target, until all pure factors are estimated; (5) Otherwise adapt the projected target, by applying constraints. This gives a new target to be tested. Return to step 3. Let us explain the above steps of ITTFA in more detail for data sets obtained in HPLC with DAD. (1) Taking into consideration the fact that initial targets have to be formulated and that the projected targets have to be inspected and adapted, we estimate the pure elution profiles instead of the pure spectra. Therefore, the PCA is carried out in the elution time space and the resulting principal components represent abstract elution profiles (see Fig. 34.6). The factors we are looking for are the elution profiles of the different compounds. The first abstract factor closely resembles one
271
of the elution profiles. The other one also has a smooth shape, but with negative values because it is orthogonal to the first one. (2) Several procedures have been proposed to select a target. The first one is the so-called needle search [13,14,16]. This procedure starts the ITTFA with as many needle targets (i.e., all elements of the target are zero except one which has a value equal to one) as there are variables. Each needle is projected into the space defined by the eigenvectors once, and the correlation between input and output target is evaluated. The nc (number of significant eigenvectors) best targets are retained to start the iterations. A second procedure is to apply first a VARIMAX rotation on the eigenvectors (see Section 34.2.3). Because elution profiles are present in specific windows of the data matrix and because they are almost orthogonal, a VARIMAX yields good initial profiles for each factor, indicating at least the position of the peaks. (3) Targets are adapted by replacing all negative values by zero. Furthermore, all secondary peaks or shoulders are removed. Some typical examples are given in Fig. 34.21. Because all elution profiles are normalized to a sum equal to one, the resulting factors are also normalized. The overall result of ITTFA is therefore a set of peaks with equal peak area or height. Depending on the kind of information wanted, either the spectra or elution profiles, ITTFA is concluded by a terminating step. The spectra are obtained by solving eq. (34.3), where X is the original and unnormalized data set and C contains the normalized elution profiles. After the spectra are obtained, we can denormalize the elution profiles by dividing them by the norm of the spectra. This is easily seen from the fact that X = C K S, where C and S are respectively the elution profiles and spectra, normalized to a sum equal to one, and K is a diagonal matrix with on its diagonal the norms of the spectra. From C and X we obtain KS. The lengths of the spectra KS are equal to the diagonal elements of K. Multiplication of C by K in its turn gives CK, the denormalized elution profiles. An example of the results obtained by ITTFA is given in Figs. 34.22 and 34.23. The solid lines in the two figures are the estimated spectra and elution profiles. The dotted lines are the true spectra and profiles The ability to estimate spectra and elution profiles by ITTFA depends on the similarity of the spectra, the overlap of the elution profiles and the relative intensity of the signal. By simulation, Strasters et al. [6] evaluated the quality of the spectra as a function of the peak overlap and spectrum similarity. The quality of the spectra was expressed as the correlation coefficient between the true and estimated spectrum. As can be seen from Fig. 34.24, spectra of good quality are obtained for a chromatographic resolution /?, > 0.45. The quality of the elution profiles, expressed as the accuracy of the peak areas, has been evaluated by Vandeginste et al. [11].
272
CO
o
m
d
•A.
CM
o d
in
^*1l«
o
a
(•
-
o o o
CO CO
o
(D CO
"^
CM
O CM
-^
O
O CD CM
O 00 CM
O O CO
O CN CO
Vm O \ 'wM CO ••
O
Q
O4
:3
a. (D
• * — '
0
^ T3 cd Q^
f
0
0
< ^
03
13 CD
V-i
cd
C/3 (U
CO CN (N CO
PLH
273
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 retention time
Fig. 34.23. Individual elution profiles and the total profile estimated by ITTFA compared to the true profiles.
spectral similarity
0.3
0.4
resolution Fig. 34.24. Quality (correlation coefficient between true and estimated spectra) of the spectra, estimated by ITTFA depending on similarity of the spectra and the chromatographic resolution.
274
One of the early applications of ITTFA was developed by Hopke et al. in the area of geochemistry [12]. By measuring the elemental composition in the crossing section of two lava beds, Hopke was able to derive the elemental composition of the two lava bed sources. This application is very similar to the air dust case given in the introduction of this chapter (see Section 34.1).
34.3 Evolutionary and local rank methods Methods for evolutionary rank analysis are explained and discussed in this section. The different approaches of evolutionary rank analysis have in common that the two-way data structure is analyzed piece-wise to locally reveal the presence of the analytes. A reference to the review of evolutionary methods by Toft et al. is included in the additional recommended reading list at the end of this chapter. 34.3,1 Evolving factor analysis (EFA) As previously discussed, data sets can in principle be factor-analyzed columnwise or row-wise. The fact that the rows follow a certain logical sequence may be exploited for finding the pure row factors. We illustrate this on a four-component chromatogram given in Fig. 34.25. The compounds are present in well-defined time windows, e.g compound A in window t^-t^, compound B in window t2-t^, compound C in window t^-t-j and compound D in window t^-t^. The ordering provides additional information and this can be exploited to estimate the pure spectra from which the respective concentration profiles are estimated. Therefore, sub-matrices are formed by adding rows to an initial top sub-matrix, TQ top down, or by adding rows to a bottom sub-matrix, BQ bottom up. In our example the rank of these matrices increases from one to four as is schematically shown in Fig. 34.25. By analysing these ranks as a function of the number of added rows, time windows are derived where one, two, three, etc. significant PCs, are present. Such an analysis of the rank as a function of the number of rows of X included in the principal components analysis is the principle of evolving principal components analysis (EPCA), which is the first step of evolving factor analysis (EFA) developed by Maeder et al. [17-19]. As explained in Chapter 17, an eigenvalue is associated with a principal component. This eigenvalue expresses the amount of variance described by the eigenvector. By determining the eigenvalues or variances associated with each PC for each sub-matrix and plotting these eigenvalues (or the log of the eigenvalues) as a function of the number of rows ni included in the sub-matrix, a typical plot, dependent on the studied system is obtained. Figure 34.26 shows the plot obtained for a HPLC-DAD data set. This
275
1
6
11
; 16
tl
i 21
126
12_
36
:31
t4
41
15_
4i6
51
56
t6
2
3
4
4
4
4
4
4
3
2
J-L
Fig. 34.25. Time windows in which compounds are present in a composite 4-component peak, with the ranks of the data matrices formed by adding rows to a top matrix To (top down) or to a bottom matrix B() (bottom up).
figure can be interpreted as follows. For each sub-matrix a new significant eigenvector appears each time a new compound is introduced in the spectra, thus at t = t^, ?2» h ^^d t^. This is observed from an increase of the eigenvalue. From this plot it is not yet possible to derive the compound windows, as this plot indicates the appearance of a new compound but not its disappearance. Therefore, a second EPCA is carried out, but now in the reversed order, i.e. one starts with the initial matrix BQ and adds rows bottom up. A similar plot (see Fig. 34.27) is now obtained but in the reversed order. New factors appear at t = ^g, tj, t^ and t^. In the forward direction it means that components disappear at t = t^, t^, tj and t^. Because the widths of elution bands in HPLC in a narrow region of retention times are more or less equal, the compound which appears first in the spectra should also be the first one to disappear. Therefore, the compound windows are found by connecting the line indicating the first appearing compound in Fig. 34.26 with the last appearing compound in Fig. 34.27. Both figures can be combined in a single figure (see Fig. 34.28), from which the concentration windows can be reconstructed as indicated. Tauler and Casassas [20] applied this technique to reconstruct the concentration
276
20 r e t e n t i o n time
Fig. 34.26. Eigenvalues when adding rows to To (forward evolving PCA).
to i->
c =} Q
u 5
/Axx •••1
a
•• ••
3 «0
> cr
a»
O)
"•••••
• • • .
h
'['•••
1
•••
a;
•-, 1
o
••
. 20
1
• . I'-.
'
f-. 1 •••
••••..
1 .. Is
•••.
1 ••
1
s
•••1
1 ••. 1 1
1
i
. 1
'
40
r e t e n t i o n time
Fig. 34.27. Eigenvalues when adding rows to Bo (backward evolving PCA).
277
c> 0) en
"•*
f 1 ^1
^2
s
s
H
1 1 - - -
S
k
1
1
;
1
'
1
I
\
1
"
,
20 retention time
1
_
1
,..,J
40
Fig. 34.28. Reconstructed concentration profiles from the combination of Figs. 34.26 and 34.27.
II
T
R
^1
1
EZ]
1 I ^1
1
0 Ci
-pO T i r;
Fig. 34.29. Construction of a sub-matrix containing only the zero rows of c,.
profiles in equilibria studies. One could consider these bell-shaped profiles of the eigenvalues as being a first rough estimate of the concentration profiles C in eq. (34.3). It is now possible to calculate the pure spectra S in an iterative way by
278
solving the eq. (34.3) first for S (by a least-squares method). Because C is a first approximation, the spectra have negative values, v^hich can be set to zero. In a next step the corrected spectra are used to solve eq. (34.3) for C. These steps are repeated until S and C are stable. This is the principle of alternating regression, also called alternating least squares, introduced by Karjalainen [21]. This iterative resolution method has been successfully applied to many different analytical data, including chromatographic examples (see recommended additional reading). An alternative and faster method estimates the pure spectra in a single step. The compound windows derived from an EPCA, are used to calculate a rotation matrix R by which the PCs are transformed into the pure spectra: X = C S^ = T* R R"* V*^. Consequently, C = T* R, where T* is the score matrix of X. Focusing on a particular component, c^, (ith column of the matrix C) one can write c^ = T* r^, where r, is the /th column of R. Because compound / is not present in the shaded areas of c, (see Fig. 34.29), the values of c^ in these areas are zero. This allows us to calculate the rotation vector r^. Therefore, all zero rows of c^ are combined into a new column vector, cf, and the corresponding rows of T* are combined into a new matrix T*^. As a result one obtains: c^ :=T;^ r,. =0
(34.10)
The procedure is schematically shown in Fig. 34.29. Equation (34.10) represents a homogeneous system of equations with a trivial solution r^ = 0. Because component / is absent in the concentration vector, this component does not contribute to the matrix T*^. As a consequence the rank of T*^ is one less than its number of rows. A non-trivial solution therefore can be calculated. The value of one element of r, is arbitrarily chosen and the other elements are calculated by a simple regression [17]. Because the solution depends on the initially chosen value, the size (scale) of the true factors remains undetermined. By repeating this procedure for all columns c, (/ = 1 to /?), one obtains all columns of R, the entire rotation matrix.
34.3.2 Fixed-size window evolving factor analysis (FSWEFA) When the rows of a data matrix follow a certain pattern, e.g. the appearance and disappearance of compounds as a function of time, a. fixed-size window EFA is applicable. This is the case, for instance, for data sets generated by hyphenated measurement techniques such as HPLC with DAD. Fixed-size window EFA [22] can be applied for detecting the presence of minor compounds (< 1 %) and for the resolution of a data set into its components (pure spectra and elution profiles).
279 wavelength 1
wavelength
wavelength P
0)
1
P
a> E
a> E
E
1
P
'43
Fig. 34.30 Principle of fixed-size window evolving factor analysis.
3 2 1 0 1 logEY 2 <J -4 -5 -6 7 10
20
30
40
50
60
Fig. 34.31. Eigenvalue plot obtained by fixed-size moving window FA for a pure peak.
Contrary to EFA which calculates a PC A of a sub data matrix to which rows are added, in fixed-size window EFA a small window of rows is selected which is moved over the data set (see Fig. 34.30). Typically, a window of seven consecutive spectra is used. At each new position of the window a PCA is calculated and the eigenvalues associated with each PC are recalculated and are plotted as a function of the position of the window. This yields a number of eigenvalue-lines. Figure 34.31 shows the eigenvalue-lines obtained for a simulated pure LC-DAD peak. In the baseline zones (null spectra) all eigenvalue-lines are noisy horizontal lines. In the selective retention time regions (one component present) the eigenvalue-line associated with the first PC follows the appearance and disappearance of the
280 %10 3 Q.
detectable
E
0.5 0.3
not detectable
0.2 0.1
0.2
0.4
0.6
0.8 1 — • resolution
Fig. 34.32. Detectability of a minor compound by fixed-size window evolving factor analysis.
compound. The other eigenvalue-lines remain noisy horizontal lines. From the variations observed for eigenvalue-lines representing noise one can derive a critical value to decide on the significance of an eigenvalue. Figure 34.32 demonstrates the capabilities of this method for the detection of a minor compound which is partly separated from the main constituent. Impurities as low as 1% are detected when chromatographic resolution > 0.3. Eigenstmcture tracking (ETA), developed by Toft and Kvalheim [23] is very similar to FSWEFA. Here, one starts initially with a window of size 2 and the size is increased by one until a singular value representing noise is obtained, i.e. the window size is one higher than the number of co-eluting compounds. 34.3.3 Heuristic evolving latent projections (HELP) Heuristic evolving latent projection [24], which was introduced in Chapter 17, concentrates on the evaluation of the results of a local principal components analysis (LPCA), performed on a part of the data matrix. Both, the scores and loadings, obtained by a local PCA are inspected for the presence of structures which indicate selective retention time regions (pure spectra) and selective wavelengths. Because the data are not pre-treated (normalization or any other type of pretreatment) the score plots obtained are very similar to the scores plots previously discussed in Section 34.2.1. Straight lines departing from the origin in the scores plot indicate pure spectra. HELP can be conducted in the wavelength space and in the retention time space. Purity checks and the search for selective retention time
281
B
WAVELENGTH
ELUTION TIME
1.277
U
0.000 0
O
-1.277 J 1
-3.436
0.000
J
3.436
PCI
Fig. 34.33. Simulated three-component system: (a) spectra (b) elution profiles and (c) scores plot obtained by a global PCA with HELP.
regions are usually conducted in the wavelength space. Selective wavelengths are found in the retention time space, and if present yield directly the pure elution profiles. The method also provides what is called a data-scope, which zooms in on a particular part of the data set. The functioning of the data-scope is illustrated with a simulated three-component system given in Figs. 34.33a and b. The scores plot (Fig. 34.33c) obtained by a global PCA in wavelength space shows the usual line structures. In this case the data-scope technique is applied to evaluate the purity of the up-slope and down-slope elution zones of the peak. Therefore, data-scope performs a local PCA on the up-slope and down-slope regions of the data. In Fig. 34.34 the first three local PCs with the associated eigenvalues are given for four retention time regions: (a) a zero component region (Fig. 34.34a), (b) selective chromatographic regions at the up-slope side (Fig. 34.34b), (c) and at the
282
1.277
LM^W^t/^
imnnniiiimiimn»imimiinimiiiBtniiuiimwiimr
\A/W\Mf\NM nfmiiiini|uiuiuniinmniiJtmtiiwinHHiBimuBiil
k Y ^ WH^^V^Ayj WAVELENGTH
(a)
Fig. 34.34. The three first principal components obtained by a local PC A: (a) zero component region, (b) up-slope selective region, (c) down-slope selective region (d) three-component region. The spectra included in the local PCA are indicated in the score plot and in the chromatogram.
down-slope side (Fig. 34.34c), and a three-component region in the centre of the peak profile (Fig. 34.34d). The selected regions of the data set are indicated in the scores plot (wavelength space) and in the chromatogram. The loading patterns of the three PCs obtained for the suspected selective chromatographic regions, clearly show structure in the first principal component (Fig. 34.34 b,c). The significance of the second principal component should be confirmed by separating the variation originating from this component from the variation due to experimental or instrumental errors. This is done by comparing the second eigenvalue in the suspected selective region and the first singular value in the zerocomponent region (see Table 34.1). In the first instance, the rank of the investigated local region was established by means of an F-test. For the up-slope
283
1.277
0.000
1
0 o
i
-1.277
0.000 PC1
-3.438
J
RETENTION TIME
3.438
nimHi«uui|«mminmnmninmnniininBmiiiT
\r Bmn|iBwmmnwii|Bnnii|iiuii.|Bmiiuniinnnuii
mmi)mminiiiHmiiiiiiiniinu|iiiimnnuimmul
(b)
WAVELENGTH
Fig. 34.34 continued, (b) Up-slope selective region.
TABLE 34.1 Local rank analysis of suspected selective regions in a LC-DAD data set (see Fig. 34.34) Eigenvalues PC
Zero-compound region
Suspected selective region (up-slope)
Suspected selective region (down-slope)
1
0.01033
0.48437
0.54828
2
0.00969
0.0090
0.0081
3
0.00860
0.00481
0.00569
284
1.277
O 0.000 Q.
0 O
RETENTION TIME
-1.277 -3.438
0.000 PCI
3.438
inniB>HHiim{iiim.nii.i.»:|mmiii|BUiimHiininLiiiii(
B»pg>iiit|nmn:»|niiiiiii{iimBnnnuanniiniHBmir
K innrwnmnui;BunM{wiinriminBiHBmBinMiiuiniinii1
WAVELENGTH
(C)
Fig. 34.34 continued, (c) Down-slope selective region
region F = 0.01033/0.00900 = 1.148 and for the down-slope region F = 0.01033/ 0.00881 = 1.172. The number of degrees of freedom is equal to the number of rows included in the local rank analysis times the number of wavelengths. Both values are below the critical value, indicating that the two selected regions are pure. As the number of spectra used in the zero-component region increases, the critical F value tends towards 1 and one risks setting the detection limit too high. To avoid that problem, a non-parametric test was introduced [24]. On the other hand, when F exceeds the critical F-value, one should be careful in concluding that this variation is due to a chemical compound. By the occurrence of artifacts, such as non-linearities, baseline drift and heteroscedasticity of the noise, the second principal component may falsely show structure [25,26] and lead to a false positive
285
1.277
O 0.000 Q.
0.0
RETENTION TIME
-1.277 -3.438
0.000 PCI
3.438
iiiiiiHi|iniiiiii|niiiHii|Hiiiiiii|NinHM|niiinii|i^innii|Hiiiiii
WAVELENGTH
(d)
Fig. 34.34 continued, (d) Three-component region.
identification of a component. This may seriously complicate the estimation of pure factors when applying the methods explained in this chapter. This aspect is discussed in more detail in Section 34.6. To overcome these problems, the HELP technique includes a procedure to detect artifacts. For instance, baseline points should be located in the origin of the scores plot. If this is not the case for a matrix consisting of baseline spectra taken from the left and right hand side of the peak of interest, a drift may be present. The effectiveness of a baseline correction (e.g. substraction of baseline) can be monitored by repeating the local PCA.
286
34.4 Pure column (or row) techniques Pure variables are fiilly selective for one of the factors. This means that only one pure factor contributes to the values of that variable. When the pure variables or selective wavelengths for each factor are known then the pure spectra can be calculated in a straightforward manner from the mixture spectra by solving: X = PS'^
(34.11)
where X is the matrix of the mixture spectra, S is the matrix of the unknown pure spectra and P is a sub-matrix of X which contains the nc pure columns (one selective column per factor). The pure spectra are then estimated by a least squares procedure: sT = (pTp)-^p'rx The pure variable technique can be applied in the column space (wavelength) as well as in the row space (time). When applied in the column space, a pure column is one of the column factors. In LC-DAD this is the elution profile of the compound which contains that selective wavelength in its spectrum. When applied in the row space, a pure row is a pure spectrum measured in a zone where only one compound elutes. 34.4.1 The variance diagram (VARDIA) technique In Section 34.2 we explained that factor analysis consists of a rotation of the principal components of the data matrix under certain constraints. When the objects in the data matrix are ordered, i.e. the compounds are present in certain row-windows, then the rotation matrix can be calculated in a straightforward way. For non-ordered spectra with three or less components, solution bands for the pure factors are obtained by curve resolution, which starts with looking for the purest spectra (i.e. rows) in the data matrix. In this section we discuss the VARDIA [27,28] technique which yields clusters of pure variables (columns), for a certain pure factor. The principle of the VARDIA technique is schematically shown in Fig. 34.35. Let us suppose that Vj and V2 are the two first principal components of a data matrix with p variables. The plane defined by Vj and V2 is positioned in the original space of the/? variables (see Fig. 34.35). For simplicity only the variables x^,X2, and x^ are included in the figure. The angle between the axes jCj and X2 is small because these two variables are correlated, whereas the axis x^ is almost orthogonal to x^ and ^2, indicating a very low correlation. The loadings v^ and V2 are the projections of all variables on these principal components. A high loading for a variable on Vj means
287
Fig. 34.35. Principle of the VARDIA technique demonstrated on a two-component system, f is rotated in the V1-V2 plane. The variables Xi, X2 and X3 are projected on f during the rotation. In this figure f is oriented in the direction of Xi and X2.
that the angle of this variable axis with Vj is small. Conversely a small loading for the variable with Vj means that the angle with this variable is large. Instead of projecting the variables on the principal components, one can also project them on any other vector f located in the space defined by the PCs and which makes an angle T with Vj. Vector f is defined as: f = Vi cosF + V2 sinP The loading of variabley on this vector is: fj = Vy cosF + V2j sinF
(34.12)
The larger the loading^ of variable) on f, the smaller the angle is between f and this variable. This means that f lies in the direction of that variable, f may lie in the direction of several correlated variables (jCj and X2 in Fig. 34.35), or a cluster of variables. Because correlated variables are assumed to belong to the same pure factor, f could be one of the pure factors. Thus by rotating fin the space defined by the principal components and observing the loadings during that rotation, one may find directions where variables cluster in the multivariate space. Such a cluster belongs to a pure factor. This is illustrated with the LC-DAD data discussed in Section 34.2.5.1. The two principal components found for this data matrix are given in eq. (34.7). By illustration, the loadings of the variables 2 and 15 on fare
288
TABLE 34.2 The loadings of variables 2 and 15 during the rotation of PC^ (see eq. (34.7) for the values of PC^ and PCj) Rotation angle
Variable 2
Variable 15
10
0.668
0.014
20 30
0.706
0.013
0.723
0.010
40
0.718
0.008
50
0.691
0.005
60 70
0.643
0.002
0.576
-0.001
80 90
0.491
-0.004
0.391
-0.007
100 110
0.279
-0.010
0.159
-0.012
120 130
0.034
-0.014
-0.092
-0.016
140
-0.215
-0.017
150
-0.332
-0.017
160 170
-0.438
-0.017
-0.532
-0.017
180 190
-0.609
-0.016
-0.668
-0.014
200 210
-0.706
-0.013
-0.723
-0.010
220
-0.718
-0.008
230
-0.691
-0.005
240
-0.643
-0.002
250
-0.576
0.001
260 270 280 290 300 310
-0.491
0.004
-0.391
0.007
-0.279
0.010
-0.159
0.012
-0.034
0.014
0.092
0.016
320
0.215
0.017
330
0.332
0.017
340 350 360
0.438
0.017
0.532
0.017
0.609
0.016
289
calculated as a function of the angle F of f with Vj in steps of 10 degrees according to eq. (34.12) and are tabulated in Table 34.2. From this table one can see that the highest correlation between f and variable 2 is found when f has an angle of 30 degrees (and in the negative direction, 210 degrees) with the first principal component. By the same reasoning we find that f is maximally correlated with variable 15 when f has an angle of 340 degrees with PCj. The next question is about the purity of these variables. Therefore, we need to define a criterion for purity. Because the loading of variable 2 at 30 degrees (/230) is equal to 0.7229, which is larger than /15 340 = 0.017 at 340 degrees, one may conclude that variable 2 is probably purer than variable 15. A measure for purity of a variable is obtained [27] by comparing the loading^ of variable7 on f with the length of the vector obtained by projecting variable; on the V1-V2 plane. The length of variable7 projected in the PC-space is equal to
If
fj> pi + vi)cosm) one concludes that the variable j is pure. The cosine term introduces a small projection window from -p/2 to +p/2. Windig [27] recommends a value of 10 degrees for p. In order to decide on the purity of the factor f as a whole, the purity of the loadings of all variables should be checked. At 30 degrees, where f = 0.866vi + 0.5V2, the variables 2 and 12 (see Table 34.3) fulfil the purity criterion. The sum of squares of all variable loadings fulfilling the purity criterion are plotted in a so-called variance diagram. At 30 degrees, this is the value (/2^3o + /i2,3o)- ^^^^ ^^ the principle of the VARDIA technique in which var(/) is plotted as a function of F, with
var(F) = X / ) '
(34.13)
for allfj for whichy^ > J{v^j + v|y )cos(P/2), with^ = v^cosF + V2ySinF (F is the rotation angle). The variance diagram obtained for the example discussed before is quite simple. Clusters of pure variables are found at 30 degrees (var = 0.5853) and at 300 degrees (var = 0.4868) (see Fig. 34.36). The distance from the centre of the diagram to each point is proportional to the variance value. Neighbouring points are connected by solid lines. All values were scaled in such a way that the highest variance is full scale. As can be seen from Fig. 34.36, two clusters of pure variables are found. The
290 TABLE 34.3 VARDIA calculation at 30 degrees Vl
V2
*3()
0.372
0.160
0.402
0.601
^(v?+V2)cos(57i/180)
f2
*
*• 3 0
0.403
-
0.391
0.721*
0.523
0.523
0.148
-0.208
0.024
0.254
0.218
-0.450
-0.036
-0.498
-0.019
-0.660
0.314
-0.583
0.402
-0.070
0.313
0.406
0.088
-0.232
-0.085
-0.333
0.101
-0.307
-0.066
-0.322
0.058
-0.063
0.019
0.085
0.101
-0.005
0.085
0.101
0.100
0.023
0.098
0.102
-
0.206
0.144
0.250*
0.250
0.063
0.139
0.110
0.175
0.177
-
0.239
0.212
0.313
0.318
0.016
-0.007
0.010
0.017
0.010
-0.0108
0.003
0.015
0.0035
0.0006
0.003
0.004
0.002
-0.009
-0.003
0.009
0.0004
-0.001
-0.0002
0.001
VAR(30) * Loadings for which fjQ < ^(vi +\2 )cos(57c/180).
0.586
Spectra at 30 degrees and 300 degrees (see Figs. 34.37 and 34.38) are in good agreement with the pure spectra given in Fig. 34.3. This demonstrates that this procedure yields quite a good estimate of the pure spectra. The rotation of a factor in the space defined by two eigenvectors is straightforward and, therefore, the method is well applicable to solve two-component systems. Also three-component systems can be solved after all rows (spectra) have been normalized. Windig [27,28] also proposed procedures to handle more complex systems with more than two components. Therefore, several variance diagrams have to be combined. Instead of observing the variance of the loadings of a factor being rotated in the space defined by the eigenvectors, one can also plot the loadings, and guide iteratively the rotation to a factor that obeys the constraints imposed by the system
291
180
210
Fig. 34.36. Variance diagram obtained for the data in Fig. 34.2.
wavelength (nm)
Fig. 34.37. Spectrum (f) at an angle of 30 degrees with Vi (in the V1-V2 plane).
292 0)
o c:
CO
to
220
240
260
280
300
320
340
360
380 400 wavelength (nm)
Fig. 34.38. Spectrum (f) at an angle of 300 degrees with Vi (in the V1-V2 plane).
under investigation, such as non-negativity of signals and concentrations. The difficulty is to design a procedure that converges to the right factors in a few steps. 34.4.2 Simplisma Windig et al. [29] proposed an elegant method, called SIMPLISMA to find the pure variables, which does not require a principal components analysis. It is based on the evaluation of the relative standard deviation (Sj/Xj) of the columnsy of X. This yields a so-called standard deviation spectrum. A large relative standard deviation indicates a high purity of that column. For instance, variable 1 in Table 34.4 has the same value (0.2) for the two pure factors and is therefore very impure. By mixing the two pure factors no variation is observed in the value of that variable. On the other hand, variable 2 is pure for factor 1 and varies between 0.1 and 0.4 in the mixtures. In order to avoid that wavelengths with a low mean intensity obtain a high purity value, the relative standard deviation is truncated by introducing a small off-set value 6, leading to the following expression for the purity pj of a variable (column 7): Pj =
Xj + 6
The purity plotted as a function of the wavelength; gives a purity spectrum. The calculation of a purity spectrum is illustrated by the following example. Suppose
293 TABLE 34.4 Determination of pure variables from mixture spectra Mixtures ratio
Spectra 1
2
3
4
5
6
0.5/0.5
0.2
0.25
0.45
0.35
0.45
0.40
0.2/0.8
0.2
0.10
0.30
0.32
0.48
0.64
0.4/0.6
0.2
0.20
0.40
0.34
0.46
0.48
0.6/0.4
0.2
0.30
0.50
0.36
0.44
0.32
0.8/0.2
0.2
0.40
0.60
0.38
0.42
0.16
Stand, dev.
0
0.118
0.118
0.0224
0.0224
0.1789
Average
0.2
0.25
0.45
0.35
0.45
0.40
Rel, stand, dev.
0
0.4472
0.2622
0.0640
0.0498
0.4473
Pure 1
0.2
0.5
0.7
0.4
0.4
0.0
Pure 2
0.2
0.0
0.2
0.3
0.5
0.8
Pure 1
0.4
1.0
1.4
0.8
0.8
0
Pure 2
0.25
0
0.25
0.375
0.625
1.0
Estimated
that spectra have been obtained of a number of mixtures (see Table 34.4). A plot of the relative standard deviation spectrum together with the pure spectra (see Fig. 34.39) shows that a high relative standard deviation coincides with a pure variable and that a small relative standard deviation coincides with an impure variable (impure in the sense of non selective). Pure variables either belong to the same spectrum, e.g. because it contains several selective wavelengths, or belong to spectra from different compounds. This has to be sorted out by considering the correlation between the pure variable columns of the data matrix. If the pure variables are not positively correlated (here variable 2 and 6) one can also conclude that these pure variables belong to different compounds. In this example the correlation between columns 2 and 6 is -1.0 which indicates that the pure variables belong to different compounds. In practice, however, one has to follow a more complex procedure to find the successive pure variables belonging to different species and to determine the number of compounds [29].
294 (a)
8 C to
JO
o CO
5
6
wavelength 0)
0.4
5
— •
6
wavelength
Fig. 34.39. Mixture spectra and the corresponding relative standard deviation spectrum.
Once the pure variables have been identified, the data set can be resolved into the pure spectra by solving eq. (34.11). For the mixture spectra in Table 34.4, this gives: 0.2 0.2 0.2 0.2 0.2
0.25 0.45 0.35 0.45 0.10 0.3 0.32 0.48 0.2 0.4 0.34 0.46 0.3 0.5 0.36 0.44 0.4 0.6 0.38 0.42
0.40 0.64
0.25 0.4 0.1 0.64
0.48 = 0.20 0.48 0.30 0.32 0.32 0.40 0.16 0.16
giving the results: s;^ = [0.4 1.0 1.4 0.8 0.8 0] and s j = [0.25 0.0 0.25 0.375 0.625 1.0]. The estimated spectra are the pure spectra except for a normalization factor. As we indicated before, the pure(st) columns of an HPLC-DAD data matrix are an
295
estimate of the pure column factors. These are the elution profiles which are the less contaminated by the other compounds. An alternative procedure is to evaluate the purity of the rows / instead of the purity of the columns;: X- + O
The rows with the highest purities are estimates of the row factors, i.e. the purest spectra from the data set, which are refined afterward by alternating regression. 34.4.3 Orthogonal projection approach (OPA) The orthogonal projection approach (OPA) [30] is an iterative procedure to find the pure or purest spectra (row) in a data matrix. In HPLC, a pure spectrum coincides with a zone in the retention time where only one solute elutes. OPA can also be applied to find the pure or purest chromatograms (columns) in a data matrix. A pure chromatogram indicates a selective wavelength, which is a pure variable. We illustrate the method for the problem of finding the purest spectra. The procedure for finding pure chromatograms, or selective wavelengths is fully equivalent by applying the procedure on the transpose of the data matrix. A basic assumption of OPA is that the purest spectra are mutually more dissimilar than the corresponding mixture spectra. Therefore, OPA uses a dissimilarity criterion to find the number of components and the corresponding purest spectra. Spectra are sequentially selected, taking into account their dissimilarity. The dissimilarity d^ of spectrum / is defined as the determinant of a dispersion matrix Y,. In general, matrices Y, consist of one or more reference spectra, and the spectrum measured at the /th elution time. rf,. = det(Y7 Y.)
(34.14)
A dissimilarity plot is then obtained by plotting the dissimilarity values, J^, as a function of the retention time /. Initially, each p-l matrix Y^ consists of two columns: the reference spectrum, which is the mean (average) spectrum (normalised to unit length) of matrix X, and the spectrum at the rth retention time. The spectrum with the highest dissimilarity value is the least correlated with the mean spectrum, and it is the first spectrum selected, x^j. Then, the mean spectrum is replaced by x^j as reference in matrices Y, (Y^ = [x^j xj), and a second dissimilarity plot is obtained by applying eq. (34.14). The spectrum most dissimilar with x^j is selected {x^^ and added to matrix Y-. Therefore, for the determination of the third dissimilarity plot Y^ contains three columns [x^^ x^2 ^J' i-^-» ^^^ reference spectra and the spectrum at the /th retention time.
296
1
1^
,
,
,
,
,
,
,
,
,
1 0.9
MM
\
\
0.8
\
\
/ 1/ 1
0.7
I
|o.6 CO
\
§0.5 c o
\
f "\
0.4 0.3
/
\ \
\ !
1 1
\ \
/ A \
0.2 0.1 j 1
>•
10
20
.mrntf
30
,.
..1 . . . ^ ^ 1
40
i^'*'*»>..., 1
50 60 Time
70
i,?TTrrrtii
80
90
1
1
100
Fig. 34.40. Normalized concentration profiles of a minor and main compound for a system with 0.2% of prednisone; chromatographic resolution is 0.8.
In summary, the selection procedure consists of three steps: (1) compare each spectrum in X with all spectra already selected by applying eq. (34.14). Initially, when no spectrum has been selected, the spectra are compared with the average spectrum of matrix X; (2) plot of the dissimilarity values as a function of the retention time (dissimilarity plot) and (3) select the spectrum with the highest dissimilarity value by including it as a reference in matrix Y^. The selection of the spectra is finished when the dissimilarity plot shows a random pattern. It is considered that there are as many compounds as there are spectra. Once the purest spectra are available, the data matrix X can be resolved into its spectra and elution profiles by Alternating Regression explained in Section 34.3.1. By way of illustration, let us consider the separation of 0.2% prednisone in etrocortysone eluting with a chromatographic resolution equal to 0.8 [30] (Fig. 34.40). The dissimilarity of each spectrum with respect to the mean spectrum is plotted in Fig. 34.41a. Two clearly differentiated peaks with maxima around times 46 and 63 indicate the presence of at least two compounds. In this case, the
297
a)
xlO"
Fig. 34.41. Dissimilarity of each spectrum with respect to (a) the mean spectrum, (b) spectrum at time 46 and (c) spectra at times 46 and 63, for the system of Fig. 34.40.
dissimilarity of the spectrum at time 46 is slightly higher than the one of the spectrum at time 63 and it is the first spectrum selected. Each spectrum is then compared with the spectrum at time 46 and the dissimilarity is plotted versus time (Fig. 34.41b). The spectrum at time 63 has the highest dissimilarity value and, therefore, it is the second spectrum selected. The procedure is continued by calculating the dissimilarity of all spectra left with respect to the two already selected spectra, which is plotted in Fig. 34.41c. As one can see the dissimilarity values are about 1000 times smaller than the smallest than obtained so far. Moreover, no peak is observed in the plot. This leads to the conclusion that no third component is present in the data. A comparison of the performance of FWSEFA, SIMPLISMA and OPA on a real data set of LC-FTIR spectra containing three complex clusters of co-eluting compounds is given in Ref. [31]. An alternative method, key-set factor analysis, which looks for a set of purest rows, called key-set, has been developed by Malinowski [32].
298
34.5 Quantitative methods for factor analysis The aim of all the foregoing methods of factor analysis is to decompose a data-set into physically meaningful factors, for instance pure spectra from a HPLC-DAD data-set. After those factors have been obtained, quantitation should be possible by calculating the contribution of each factor in the rows of the data matrix. By ITTFA (see Section 34.2.6) for example, one estimates the elution profiles of each individual compound. However, for quantitation the peak areas have to be correlated to the concentration by a calibration step. This is particularly important when using a diode array detector because the response factors (absorptivity) may considerably vary with the compound considered. Some methods of factor analysis require the presence of a pure variable for each factor. In that case quantitation becomes straightforward and does not need a multivariate approach because full selectivity is available. In this section we focus on methods for the quantitation of a compound in the presence of an unknown interference without the requirement that this interference should be identified first or its spectrum should be estimated. Hyphenated methods are the main application domain. The methods we discuss are generalized rank annihilation method (GRAM) and residual bilinearization (RBL). 34.5.1 Generalized rank annihilation factor analysis (GRAFA) In 1978, Ho et al. [33] published an algorithm for rank annihilation factor analysis. The procedure requires two bilinear data sets, a calibration standard set X^ and a sample set X„. The calibration set is obtained by measuring a standard mixture which contains known amounts of the analytes of interest. The sample set contains the measurements of the sample in which the analytes have to be quantified. Let us assume that we are only interested in one analyte. By a PCA we obtain the rank R^ of the data matrix X^ which is theoretically equal to 1 + n^ where Az- is the number of interfering compounds. Because the calibration set contains only one compound, its rank R^ is equal to one. In the next step, the rank is calculated of the difference matrix X^ = X„ - kX^. For any value of/:, the rank of X^ is equal to 1 + n, except for the case where k is exactly equal to the contribution of the analyte to the signal. In this case the rank of X^ is R^ - 1. Thus the concentration of the analyte in the unknown sample can be found by determining the /:-value for which the rank of X^^ is equal to 7?^ - 1. The amount of the analyte in the sample is then equal to kc^ where c^ is the concentration of the analyte in the standard solution. In order to find this /:-value Ho et al. proposed an iterative procedure which plots the eigenvalues of the least significant PC of X^ as a function oik. This eigenvalue becomes minimal when k exactly compensates the signal of the analyte in the sample. For other /^-values the signal is under- or
299
Fig. 34.42. RAFA plot of the least significant eigenvalue as a function of ^ (see text for an explanation
ofk).
overcompensated which results in a higher value of the EV. An example of such a plot is given in Fig. 34.42. When several analytes have to be determined, this procedure needs to be repeated for each analyte. Because this algorithm requires that a PCA is calculated for each considered value of ^, RAFA is computationally intensive. Sanchez and Kowalski [34] introduced generalized rank annihilation factor analysis (GRAFA).
300
More than one analyte can be quantified simultaneously in the presence of interfering compounds. The required measurements are identical to RAFA: a data matrix X^ of the unknown sample and a calibration matrix with the analytes X^. 34.5.2 Residual bilinearization (RBL) In order to apply residual bilinearization [35] at least two data sets are needed: Xy which is the data set measured for the unknown sample and X^ which is the data matrix of a calibration standard, containing the analyte of interest. In the absence of interferences these two data matrices are related to each other as follows: X^ = bX^-^R
(34.15)
b is 3. coefficient which relates the concentration of the analyte in the unknown sample to the concentration in the calibration standard, where c^ = bc^. R is a residual matrix which contains the measurement error. Its rows represent null spectra. However, in the presence of other (interfering) compounds, the residual matrix R is not random, but contains structure. Therefore the rank of R is greater than zero. A PCA of R, after retaining the significant PCs, gives: R = T*V*'r + E
(34.16)
By combining eqs. (34.15) and (34.16) we obtain: X„ =fcX,+ r V*^ + E
(34.17)
By RBL the regression coefficient b is calculated by minimizing the sum of squares of the elements in E. Because the rank of R in eq. (34.16) is unknown, the estimation of b from eq. (34.17) should be repeated for an increasing number of principal components included in V*^. Schematically the procedure proceeds as follows: (1) Start with an initial estimate b^ofb', (2) Calculate R = X, - fcoX,; (3) Determine the rank of R and decompose R into T*V*^ + E; (4) Obtain a new estimate b^ofbby solving X^ = bX^ + f *Y*T f^j. ^^-^^which T*V*^ is the result ofstep 3. This yields: Z7i = (X J X,)-'Xj ( X , - T V ) ; (5) Repeat steps (2) to (4) after substituting b^ with by The iteration process is stopped after b converged to a constant value. If na analytes are quantified simultaneously, data matrices of standard samples are measured for each analyte separately. These matrices X^j, X^2' •••' ^s na ^^^ collected in a three-way data matrix Xg of the size nxpxna, where n is the number of spectra in X^j,..., X^ „3, p is the number of wavelengths and na is the number of analytes. The basic equation for this multicomponent system is given by:
301
X, = X,b + R
(34.18)
where X,. is the three-way matrix of calibration data and b is a vector of regression coefficients related to the unknown concentrations by c^ = c^ b^. How to perform matrix operations on a three-way table is discussed in Section 31.17. The procedure is then continued in a similar way as for the one-component case. Eq. (34.18) is solved for b iteratively by substituting R = T*V*^ + E as explained before. Because the concentrations c^ are known, the three-way data matrix X^ measured for the standard samples can be directly resolved in its elution profiles and spectra by Parafac [36] explained in Section 31.8.3. References to other methods for the decomposition of three-way multicomponent profiles are included in the list of additional recommended reading. 34.5.3 Discussion In order to apply RBL or GRAFA successfully some attention has to be paid to the quality of the data. Like any other multivariate technique, the results obtained by RBL and GRAFA are affected by non-linearity of the data and heteroscedasticity of the noise. By both phenomena the rank of the data matrix is higher than the number of species present in the sample. This has been demonstrated on the PCA results obtained for an anthracene standard solution eluted and detected by three different brands of diode array detectors [37]. In all three cases significant second eigenvalues were obtained and structure is seen in the second principal component. A particular problem with GRAFA and RBL is the reproducibility of the retention data. The retention time axes should be perfectly synchronized. Small shifts of one time interval (thus the /th spectrum in X^ corresponds with the /+lth spectrum in X^) already introduce major errors (> 5%) when the chromatographic resolution is less than 0.6. The results of an extensive study on the influence of these factors on the accuracy of the results obtained by GRAFA and RBL have been reported in Ref. [37]. Although some practical applications have been reported [38,39], the lack of robustness of RBL and GRAFA due to artifacts mentioned above has limited their widespread application in chromatography. 34.6 Application of factor analysis for peak purity check in HPLC In pharmaceutical analysis the detection of impurities under a chromatographic peak is a major issue. An important step forward in the assessment of peak purity was the introduction of hyphenated techniques. When selecting a method to perform a purity check, one has the choice between a global method which considers a whole peak cluster (from the start to the end of the peak), and evolutionary methods, which consider a window of the peak cluster, which is
302
usually moved over the cluster. All global methods, except PCA, usually apply a stepwise approach, e.g. SIMPLISMA, OPA and HELP. HELP is a very versatile tool for a visual inspection and exploration of the data. Several complications can be present, such as heteroscedastic noise, sloping baseline, large scan time and non-linear absorbance [40]. This may lead to the overestimation of the number of existing compounds. The presence of heteroscedastic noise and non-linearities have an important effect on all PCA based methods, such as EFA and FSWEFA. Non-zero and sloping baselines have a critical effect in SIMPLISMA, HELP and FSWEFA. In any case it is better to correct for the baseline prior to the application of any multivariate technique. Baseline correction can be done by subtracting a linear interpolation of the noise spectra before and after the peak, or by row-centring the data [40]. Most analytical instruments have a restricted linear range and outside that range Beer's law no longer holds. Non-linear absorbance indicates the presence of more compounds in all the approaches discussed in this chapter. In some cases it is possible to detect a characteristic profile indicating the presence of non-linearities. In any case the best remedy is to keep the signal within the linear range. A non-linearity may also be introduced because the DAD needs about 10-50 ms to measure a whole spectrum. During that time the concentration of the eluting compound(s) may change significantly. The most sensitive methods for the detection of small amounts of impurities eluting at low chromatographic resolutions, OPA and HELP, are also the ones most affected by these non-linearities. If the scan time is known, a partial correction is possible. EFA, FSWEFA and ETA, which belong to the family of evolutionary methods are somewhat less performing for purity checking. They may also flag impurities due to the heteroscedasticity of the noise and non-linearity of the signal. For a more detailed discussion we refer to Ref. [40].
34.7 Guidance for the selection of a factor analysis method The first step in analysing a data table is to determine how many pure factors have to be estimated. Basically, there are two approaches which we recommend. One starts with a PCA or else either with OPA or SIMPLISMA. PCA yields the number of factors 2ind the significant principal components, which are abstract factors. OPA yields the number of factors and the purest rows (or columns) (factors) in the data table. If we suspect a certain order in the spectra, we preferentially apply evolutionary techniques such as FSWEFA or HELP to detect pure zones, or zones with two or more components. Depending on the way the analysis was started, either the abstract factors found by a PCA or the purest rows found by OPA, should be transformed into pure factors. If no constraints can be formulated on the pure factors, the purest rows
303
(spectra) found by OPA cannot be improved. On the contrary, a PCA can either be followed by a Varimax rotation or by constructing a variance diagram which yields factors with the greatest simplicity. If constraints can be formulated on the pure factors, a PCA can be followed by a curve resolution under the condition that only two compounds are present. OPA (or SIMPLISMA or FSWEFA) can be followed by alternating regression to iteratively estimate the pure row-factors (spectra) and pure column-factors (elution profiles). In a similar way, the varimax and vardia factors can be improved by alternating regression. The success of ITTFA in finding pure factors depends on its convergence to a pure factor by a stepwise application of constraints on the solution, which has been demonstrated on elution profiles. However, it then requires a PCA in the retention time space. Although the decomposition of a data table yields the elution profiles of the individual compounds, a calibration step is still required to transform peak areas into concentrations. Essentially we can follow two approaches. The first one is to start with a decomposition of the peak cluster by one of the techniques described before, followed by the integration of the peak of the analyte. By comparing the peak area with those obtained for a number of standards we obtain the amount. One should realize that the decomposition step is necessary because the interfering compound is unknown. The second approach is to directly calibrate the method by RAFA, RBL or GRAFA or to decompose the three-way table by Parafac. A serious problem with these methods is that the data sets measured for the sample and for the standard solution should be perfectly synchronized. References 1. 2. 3.
4. 5. 6.
7. 8.
9.
H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23 (1958) 187-200. M. Forina, C. Armanino, S. Lanteri and R. Leardi, Methods of Varimax rotation in factor analysis with applications in clinical and food chemistry. J. Chemom., 3 (1988) 115-125. J.K. Strasters, H. A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Evaluation of peak-recognition techniques in liquid chromatography with photodiode array detection. J. Chromatog., 385 (1987) 181-200. E.R. Malinowski and D. Howery, Factor Analysis in Chemistry. Wiley, New York, 1980. P.K. Hopke, Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989) 7-19. J.K. Strasters, H.A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Reliability of iterative target transformation factor analysis when using multiwavelength detection for peak tracking in liquid-chromatographic separation. Anal. Chem., 60 (1988) 2745-2751. W.H. Lawton and E A. Sylvestre, Self modeling curve resolution. Technometrics, 13 (1971) 617-633. B.G.M. Vandeginste, R. Essers, T. Bosman, J. Reijnen and G. Kateman, Three-component curve resolution in HPLC with multiwavelength diode array detection. Anal. Chem., 57 (1985) 971-985. O.S. Borgen and B.R. Kowalski, An extension of the multivariate component-resolution method to three components. Anal. Chim. Acta, 174 (1985) 1-26.
304 10. 11.
12.
13.
14. 15.
16.
17. 18.
19. 20.
21.
22.
23.
24.
25.
26.
A. Meister, Estimation of component spectra by the principal components method. Anal. Chim. Acta, 161 (1984) 149-161. B.G.M. Vandeginste, F. Leyten, M. Gerritsen, J.W. Noor, G. Kateman and J. Frank, Evaluation of curve resolution and iterative target transformation factor analysis in quantitative analysis by liquid chromatography. J. Chemom., 1 (1987) 57-71. P.K. Hopke, D.J. Alpert and B.A. Roscoe, FANTASIA — A program for target transformation factor analysis to apportion sources in environmental samples. Comput. Chem., 7 (1983) 149-155. P.J. Gemperline, A priori estimates of the elution profiles of the pure components in overlapped liquid chromatography peaks using target factor analysis. J. Chem. Inf. Comput. Sci., 24 (1984)206-212. P.J. Gemperline, Target transformation factor analysis with linear inequality constraints applied to spectroscopic-chromatographic data. Anal. Chem., 58 (1986) 2656-2663. B.G.M. Vandeginste, W.Derks and G. Kateman, Multicomponent self modelling curve resolution in high performance liquid chromatography by iterative target transformation analysis. Anal. Chim. Acta, 173 (1985) 253-264. A. de Juan, B. van den Bogaert, F. Cuesta Sanchez and D.L. Massart, Application of the needle algorithm for exploratory analysis and resolution of HPLC-DAD data. Chemom. Intell. Lab. Syst.,33(1996) 133-145. M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks. Anal. Chem., 59 (1987) 527-530. H. Gampp, M. Maeder, C.J. Meyer and A.D. Zuberbuhler, Calculation of equilibrium constants from multiwavelength spectroscopic data. Ill Model-free analysis of spectrophotometric and ESR titrations. Talanta, 32 (1985) 1133-1139. M. Maeder and A.D. Zuberbuhler, The resolution of overlapping chromatographic peaks by evolving factor analysis. Anal. Chim. Acta, 181 (1986) 287-291. R.Tauler and E. Casassas, Application of principal component analysis to the study of multiple equilibria systems — Study of Copper(II) salicylate monoethanolamine, diethanolamine and triethanolamine systems. Anal. Chim. Acta, 223 (1989) 257-268. E.J. Karjalainen, Spectrum reconstruction in GC/MS. The robustness of the solution found with alternating regression, in: E.J. Karjalainen (Ed.), Scientific Computing and Automation (Europe). Elsevier, Amsterdam, 1990, pp. 477-488. H.R. Keller and D.L. Massart, Peak purity control in liquid-chromatography with photodiode array detection by fixed size moving window evolving factor analysis. Anal. Chim. Acta, 246 (1991)379-390. J. Toft and O.M. Kvalheim, Eigenstructure tracking analysis for revealing noise patterns and local rank in instrumental profiles: application to transmittance and absorbance IR spectroscopy. Chemom. Intell. Lab. Syst., 19 (1993) 65-73. O.M. Kvalheim and Y.-Z. Liang, Heuristic evolving latent projections — resolving 2-way multicomponent data. 1. Selectivity, latent projective graph, datascope, local rank and unique resolution. Anal. Chem., 64 (1992) 936-946. M.J.P. Gerritsen, H. Tanis, B.G.M. Vandeginste and G. Kateman, Generalized rank annihilation factor analysis, iterative target transformation factor analysis and residual bilinearization for the quantitative analysis of data from liquid-chromatography with photodiode array detection. Anal. Chem., 64 (1992) 2042-2056. H.R. Keller and D.L. Massart, Artifacts in evolving factor analysis-based methods for peak purity control in liquid-chromatography with diode array detection. Anal. Chim. Acta, 263 (1992) 21-28.
305 27. 28. 29.
30. 31.
32. 33.
34. 35. 36 37.
38.
39. 40.
W. Windig and H.L.C. Meuzelaar, Nonsupervised numerical component extraction from pyrolysis mass spectra of complex mixtures. Anal. Chem., 56 (1984) 2297-2303. W. Windig and J. Guilement, Interactive self-modeling mixture analysis. Anal. Chem., 63 (1991) 1425-1432. W. Windig, C.E. Heckler, FA. Agblevor and R.J. Evans, Self-modeling mixture analysis of categorized pyrolysis mass-spectral data with the Simplisma approach. Chemom. Intell. Lab. Syst., 14 (1992) 195-207. F.C. Sanchez, J. Toft, B. van den Bogaert and D.L. Massart, Orthogonal projection approach applied to peak purity assessment. Anal. Chem., 68 (1996) 79-85. F. C. Sanchez, T. Hancewicz, B.G.M. Vandeginste and D.L. Massart, Resolution of complex liquid chromatography Fourier transform infrared spectroscopy data. Anal. Chem., 69 (1997) 1477-1484. E.R. Malinowski, Obtaining the key set of typical vectors by factor analysis and subsequent isolation of component spectra. Anal. Chim. Acta, 134 (1982) 129-137. C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to quantitative analysis of multicomponent fluorescence data from the video fluorometer. Anal. Chem., 52 (1980) 1108-1113. E. Sanchez and B.R. Kowalski, Generalized rank annihilation factor analysis. Anal. Chem., 58 (1986) 496-499. J. Ohman, P. Geladi and S. Wold, Residual bilinearization. Part I Theory and algorithms. J. Chemom., 4 (1990) 79-90. A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15 (1992) 143-157. M.J.P. Gerritsen, N.M. Faber, M. van Rijn, B.G.M. Vandeginste and G. Kateman, Realistic simulations of high-performance liquid-chromatographic ultraviolet data for the evaluation of multivariate techniques. Chemom. Intell. Lab. Syst., 2 (1992) 257-268. E. Sanchez, L.S. Ramos and B.R. Kowalski, Generalized rank annihilation method. I. Application to liquid chromatography-diode array ultraviolet detection data. J. Chromatog., 385 (1987) 151-164. L.S. Ramos, E. Sanchez and B.R. Kowalski, Generalized rank annihilation method. II Analysis of bimodal chromatographic data. J. Chromatog., 385 (1987) 165-180. F. C. Sanchez, B. van den Bogaert, S.C. Rutan and D.L. Massart, Multivariate peak purity approaches. Chemom. Intell. Lab. Syst., 34 (1996) 139-1171.
Additional recommended reading Books E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, 2nd Edn. Wiley, New York, 1992. R. Coppi and S. Bolasco (Eds.), Multiway Data Analysis. North-Holland, Amsterdam, 1989.
Articles Target transformation factor analysis: P.K. Hopke, Tutorial: Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989) 7-19.
306 Rank annihilation factor analysis: C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to fluorescent multicomponent mixtures of polynuclear aromatic hydrocarbons. Anal. Chem., 52 (1980) 1071-1079. J. Ohman, P. Geladi and S. Wold, Residual bilinearization, part 2: AppHcation to HPLC-diode array data and comparison with rank annihilation factor analysis. J. Chemom., 4 (1990) 135-146. Evolutionary methods: H.R. Keller and D.L. Massart, Evolving factor analysis. Chemom. Intell. Lab. Syst., 12 (1992) 209-224. F. Cuesta Sanchez, M.S. Khots, D.L. Massart and J.O. De Beer, Algorithm for the assessment of peak purity in liquid chromatography with photodiode-array detection. Anal. Chem., 285 (1994) 181-192. J. Toft, Tutorial: Evolutionary rank analysis applied to multidetectional chromatographic structures. Chemom. Intell. Lab. Syst., 29 (1995) 189-212. Three-way methods: B. Grung and O.M. Kvalheim, Detection and quantitation of embedded minor analytes in three-way multicomponent profiles by evolving projections and internal rank annihilation. Chemom. Intell. Lab. Syst., 29 (1995) 213-221. B. Grung and O.M. Kvalheim, Rank mapping of three-way multicomponent profiles. Chemom. Intell. Lab. Syst., 29 (1995) 223-232. R. Tauler, A.K. Smilde and B.R. Kowalski, Selectivity, local rank, three-way data analysis and ambiguity in multivariate curve resolution. J. Chemom., 9 (1995) 31-58. Simplisma: W. Windig and G. Guilment, Interactive self-modeling mixture analysis. Anal. Chem., 63 (1991) 1425-1432. W. Windig and D.A. Stephenson, Self-modeling mixture analysis of second-derivative near-infrared spectral data using the Simplisma approach. Anal. Chem., 64 (1992) 2735-2742. Alternating least squares method: R. Tauler, A.K. Smilde, J.M. Henshaw, L.W. Burgess and B.R. Kowalski, Multicomponent determination of chlorinated hydrocarbons using a reaction-based chemical sensor. 2 Chemical speciation using multivariate curve resolution. Anal. Chem., 66 (1994) 3337-3344. R. Tauler, A. Izquierdo-Ridorsa, R. Gargallo and E. Casassas, Application of a new multivariate curve-resolution procedure to the simultaneous analysis of several spectroscopic titrations of the cupper (II) — polyiosinic acid system. Chemom. Intell. Lab. Syst., 27 (1995) 163-174. S. Lacorte, D. Barcelo and R. Tauler, Determination of traces of herbicide mixtures in water by on-line solid-phase extraction followed by liquid chromatography with diode-array detection and multivariate self-modelling curve resolution. J. Chromatog. A 697 (1995) 345-355.
307
Chapter 35
Relations between measurement tables 35.1 Introduction Studying the relationship between two or more sets of variables is one of the main activities in data analysis. This chapter mainly deals with modelling the linear relationship between two sets of multivariate data. One set holds the dependent variables (or responses), the other set holds the independent variables (or predictors). However, we will also consider cases where such a distinction cannot be made and the two data sets have the same status. Each set is in the usual objects X measurements format. There is a choice of techniques for estimating the model, all closely related to multiple linear regression (see Chapter 10). Roughly, the model found can be used in two ways. One usage is for a better understanding of the system under investigation by an interpretation of the model results. The other usage is for the future prediction of the dependent variable from new measurements on the predictor variables. Examples of problems amenable to such multivariate modelling are legion, e.g. relating chemical composition to spectroscopic or chromatographic measurements in analytical chemistry, studying the effect of structural properties of chemical compounds, e.g. drug molecules, on functional behaviour in pharmacology or molecular biology, linking flavour composition and sensory properties in food research or modelling the relation between process conditions and product properties in manufacturing. The present chapter provides an overview of the wide range of techniques that are available to tackle the problem of relating two sets of multivariate data. Different techniques meet specific objectives: simply identifying strong correlations, matching two multi-dimensional point configurations, analyzing the effects of experimental factors on a set of responses, multivariate calibration, predictive modelling, etc. It is important to distinguish the properties of these techniques in order to make a balanced choice. As an example consider the data presented in Tables 35.1-35.4. These tables are extracted from a much larger data base obtained in an international cooperative study on the sensory aspects of olive oils [1]. Table 35.1 gives the mean scores for 16 samples of olive oil with respect to six appearance attributes given by a Dutch sensory panel. Table 35.2 gives similar scores for the same samples as judged by a
308 TABLE 35.1 Olive oils: mean scores for appearance attributes from Dutch sensory panel Sample
ID
Yellow
Green
Brown
Olossy
Transp
Syrup
1
Gil
21.4
73.4
10.1
79.7
75.2
50.3
2
G12
23.4
66.3
9.8
77.8
68.7
51.7
3
012
32.7
53.5
8.7
82.3
83.2
45.4
4
013
30.2
58.3
12.2
81.1
77.1
47.8
5
022
51.8
32.5
8.0
72.4
65.3
46.5
6
131
40.7
42.9
20.1
67.7
63.5
52.2
7
132
53.8
30.4
11.5
77.8
77.3
45.2
8
132
26.4
66.5
14.2
78.7
74.6
51.8
9
133
65.7
12.1
10.3
81.6
79.6
48.3
10
142
45.0
31.9
28.4
75.7
72.9
52.8
11
S51
70.9
12.2
10.8
87.7
88.1
44.5 42.3
12
S52
73.5
9.7
8.3
89.9
89.7
13
S53
68.1
12.0
10.8
78.4
75.1
46.4
14
S61
67.6
13.9
11.9
84.6
83.8
48.5
15
S62
71.4
10.6
10.8
88.1
88.5
46.7
16
S63
71.4
10.0
11.4
89.5
88.5
47.2
Mean
50.9
33.5
12.3
80.8
78.2
48.0
Std. dev.
19.5
23.5
5.1
6.2
8.3
3.1
British panel. Note that the sensory attributes are to some extent different. Table 35.3 gives some information on the country of origin and the state of ripeness of the olives. Finally, Table 35.4 gives some physico-chemical data on the same samples that are related to the quality indices of olive oils: acid and peroxide level, UV absorbance at 232 nm and 270 nm, and the difference in absorbance at wavelength 270 nm and the average absorbance at 266 nm and 274 nm. Given these tables of multivariate data one might be interested in various relationships. For example, do the two panels have a similar perception of the different olive oils (Tables 35.1 and 35.2)? Are the oils more or less similarly scattered in the two multidimensional spaces formed by the Dutch and by the British attributes? How are the two sets of sensory attributes related? Does the
309 TABLE 35.2 Olive oils: mean scores of appearance attributes from British sensory panel Sample
ID
Bright
Depth
Yellow
Brown
Green
1
Gil
33.2
76.8
24.4
50.9
56.8
2
G12
40.9
76.7
28.3
39.4
61.4
3
G12
44.1
70.0
33.6
35.9
52.4
4
G13
51.4
65.0
37.1
28.3
52.1
5
G22
63.6
47.2
58.1
17.9
36.9
6
131
42.4
67.3
41.6
41.1
34.7
7
132
60.6
51.1
58.0
20.3
33.5
8
132
71.7
42.7
69.9
17.7
21.6
9
133
41.7
74.7
28.1
42.8
51.9
10
142
48.3
68.7
44.7
57.4
16.4
11
S51
78.6
34.3
82.5
9.4
18.7
12
S52
84.8
25.0
85.9
3.1
16.2
13
S53
85.3
26.3
86.7
2.3
17.9
14
S61
81.4
34.5
80.2
8.3
18.2
15
S62
88.4
27.7
87.4
4.7
14.7
16
S63
88.4
29.7
86.8
3.4
16.3
Mean
62.8
51.1
58.3
23.9
32.5
Std.dev.
19.8
19.9
24.4
18.5
17.2
country of origin or the state of ripeness affect the sensory characteristics (Tables 35.1 and 35.3)? Can we possibly predict the sensory properties from the physicochemical measurements (Tables 35.1 and 35.4)? An important aspect of all methods to be discussed concerns the choice of the model complexity, i.e., choosing the right number of factors. This is especially relevant if the relations are developed for predictive purposes. Building validated predictive models for quantitative relations based on multiple predictors is known as multivariate calibration. The latter subject is of such importance in chemometrics that it will be treated separately in the next chapter (Chapter 36). The techniques considered in this chapter comprise Procrustes analysis (Section 35.2), canonical correlation analysis (Section 35.3), multivariate linear regression
310 TABLE 35.3 Country of origin and state of ripeness of 16 olive oils. The last 4 columns contain the same information in the form of a coded design matrix Sample
ID
Country
Ripeness
Spain
Unripe
Overripe
1
Gil
Greece
unripe
0
1
0
2
G12
Greece
normal
0
0
0
Greece
3
G12
Greece
normal
0
0
0
4
G13
Greece
overripe
0
0
1
5
G22
Greece
normal
0
0
0
6
131
Italy
unripe
0
0
1
0
7
132
Italy
normal
0
0
0
0 0
8
132
Italy
normal
0
0
0
9
133
Italy
overripe
0
0
0
1
10
142
Italy
normal
0
0
0
0
11
S51
Spain
unripe
0
1
0
12
S52
Spain
normal
0
0
0
13
S53
Spain
overripe
0
0
1
14
S61
Spain
unripe
0
1
0
15
S62
Spain
normal
0
0
0
16
S63
Spain
overripe
0
0
1
(Section 35.4), reduced rank regression (Section 35.5), principal components regression (Section 35.6), partial least squares regression (Section 35.7) and continuum regression methods (Section 35.8).
35.2 Procrustes analysis 35.2.1 Introduction Procrustes analysis is a method for relating two sets of multivariate observations, say X and Y. For example, one may wish to compare the results in Table 35.1 and Table 35.2 in order to find out to what extent the results from both panels agree, e.g., regarding the similarity of certain olive oils and the dissimilarity of others. Procrustes analysis has a strong geometric interpretation. The
311
TABLE 35.4 Physico-chemical quality parameters of the 16 olive oils Sample
ID
Acidity
Peroxide
K232
K270
DK
1
Gil
0.73
12.70
1.900
0.139
2
G12
0.19
12.30
1.678
0.116
-0.004
3
012
0.26
10.30
1.629
0.116
-0.005
4
013
0.67
13.70
1.701
0.168
-0.002
5
022
0.52
11.20
1.539
0.119
-0.001
6
131
0.26
18.70
2.117
0.142
0.001
7
132
0.24
15.30
1.891
0.116
0.000
8
132
0.30
18.50
1.908
0.125
0.001
9
133
0.35
15.60
1.824
0.104
0.000
10
142
0.19
19.40
2.222
0.158
-0.003
11
S51
0.15
10.50
1.522
0.116
-0.004
12
S52
0.16
8.14
1.527
0.103
-0.002
13
S53
0.27
12.50
1.555
0.096
-0.002 -0.003
0.003
14
S61
0.16
11.00
1.573
0.094
15
S62
0.24
10.80
1.331
0.085
-0.003
16
S63
0.30
11.40
1.415
0.093
-0.004
Mean
0.31
13.25
1.709
0.118
-0.002
Std. dev.
0.18
3.35
0.249
0.024
0.002
observations (objects) are envisioned as points in a high-dimensional variable space. The objective is to find the transformation such that the configuration of points in X-space best matches the corresponding point configuration in F-space. Not all transformations are allowed: the internal configuration of the objects should be preserved. Procrustes analysis treats the two data sets symmetrically: there is no essential difference between either transforming Y to match X or applying the reverse transformation to X so that it best matches Y. One may also apply a transformation to each so that they meet halfway. In the sequel we consider the transformation of X to the target Y. We will assume that X and Y have the same number of variables. If this condition is not met one is at liberty to add the required number of columns, with zeros as entries, to the smaller data set (so-called "zero padding").
312
little bear
Great Bear
OVERALL ROTATION (PCA)
^
^
^
\
d
Fig. 35.1. The stages of Procrustes analysis.
We will explain the mechanics of Procrustes analysis by optimal matching of the two stellar configurations Great Bear and Little Bear. For ease of presentation we work with the 2D-configuration as we see it from the earth (Fig. 35.1a) and we ignore that the actual configuration is 3-dimensional. First X and Y are column mean-centred, so that their centroids m^ = X^IM and my = Y^IM are moved to the origin (Fig. 35.1b, translation step). This column centering is an admissible transformation since it does not alter the distances between objects within each
313
data set. The next step is a reflection (Fig. 35.1c). Again this is a transformation which leaves distances between objects unaltered. The following step is a rotation, which changes the orientation, but not the internal structure of the configuration (Fig. 35.Id). When this best match is found one is at liberty to rotate all configurations equally. This will not affect the match but it may yield an overall orientation that is more appealing (Fig. 35.le). Finally, by taking the mean position for each star, one obtains an average configuration, often called the consensus, that is representative for the two separate configurations (Fig. 35.If). The major problem is to find the rotation/reflection which gives the best match between the two centered configurations. Mathematically, rotations and reflections are both described by orthogonal transformations (see Section 29.8). These are linear transformations with an orthonormal matrix (see Section 29.4), i.e. a square matrix R satisfying R^R = RR^ = I, or R^ = R"^ When its determinant is positive R represents a pure rotation, when the determinant is negative R also involves a reflection. The best match is defined as the one which minimizes the sum of squared distances between the transformed Z-objects and the corresponding objects in the target configuration given by Y. The Procrustes problem then is equivalent to minimizing the sum of squares of the deviations matrix E = Y - XR, assuming both X and Y have been column-mean centered. This looks like a straightforward least-squares regression problem, Y = XR + E, but it is not since R is restricted to be an orthogonal rotation/reflection matrix. Using a shorthand notation for a matrix sum of squares, ||E|p = YJ^'^J = tr(E^E), we may state the Procrustes optimization problem as: min 11 Y - XR IP R
subject to R'^R = RR^ = 1
(35.1)
Using elementary properties of the trace of a matrix (viz. tr(A + B) = tr(A) + tr(B) and tr(AB) = tr(BA), see Section 29.4) we may write: II Y - XR IP = tr((Y - XR)'r(Y - XR)) = tr(YTY - R^X'^Y - Y^XR + R'^X^XR) = = tr(YTY) - 2tr(Y^XR) + tr(RTX'rXR)
(35.2)
The first term on the right-hand side represents the total sum of squares of Y, that obviously does not depend on R. Likewise, the last term represents the total sum of squares of the transformed X-configuration, viz. XR. Since the rotation/reflection given by R does not affect the distance of an object from the origin, the total sum of squares is invariant under the orthogonal transformation R. (This also follows from tr(R'^X^XR) = tr(X^XRRT) = tT(X^XI) = tT{X^X).) The only term then in eq. (35.2) that depends on R is tr(Y'^XR), which we must seek to maximize.
314
Let the SVD of V^X be given by V^X = Q D W ^ , with D being the diagonal matrix of singular values. The properties of singular vector decomposition (SVD, Section 29.6) tell us that, among all possible orthonormal matrices, Q and W are the ones that maximize tr(Q'^Y'^XW). Since tr(Q'^Y'^XW) = trCY^XWQ^), it follows that R = WQ^ is the rotation/reflection which maximizes tr(Y^XR), and hence it minimizes the squared distances, ||Y - XR |p (eq. 35.2) between X and Y. Given this optimal Procrustes rotation applied to X, one may compute an average configuration Z as (Y + XR)/2. Usually, this is followed by a principal component analysis (Section 31.1) of the average Z. The rotation matrix V, obtained as the matrix of eigenvectors of Z^Z, is then applied each to Y, XR and Z. It must be emphasized that Procrustes analysis is not a regression technique. It only involves the allowed operations of translation, rotation and reflection which preserve distances between objects. Regression allows any linear transformation; there is no normality or orthogonality restriction to the columns of the matrix B transforming X. Because such restrictions are released in a regression setting Y = XB will fit Y more closely than the Procrustes match Y = XR (see Section 35.3). 35.2.2 Algorithm Summarizing, the Procrustes matching problem for two configurations X and Y can be solved with the following algorithm: Column-centering: SVD: Rotate X: Average: PCAofZ: Final rotation:
X <— X-l^nix and Y
For more details, also for the case of determining individual isotropic scaling factors, see Refs. [2,3]. 35.2.3 Discussion In the field of structural chemistry Procrustes analysis has been used to compare the geometries of similar structures from different studies: e.g. the conformations of a molecule crystallized in two crystalline forms, or the 3D-configuration of the main chain of two proteins, or the geometries of differently substituted compounds. Figure 35.2a shows the 3D-configuration of two comparable amino-acid sequences taken from related protein fragments obtained from different structural studies. After Procrustes matching the two structures are overlaid (Fig. 35.2b) revealing both the close overall similarity and the local deviations.
315
AA, /-O-
--A
Fig. 35.2. Comparable fragments of different protein fragments, before (a) and after (b) matching.
Let us now compare the sensory data of Table 35.1 (Dutch panel, X) and Table 35.2 (British panel, Y) using Procrustes analysis. Since the British panel has one variable less than the Dutch panel we add a column of zeros to ensure that the two data sets have the same size (16x6). This so-called 'zero padding' is allowed since it does not affect the distances between the objects. Another form of data preprocessing that we apply is the overall scaling of the British scores which have a larger total variance. After shrinking these data by a factor 0.73, both data sets have the same total variance (= 1038). Figure 35.3 shows the object scores of the combined data, obtained after matching, projected on the principal axes of the
316
20
C D d
10 to
b
(0
a o c
" rflt I
^
a
•
%
M
-10^
-20\ 1
-1
1 — • — I — ' — \ — '
50
-40
-30
-20
-10
0
p-
10
—^
20
r — ^
30
I
I
40
r -
50
principal axis 1 Fig. 35.3. Scatter plot of 16 olive oils scored by two sensory panels (Dutch panel: lower case; British panel: upper case). The combined data are shown after Procrustes matching and projection onto the principal plane of the average configuration.
consensus configuration. This scatter plot clearly shows that the relative positions of the 16 olive oil samples by the two panels compare very well. It is also clear that samples from different countries, especially Greece and Spain, are well separated. As to the interpretation of the various attributes one may plot the correlations of the individual attributes with the two principal dimensions (Fig. 35.4). This plot shows, for example, that indeed the attributes 'yellow' and 'green' (and to a lesser extent 'brown') are similarly perceived by the two panels. It also shows that the second dimension is better represented by the attributes from the Dutch panel. Practically all attributes lie close to the circle of radius 1. This means that a large fraction of their variance is explained by dimensions 1 and 2, indicating that there is little correlation with the higher dimensions. Hence, the common twodimensional representation provides an adequate summary of the data of both panels. Procrustes analysis has been generalized in two ways. One extension is that more than two data sets may be considered. In that case the algorithm is iterative. One then must rotate, in turn, each data set to the average of the other data sets. The cycle must be repeated until the fit no longer improves. Procrustes analysis of many data sets has been applied mostly in the field of sensory data analysis [4]. Another extension is the application of individual scaling to the various data sets in order to improve the match. Mathematically, it amounts to multiplying all entries in a data set by the same scalar. Geometrically, it amounts to an expansion (or
317 1.0
BRIC YELLJOW
Fig. 35.4. Correlations of sensory attributes with principal axes of average configuration after Procrustes matching (Dutch panel: lower case; British panel: upper case).
shrinkage) that is isotropic, i.e. equal in all directions. While it does affect the absolute distances between objects, it does not affect their relative distances nor the angles involving three objects. In Chapter 38 we give an example of such a Generalized Procrustes Analysis (GPA) where it is used to find a 'best average' configuration of different food products coming from individual assessments by different panellists. Procrustes analysis has been little exploited in typical chemometrical areas, since it is less common that one only wants to investigate the similarity of two comparable data tables. One is more often interested in a predictive regression-type linear relation between two data tables. The data tables involved then may be of quite different origin, e.g. sensory and chemical. 35.3 Canonical correlation analysis 35.3.1 Introduction Canonical Correlation Analysis (CCA) is perhaps the oldest truly multivariate method for studying the relation between two measurement tables X and Y [5]. It generalizes the concept of squared multiple correlation or coefficient of determination, R^. In Chapter 10 on multiple linear regression we found that R^ is a measure for the linear association between a univariate y and a multivariate X. This R^ tells how much of the variance of y is explained by X: R^ = y^ yly^y = llylP/llylF. Now, we extend this notion to a set of response variables collected in the multivariate data set Y.
318 TABLE 35.5 Two small bivariate data sets. A. Raw data X
Y
•^1
^2
^'i
yi
8
11
10
8
8
8
9
7
9
12
6
11
10
10
1
14
9
13
3
0
2
10
7
9
11
5
12
5
2
11
2
7
B. Correlation matrix
Xl
yi
yi
Xi
X2
1
0.86
-0.55
0.53 0.62
X2
0.86
1
-0.46
yi
-0.55
-0.46
1
-0.93
yi
0.53
0.62
-0.93
1
For example, let us take a look at the data of Table 35.5a. This table shows two very simple data sets, X and Y, each containing only two variables. Is there a relationship between the two data sets? Looking at the matrix of correlation coefficients (Table 35.5b) we find that the so-called intra-set (or within-set) correlations are strong: r(jCi,jC2) = +0.86 and r(y^,y2) = -0.93. The inter-set (or between-set) correlations, however, are relatively low: r{x,,y,) = -0.55,
rix.^y^) = +0.53,
rix^^y,) = -0.46,
r(x2,y2) = +0.62.
The squared inter-set correlation coefficients vary from 0.21 to 0.38. Thus, only some 20% to 40% of the variance of the individual variables can be explained by one of the variables from the other data set. At a first glance these low inter-set correlations do not indicate a strong relation between the two data tables. In
319
principle, however, these figures can be improved if we consider multiple regression of each variable on all variables from the other data set. We then find the following squared multiple correlation coefficients (7?^-values): R\x^\yi,y2) = 0.3l, R\x2\y^.y2) = 0AS, R\y^\x^,X2) = 030, R^(y2Kx2) = 039 Here, the notation R'^(y^\x^,X2) stands for the squared multiple correlation coefficient (or coefficient of determination) of the multiple regression of y^ on x^ and X2. The improvement is quite modest, suggesting once more that there is only a weak (linear) relation between the two sets of data. We can go one step further, however. Each of the above multiple regression relations is between a single variable (response) of one data set and a linear combination of the variables (predictors) from the other set. Instead, one may consider the 'multiple-multiple' correlation, i.e. the correlation of a linear combination from one set with a linear combination of the other set. Such linear combinations of the original variables are variously called factors, components, latent variables, canonical variables or canonical variates (also see Chapters 9,17, 29, and 31). As an example, let us construct a factor u = Yq from the F-variables defined through its weight vector q. For example, let q = [0.8, -0.6]^, i.e. u = 0.8yi-0.6y2. A vector of factor scores u can be calculated then by applying these weights, 0.8 and -0.6, to the columns y^ and y2, respectively, in Table 35.5 and adding the results. Multiple regression of this newly formed variable u on Xj and X2 yields R^ = 0.66. This is already a considerable improvement. The natural question then is: Which linear combination of F-variables yields the highest R^ when regressed on the X-variables in a multiple regression? Canonical correlation analysis answers this question. The particular linear combinations of the X- and F-variables achieving the maximum correlation are the so-called first canonical variables, say t^ = Xwj and Uj = Yqj. The vectors of coefficients Wj and q^ in these linear combinations are the canonical weights for the X-variables and 7-variables, respectively. For the data of Table 35.5 they are found to be Wj = [0.583, -0.561]^ and qi = [0.737,0.731]^. The correlation between these first canonical variables is called the first canonical correlation, pj. This maximum correlation turns out to be quite high: pj = 0.95 {R^ = 0.90), indicating a strong relation between the first canonical dimensions of X and Y. The next pair of canonical variates, t2 and U2 also has maximum correlation p2, subject, however, to the condition that this second pair should be uncorrelated to the first pair, i.e. t^ t2 = u^ U2 = 0. For the example at hand, this second canonical correlation is much lower: p2 = 0.55 (R^ = 0.31). For larger data sets, the analysis goes on with extracting additional pairs of canonical variables, orthogonal to the previous ones, until the data table with the smaller number of variables has been
320
transformed completely into equally many orthogonal canonical variables. It turns out that the interset correlations of canonical variables of different dimensions are also orthogonal, i.e. ij u^ = 0 for / ^j. Thus, we see that CCA forms a canonical analysis, namely a decomposition of each data set into a set of mutually orthogonal components. A similar type of decomposition is at the heart of many types of multivariate analysis, e.g. PCA and PLS. Under the assumption of multivariate normality for both populations the canonical correlations can be tested for significance [6]. Retaining only the significant canonical correlations may allow for a considerable dimension reduction. 35.3.2 Algorithm Computationally, canonical correlation analysis can be implemented using the following steps, where it is assumed that the data X and Y are mean-centered. X = Ux Sx V^
< SVD of X >
(35.5a)
Y = Uy Sy V^
< SVD of Y >
(35.5b)
R = (U J UY) = W*DQ*'^
< SVD of interset PC correlation matrix>
T = (n-iy^^ UxW*
< X canonical scores>
(35.7a)
U = (n-iy^^ UyQ*
< r canonical scores>
(35.7b)
W = (n-iy^^ VxSx^W*
<Xcanonical weights>
(35.8a)
Q = (n-iy^^ VYSY^ Q*
< ycanonical weights>
(35.8b)
T = XW
< X canonical scores>
(35.9a)
U = YQ
< Y canonical scores>
(35.9b)
P = X'^T(T'^T)-^
< X canonical structure>
(35.1 Oa)
C = Y^U(U^U)-^
< y canonical structure
(35.10b)
(35.6)
Equations 35.5a and b represent the singular value decomposition of the original data tables, giving the new sets, Ux and Uy, of unit-length orthogonal (orthonormal) variables. From these the matrix R is calculated as U x Uy (eq. 35.6). R is the correlation matrix between the principal components of X and those of Y, because of the equivalence of PCs and (left) singular vectors. Singular value decomposition of R yields the canonical weight vectors W* and Q* applicable to Ux and Uy, respectively. The singular values obtained are equal to the canonical
321
correlations p^. Instead of a single SVD of R one may apply a spectral decomposition (Section 29.6) of RR^ giving eigenvectors W* and a spectral decomposition of R^R giving eigenvectors Q*, the eigenvalues corresponding to the squared canonical correlations. The canonical variables are now obtained as in eqs. (35.7a,b). The factor {n-\f^ is included to ensure that the canonical variables have unit variance. Back-transformation to the centred X- and F-variables yields the sets of canonical weights collected in matrices W and Q, respectively (eqs. 35.8a,b). Applying these weights to the original variables again yields the canonical variables (eqs. 35.9a,b). Regressing the X-variables and F-variables on their corresponding canonical variables gives the loading matrices P and C (eqs. 35.10a,b)) which appear in the canonical decomposition: X = TP'^ = T(T'rT)-iT^X and Y = UC^ = U(U^U)"^U^Y. The loadings, defining the original mean-centred variables in terms of the orthogonal canonical variables, are better suited for interpretation than the weights. Each row of P (or C) corresponds to a variable and tells how much each canonical variable contributes to (or "loads" on) this variable. In case the X-variables and the F-variables are also scaled to unit variance, P and C contain the intra-set correlations between the original variables and the canonical variables (so-called structure correlations', see Table 35.6). It should be appreciated that canonical correlation analysis, as the name implies, is about correlation not about variance. The first step in the algorithm is to move from the original data matrices X and Y, to their singular vectors, Ux and Uy, respectively. The singular values, or the variances of the PCs of X and Y, play no role. 353.3 Discussion Let us take a closer look at the analysis of the data of Table 35.5. In Table 35.6 we summarize the correlations of the canonical variates and also their correlations with the original variables. The high value of the first canonical correlation (pi = 0.95) suggests a strong relationship between the two data sets. However, the canonical variables tj and Uj are only strongly related to each other, not with the original variables (Table 35.6). On the other hand, the second set of canonical variates t2 and U2 are strongly related to their original variables, but not to each other (p2 = 0.55). Thus, the analysis yields a pair of strongly linked, but uninteresting factors and a pair of more interesting factors, which are weakly related, however. A major limitation to the value of CCA thus already has become apparent in the example shown. There is no guarantee that the most important canonical variable t, (or Ui) is highly correlated to any of the individual variables of X (or Y). It is possible then for the first canonical variable tj of X to be strongly correlated with Uj, yet to have very little predictive value for Y. In terms of principal components
322 TABLE 35.6 Canonical structure: correlations between the original variables (jc, y) and their canonical variates (/, u).
X
Y
^1
h
"i
"2
^1
0.0476
0.9989
0.0451
0.5522
X2
0.5477
0.8367
0.5187
0.4625
>'J
-0.0042
-0.5528
-0.0045
-1.0000
yi
0.3532
0.5129
0.3729
0.9279
u
1.0000
0.0000
0.9470
0.0000 0.5528
h
0.0000
1.0000
0.0000
"i
0.9470
0.0000
1.0000
0.0000
"2
0.0000
0.5528
0.0000
1.0000
(Chapter 17): only the minor principal components of X and Y happen to be highly correlated. It is questionable whether such a high correlation is then of much interest. This dilemma of choosing between high correlation and large variance presents a major problem when analyzing the relation between measurements tables. The regression techniques treated further on in this chapter address this dilemma in different ways. A second limitation of CCA is that it cannot deal in a meaningful way with data tables in 'landscape mode', i.e. wide data tables having more variables than objects. This severely limits the importance of CCA as a general tool for multivariate data analysis in chemometrics, e.g. when X represents spectral data. As the name implies CCA analyses correlations. It is therefore insensitive to any rescaling of the original variables. This advantage is not shared with most other techniques discussed in this chapter. As in Procrustes analysis X and Y play entirely equivalent roles in canonical correlation analysis: there is no distinction in terms of dependent variables (or responses) versus independent variables (or predictors, regressors, etc.). This situation is fairly uncommon. Usually, the X and Y data are of a different nature and one is interested in understanding one set of data, say Y, in the light of the information contained in the other data set, X. Rather than exploring correlations in a symmetric X<->Y relation, one is searching for an asymmetric regression relation X->Y explaining the dependent F-variables from the predictor X-variables. Thus, the symmetrical nature of CCA limits its practical importance. In the following sections we will discuss various asymmetric regression methods where the goal is to fit the matrix of dependent variables Y by linear combination(s) of the predictor variables X.
323
35.4 Multivariate least squares regression 35.4.1 Introduction In this section we will distinguish multivariate regression from multiple regression. The former deals with a multivariate response (Y), the latter with the use of multiple predictors (X). When studying the relation between two multivariate data sets via regression analysis we are therefore dealing with multivariate multiple regression. Perhaps the simplest approach to studying the relation between two multivariate data sets X and Y is to perform for each individual univariate variable y^ {k = 1,..., m) a separate multiple (i.e. two or more predictor variables) regression on the Z-variables. The obvious advantage is that the whole analysis can be done with standard multiple regression programs. A drawback of this approach of many isolated regressions is that it does not exploit the multivariate nature of Y, viz. the interdependence of the y-variables. Genuine multivariate analysis of a data table Y in relation to a data table X should be more than just a collection of univariate analyses of the individual columns of Y! One might suspect that fitting all 7-variables simultaneously, i.e. in one overall multivariate regression, might make a difference for the regression model. This is not the case, however. To see this, let us state the multivariate (i.e. two or more dependent variables) regression model as: [yi' y2' — Ym] = X [bj, b2,..., b j + [ei, t^^..., e j
(35.11)
which shows explicitly the various responses, y^ (/ = 1, 2, ..., m), as well as the vector of regression coefficients b^, and the residual vector, e^, corresponding to each response. This model may be written more compactly as: Y = XB + E (35.12) where Y is the nxm data set of responses, X the nxp data set of regressors, B the pxm matrix of regression coefficients and E the nxm error matrix. Each column of Y, B and E corresponds to one of the m responses, each column of X and each row of B to one of the p predictor variables and each row of Y, X, and E to one of the n observations. The total residual sum of squares, taken over all elements of E, achieves its minimum when each column e^ separately has minimum sum of squares. The latter occurs if each (univariate) column of Y is fitted by X in the least-squares way. Consequently, the least-squares minimization of E is obtained if each separate dependent variable is fitted by multiple regression on X. In other words: the multivariate regression analysis is essentially identical to a set of univariate regressions. Thus, from a methodological point of view nothing new is added and we may refer to Chapter 10 for a more thorough discussion of theory and application of multiple regression.
324
35.4.2 Algorithm The solution for the regression parameters can be adapted in a straightforward manner from eq (10.6), viz. b = (X^X)"^ X^y, giving: B = {X^Xr'X^Y
(35.13)
Y = X B = X (X^X)-^ X^Y
(35.14)
In eqs. (35.13) and (35.14) X may include a column of ones, when an intercept has to be fitted for each response, giving (p+l x m) B. Otherwise, X and Y are supposed to be mean centered, and (pxm) B does not contain a column of intercepts. The geometric meaning of Eq. (35.14) is that the best fit is obtained by projecting all responses orthogonally onto the space defined by the columns of X, using the orthogonal projection matrix X(X^X)"^X^ (see Section 29.8). 35.4.3 Discussion A major drawback of the approach is felt when the number of dependent variables, m, is large. In that case there is an equally large number of separate analyses and the combined results may be hard to summarize. When the number of predictor variables, p, is very large, e.g. when X represents spectral intensities at many wavelengths, there is also a problem. In that case X^X cannot be inverted and there is no unique solution for B. Both in the case of large m or large p some kind of dimension reduction is called for. We will therefore not discuss the multivariate regression approach further, since this chapter focuses on truly multivariate methods, taking the joint variation of variables into account. All other methods discussed in this chapter provide such a dimension reduction. They search for the most "interesting" directions in F-space and/or "interesting" directions in X-space that are linearly related. They differ in the optimizing criterion that is used to discover those interesting directions.
35.5 Reduced rank regression 35.5.7 Introduction Reduced rank regression (RRR), also known as redundancy analysis (or PCA on Instrumental Variables), is the combination of multivariate least squares regression and dimension reduction [7]. The idea is that more often than not the dependent F-variables will be correlated. A principal component analysis of Y might indicate that A {A « m) PCs may explain Y adequately. Thus, a full set of m
325
separate multiple regressions as in unconstrained multivariate regression (Section 35.4) contains a fair amount of redundancy. To illustrate this we may look for A particular linear combinations of X-variables that explain most of the total variation contained in Y. For simplicity let us start with A=l. When the K-variables have equal variance, this boils down to finding a single component in X-space, say tj = Xwj, that maximizes the average 7?^. This average /?^ is the mean of the individual /?^-values resulting from all regressions of the individual y-variables with the X-component tj. Since now all 7-variables are estimated by the same regressor tj, all fitted K-variables are proportional to this predictor and, consequently, they are all perfectly correlated. In other words, the rank of the fitted y-matrix is, of necessity, 1. Hence the name reduced rank regression. Of course, this rank-1 restriction may severely affect the quality of the fit when the effective dimensionality A of Y is larger, 1 < A < m. Thus, we may look for a second linear combination of X-variables, X,^ = Xw2, orthogonal to tp such that the multivariate regression of Y on tj and i^ further maximizes the amount of variance explained. This process may be continued until Y can be sufficiently well approximated by regression on a limited set of Z-components, T = [tj, i^^ ..., t J . Since each y-variable is fitted by a linear combination of the A X-components, each Xcomponent itself being a linear combination of the predictor variables, the Yvariables can finally be expressed as a linear combination of the X-variables. It should be noted that when the same number of A-components is used as there are K-variables, i.e. A = m, we can no longer speak of reduced rank regression. The solution then has become entirely equivalent with unconstrained multivariate regression. The question of how many components to include in the final model forms a rather general problem that also occurs with the other techniques discussed in this chapter. We will discuss this important issue in the chapter on multivariate calibration. An alternative and illuminating explanation of reduced rank regression is through a principal component analysis of Y, the set of fitted y-variables resulting from an unrestricted multivariate multiple regression. This interpretation reveals the two least-squares approximations involved: projection (regression) of Y onto X, followed by a further projection (PCA) onto a lower dimensional subspace. 35.5.2 Algorithm The interpretation also suggests the following simple computational implementation of reduced rank regression. Step 1. Multivariate least squares regression of Y on X (compare Section 35.4): Y = X(X^X)-iX^Y
(35.15)
326
Equation (35.15) represents the projection of each 7-variable onto the space spanned by the X-variables, i.e. each F-variable is replaced by its fit from multiple regression on X. Step 2. Next one appUes an SVD (or PCA) to centered Y, denoted as Y*(= Y - I m ^): Y* = U S V ^
(35.16)
Step 3. Dimension (rank) reduction by only retaining A major components to approximate Y*. This gives the RRR fit: Y*,A] = ^A]StAiVtIj
(35.17)
Step 4. The RRR model coefficients are then found by a multivariate linear regression of the RRR fit, Yj^j = (Y*j^j + In"** Y ) ^^ original X, which should have a column of ones: BRRR
= ( X ^ ) - ' X\Y\^^+
l„m^)
(35.18)
35.5.3 Discussion A major difference between reduced rank regression and canonical correlation analysis or Procrustes analysis is that RRR is a regression technique, with different roles for Y and X. It is an appropriate method for the simultaneous prediction of many correlated K-variables from a common set of X-variables through a few X-components. Since reduced rank regression involves a PCA of Y, its solution depends on the choice of scale for the 7-variables. It does not depend on the scaling of the X-variables. The reduction to a few factors may help to prevent overfitting and in this manner it stabilizes the estimation of the regression coefficients. However, the most important factor determining the robustness of any regression solution is the design of the regressor data. When the X-variables are highly correlated we still have no guarantee that unstable minor X-factors are avoided in the regression. In that case, and certainly when X is not of full rank, one may consider to base the regression on all but the smallest principal components of X. The ill-conditioning problem does not occur in the following example. 35.5.4 Example Let us try to relate the (standardized) sensory data in Table 35.1 to the explanatory variables in Table 35.3. Essentially, this is an analysis-of-variance problem. We try to explain the effects of two qualitative factors, viz. Country and Ripeness, on the sensory responses. Each factor has three levels: Country = {Greece, Italy,
327
Spain} and Ripeness = {Unripe, Normal, Overripe}. Since not all combinations out of the complete 3x3 block design are duplicated, there is some unbalance making the design only nearly orthogonal. We treat this multivariate ANOVA problem as a regression problem, coding the regressors as indicated in Table 35.3 and omitting the Italy column and Normal column to avoid ill-conditioning of X. Some of the results are collected in Table 35.7. Table 35.7a shows that some sensory attributes can be fitted rather well by the RRR model, especially 'yellow' and 'green' (/?^« 0.75), whereas for instance 'brown' and 'syrup' do much worse (/?2 ^ 0.40). These fits are based on the first two PCs of the least-squares fit (Y). The PCA on the OLS predictions showed the 2-dimensional approximation to be very good, accounting for 99.2% of the total variation of Y. The table shows the PC weights of the (fitted) sensory variables. Particularly the attributes 'brown', and to a lesser extent 'syrup', stand out as being different and being the main contributors to the second dimension. TABLE 35.7 (a) Basic results of the reduced rank regression analysis. The columns PCI and PC2 give the weights of the PCA model of Y (OLS fitted Y). The columns /?2 (in %) show how well Y and Y are fitted by the first two principal components of Y. /-variable
PCl(Y)
PC2(Y)
RK%
/?2(Y)
Yellow
0.50
-0.35
99.9
77
Transp
0.44
+0.06
99.2
53
Glossy
0.44
+0.24
99.9
58
Green
-0.47
+0.40
99.6
73
Brown
-0.16
-0.73
98.2
41
Syrup
-0.34
-0.35
97.3
41
Overall R^:
80.8
18.4
99.2
57
(b) The columns PCI and PC2 give the X-weights of the PCA model of Y (OLS fitted Y). The columns /?2 (in %) show how well the X variables are fitted by the first two principal components of Y. X-variable
PCI
PC2
R^
Intercept
-0.97
-0.86
-
Greece
-0.17
+1.91
99
Spain
+3.23
+1.03
94
Unripe
-0.88
-0.35
4
Overripe
+0.14
-0.13
7
328
The two principal axes can also be defined as linear combinations of the explanatory variables. This is given in Table 35.7b. The larger coefficients for the Country variables when regressing the PCs on the four predictor variables show that the country of origin is strongly related to the most predictive principal dimensions and that the state of ripeness is not. This also appears from the fact that the Country variables can be fitted very well (high R^) by the first two PCs, in contrast to the low R^ values for the ripeness variables. In other words, the country of origin is the dominant factor affecting the appearance of olive oils whereas the state of ripeness has little effect. The first predictive dimension mainly represents a contrast between the olive oils of Spanish origin versus non-Spanish origin and to a much lesser extent a contrast between unripe versus the (over)ripe olives. The second predictive factor, which is mostly used to fit the 'brown' sensory attribute, represents a contrast between Italy and Greece, with Spain in the middle. Fig. 35.5 summarizes the relationships between samples (objects), predictor variables and dependent variables. The objects are plotted as standardized scores (first two columns of (n-iy^^V), the variables as loading vectors, taken from X'^U and Y^U, respectively, scaled to fit on the graph. For a thorough treatment of biplotting the results of rank reduced multivariate regression models, see Ref [8]. By combining the coefficients in the two parts of the table one can express each sensory attribute in terms of the explanatory factors. Note that the above regression
Fig. 35.5. Biplot of reduced rank regression model showing objects, predictors and responses.
329
model is defined in terms of binary regressor variables, which indicate the presence or absence of a condition. Italian olive oils, for example, are defined as not Greek and not Spanish, and the variables indicating the country of origin, 'Greece' and 'Spain', are both set to 0. For example: Yellow = 0.50*PC1 - 0.35*PC2 = 0.5*(-0.97-0.17* 'Greece' + ...)-0.35*(-0.86 + 1.91 * 'Greece' + ...) = -0.22-0.74*'Greece' + 1.25* 'Spain' - 0.26* 'Unripe' + 0.19* 'Overripe' For an unripe Spanish olive oil this works out as: Yellow = -0.22 - 0.74*0 + 1.25*1 - 0.26*1 + 0.19*0 = 0.77. Since the sensory data were standardized one needs to multiply by the standard deviation (19.5) and to add the average (50.9) to arrive at a prediction in original units, viz. 50.9+0.77*19.5 ^ 66.
35.6 Principal components regression 35.6.1 Introduction In principal components regression (PCR) first a principal component analysis (Chapters 17 and 31) is performed on X, then the 7-variables are regressed on these PCs of X. PCR also combines the two steps of regression and dimension reduction. Compared with reduced rank regression the order of these two basic steps is reversed. The major difference, however, is that the dimension reduction pertains to the predictor set X, and not to the dependent variables. In PCR, therefore, the definition of Z-components is determined prior to the regression analysis, the F-variables not playing a role at this stage. As in the other approaches PCR modelling proceeds factor by factor, the number of factors A to be determined by some model validation procedure (Chapter 36 on Multivariate Calibration). 35.6.3 Algorithm The computational implementation of principal components regression is very straightforward. Step 1. First, carry out an SVD (or PC A) on centered X: X = U S V^ Step 2. Multivariate least squares regression of Y on the major A principal components, using either the unit-norm singular vectors U^^j, or the principal components T^^j = XV^^^ = Uf^jS^^^:
330
The equation represents the projection of each K-variable onto the space spanned by the first A PCs of X. Step 4. The PCR model coefficient matrix, pxm Bp^R, can be obtained in a variety of equivalent ways: BpcR = (X^X)-' X^Y = V,^,Sf^, U,^^, Y = V,^,(T,I,T,^,)-'T,X, Y The vector of intercepts is obtained as: b = (my - nix ^PCR)^35.6.3 Discussion The PCR approach has many attractive features. First of all there is the aspect of a prior dimension reduction of the data set (measurement table) X. Using PCA this is done in such a way as to maintain the maximum amount of information. The neglected minor components are supposed to contain noise that is in no way relevant for the relation with Y. Another advantage is that the principal components are orthogonal (uncorrelated). This greatly simplifies the multiple regression of the y-variables, allowing the effect of the individual principal components to be assessed independently. The chief advantage is that the major principal components have, by definition, large variance. This leads to a stable regression as the variance of an estimated regression coefficient is inversely proportional to the variance of the regressor (si = s^ /Z(A:? - Jc"); see Section 8.2.4.1) The orthogonality of the principal components has the advantage that the effect of the various PCs are estimated independently: multiple regression becomes equivalent with a sequence of separate regressions of the response(s) on the individual PCs. The fact that the X-components are chosen on the basis of representing X rather than Y does not only have advantages. It also gives rise to a major concern. What if a minor component happens to be important for the regression? And what is the use of a major principal component if it is not related to Y? The answer to the latter question is simple: it is of little use but it does not harm either. The problem of discarding minor X-components that possibly are highly correlated to Y is more severe. One way to address this problem is to include the minor components in the regression if they are really needed. That is, one should go on adding principal components in the regression model until Y is fitted well, provided such a model also passes the (cross-)validation procedure (Section 36.10). Another strategy that is gaining popularity is to enter the principal components in a different order than the standard order of descending variance (PCI, PC2,...). Rather than this top-down procedure, one may apply variable selection: one starts with the principal component that is most highly correlated with the
331
responses, then move to the PC with the second highest correlation, etc. The only thing that is needed is to compute the correlation coefficients of the PCs with each response. For a univariate y the PCs may then be ranked according to their descending (squared) correlation coefficient. By applying this forward selection procedure one ensures that highly correlating PCs are not overlooked. For multivariate response data Y one should compute for each PC an average index of its importance for all 7-variables together, e.g. the average squared correlation or the total variance explained (I I Yl P = 11 u J Yl P) by that PC. Comparative studies have shown that the latter method of PCR frequently performs better, i.e. it gives good predictive models with fewer components [9]. Since principal components regression starts with a PCA of X, its solution depends on the particular scaling chosen for the X-variables. It does not depend on the scaling of the 7-variables. When the maximum number of factors are used the regression model becomes equivalent with multivariate regression. There is no special multivariate version of principal component regression: each K-variable is separately regressed on the set of X-components. One might also consider regressing the major PCs of Y or of Y (Eq. 35.14) on the PCs of X.
35.7 Partial least squares regression The purpose of Partial Least Squares (PLS) regression is to find a small number A of relevant factors that (/) are predictive for Y and (//) utilize X efficiently. The method effectively achieves a canonical decomposition of X in a set of orthogonal factors which are used for fitting Y. In this respect PLS is comparable with CCA, RRR and PCR, the difference being that the factors are chosen according to yet another criterion. We have seen that PCR and RRR form two extremes, with CCA somewhere in between. RRR emphasizes the fit of Y (criterion ii). Thus, in RRR the Xcomponents t- preferably should correlate highly with the original 7-variables. Whether X itself can be reconstructed ('back-fitted') from such components t^ is of no concern in RRR. With standard PCR, i.e. top-down PCR, the emphasis is initially more on the X-side (criterion /) than on the F-side. CCA emphasizes the importance of correlation; whether the canonical variates t and u account for much variance in each respective data set is immaterial. Ideally, of course, one would like to have the best of all three worlds, i.e. when the major principal components of X (as in PCR) and the major principal components of Y (as in RRR) happen to be very similar to the major canonical variables (as in CCA). Is there a way to combine these three desiderata — summary of X, summary of Y and a strong link between the two — into a single criterion and to use this as a basis for a compromise method? The PLS method attempts to do just that.
332
PLS has been introduced in the chemometrics literature as an algorithm with the claim that it finds simultaneously important and related components of X and of Y. Hence the alternative explanation of the acronym PLS: Projection to Latent Structure. The PLS factors can loosely be seen as modified principal components. The deviation from the PCA factors is needed to improve the correlation at the cost of some decrease in the variance of the factors. The PLS algorithm effectively mixes two PCA computations, one for X and one for Y, using the NIPALS algorithm. It is assumed that X and Y have been column-centred as usual. The basic NIPALS algorithm can best be demonstrated as an easy way to calculate the singular vectors of a matrix, viz. via the simple iterative sequence (see Section 31.4.1): t = Xw
(35.19)
wocX^t
(35.20)
for X, and u = Yq
(35.21)
q oc Y^u
(35.22)
for Y. The «:-symbol is used here to imply that the resultant vector has to be normalized, i.e. w^w = q^q = 1. In eq. (35.19) t represents the regression coefficients of the rows of X regressed on w. Likewise, w in eq. (35.20) is proportional to the vector of regression coefficients obtained by regressing each column (variables) of X on the score vector t. This iterative process of criss-cross regressions is graphically illustrated in Fig. 35.6. Iterating eq. (35.19) and eq. (35.20) leads w to converge to the first eigenvector of X^X. One may easily verify this by substituting eq. (35.19) into eq. (35.20), which yields w oc X^t«: X^Xw, the defining relation for an eigenvector. Similarly, t is proportional to an eigenvector of XX^. It can be shown that the eigenvectors w and t are the dominant eigenvectors, i.e. the ones corresponding to the largest eigenvalue. Thus, w and (normalized) t form the first pair of singular vectors of X. Likewise, q and (normalized) u are the dominant eigenvectors of Y^Y and YY^, W
X Fig. 35.6. Principle of SVD/NIPALS algorithm.
333
w
u
T Fig. 35.7. Principle of PLS/NIPALS algorithm.
respectively, or the first pair of singular vectors of Y. Once this first pair of singular vectors is determined one extracts this dimension by fitting t to X (or u to Y) and proceeding with the matrix Ej (or Fj) of residuals. Using the residual matrix Ej (or Fj) and the basic NIPALS algorithm one may find the pair of dominant singular vectors which in fact is the second pair of singular vectors of the starting matrix X (or Y). The process is repeated until the starting matrix is fully depleted. Instead of separately calculating the principal components for each data set, the two iterative sequences are interspersed in the PLS-NIPALS algorithm (see Fig. 35.7): Wcx: X^U
(35.23)
t = Xw
(35.19)
qocY^t
(35.24)
u = Yq
(35.21)
One starts the iterative process by picking some column of Y for u and then repeating the above steps cyclically until consistency. Upon convergence we have w oc X^u oc X^Yq oc X^YY^t oc X^YY^Xw. Thus, w is an eigenvector of X'^YY'^X and, similarly, q is an eigenvector of Y^XX^Y [10]. These matrices are the two symmetric matrix products, viz. (X'^Y)(X'^Y)'^ and (X'^Y)'^(X'^Y), based on the same cross-product matrix (X^Y). Apart from a factor (n - 1), the latter matrix is equal to the matrix of inter-set covariances. Another interpretation of the weight vectors w and q in PLS is therefore as the first pair of singular vectors of the CO variance matrix X^Y. As we found in Chapter 29 this first pair of singular vectors forms the unique pair of normalized weight vectors that maximizes the expression w^(X^Y)q = (Xw)'^(Yq) = t^u. Up to a factor (n - 1), the latter inner product equals the covariance of the two score vectors t = Xw and u = Yq. This then leads to the following important interpretation of the PLS factors: t = Xw and u = Yq are chosen to maximize their covariance [10,11].
334
Let us take a closer look at this covariance criterion. A covariance involves three terms (see Section 8.3): cow(i,u) = s,s^r,^
(35.25)
or, taking the square, cov(t,u)2 = var(t) var(u) r,l
(35.26)
Thus, the PLS covariance criterion capitalizes on precisely the three links that connect two sets of data via their latent factors: (i) the X-factor t should have appreciable variance, var(t); (ii) similarly, the K-factor u should have a large variance, var(u), and (iii) the two factors t and u should be strongly related also (high r^). Of the three aspects inherent to the covariance criterion (35.26), CCA just considers the so-called inner relation between t and u as expressed by r^, RRR entirely neglects the var(t) aspect, whereas PCR emphasizes this var(t) component. One might maintain that PLS forms a well-balanced compromise between the methods treated thus far. PLS neither emphasizes one aspect of the X<->Y relation unduly, nor does it completely neglect any. The covariance criterion as such suggests a symmetrical situation, X and Y playing equivalent roles. In fact, up to here, there is little difference with Procrustes analysis which also utilizes the singular vectors of the covariance matrix (Section 35.2). The difference is that in PLS the first X-factor, say tj, is now used as a regressor to fit both the X-block and the K-block: X(=Eo) = t,pT + E ,
(35.27a)
Y(=:Fo) = t , c ^ + F i
(35.27b)
Here, the loading vector pj contains the coefficients of the separate univariate regressions of the individual X-variables on tj. The/^ element of Pj, py, represents the regression coefficient of X: regressed on ti'-Pij = ^J^j f tJ ty The full vector of loadings becomes p, = E^tj / t ^ t j . Similarly, Cj contains the regression coefficients relating tj to the K-variables: Cj =FQ t^/tjt^. The residuals of these regressions are collected in residual matrices Ej and Fj: E,=E,-i,pJ (35.28a) F, =Fo-t,c;^
(35.28b)
A second PLS factor t2 is extracted in a similar way maximizing the covariance of linear combinations of the residual matrices Ej and Fj. Subsequently, Ej and Fj are regressed on t2, yielding new residual matrices Ej and F2 from which a third PLS
335
factor t3 is computed, and so on. If one does not limit the number of factors, the process automatically stops when the Z-matrix has been fully depleted. This occurs when the number of factors A equals the rank of X, i.e. A = mm(n - 1, /?). As for PCR, such a full-rank PLS model is entirely equivalent with multivariate regression on the original X-variables. The PLS algorithm is relatively fast because it only involves simple matrix multiplications. Eigenvalue/eigenvector analysis or matrix inversions are not needed. The determination of how many factors to take is a major decision. Just as for the other methods the 'right' number of components can be determined by assessing the predictive ability of models of increasing dimensionality. This is more fully discussed in Section 36.5 on validation. Let us now consider a new set of values measured for the various X-variables, collected in a supplementary row vector x*. From this we want to derive a row vector y* of expected F-values using the predictive PLS model. To do this, the same sequence of operations is followed transforming x* into a set of factor scores {^1*, ^2, ..., ^^ } pertaining to this new observation. From these r*-scores y * can be estimated using the loadings C. Prediction starts by equating yo to the mean (my) for the training data and removing the mean m^ from x* giving CQ : y 0 = my
6*0 = X* - n i x
Then we compute the score of the new observation x* on the first PLS dimension and from that we calculate an updated prediction (y\) and we remove the first dimension from CQ giving e\: r, =eo' w,
^1 =^o-h
Pi
This sequence is repeated for dimension 2:
62 —Cj — ^2 P 2
and so on.
336
Alternatively, one may obtain predicted values directly as y*o =niY + (x*-mx)'^BpLs using the matrix of regression coefficients BpLs, as estimated by the PLS method. It may be shown that a closed expression for these coefficients can be obtained from the weights and loadings matrices [12]: BpLs = W(P^W)-'CT-. 35.7.2 NIPALS-PLS Algorithm Here we summarize the steps needed to compute the PLS model 1. 2. 3. 4. 5.
E = X-lml/n
F =Y E<-E*diag(l/std(E)) F ^ F * diag(l/std(Y)) ssqX = sum_of_squares(E) 6. ssqY = sum_of_squares(F) 7. for a = 1 to A 8. Set u = first column of F 9. Iterate until convergence 10. w = E^u, w <— w/(w^w)"^ <X factor weights> 11. t = Ew <X factor scores> 12. c = FMt^t) 13. u = FTc/(c^c)"2 End 14. 15. p = E'^t/(t'^t) <X factor loadings> 16. E<-E-tp^ 17. F <- F-tc^ 18. r2x = (t'rt)(p^p)/ssqX <X sum of squares accounted for by t> r2y = (t'^t)(c^c)/ssqY 19. 20. Save results for current factor w, t, c, p, u, r2x, r2y into W, T, C, P, U, R2X, R2Y 21. End 22 BpLs = W(P^W)'CT^
Slightly different implementations of the above PLS-NIPALS algorithm exist. They mostly differ in the chosen normalization of w, t or p (here Iwl = 1). This is not an important issue, but it may be a cause of confusion when comparing results from different (software) implementations. That the normalization is of no real importance can be seen as follows. Let us say we choose to multiply the weight
337
vector w by some scalar constant. Then, the length of the score vector t increases by the same factor. In contrast, the lengths of the loading vectors p and c decrease by that factor. Since the loadings appear in combination with scores (e.g. tp^, tc^, (t^t)(p^p), and (tTt)(c^c)) or with weights (W(P'rW)-i C^), the chosen multiplication factor has no effect. A slight alternative to the above NIPALS algorithm is replacing the iteration loop (line 8-14) by: 8'. 9\
q
10.
W = E^U, W
11. 12.
t = Ew c = FV(ft)
(F'^EE'^F)
This formulation is attractive since it gives the weight vector q accurately and (usually) very fast. The latter holds true since more often than not the number of response variables (m) is not that large, hence the matrix F^EE^F (mxm) is usually quite small. This is especially true when Y is univariate. One may verify that in this situation Y and F become vectors,say y and f, hence the loading 'vector' c = F'^t/(t'^t) becomes a scalar, c = f^t/(t'^t), and so do the 'matrix' F'^EE'^F (=f^EE^f) and its eigenvector: q=^=l. The NIPALS procedure also becomes very efficient when there is a single response y as it requires only one passage through the iteration steps (lines 9-14). The distinction between multivariate and univariate Y is an important one and the two situations are distinguished by the notation PLS2 and PLSl, respectively. In this chapter, which is about relating two multivariate data tables, the emphasis is on PLS2, and the case of univariate y is not considered. PLSl regression is considered further in Chapter 36. 35.7.3 Discussion One may wonder why it is necessary to take the variance of predictors t^ into account, when the goal, after all, is not to fit X but to fit and predict Y. There is a good reason, however, to prefer a fit based on a large-variance component t above an equally good fit based on a component of small variance. This ties in with the intuitive appeal that a regression relation should preferably be based on a regressor varying over as wide a range as possible. Or, more precisely, the variance of an estimated regression parameter is inversely proportional to the variance of the regressor (Section 8.4 and 10.4). As an example we try to model the relation between the sensory data of Table 35.1 and the instrumental measurements of Table 35.4. The PLS analysis results are shown in Table 35.8. The first PLS dimension loads about equally high on
338
S52
^?eaB2 S61 S53 G12
133
U
132
1
322
G13 G11 132
-2
012
142 131 T
-4
'
1
'
-
2
0
2
4
t1 Fig. 35.8. Plot of inner relation (u versus t) for PLS dimension 1.
peroxide, K232 and K270 and much less on DK and Acidity. As to the K-variables all variables are about equally well approximated from tj. Thus the first PLS factor (tj) tries to model some average overall appearance feature (Ui) of the F-block. There is a good linear relation between the first X-factor and the first F-factor (Fig. 35.8). One notices that this dimension separates the Spanish olive oils from the other nationalities. The second dimension captures a different feature. It chiefly loads on acidity and a little bit on peroxide and DK. This second PLS factor t2 loads mainly on the *brown' variable and slightly on 'green'. A scatter plot of U2 versus t2 again reveals a fairly good correlation (0.69) and, this time, a contrast between the Greek and the non-Greek oils (Fig. 35.9). While a plot of u-scores versus t-scores shows the strength of the linear relation, a scatter plot of the t-scores for the first few dimensions gives a view of the arrangements of the objects in multivariate space. From Fig. 35.10 one concludes, for example, that there is a weak clustering of objects according to the country of origin. Figure 35.11 shows the pattern of loadings of the predictors and the dependent variables. One observes that the first PLS factor is associated with a contrast between glossy, transparency and yellowness on the one hand and green, syrupy and brown on the other. One also sees that the UV measurements K232 and K270 are most relevant for this dimension, or that K232 is more associated with 'syrupy' and 'brown', whereas K270 and DK are more associated with the attribute 'green'. The various projections obtained via PLS regression such as the ones shown in Figs. 35.8-11 are a powerful tool in interpreting the structure and the relationship of the two datasets.
339
G11 G12
1^
U
G13 022
52 1
ssg l3iS61^
142 -l
,—1—r-
-3
1
— , — . — 1 — , —
1
-2
0
1
2
3
t2 Fig. 35.9. Plot of inner relation (u versus t) for PLS dimension 2.
G11 G13
132
k322
^
133^ 132
G12
131 -2
^
^
^
U,
142
-3
-2
-1
0 tl
Fig. 35.10. Scatter plot of samples in first two PLS dimensions.
340 u . a. ACIDITY
0.6
d i m
2
DK green
0.3-
K270
glossy
0.0
transpar yellow
^gsyrupy
-0.3
PI :ROXIDE brown
-0.6 1
—
'
—
•
—
1
—
-0.6-0.3
'
—
'
—
—
0.0
,
—
,
—
,
—
.
0.3
—
.
—
1
—
.
0.6
—
,
—
f
J
0.9
dim 1 Fig. 35.11. Loadings of the two dominant PLS factors on the X-variables (lower case) and 7-variables (upper case).
35.7.4 Alternative PLS algorithms The algorithm described above is the classical PLS algorithm as developed by Wold and Martens in the eariy 1980s. There is a choice of different algorithms all deriving the PLS model. Most of these algorithms aim at increasing the efficiency of the algorithm when the number of objects (n) or the number of variables (p) is very large. Rather than working with the X- and F-matrices (or their residual offspring) these algorithms work with cross-product matrices of smaller dimension. An overview and comparison of these algorithms has been given in Ref. [13]. The SIMPLS algorithm [14] was devised as a statistically /nspired modification of the PLS algorithm and led to a PLS model that is slightly different from the classical PLS/NIPALS algorithm when the response is multivariate (PLS2). SIMPLS solves the covariance optimization in the same way as CCA solves the correlation optimization problem and PCA solves the variance optimization problem, viz. as a set of constrained optimization problems. An outstanding feature of SIMPLS is that the weight matrix, called R to avoid confusion with the NIPALS/PLS weight matrix W, applies to the original data. The restriction is that any newly formed PLS factor t^ = Xr^ should be orthogonal to its predecessors tb = Xr^, \
341 TABLE 35.8 Summary of PLS results
OL
X
Y
PLS dimension
PLS dimension
W_
(2)
(3)
(4)
(5)
-21 -53 -56 -51 -30
77 -44 -22 16 35
11 -20 2 59 -76
53 61 -25 -36 -38
-24 31 -75 47 21
Yellow Green Brown Glossy Transpar Syrupy
-246 -507 -545 ^88 -394
814 -368 -271 128 349
107 -174 -30 625 -755
497 659 -361 -292 -349
-241 316 -756 471 218
Yellow Green Brown Glossy Transpar Syrupy
-1969 2504 -556 724 -708 620 1205 -221 1106 -1533 1725 1708 333 1483 26 -2617 -878 -499 -843 -539 -760 -1797 -554 -965 -615 -106 -1024 2676 -1905 1171 1411 -459 647 1770 144 -218 1029 -73 -548 1542 -513 -281 2215 62 -505 1822 39 77
-356 -54 -55 282 130 -254 -206 260 390 62 -387 -985 267 -129 391 645
-332 -105 -385 323 64 176 -36 350 -275 -189 296 115 -5 -134 218 -81
(2)
(3)
(4)
(5)
39 -36 -39 44 41 -42
-29 40 -79 11 -2 -32
-66 55 47 13 6 -1
5 -16 25 -12 -22 91
66 -40 16 22 11 -55
375 -343 -377 423 395 -405
-199 266 -526 76 -14 -218
-337 280 243 70 31 -9
43 -136 209 -102 -181 734
541 -328 139 188 90 -456
Weights Acidity Peroxide K232 K270 DK Loadings Acidity Peroxide K232 K270 DK Scores Gil G12 G12 G13 G22 131 132 132 133 142 S51 S52 S53 S61 S62 S63
524 253 -701 -1595 1662 -2075 782 1726 1103 -2969 319 1204 1482 -722 -1085 389 -359 -797 1138 837 -675 594 -620 643 -204 464 -280 -3220 -957 -491 283 701 449 -729 -1422 764 -138 -1191 -2018 606 867 -162 -1498 -258 1040 -2643 -2423 138 338 1041 2328 -278 -311 -725 997 3172 987 235 -453 -1179 694 -495 -750 285 283 1158 -869 -112 701 -269 522 2110 -721 40 164 2093 -766 -99 493 646
% VAF Acidity Peroxide K232 K270 Dk
17.6 74.8 86.2 69.1 45
95 90.6 94.8 71 59.3
95.8 92.7 94.8 97.5 97.8
99.7 99.5 96.9 98.8 99.7
100 100 100 100 100
Yellow Green Brown Glossy Transpar Syrupy
40.9 34.3 41.3 51.9 45.4 47.6
45.6 42.6 73.7 52.6 45.4 53.2
53.3 47.9 77.7 52.9 45.5 53.2
53.3 48.2 78.4 53.1 46 61.6
54.9 48.8 78.5 53.2 46.1 62.7
Average
58.5
82.1
95.7
98.9
100
Average
43.6
52.2
55.1
56.8
57.4
342
vector r^ to be orthogonal to prior loading vectors p^, since t J t^ = rJ X \ = rj p^/( 111^). In this way one avoids the deflation step of X for each separate dimension. It turns out that NIPALS/PLS2 does not truly maximize the covariance criterion after the first dimension whereas SIMPLS does. This leads to a slight difference between the NIPALS/PLS2 model and the SIMPLS model. From a practical point of view the difference is negligible. Once the weight matrix R is found, one can immediately find the regression model.
35.8 Continuum regression methods We have seen that PLS regression (covariance criterion) forms a compromise between ordinary least squares regression (OLS, correlation criterion) and principal components regression (variance criterion). This has inspired Stone and Brooks [15] to devise a method in such a way that a continuum of models can be generated embracing OLS, PLS and PCR. To this end the PLS covariance criterion, cov(t,y) = Sf Sy r^, is modified into a criterion T = s^^^^'^^Sy r^. (For simplicity, we assume a univariate response, in which case u = y.) For a = 0 the variance of t does not play a role, since s^ = I. Maximizing Tis equivalent with maximizing the correlation r^, and a = 0 corresponds to ordinary least squares regression. For a = 1/2, T = ^J^Y^'rv == cov(t,y), which is the PLS criterion. For a -> 1 we have that (x/(l - a ) -> oo and Tis dominated by the s^ component, i.e. the variance of t should be maximal. Hence, a = 1 corresponds to PCR. By varying a between 0 and 1 we can generate a continuum of regression models, including OLS (a = 0), PLS(a = 1/2) and PCR(a = 1) as special cases. One hopes that by optimizing the additional parameter a through cross-validation better models may be obtained. The method has been aptly called continuum regression (CR). When Y, is multivariate the method is called jomr continuum regression (JCR)[16]. The limit a = 0 then corresponds to reduced rank regression. Since the algorithmic implementation of (joint) continuum regression is rather cumbersome we discuss two simpler alternatives. Principal covariates regression (PCovR) is a technique that recently has been put forward as a more flexible alternative to PLS regression [17]. Like CCA, RRR, PCR and PLS it extracts factors t from X that are used to estimate Y. These factors are chosen by a weighted least-squares criterion, viz. to fit both Y and X. By requiring the factors to be predictive not only for Y but also to represent X adequately, one introduces a preference towards the directions of the stable principal components of X. The easiest way to envision PCovR is as a PCA of the supermatrix Z obtained by adjoining Y and X, Z = [YIX]. Here, Y is the least squares fit of Y by X as given by eq. (35.14). By including a relative weighing of the two matrices Y and X
343
constituting Z = [(1 -a)YlaX], one may, to some extent, control the outcome of the analysis. If the importance of X is downgraded (a -^ 0), Y dominates and the analysis becomes entirely equivalent to RRR. In the opposite case (a -^ 1), X will dominate and the analysis becomes equivalent to PCR. When Y and X are scaled to equal "size", i.e. sum of squares, they are weighted equally and the results resemble those of PLS. The main advantage of PCovR lies in its flexibility of being able to move between the two extremes. It releases the data analyst from making a choice between the various techniques, letting the data determine the best predictive model. One might also consider it an advantage of PCovR compared to PLS that it employs the familiar least-squares criterion. PCovR also allows for a nice graphical presentation of the dependent and independent variables in conjunction through a biplot of the augmented Z data table. Figures 35.12-14 show such biplots for the sensory and physico-chemical data. Figure 35.12 shows the RRR extreme, optimally displaying the fitted sensory variables (total Y variance accounted/or by two dimensions, VAF = 54%, i.e. 95% of the maximum attainable percentage 57%) and the explanatory variables (VAF = 77%) projected on this PCs-of-Y plane. Figure 35.13 shows the PCR extreme, optimally displaying the explanatory variables (VAF = 82%) and the sensory variables (VAF = 51%, or 89% of maximum) projected on this PCs-of-X plane. Figure 35.14 forms the PCovR compromise, accounting for 53% of the total Y-variance (94% of maximum) and 81% of the total X-variance. These results and the accompanying graphs show once again that the difference between the various methods is 15%
] ^^m G13C
GREEN
\
GS22C
S52C
Q12P
GLOSSY
80%
»2iErSfflS62C Y^mjp::^afi^^^^^* J
.^^^^
^^ I32P
VEILOIV
\BR6WN 1
-3
1
,
1
1
fJ
0 PCI
Fig. 35.12. Scores, X-loadings and F-loadings for the two dominant RRR dimensions.
344
23%
P C 2
59%
I YELLOW^ -1
PEROX?DE / BROWN 142 -2
-1 PCI
Fig. 35.13. Scores, X-loadings and K-loadings for the two dominant PCR dimensions.
20 %
G22 d i m
S52
IMQSSY
G12 ^ YELLOW IPEROXYD^ BROWN 1142 —r-1 dim 1 (66%) Fig. 35.14. Scores, X-loadings and K-loadings for the two dominant PCovR dimensions.
345
marginal in this case. The reason is that the principal components of X and those of Y are already highly correlated, i.e. both sets of data can be explained by the same latent variables. Finally, another alternative to continuum regression has been put forward by Wise and de Jong [18]. Their continuum power-PLS (CP-PLS) method modifies the matrix X = USV^ into X^^^ = US^^^V^, i.e. the singular values are raised to a certain power y = oc/(l - a) and a modified ('powered') predictor matrix is constructed. Then one applies PLS regression and the results are back-transformed to the original X matrix. Again by changing a from 0 to 1 (or y from 0 to «>) the model changes from RRR via PLS to PCR.
35.9 Concluding remarks Different techniques have been presented that can be used to relate two measurements tables. Are we now in a position to answer the obvious question: Which one is to be preferred? As expected there is no clear cut answer to this question. It all depends on the objective of the comparison and the nature of the data. Some general conclusions can be drawn, however. Procrustes analysis stands apart from the other techniques in that it is not a regression technique. Its main use lies in the comparison of two or more data sets which are supposed to be essentially similar apart from basic transformations such as translation, rotation and reflection. Procrustes analysis finds those transformations needed to match the sets of data. Multivariate regression is only useful when both the number of dependent variables and the number of regressors are not too large. It does not provide an interpretation in terms of latent variables. All other regression techniques discussed apply some form of dimension reduction and become essentially equivalent to multivariate regression when they are carried through to the maximum number of factors. Canonical correlation analysis is not to be recommended when the interest lies in understanding or predicting Y from X. However, when the aim is to discover dimensions in Y and X that are very 'similar', and this in itself is of substantial interest, then CCA can be a very useful tool. Reduced rank regression is especially useful when the Y data are highly collinear and the X data are not. Hence it is especially useful when X is designed and Y is multivariate. It is of little use when the Y data are nearly orthogonal. Principal component regression is to be preferred when the X data are highly collinear. It provides little advantage when the predictor variables are nearly orthogonal. Partial least squares should do well under most circumstances. When the X data are orthogonal, e.g. when coming from a designed experiment, PLS becomes equivalent to reduced rank regression. Principal covariates regression is in many ways quite comparable to PLS. It adds some additional flexibility, covering RRR and PCR as extreme cases, as does
346
continuum regression. When we are in the lucky circumstance that the major dimensions of X and Y are strongly related, then it will be hard to miss this by any technique: all will reveal this strong relation. On the other extreme side, when there is no relation whatsoever between the two data tables, all techniques will arrive to the same negative conclusion. The similarities and differences between the regression methods have been investigated in various instances (see e.g. [19]). The relation between CCA, PLS, SIMPLS and RRR has been discussed by Bumham, Viveros and MacGregor [20]. When X is an orthogonal design matrix PLS2 becomes equivalent with RRR [21]. Breiman and Friedman [22] have devised a method, called Curds-and-Whey, that uses the canonical variables and combines ideas from canonical correlation analysis and ridge regression (Section 10.6). A thorough discussion of various multivariate regression techniques and their application to QSAR problems is given by Jonathan and Stone [23]. A good collection of papers on the applicability of PLS regression to QSAR problems can be found in Ref [24]. Total Least Squares (TLS, [25]) is a technique popular with numerical analysts and engineers. It is a multivariate generalization of errors-in-variables regression or orthogonal distance regression (cf Section 8.2.11), where measurement errors in both X and Y are explicitly taken into account.
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12.
C. Peri, A. Garrido and A. Lopez, eds., The sensory and nutritional quality of virgin olive oil. Grasas y Aceitas, 45, (Symposium Issue) (1994) 1-81. J.C. Gower, Generalized Procrustes analysis. Psychometrika, 40 (1975) 33-51. J.M.F. ten Berge, Orthogonal Procrustes rotation for two or more matrices. Psychometrika, 42 (1977)267-276 G.B. Dijksterhuis, Procrustes analysis in sensory research, Ch. 7 in: Multivariate analysis of data in sensory science (T. Naes and E. Risvik, eds), Elsevier, Amsterdam (1996). H. Hotelling, Relationships between two sets of variates. Biometrika, 28 (1936) 321-327. T.W. Anderson, Introduction to Multivariate Statistical Analysis, 2nd ed., Wiley, New York (1984). P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression, Appl. Stat., 31 (1982) 244^255. C.J.F. ter Braak, Interpreting canonical correlation analysis through biplots of structure correlations and weights. Psychometrika, 55 (1990) 519-531. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta (1997) 348, 19-27. A. Hoskuldsson, PLS regression methods. J. Chemom., 2 (1988) 211-228. I.E. Frank, Intermediate least-squares regression method. Chemom. Intell. Lab. Syst., 1 (1987) 233-242. H. Martens and T. Naes. Multivariate Calibration. Wiley, Chichester, 1989.
347 13. 14. 15.
16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
S. de Jong, Comparison of PLS algorithms. Chemom. Intell. Lab. Syst. (1998) S. de Jong, SIMPLS: an alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst., 18 (1993) 251-263. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed prediction embracing ordinary least sqaures, partial least squares, and principal component regression. J. Roy. Stat. Soc. B52 (1990) 237-269. R.J. Brooks and M. Stone, Joint Continuum Regression for multiple predictands. J. Am. Stat. Assoc, 89 (1994) 1374-1377. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14 (1992) 155-164. S. de Jong and B.M. Wise, J. Chemom., 12, (1998). I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics, 35 (1993) 109-135. A. Bumham, R. Viveros, J.F. MacGregor, Frameworks for latent variable multivariate regression. J. Chemom., 10 (1996) 3 1 ^ 6 . 0. Langsrud and T. Nses, On the structure of PLS in orthogonal designs. J. Chemom., 9 (1995) 483-487. L. Breiman and J.H. Friedman, Predicting multivariate responses in multiple linear regression. J. Roy. Stat. Soc. B59 (1997) 3-37. M. Stone and P. Jonathan, Statistical thinking and technique for QSAR and related studies. J. Chemom. 7 (1993) 455-475; 8 (1994) 1-20, 303. S. Wold, in: QSAR: Chemometric Methods in Molecular Design, ed. H. van de Waterbeemd, VCH, Weinheim, 1995 (especially Ch 3.2, Ch 4.4, Ch 5.1, Ch 5.2). S. Van Huff el and J. Vandewalle, The Total Least Squares Problem: Computational Aspects and Analysis. SIAM, Philadelphia, PA, 1991.
Additional recommended reading R. Gittins, Canonical Analysis. A Review with Applications in Ecology. Springer-Verlag, Berlin, 1985. M. Tenenhaus, La Regression PLS — Theorie et Pratique. Editions Technip, Paris, 1998.
This Page Intentionally Left Blank
349
Chapter 36
Multivariate calibration 36.1 Introduction Multivariate calibration is the collective term used for the development of a quantitative model for the reliable prediction of properties of interest (y^, y^,..., y^) from a number of predictor variables (jCj, x^,..., x^). As an example, one may think of the spectroscopic analysis of a mixture in order to measure the concentration of one or more of its constituents. The goal of calibration, whether multivariate or not, is to replace a measurement of the property of interest by one that is cheaper, or faster, or better accessible, yet sufficiently accurate. Developing the calibration model includes stating the objective of the study, designing the experiment, choosing the type of model, estimating its parameters and the final stage of assessing the precision of the predictions. Multivariate calibration is set apart from univariate calibration because it involves more than one predictor variable (p> I). This gives rise to new opportunities, e.g. using a complete spectrum as a predictor rather than the signal at a single 'best' wavelength. Using the entire spectral information may, in principle, lead to better predictions. However, it also opens the possibility that essentially non-informative spectral regions are included in the model through chance correlations in the calibration set. Having more than one predictor admits the estimation of more than one dependent property (^ > 1) and correction for undesired covariates (interferences). In fact, the latter prospect is a major motivation for doing the multivariate measurements. In univariate calibration it is not possible to correct for interferences without additional information. With additional information, i.e. with multivariate data, the opportunity arises to separate the information relevant for the respective properties from non-relevant variation or random noise. The set of possible dependent properties and independent predictor variables, i.e. the number of possible applications of predictive modelling, is virtually boundless. A major application is in analytical chemistry, specifically the development and application of quantitative predictive calibration models, e.g. for the simultaneous determination of the concentrations of various analytes in a multicomponent mixture where one may choose from a large arsenal of spectroscopic methods (e.g. UV, IR, NIR, XRF, NMR). The emerging field of process analysis.
350
i.e. the analysis of chemical systems or processes using multiple sensors, depends heavily on the applicability of multivariate calibration models for the quantitative monitoring of the systems or processes of interest. Particularly, the application of near-infrared spectroscopy to analyze samples needing little or no pretreatment has found wide-spread use in the chemical and food industries [1]. The technique is also used to characterize product properties related to composition, e.g. octane number of gasolines, iodine value of fats and oils or crystallinity of polymers. Applications outside the field of analytical chemistry are, for example, the prediction of biochemical or pharmacological properties from structural parameters (QSAR) [2], the understanding of sensory profiles from physico-chemical data in food research [3], and the modelling of environmental data [4]. The ultimate goal of multivariate calibration is the indirect determination of a property of interest (y) by measuring predictor variables (X) only. Therefore, an adequate description of the calibration data is not sufficient: the model should be generalizable to future observations. The optimum extent to which this is possible has to be assessed carefully: when the calibration model chosen is too simple (underfitting) systematic errors are introduced, when it is too complex (overfitting) large random errors may result {cf. Section 10.3.4). In many applications the goal of predictive modelling is not a detailed understanding of the relation between dependent and independent variables. Ability to interpret the model, therefore, is not a requirement per 5^. This should not preclude the exploitation of available background knowledge on the problem at hand during calibration modelling. A model that can be sensibly interpreted certainly adds value and confidence to the calibration result. Section 36.2 deals with the types of regression models that are commonly used in multivariate calibration. The treatment of the subject will mostly focus on multi-component analysis, i.e. predicting the analyte concentrations in a mixture from some multi-channel responses. We will gradually add complexity to the analytical chemical problem (unknown pure spectra, interferents, less samples than channels, etc) which automatically calls for different calibration methods. These topics are better addressed after modelling aspects have been discussed. Section 36.3 deals with the important issue of validation and assessing the predictive performance. Section 36.4 treats procedures preceding the actual modelling stage, viz. design and data pretreatment and outlier detection. We conclude in Section 36.5 with briefly mentioning some recent developments, such as wavelength selection and calibration standardization. We will also discuss non-linear modelling as an extension to the mainstream linear calibration setting.
351
36.2 Calibration methods Let us assume that we have collected a set of calibration data (X, Y), where the matrix X (nxp) contains thep > 1 predictor variables (columns) measured for each of n samples (rows). The data matrix Y (nxq) contains the q variables which depend on the X-data. The general model in calibration reads Y = / ( X ; 0 ) + EY
(36.1)
where / is some multivariate function, 0 is a vector of parameters and Ey is the error. We will mainly discuss linear calibration functions, i.e./is a linear function of the parameters. The case of non-linear regression modelling is treated for the univariate case in Chapter 11. Non-linear multivariate modelling is briefly dealt with in the concluding Section 36.5 on recent developments. There are several good reasons to focus on linear models. Theory may indicate that a linear relation is to be expected, e.g. Lambert-Beer's law of the linear relationship between concentration and absorbance. Even when a linear relation does not hold strictly it can be a sufficiently good local approximation. Finally, one may try and find a transformation of the individual variables (e.g. a logarithmic transformation), in order to obtain an acceptable linear model for the transformed variables. Thus, we simplify eq. (36.1) to Y = X B +EY nxq
nxp pxq
(36.2)
nxq
where B is the matrix of regression coefficients. Each column of B contains the regression coefficients pertaining to the corresponding F-variable, i.e. column of Y. So far, we have deliberately refrained from using the term 'response'. In conventional statistical theory on calibration, the F-variables are regarded as random variables depending on the X-variables. The dependent F-variables respond to a change in the independent X-variables, not the other way around. In the traditional controlled calibration experiment the X-variables are designed. They are under full control and can be set accurately at the values prescribed by the experimental design, e.g. calibration standards of known composition can be prepared. In the setting of multi-component analysis, the measured response Y is the collection of n spectra measured at q wavelengths, which depend on the concentrations of the/? absorbing analytes in the n different calibration standards. The multivariate regression model B then would consist of the molar absorbances for each analyte at each wavelength. In other words B would contain the pure spectra. This is the classical approach to multicomponent analysis (c/. Section 10.7).
352
In an informal way one may summarize eq. (36.2) as: spectrum = function (composition) Given the pure spectra, or more generally given the estimated model B, it is an easy task to predict a spectrum y^^^ knowing the composition x^^^ of a new sample. However, our preoccupation is with the opposite application: given a newly measured spectrum y^^^, what is the most likely mixture composition x^^^; and, how precise is the estimate? Thus, eq. (36.2) is necessary for a proper estimation of the parameters B, but we have to invert the relation y =j{\) = xB into, say, x = g{y) for the purpose of making future predictions about x (concentration) given y (spectrum). We will treat this case of controlled calibration using classical least squares (CLS) estimation in Section 36.2.1. Often, it is not quite feasible to control the calibration variables at will. When the process under study is complex, e.g. a sewage system, it is impossible to produce realistic samples that are representative of the process and at the same time optimally designed for calibration. Often, one may at best collect representative samples from the population of interest and measure both the dependent properties Y and the predictor variables X. In that case, both Y and X are random, and one may just as well model the concentrations X, given the observed Y. This case oi natural calibration (also known as random calibration) is compatible with the linear regression model X = Y B + Ex
(36.3)
It should be understood that B in eq. (36.3) differs from that in eq. (36.2), as we shall not introduce new symbols for every possible new model. One may recapitulate eq. (36.3) loosely as composition = function (spectrum) The model of eq. (36.3) has the considerable advantage that X, the quantity of interest, now is treated as depending on Y. Given the model, it can be estimated directly from Y, which is precisely what is required in future application. For this reason one has also employed model (36.3) to the controlled calibration situation. This case of inverse calibration via Inverse Least Squares (ILS) estimation will be treated in Section 36.2.3 and has been treated in Section 8.2.6 for the case of simple straight line regression. We will see that CLS and ILS calibration modelling have limited applicability, especially when dealing with complex situations, such as highly correlated predictors (spectra), presence of chemical or physical interferents (uncontrolled and undesired covariates that affect the measurements), less samples than variables, etc. More recently, methods such as principal components regression (PCR, Section 17.8) and partial least squares regression (PLS, Section 35.7) have been
353
used to deal with such conditions. The main feature of these methods is that the spectral data are projected onto a low-dimensional subspace of the ^wavelengths (predictors) retaining nearly all relevant information. These methods are discussed in Sections 36.2.4 and 36.2.5. Finally, in Section 36.2.6, we briefly discuss some other approaches to the multivariate linear calibration problem. 36,2.1 Classical least squares From now on, we adopt a notation that reflects the chemical nature of the data, rather than the statistical nature. Let us assume one attempts to analyze a solution containing p components using UV-VIS transmission spectroscopy. There are n calibration samples ('standards'), hence n spectra. The spectra are recorded at q wavelengths ('sensors'), digitized and collected in an nxq matrix S. The information on the known concentrations of the chemical constituents in the calibration set is stored in an nxp matrix C. Each column of C contains the concentrations of one of the p analytes, each row the concentrations of the analytes for a particular calibration standard. The calibration standards must have been well-chosen. In the simplest case it suffices to measure the spectra of the respective single-component solutions. This presumes that pure samples can be made, that the whole range of concentrations is relevant and that one is convinced that the relation is linear over the entire concentration range. Normally, not all of these conditions will be fulfilled, although it may apply, for example, to the case of re-calibrating an instrument for a system that has already been amply studied. Usually, we would have to set up a design for measuring a number of mixtures spanning the range of interest. Assuming Lambert-Beer's law to hold, we may write S = C K + Es nxq
nxp pxq
(36.4)
nxq
which states that every spectrum (row of S) is a concentration-weighted summation (row of C) of the pure spectra (rows of K). Alternatively, each column7 of S, Sy, containing the intensities at wavelength y for the n calibration samples, is a linear combination of the columns of C, the weights being given by column; of K. Setting Y = S, X = C and B = K, shows that eq. (36.4) is the equivalent of the multivariate linear statistical model of eq. (36.2). The pure spectra K take the role of regression coefficients, which can be estimated by classical least-squares regression of the measured calibration spectra (S) on the known calibration matrix C (see Chapter 10 on multiple linear regression for univariate y and Section 35.2 on multiple regression for multivariate Y): K=(C^C)-^CTS
(36.5)
354 TABLE 36.1 Concentrations and digitized spectra (xlOO) of four calibration samples (compare Fig. 36.1a) Analyite
Spectral channel
3
75
85
95
74
6
47
38
112
103
77
55
41
5
15
25
35
10
97
104
110
106
95
83
100
10
45
60
79
92
117
2
Sample
1
1
100
10
2
10
45
55
65
3
10
10
100
29
39
50
70
89
113
111
87
59
40
4
30
30
30
44
52
60
68
76
79
72
54
40
30
Note that eq. (36.5) is a collection of many univariate multiple regression models: for each wavelength y the multiple regression of the corresponding spectral 'channel', i.e. Sp on the concentration matrix C yields a vector of regression coefficients, k^ (theyth column of K). For K to be estimable C^C must be invertible, i.e. the number of calibration standards should at least be as large as the number of analytes. It is clearly not possible to obtain, directly or indirectly, say 3 pure spectra from recording the spectra of just 1 or 2 standards of known composition. In practice, the condition n>p, or more precisely rank(C) =p, is hardly a restriction. As an example, we give in Table 36.1 data on calibration spectra at 10 wavelengths of 4 calibration standards for 3 analytes. The corresponding spectra are shown in Fig. 36.1a at a 10-fold higher resolution. The pure spectra calculated according to eq. (36.5), using all 100 wavelengths, are displayed in Fig. 36.1b. In many cases, one may measure spectra of solutions of the pure components directly, and the above estimation procedure is not needed. For the further development of the theory of multicomponent analysis we will therefore abandon the hat-notation in K. Given the pure spectra, i.e. given K (pxq), one may try and estimate the vector of concentrations c^^^ (pxl) of a new sample from its measured spectrum Sne^(^xl): sLw-cLwK
(36.6)
Transposing both sides to column vectors gives a set of q linear equations in p unknowns s„ew=K^c„,„ qx\
qxp
(36.7)
px\
The concentration vector c^^^ can be estimated by least-squares regression of the q absorbances (s^^^) on those of the pure components (rows of K), giving
355 calibration spectra
regression models
0.3
,."i
0.2 [-
15
0.1 L
c So
0
new spectrum
1 I '. •'! /•
•vvvl
-0.1 1*
.
/',
-0.2 0
50
100
channel Fig. 36.1. (a) Spectra of four different calibration standards with three analytes of known composition; (b) Extracted pure spectra for the three analytes; (c) Regression weights for estimating concentrations of the three analytes; (d) New spectrum resolved into three spectral contributions from pure analytes.
^new - ( ^ K )
KS,
(36.8)
or ^T
_
T
»
(36.9)
Here, BQ^ = K^(KK^)"^ is the final qxp matrix of regression coefficients for converting a spectral measurement into concentration estimates. For KK^ (pxp) to be invertible K should be of full rank. A first requirement for this is that p
356
models has the property that it gives an estimate close to zero when applied to the pure spectra of the other two analytes. Figure 36. Id gives the (simulated) spectrum of a supposedly unknown mixture and its reconstruction in terms of concentration-weighted pure spectra. The estimated composition (34.7, 16.0, 64.1) reflects the true composition (35.0,15.0,65.0) well. The greatest error occurs for analyte 2, which is clearly overestimated at the cost of analytes 1 and 3. Such a result is in line with the spectral characteristics of the analytes, viz. the fact that the spectrum of analyte 2 strongly overlaps with those of 1 and 3 (see Fig. 36.1b). The attraction of the CLS approach is that the whole spectral domain is used for estimating each constituent. Using redundant information has an effect equivalent to replicated measurement and signal averaging, hence it improves the precision of the concentration estimate. This is true if the entire spectral region contains information with respect to the constituents, which need not be the case. A further improvement in precision is possible by taking the error structure of S into account using Generalized Least Squares (GLS) regression [5,6]. The error of the spectral signal is not necessarily constant over the range of wavelengths, e.g. it may increase with the signal itself. In this situation of heteroscedasticity a more efficient estimator than the one given by eq. (36.9) for c^g^ is c„,, = (KV-^KVKV-is„,^
(36.10)
where V (qxq) is the error covariance. It may be estimated from the matrix of IS][dual sEs: 'S
= S-- C K
r _
= EJ Es/(«--1)
(36.11) (36.12)
This GLS estimator is akin to inverse variance-weighted regression discussed in Section 8.2.3. Again there is a limitation: V can be inverted only when the number of calibration samples is larger than the number of predictor variables, i.e. spectral wavelengths. Thus, one either has to work with a limited set of selected wavelengths or one must apply other solutions which have been proposed for tackling this problem [5]. The CLS method hinges on accurately modelling the calibration spectra as a weighted sum of the spectral contributions of the individual analytes. For this to work the concentrations of all the constituents in the calibration set have to be known. The implication is that constituents not of direct interest should be modelled as well and their concentrations should be under control in the calibration experiment. Unexpected constituents, physical interferents, non-linearities of the spectral responses or interaction between the various components all invalidate the simple additive, linear model underlying controlled calibration and classical least squares estimation.
357
36.2.2 Inverse least squares In inverse calibration one models the properties of interest as a function of the predictors, e.g. analyte concentrations as a function of the spectrum. This reverses the causal relationship between spectrum and chemical composition and it is geared towards the future goal of estimating the concentrations from newly measured spectra. Thus, we write C = S P + Ec
(36.13)
where P (qxp) is a matrix of regression coefficients. Each column of P contains a set of q weights for a particular analyte. Multiplying the q weights of the jth column of P with the corresponding spectral intensities and summing the results yields a concentration estimate for theyth analyte. The P-matrix is chosen to fit best, in a least-squares sense, the concentrations in the calibration data. This is called inverse regression, since usually we fit a random variable prone to error (>;) by something we know and control exactly (x). The least-squares estimate P is given by F = (S^Sr^S^C.
(36.14)
It is only possible to invert S^S (pxp) if the matrix of calibration spectra S is of full rank/?. As a consequence, the number of calibration samples should at least be as large as the number of wavelengths, n > p. This precludes the use of the full spectrum and poses the problem of selecting the 'best' wavelengths. The variable selection methods discussed in Chapter 10 (forward selection, stepwise regression, all possible subsets regression; cf. Section 10.3.3) or genetic algorithms (Chapter 27) can be used to determine a good set of wavelengths. A major concern when making this selection is the correlation between the absorbances at the various wavelengths for the calibration set. Highly dependent predictors inflate the variance of the regression coefficients P, leading to unreliable predictions (cf. Section 10.5), particularly if the prediction samples move out of the region covered by the calibration samples. The PCR and PLS methods provide a way out for the problem of wavelength selection and the associated collinearity problem. The advantage of the inverse calibration approach is that we do not have to know all the information on possible constituents, analytes of interest and interferents alike. Nor do we need pure spectra, or enough calibration standards to determine those. The columns of C (and P) only refer to the analytes of interest. Thus, the method can work in principle when unknown chemical interferents are present. It is of utmost importance then that such interferents are present in the calibration samples. A good prediction model can only be derived from calibration data that are representative for the samples to be measured in the future.
358
With this proviso a new unknown sample with the spectrum s^^^^ can be directly translated into a concentration estimate: cL=SnewB,LS
(36.15)
where the matrix of prediction vectors BJLS (qxp) has already been established in the calibration step: B,Ls=P = (S'^Sr^S'^C.
(36.16)
It should be noted that Bj^s is just a collection of separate multiple regression models. Each column h^^ of BILS is associated with a particular analyte and follows from a multiple regression of the corresponding column c of C on the predictor matrix of spectra S. 36.2.3 Principal Components Regression The application of principal components regression (PCR) to multivariate calibration introduces a new element, viz. data compression through the construction of a small set of new orthogonal components or factors. Henceforth, we will mainly use the term 'factor' rather than 'component' in order to avoid confusion with the chemical components of a mixture. The factors play an intermediary role as regressors in the calibration process. In PCR the factors are obtained as the principal components (PCs) from a principal component analysis (PC A) of the predictor data, i.e. the calibration spectra S (nxp). In Chapters 17 and 31 we saw that any data matrix can be decomposed ('factored') into a product of (object) score vectors T(nxr) and (variable) loadings P(pxr). The number of columns in T and P is equal to the rank r of the matrix S, usually the smaller of n or p. It is customary and advisable to do this factoring on the data after columncentering. This allows one to write the mean-centered spectra SQ as:
= t,pj
+t2 p j +... + t , p ;
(36.17)
where !„ is a nxl vector of ones and s (pxl) is the average spectrum over all calibration spectra. The PC score vectors t^ (a = 1,2,.., r), i.e. the columns of T, are arranged in order of diminishing importance, i.e. descending variance. We will choose to normalize the loadings p^ so that the importance of each dimension is reflected in the length (norm) of the corresponding score vector. This corresponds to the choice a = 1 and (3 = 0, as discussed in Chapter 31, when factoring a data matrix. The scores T are obtained as weighted sums of the original columns of SQ. The weights are given by the eigenvectors W (qxr) of the cross-product matrix S J SQ associated with the r positive eigenvalues:
359
T = SoW
(36.18)
In PCA, the weights W and the normalized loadings P are identical {cf. Chapters 31 and 35). For any other data decomposition, e.g PLS, weights and loadings differ. The main idea of PCA is to approximate the original data, spectra in our case, by a small number {A < r) of factors, discarding the minor factors: S-l„s'^ + rP*T
(36.19)
The suffix * in T* {nxA) and P* {qxA) indicates that only the first A columns of T and P are used, A being much smaller than n and q. In principal component regression we use the PC scores as regressors for the concentrations. Thus, we apply inverse calibration of the property of interest on the selected set of factor scores: C = l„c'^ +T*Q* + Ec
(36.20)
The regression coefficients Q* follow immediately by least-squares estimation Q* = {T^Ty'T^C^
(36.21)
where CQ = C - 1 ^ c^ is the matrix of concentrations, expressed as deviations from the mean composition c. Substituting eq. (36.18) into eq. (36.20) gives C = l „ c ^ +SoW*Q*
(36.22)
Prediction of a new sample with spectrum s^^^ is now straightforward cLw=cT+(s„,,-s^BpcR
(36.23)
where BpcR = W* Q*
(36.24)
Notice that the matrix Q* represents a multivariate regression model. Each column of Q* contains a multiple regression model of the (mean-centered) concentrations of an analyte on the orthogonal regressors in T*. Because of this orthogonality, T*^T* is diagonal and its inversion is trivial (viz. inverting the individual diagonal elements). Thus, the calculation of Q* and Bp^R is a simple matter once the PCA, giving T and W, is done. One should also appreciate that eq. (36.23) has the form of a multiple regression model, with the regression coefficients B = Bp^R having been computed via the detour of PCR, since S^S is not invertible. PCR combines aspects of both CLS and ILS. In common with ILS it is based on the direct calibration of the property of interest from the multivariate predictor, irrespective of the direction of the causal relation. Contrary to ILS and in common with CLS it can use all predictor information even when there are many more
360
low X-content velcngth samples high X-content 1100 Fig. 36.2. Raw data: near-infrared spectra of a food product with different X-content.
predictors than samples. There are no problems of collinearity since the regressors in T* are uncorrelated. There is no problem of having too many variables {q > n), since we have made the step from the q original variables, possibly hundreds of wavelengths, to just a few PCs (dimension reduction). There is no problem of selecting predictor variables: all are represented in the PCs via the weights in W* that define T*. The only problem is choosing A, the number of PCs to be used for the calibration model. This important decision is given separate attention in Section 36.3. As an example we give in Fig. 36.2 a set of near-infrared calibration spectra of a food product along with the varying content of a major component that we must call X for proprietary reasons. The food product is measured as such in reflection mode. We see that there is little variation in the spectra. Figure 36.3 gives the mean spectrum (A) and the 'spectrum' of standard deviations (B). Comparing A and B, we observe bands at roughly the same wavelengths, e.g. at wavelengths 1450 and 1880, indicating that the variation in intensity is largest at the peaks. Figure 36.4 shows the loadings for the major 4 PCs (cf. Section 17.6.2). Notice that a loadings vector has an entry for each wavelength, so it can be drawn as a spectrum-like graph. One observes that the loadings plot of PCI more or less reflects the standard deviation spectrum (Fig. 36.3b), and that practically all individual loadings are positive. The implication is that PCI gives positive weight
361
250
0.04
50
100
150
250
wavelength Fig. 36.3. Summary of spectral information: (a) average spectrum, (b) standard deviation spectrum.
to each wavelength and that the resulting PCI scores more or less reflects the total intensity of the spectra (c/. Section 31.3 where we also found that the first PC represents overall size when the measurements are positively correlated). The loadings of PC2, PC3, etc. show more complex patterns. Negative parts enter since the different loadings spectra are orthogonal, i.e. the inner products of the loading vectors are zero. One may also interpret the loadings spectra of the higher PCs in terms of contrasts between different spectral regions. Plots such as Fig. 36.4 are helpful in interpreting the nature of the PCs from a spectroscopic point of view. They also can aid in detecting artefacts in the data. For example, a spike in one of the spectra will show up as a spike in one or more of the loadings plots at the 'faulty' wavelength. Figure 36.5 shows the scores plots for PC2 v. PCI (A) and PC4 v. PC3 (B). Such plots are useful in indicating a possible clustering of samples in subsets or the presence of influential observations. Again, a spectrum with a spike may show up as an outlier for that sample in one of the scores plots. If outliers are indicated, one should try and identify the cause of the outlying behaviour. Only when a satisfactory explanation is found can the outlier be safely omitted. In practice, one will
362 PCI
PC 2
-0.1
PC 3
PC 4
0.2 D
'
'
s
0.1
GO
c
0
n 1
100 wavelength
200
V^J_
^
100 wavelength
1
200
Fig. 36.4. Spectral loadings for the first four principal components.
also remove the outlier solely on the basis of its extreme deviation. The most difficult cases arise when suspect outliers are only mildly deviating and there is no good explanation for the outlying behaviour. In our example, there are no such remarkable outlying features in the score plots. Robust methods for PCA and PCR have been developed which are less vulnerable to deviating observations {cf. Section 36.4.3). Finally, one may plot the X-content against any of the PC scores (Fig. 36.6). In this case we observe a relationship of the amount of X with PC2 and with PC3. The amount of spectral variance explained by the PCs is shown in Fig. 36.7. It would appear that the first four PCs account for practically all variation (99.2 %). Thus, a model with A=4 PCs will capture most of the spectral variation and, hopefully, most of the correlation with the X-content. The estimated model {cf. eq. 36.20) is 50.3 + 1.53^1 + 10.7^2
19.1^3+1.2r4
(36.25)
363 0.6
—
A
0.4 \-
0
« 0
0.2 L
..
o
0
0
0 0
^"
U
o
0
o
o
0
O
0 G
0 0
0
o
o
0
-0.2 -
T
O O
o
0
"
0
o
0
0
0
Oo
^
0
0 O 0
8
o
0°
-0.4 -0.8
|0
1
0.2
-0.6
-0.4
'
'
0
1
'
'
°
'
0.6
1
o„
: 0
1
0.4
0 0.2 PCI scores
B
0.1
1
1
-0.2
"
0 „
1
0.8
1
1
0 0 O
00 ° o cP Op
G
O
Oo 0 0 o "
5
-0.1 ^' -0.2 h -
o •
•
•
•
-0.5
Oo
G
"
•
o
o
-0.3 -0.6
"o" 0
-0.4
-0.3
-0.2
o
-0.1 0 PC3 scores
0.1
0.2
0.3
0.4
Fig. 36.5. Score plot of the samples in PC space: (a) scatterplot of PC2-scores vs. PCl-scores, (b) scatterplot of PC4-scores vs. PC3-scores.
where the constant term represents the average X-content in the calibration set, ^mois =50.3, and the regression coefficients correspond to the slopes in Fig. 36.7. If we multiply the weights (= loadings) given in Fig. 36.4 with these corresponding slopes and add up the result we arrive at the regression model shown in Fig. 36.8. This is the graphical version of eq. (36.24). It gives the weights bp^R in the closed model (Cn = c„ ,1^ + SJ bpcR) which describes how to convert spectral measurements into best fitting estimates of the X-content, based on 4 PCs as an intermediary. Figure 36.9 gives the fitted versus the observed X-content. We chose the number of PCs in the PCR calibration model rather casually. It is, however, one of the most consequential decisions to be made during modelling. One should take great care not to overfit, i.e. using too many PCs. When all PCs are used one can fit exactly all measured X-contents in the calibration set. Perfect as it may look, it is disastrous for future prediction. All random errors in the calibration set and all interfering phenomena have been described exactly for the calibration set and have become part of the 'predictive' model. However, all one needs is a description of the systematic variation in the calibration data, not the
364 PCI scores
50
PC2 scores
50
A
B
45 ~
~
45
o ft)
u
40
-
•
•
-
^^?^->aO ^ O 0 o 0 %'^^^^«i:^, o
•
X
<>„o
35 in
o -
0 '^"°^>;-^ 0
1
1
40
I
() 0
0^ 0
35 30
oo&b 0 0 o
0.5
-0.5
0.5
-0.5
o 0
«
o°o''
o
o tOo o
0 0 o
50
50
D 45
45 h
-
o
40
o o o
35
o o
30
0.4 PC3 scores
-0.2
0 0.2 PC4 scores
0.4
Fig. 36.6. Scatter plot of dependent variable (X-content) versus the first four principal components.
random variation. As a model becomes more complex, e.g. using more PCs in PCR calibration, the noise term starts to dominate the added systematic contribution. Procedures to establish the optimum complexity, i.e. choosing the best-predictive calibration model from a range of increasingly complex models, are given in Section 36.3. Thus far w^e have discussed the traditional method of applying PCR, where the PCs are chosen in order of decreasing variance (top-down approach). Another approach is to calculate the correlation of each PC with y and to enter the PCs in order of their decreasing squared correlation. Since the PC scores are uncorrelated, this is equivalent to applying multiple regression with variable selection on the set of PCs using forward selection (cf. Section 10.3.3). Several studies, e.g. Refs. [7,8], have shown that, with respect to prediction accuracy, this correlation-PCR approach performs no worse and often better than the traditional top-down approach. Correlation-PCR often gives simpler models with a smaller number of PCs than top-down PCR. The added advantage of such parsimonious models is that they are easier to interpret.
365 100
c X
>
Fig. 36.7. Percentage variance of X-content explained by the principal components from spectral data. Individual percentages (bars) are shown as well as cumulative percentages (circles).
I
2r
•
T
—
45
O
,
0 °
1 U
T
CP
o
T3
o[
40
-1
35 -21-
100 wavelength
200
Fig. 36.8. Regression coefficients obtained from PCR model.
0
40 45 35 X-content (measure Fig. 36.9. Final calibration showing fitted versus measured values.
366
36.2.4 Partial least squares regression The procedure for PLS calibration is very akin to PCR calibration. It also proceeds via a small set of orthogonal factors constructed from the predictor variables. The main difference with PCR lies in the way the factors are determined. This has been discussed in Chapter 35. In PLS regression the factor is not solely determined by the spread (variance) of the predictor data; the correlation of the PLS factor with the variable to be predicted also plays a role. A high-variance factor that is not at all correlated to the dependent property is given less weight by PLS from the outset. In PCR, such a factor at first seems important on account of its high variance. However, it is downweighted in the final model through the small regression coefficient q^ for that factor (see, for example, the low coefficient for PCI in the PCR regression model of eq. (36.25)). Since in PLS regression such factors are avoided, PLS models often have less factors than PCR models built on the same data. This aspect of parsimony is sometimes seen as an advantage of PLS over traditional PCR. Correlation-PCR behaves like PLS in this respect. Another difference between PLS and PCR is that the loadings P, even when normalized to unit length, differ from the weights W. Qualitatively, however, e.g. with regard to the sign pattern, the difference often is not large. Usually one plots the loadings since they have a simpler interpretation. In Chapter 35 we showed that the loadings P are the regression coefficients obtained from regressing the predictors (spectral signals at each wavelength) on the PLS factor scores. One may consider the loadings (columns of P) as abstract spectra from which the measured spectra can be reconstructed. A high loading p^j^ implies that the spectral region around thejth wavelength has a strong contribution from the kih factor. PLS calibration of a multicomponent system can be performed in two different ways. One may do a separate regression for each analyte. Such univariate (in >^) regressions are called PLSl regressions. One may also model the various analytes collectively in one and the same multivariate PLS2 regression model. Which of these two approaches should be chosen? The use of PLS2 regression has a few advantages. Firstly, there is one common set of PLS factors T for all analytes. This simplifies interpretation and enables a simultaneous graphical inspection. Secondly, when the analyte concentrations are strongly correlated one may expect on theoretical grounds that the PLS2 model is more robust than separate PLSl models. This is especially true when the Y matrix is closed, as for compositional data. Finally, when the number of analytes is large the development of a single PLS2 model is done much quicker than the development of many separate PLS 1 models. Practical experience, however, indicates that PLSl calibration usually performs equally well or better in terms of predictive accuracy. Thus, when the ultimate requirement of the calibration study is to enable the best possible predictions, a separate PLSl regression for each analyte is advised.
367
36.2.5 Other linear methods There are many more methods that are well suited for calibration of collinear data. Latent root regression [9,10], principal covariates regression [11], total least squares [12], and reduced rank regression [13] are related to PCR and PLS in that a calibration model is based on orthogonal factors derived from the predictors (spectra). However, a comparative evaluation of these techniques has not yet been published. Continuum regression [14,15] provides a continuum of models which can be viewed as interpolations between three 'anchor' models, viz. PCR, PLS (more specifically SIMPLS, Section 35.7.4), and MLR (or reduced rank regression). It still has to be established whether there is a need for a continuum of models filling the gaps between three anchors. Ridge regression is a technique that has been specially devised for dealing with multi-coUinear data (cf. Section 10.6). It replaces the OLS regression estimator (X'^X)-^X'^y by (X^X+kiy^X^y. The addition of the constant k, the ridge parameter, to the diagonal of X^X has the effect of artificially decreasing the dependence among the X-predictors. As a result it leads to regression coefficients that are slightly biased towards the low side (shrinkage property). This small systematic error is, however, more than compensated by the beneficial effect of the greatly reduced variance of the estimates. The value of k can be identified from the graph of the regression parameters as a function of /:, namely as the value where the regression parameters start to stabilize. In a comparative simulation study it has been shown that ridge regression performs as well as PCR or PLS, all of them outperforming MLR with forward variable selection [16]. Other comparative studies also have shown that often the results obtained with various prediction methods are very similar. When each method is applied carefully it turns out that there is no overall superior technique. Thus, it is better to get well acquainted with one or two methods and to apply that method in a professional way, rather than applying different techniques which are known only superficially. Even multiple linear regression (MLR, Chapter 10) has regained some of its lost respectability in the field of multivariate calibration. In combination with modem wavelength selection procedures (Section 36.5.1) and a safeguard for overfitting, well-performing models can be derived that are based on a small number of wavelengths. Generalized standard addition method In Section 8.2.8 we have discussed the standard addition method as a means to quantitate an analyte in the presence of unknown matrix effects (cf. Section 13.9). While the matrix effect is corrected for, the presence of other analytes may still interfere with the analysis. The method can be generalized, however, to the simultaneous analysis of p analytes. Multiple standard additions are applied in order to determine the analytes of interest using many (q > p) analytical sensors. It
368
is assumed that the responses are a linear function of the analyte concentrations. The calibration model reads R = ( V J + AC) K + E
(36.26)
where R (nxq) is the set of responses measured on the calibration samples according to the 'design' matrix AC (nxp), containing the added concentrations. The unknown concentration is given by the vector CQ 07X1) and K (pxq) is the unknown matrix of sensitivities of the q responses with respect to the p different analytes. One may eliminate the unknown concentration vector CQ by subtracting the response vector TQ corresponding to the original sample, without standard additions, from each row of the response matrix R, giving AR, and at the same time subtracting the unknown CQ from the matrix (l^c J + AC). The sensitivities K can be estimated from this reduced system of equations using multiple linear regression of the corrected responses (AR) on the multiple standard additions (AC). Given the estimate K = ((AC)'^(AC))-^ (AC)'^ AR one may estimate the unknown concentration as: Co=K^(KKT)-^ro
(36.27)
More efficient estimation methods exist than the simple method described here [17]. The generalized standard addition method (GS AM) shares the strong points (e.g correction for interferences) and weak points (e.g. error amplification because of the extrapolation involved) of the simple standard addition method [18]. 36.3 Validation Many choices have to be made before a calibration model can be developed. First one has to choose between the methods discussed in Section 36.2. Each of these methods involves the choice of the dimensionality A, a 'meta-parameter' that may greatly affect the predictive performance of the calibration model. In ILS calibration one has the problem of selecting a limited number of predictive wavelengths. If one chooses the method of Brown, Denham and Spiegelman for wavelength selection (c/. Section 36.5.1) one still has to choose a confidence level a which governs the number of wavelength channels. A way to make such choices is to build the models from the calibration or training set and see which of the models gives the best predictions on a new set of data, the test set. The obvious criterion for assessment is the average size of the prediction error, as expressed by the prediction error sum-of-squares, PRESS ([19], Section 10.3.4), or the rootmean-square (rms) value of the prediction error, RMSPE.
369
PRESS = I ( q - c , ) 2
(36.28)
RMSPE = (PRESSMt)i^2
(36.29)
The summation in eq. (36.29) extends over all n^ samples in the test set. Among the various model options one chooses the one having minimum PRESS. It is still common usage to regard the model thus chosen as a validated model and RMSPE as a proper estimate of future prediction errors. However, the procedure described above only helped to choose a final model and the prediction error RMSPE estimated in this manner will be optimistically biased. The so-called test set played the role of a second training set for estimating the meta parameter(s). It is therefore better to refer to this second dataset as the monitoring set, used to fine tune the metaparameter(s). Given the chosen model one should establish its performance by testing the model on a new independent set of data. The RMSPE value obtained for this new set of data, truly the test set, gives a fair impression of the predictive capability of the model provided that the training data, monitoring data and test data are randomly sampled from the same population. Should the test result be disappointing and lead to amendments of the model, this new calibration model should be tested against wholly new data, etc. In practice the strategy of employing separate sets of calibration (training) and validation (test) data often cannot be applied as it requires a large number of samples. It is quite common that the number of samples is limited and resort is taken to resampling methods or internal cross-validation (cf. Sections 10.3.4 and 33.4). Here, one sets aside part of the data and builds a model with the rest of the data. The data not included are then used for assessing the PRESS dependence. This process is repeated for various splits of the original data such that all samples have been left out once. The PRESS values are accumulated over the various splits. At the end one chooses the model having minimum overall PRESS. Perhaps, the most common approach is to perform n calibration steps leaving out one observation at a time (leave-one-out proctdmt, LOO). As an example. Fig. 36.10 shows the cross-validation RMSPE result for (traditional) PCR and PLS models of the X-content. For the PLS model the minimum RMSPE (1.8 %) occurs at a dimensionality A = 5. For PCR the minimum lies at A = 10 (RMSPE = 2.0%). This minimum is very shallow and one might trade in a few factors for a simpler, and probably more robust, model with about the same prediction error (A = 8, RMSPE = 2.2%). In all, the model choice in this example is fairly clear-cut: PLS regression with 5 factors. If one is willing to accept a somewhat larger prediction error (RMSPE = 2.5%) a parsimonious 3-factor PLS model suffices. Since one knows that the minimum RMSPE value is optimistically biased it is good advice to prefer simpler models with slightly higher RMSPE values. The one-factor model for PCR performs not better than a zero-factor model, i.e. using no spectral information at
370
5 # factors Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained from leave-one-out cross-validation using PCR (o) and PLS (*) regression.
all. Apparently, the first PC-factor captures an important source of spectral variation that has no predictive value. Leaving out one object at a time represents only a small perturbation to the data when the number (n) of observations is not too low. The popular LOO procedure has a tendency to lead to overfitting, giving models that have too many factors and a RMSPE that is optimistically biased. Another approach is k-fold cross-validation where one applies k calibration steps (5 < k < 15), each time setting a different subset of (approximately) n/k samples aside. For example, with a total of 58 samples one may form 8 subsets (2 subsets of 8 samples and 6 of 7), each subset tested with a model derived from the remaining 49 or 50 samples. In principle, one may repeat this /:-fold cross-validation a number of times using a different splitting [20]. Van der Voet [21] advocates the use of a randomization test (cf. Section 12.3) to choose among different models. Under the hypothesis of equivalent prediction performance of two models, A and B, the errors obtained with these two models come from one and the same distribution. It is then allowed to exchange the observed errors, e^^ and e^^, for the /th sample that are associated with the two models. In the randomization test this is actually done in half of the cases. For each object / the two residuals are swapped or not, each with a probability 0.5. Thus, for all objects in the calibration set about half will retain the original residuals, for the other half they are exchanged. One now computes the error sum of squares for each of the two sets of residuals, and from that the ratio F = SSEJSSE^. Repeating the process some 100-200 times yields a distribution of such F-ratios, which serves as a reference distribution for the actually observed F-ratio. When for instance the observed ratio lies in the extreme higher tail of the simulated distribution one may
371
be confident that model B is significantly better than model A. This randomization test is generally applicable to choosing among different models, hence it can be used to determine e.g. the lowest complexity of PCR or PLS models that is significantly superior to simpler models. Another approach to model validation is the application of the bootstrap. Here, we only give a short and qualitative description. For more details about the bootstrap methodology in general, see [22,23] for a recent application to multivariate calibration. The analogy with cross-validation is that the observed data are resampled many times. The difference is that one does not split the data into two subsets, but applies resampling with replacement. As an example, suppose there are 100 objects in the calibration set. This is regarded as a population. One may draw 100 samples from this population with replacement, i.e. some objects will not be selected at all, many only once, some twice or more. With this artificial data set one builds a model. This process is repeated many times, hence it is computer intensive. For each model one computes a quantity of interest, e.g. a prediction error. From the distribution of these prediction errors one may derive an average value as well as a measure of its uncertainty. By monitoring the average prediction error as a function of the number of PCR or PLS factors one may determine an optimal model complexity. This average prediction error can be used to compute confidence limits around future predictions.
36.4 Other aspects 36.4.1 Calibration design It should be appreciated that classical theory of design of experiments (Chapters 21-26), based on the linear model estimated by least squares regression, cannot be applied directly to the problem of multivariate calibration. One reason is that one may not know precisely which (interfering) factors are at play, let alone that one could control these. Another reason is that one may regard the spectral space as the space to position the chemical samples in. The number of dimensions (wavelengths) is generally much larger than the number of experimental units (chemical samples) and the linear model cannot be estimated by ordinary least squares regression. There are two points of view to take into account when setting up a training set for developing a predictive multivariate calibration model. One viewpoint is that the calibration set should be representative for the population for which future predictions are to be made. This will generally lead to a distribution of objects in experimental space that has a higher density towards the center, tailing out to the boundaries. Another consideration is that it is better to spread the samples more or
372
less evenly over the experimental region. One way to generate such a space filling design is by applying the Kennard-Stone algorithm {cf. Section 24.4.2). Generally, this will put less weight on the center of the experimental region and more on the extremes which should give more precise estimates. Feam [24] gives an interesting discussion on the two choices, the main message being that when the number of calibration objects is limited the latter choice may be preferable. One is not always in a position to choose the samples. In that case it is never wise to discard samples from a given calibration set for the sole sake of making the distribution of calibration objects in the region of interest more uniform. When a linear model does not hold over the full range of calibration samples there are two options. One may apply a nonlinear regression method to the complete set of data {cf. Section 36.5.3) or one may split the experimental region into smaller subregions and estimate a separate linear model for each. Naes and Isaksson [25] use fuzzy clustering for splitting the training sets into smaller subsets with improved linearity in each group. 36.4.2 Data pretreatment Data pretreatment is an important issue. Proper preprocessing of the data can be very instrumental in developing better predictive models. Although it is true that the modelling process in multivariate calibration may accommodate for interferences and irrelevant artefacts, careful data preprocessing often turns out to be more effective [26]. General guidelines on how to preprocess data are hard to give since this depends very much on the specific application at hand (e.g. which type of data? spectroscopic? which techniques?) and on the nature of the samples in question. It goes without saying that any pretreatment of the data has to be applied in an identical manner to both the calibration data, the test data and future new data. A very basic form of pretreatment is (column) mean centering. It corresponds to modelling the variation around the mean, i.e. the deviation from the mean response is directly related to deviations from the mean for the predictors. Mean centering is so common that it is often not even considered as a form of data pretreatment. Without further action this may give the variables with a high variance an undue influence on the model, except for MLR which is not sensitive to such scaling. Autoscaling is a form of pretreatment that is recommendable when the predictor variables are of a different nature and not measured on the same scale. Standardizing all variables to the same variance can be seen as a democratic manoeuvre giving all variables equal chance to influence the model. With spectroscopic data a popular form of data pretreatment is to correct for varying baseline slopes by regressing each spectrum against wavelength number and continuing the calibration with residuals. Quadratic regression has also been used as a means for detrending spectra. Pre-smoothing may be applied to the
373
spectral data to get rid of uncontrolled random noise. Various options are available to implement such smoothing {cf. Chapter 40): moving window (box car) averaging, Fourier filtering, Savitzky-Golay smoothing. A side effect of smoothing is that the spectral resolution may be lost. A very common pre-treatment of spectral data is to convert spectra to first- (or second)- derivative form [27]. This has the effect of removing any offset and constant slope (or curvature). Applying second derivatives has the advantage of sharpening peaks and resolving overlapping bands to some extent, although it also introduces spurious satellite peaks. Derivatization in general amplifies the noise in the data (Section 40.5.5). As a remedy to the latter drawback one may carry out the derivatization in combination with some degree of smoothing, for example, employing Savitzky-Golay filtering (Section 40.5.2.3). An effective preprocessing method is the use of standard normal variates (SNV). This type of standardization boils down to considering each spectrum x^ as a set of q observations and calculating their z-scores: z. = ( x . -x)l
s
(36.30)
It has the effect of removing an overall offset by subtracting the mean spectral reading x and it corrects for differences affecting the overall variation. In various settings it has been found to be an effective preprocessing method. Another popular form of data pre-processing with near-infrared data is the application of the Multiplicative Scatter Correction (MSC, [28]). It is well known that particle size distribution of non-homogeneous powders has an overall effect on the spectrum, raising all intensities as the average particle size increases. Individual spectra x^ are approximated by a general offset plus a multiple of a reference spectrum, z. x. = a^-\-b^z + e,.
(36.31)
The offset a^ and the multiplication constant b^ are estimated by simple linear regression of the /th individual spectrum on the reference spectrum z. For the latter one may take the average of all spectra. The deviation e, from this fit carries the unique information. This deviation, after division by the multiplication constant, is used in the subsequent multivariate calibration. For the above correction it is not mandatory to use the entire spectral region. In fact, it is better to compute the offset and the slope from those parts of the wavelength range that contain no relevant chemical information. However, this requires spectroscopic knowledge that is not always available. A special type of data pre-treatment is the transformation of data into a smaller number of new variables. Principal components analysis is a natural example and we have treated it in Section 36.2.3 as PCR. Another way to summarize a spectrum in a few terms is through Fourier analysis. McClure [29] has shown how a NIR
374
spectrum recorded at 1700 channels can be well approximated by a Fourier series comprising only 100 terms. The error amounted to less than 0.01 %, yet the spectra can be compressed and stored in 6% of the original space. A calibration model based on some 10 selected Fourier terms proved to give results that were superior to using the full raw data. 36.4.3 Outliers The presence of outliers may have a detrimental effect on the quality of a calibration model. Therefore, the identification of outliers is an important part of the modelling process. Outliers can come in different guises. One speaks of high leverage observations when the predictor data for a calibration object deviate strongly from the rest. Such outliers in X-space may fit well to the model Cgood' outliers) or not (*bad' outliers). When the predictor data are not abnormal for an object, but the object fits poorly to the model, then one speaks of a high residual observation (outlier in the >^-direction). Another class is formed by the influential observations. These are observations that have a demonstrably large impact on the model estimates. When such observations are discarded from the calibration set a significantly different model with different predictions is obtained. How do we identify outliers and how should we treat them once found? Basically there are two approaches: either apply diagnostics for detecting outliers or use robust estimation methods. Which diagnostics to use depends on the regression method employed, but some tools are universally applicable. It is always useful to make residual plots. These may reveal strongly deviating observation or some remaining structure that should not appear in a good residual plot. A common way of inspecting the data is to draw a PC A score plot which may show samples deviating from the bulk of the samples. Formally, one can compute the Mahalanobis distance or the leverage (cf. Sections 8.2.6 and 20.7), and use that as a indication of outlying behaviour. Similarly, a loading plot may reveal wavelengths which are showing deviating behaviour. One may also use the PLS scores and loadings during the modelling stage. With MLR one can use such measures as Cook's distance (cf. Section 10.9) or variance inflation factor (cf. Section 10.5) for influential observations or variables. A complicating factor with the diagnosis of outliers is the phenomenon of masking. This refers to the fact that an individual observation may not be recognized as outlying, because it is part of a cluster of outlying objects. Only when the complete cluster of deviating outliers is removed from the calibration set, one recognizes their severe influence on the calibration model [30]. Identification of outliers is not a straightforward process. Even when observations have been diagnosed as outlying, one should not automatically discard these, certainly not when the evidence is not overwhelming. Ideally, one should
375
always try to find additional physical or chemical evidence that something is 'wrong' with such samples before deciding to remove them. An alternative to the detection and removal of outliers is to employ robust regression methods {cf. Sections 12.1.5 and 12.3). With such methods outlying objects are automatically identified and downweighted in the regression modelling. Robust modifications of popular multivariate regression procedures have been reported for PCR [31] and PLS [32]. Walczak [33] has described a method, the evolution program (EP), where a clean subset of the data is generated and the remaining observations are tested relative to this clean subset. As the name implies the method is based on the idea of natural evolution similar to the better-known genetic algorithm. The EP approach allows to build robust models in the presence of multiple multivariate outliers and can be applied usefully in combination with PCR and PLS regression. Other robust methods for identifying multivariate outliers are given in e.g. Refs. [34-35]
36.5 New developments 36.5,1 Feature selection Brown et al. [36] published an interesting, simple method for selecting wavelengths in NIR calibration. The method boils down to ranking the wavelengths in order of diminishing squared correlation (R^) with the analyte concentration. The model is then built using the first m wavelengths, j = 1 ...m, from this ordered list as a weighted average of the m simple regression models corresponding to these wavelengths. The idea is to minimize the confidence interval for future predictions. As m increases, so does the total signal-to-noise ratio (sum of F-ratios or ^-values for the m separate, simple regressions), but also does the model complexity. The optimum number of wavelengths m is established by minimizing the ratio X^(m;a)/(S^J), where X^(m;a) is the critical value of a chi-square distribution with m degrees of freedom at the chosen level of confidence. Having found the selection of wavelengths one may proceed using any of the aforementioned regression methods, e.g. ILS or GLS. Many other new developments are under way in this area of variable selection. Wavelength selection can also be done using forward selection using covariance rather than correlation as a criterion (Intermediate Least Squares [37,38]). One may see this as the PLS analogon to forward selection in MLR. Recently, also genetic algorithms {cf. Chapter 27) have been applied to the problem of finding small sets of predictive wavelengths among the legion of candidate wavelengths [39]. The challenge with the application of such methods is not to fall in the trap of overfitting. Another major problem is that of chance correlations: with the large set
376
of predictors encountered in spectral data it is not at all unlikely to select wavelengths which have no real predictive power but happen to contribute to the overall correlation with the response in the calibration set. An alternative approach to variable selection is the elimination of so-called uninformative variables. These are variables that have no better predictive power than artificial random variables added to the data [40]. 36.5.2 Transfer of calibration models The development of a calibration model is a time consuming process. Not only have the samples to be prepared and measured, but the modelling itself, including data pre-processing, outlier detection, estimation and validation, is not an automated procedure. Once the model is there, changes may occur in the instrumentation or other conditions (temperature, humidity) that require recalibration. Another situation is where a model has been set up for one instrument in a central location and one would like to distribute this model to other instruments within the organization without having to repeat the entire calibration process for all these individual instruments. One wonders whether it is possible to translate the model from one instrument (old or parent or master, A) to the others (new or children or slaves, B). Several approaches have been investigated recently to achieve this multivariate calibration transfer. All of these require that a small set of transfer samples is measured on all instruments involved. Usually, this is a small subset of the larger calibration set that has been measured on the parent instrument A. Let Z indicate the set of spectra for the transfer set, X the full set of spectra measured on the parent instrument and a suffix Aor B the instrument on which the spectra were obtained. The oldest approach to the calibration transfer problem is to apply the calibration model, b^, developed for the parent instrument A using a large calibration set (X^), to the spectra of the transfer set obtained on each instrument, i.e. Z^ and Zg. One then regresses the predictions y^ (=Z^b^) obtained for the parent instrument on those for the child instrument yg (=ZBb^), giving yA=« + ^yB+e
(36.32)
This yields an estimate for the bias (intercept) a and slope b needed to correct predictions y^ from the new (child) instrument that are based on the old (parent) calibration model, b^. The virtue of this approach is its simplicity: one does not need to investigate in any detail how the two sets of spectra compare, only the two sets of predictions obtained from them are related. The assumption is that the same type of correction applies to all future prediction samples. Variations in conditions that may have a different effect on different samples cannot be corrected for in this manner.
377
All other approaches try and relate the child spectra to the parent spectra. In the patented method of Shenk and Westerhaus [41: Sh], in its simplest form, one first applies a wavelength correction and then a correction for the absorbance. Each wavelength channel / of the parent instrument is linked to a nearby wavelength channel j(i) in the child instrument, namely the one to which it is maximally correlated. Then, for each pair of wavelengths, / for the parent Mid j(i) for the child, a simple linear regression is carried out, linking the pair of measured absorbances
ZA,/ =
«/ + ^/ ZB,/(O
(36.33)
In this way the child spectrum is transformed into a spectrum as if measured on the parent instrument. In a more refined implementation one establishes the highest correlating wavelength channel through quadratic interpolation and, subsequently, the corresponding intensity at this non-observed channel through linear interpolation. In this way a complete spectrum measured on the child instrument can be transformed into an estimate of the spectrum as if it were measured on the parent instrument. The calibration model developed for the parent instrument may be applied without further ado to this spectrum. The drawback of this approach is that it is essentially univariate. It cannot deal with complex differences between dissimilar instruments. In the direct standardization introduced by Wang et al. [42] one finds the transformation needed to transfer spectra from the child instrument to the parent instrument using a multivariate calibration model for the transformation matrix: Z^ = ZgF. The transformation matrix F (qxq) translates spectra Zg that are actually measured on the child instrument B into spectra Z^ that appear as if they were measured on instrument A. Predictions are then obtained by applying the old calibration model b^ to these simulated spectra Z^: y^=Z^h^=Z^Fh^,
(36.34)
giving be = Fb^
(36.35)
as the transferred calibration model that applies directly to spectra measured on instrument B. Either PCR or PLS2 regression have been used for establishing the proper transformation, F. Notice that for each channel of the estimated spectrum the full spectrum of instrument B is used. In piecewise direct standardization (PDS) one uses for each frequency (column of F) only the local information of
378
neighbouring wavelengths in the transfer spectra Zg employing a window of wavelengths (columns of Zg) centered around the wavelength (column of Z^) of current interest. In mathematical terms: one imposes a band structure on the transformation matrix F. The span of the neighbourhood region and the number of PCs have to be optimized via cross-validation. Applications with this PDS techniques have proved successful [43]. The choice of the transfer subset is critical to the success of calibration transfer. The transfer samples should span the region of interest and can be chosen on the basis of extreme PC or PLS factor scores. Improved results can be obtained by a better coverage of the calibration range, for example, by using a formal design algorithm such as that of Kennard and Stone (Section 24.4.2) for the selection of the transfer set. Forina [44] applies PLS regression both for estimating the calibration model on the one instrument and for modelling the relation between the two sets of spectra. Alternative methods or suggestions for improving existing methods continue to be reported. Good reviews on the theory and practice of transferring of calibration models are found in Refs. [45,46]. 36.5.3 Non-linear methods In recent years there has been much activity to devise methods for multivariate calibration that take non-linearities into account. Artificial neural networks (Chapter 44) are well suited for modelling non-linear behaviour and they have been applied with success in the field of multivariate calibration [47,48]. A drawback of neural net models is that interpretation and visualization of the model is difficult. Several non-linear variants of PCR and PLS regression have been proposed. Conceptually, the simplest approach towards introducing non-linearity in the regression model is to augment the set of predictor variables (jCj, ^2,...) with their respective squared terms (jc,^, jc|,...) and, optionally, their possible cross-product terms (JCJJC2, ...). Since the number of predictors grows appreciably, PCR or PLS regression is called for. A non-linear variant of PLS employing splines for the inner relation between y and the r-scores has been proposed that has some analogy with neural nets. However, in multivariate calibration this splines-PLS approach [49] has not yet met with success. In fact, using a quadratic regression model using factor scores from PC A can be just as effective [50]. One may employ linear PLS regression as a first step and then proceed with these PLS scores in a quadratically extended regression model (LQ-PLS, [51]). Locally weighted regression (LWR], an approach combining elements of PCA or PLS, weighted regression and local modelling, has been more successful [52]. In this approach one starts with a transformation of the spectra into a few PC scores. The spectrum of any new sample is transformed to the same PC space and a small set of similar spectra from the calibration set is determined using the Mahalanobis distance as a criterion for
379
similarity. Multiple linear regression is then used to relate the response y to the PC scores for this small local set and this is used as an interpolating model to estimate the unknown response for the new sample. The number of PC dimensions, the number of neighbours are determined through cross-validation. More elaborate extensions of this approach not only take spectral similarity but also the estimated chemical similarity into account [53]. An interesting study comparing a variety of modem non-linear methods is given in Ref. [54].
References 1. 2. 3. 4. 5. 6. 7. 8.
9. 10.
11. 12. 13. 14.
15. 16. 17.
K.I. Hildrum, T. Isaksson, T. Nses and A. Tandberg, Near Infra-red Spectroscopy Bridging the Gap between Data Analysis and NIR Applications. Ellis Horwood, New York, 1992. H. van de Waterbeemd, ed., QSAR: Chemometric Methods in Molecular Design. VCH, Weinheim, 1995. T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science, Data Handling in Science and Technology Series. Elsevier, Amsterdam, 1996 P.K. Hopke and X.-H. Song, The chemical mass balance as a multivariate calibration problem. Chemom. Intell. Lab. Assist., 37 (1997) 5-14. P.J. Brown, Measurement, Calibration and Regression. Clarendon Press, Oxford, 1993. T. Nses, Progress in multivariate calibration, pp. 52-60, in Ref. [1]. J.M. Sutter, J.H. Kalivas and P.M. Lang, Which principal components to utilize for principal component regression. J. Chemometr., 6 (1992) 217-225. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta, 348 (1997) 19-27. R.F. Gunst and R.L. Mason, Regression Analysis and its Application: A Data-Oriented Approach. Marcel Dekker, New York, 1980. E. Vigneau, D. Bertrand and E.M. Qannari, Application of latent root regression for calibration in near-infrared spectroscopy. Comparison with principal component regression and partial least squares. Chemometr. Intell. Lab. Syst., 35 (1996) 231-238. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14 (1992) 155-164. S. Van Huffel and J. Vandewalle, The Total Least Squares Problem: Computational Aspects and Analysis. SIAM, Philadelphia, PA, 1991. P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression. Appl. Stat., 31 (1982) 244-255. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed prediction ernbracing ordinary least sqaures, partial least squares, and principal component regression. J. Roy. Stat. Soc. B52 (1990) 237-269. R.J. Brooks and M. Stone, Joint continuum regression for multiple predictands. J. Am. Stat. Assoc, 89 (1994) 1374-1377. I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics, 35 (1993) 109-135. R. Sundberg, Interplay between chemistry and statistics, with special reference to calibration and the generalized standard addition method. Chemom. Intell. Lab. Syst., 4 (1988) 299-305.
380 18. 19. 20 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 32. 31. 33. 34. 35. 36. 37. 38. 39. 40.
41. 42. 43.
M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986. D.M. Allen, The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16 (1974) 125-127. M. Forina, G. Drava, R. Boggia, S. Lanteri and P. Conti, Validation procedures in near-infrared spectrometry. Anal. Chim. Acta, 295 (1994) 109-118. H. van der Voet, Comparing the predictive accuracy of models using a simple randomization test. Chemom. Intell. Lab. Syst., 25 (1994) 313-323. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Wiley, New York, 1993. R. Wehrens and W. van der Linden, Bootstrapping principal regression models. J. Chemom., 11 (1997)157-172. T. Fearn, Flat or natural? A note on the choice of calibration samples, pp. 61-66 in Ref. [1]. T. Naes and T. Isaksson, SpHtting of calibration data by cluster-analysis. J. Chemometr., 5 (1991)49-65. O.E. de Noord, The influence of data preprocessing on the robustness nd parsimony of multivariate calibration models. Chemom. Intell. Lab. Systems, 23 (1994) 65-70. W.R. Hruschka, Data analysis: wavelength selection methods, pp. 35-55 in: P.C. Williams and K. Norris, eds. Near-infrared Reflectance Spectroscopy. Am. Cereal Assoc, St. Paul MI, 1987. P. Geladi, D. McDougall and H. Martens, Linearization and scatter-correction for nearinfrared reflectance spectra of meat. Appl. Spectrosc, 39 (1985) 491-500. F. McClure, Analysis using Fourier transforms, in: Handbook of Near-Infrared Analysis, D. A. Bums and E.W. Ciurczak, eds. Dekker, New York, pp. 181-224 (1992). A.C. Atkinson, Masking unmasked. Biometrika, 73 (1986) 533-541. I.N. Wakeling and H.J.H. MacFie, A robust PLS procedure. J. Chemom., 6 (1992) 189-198. B. Walczak and D.L. Massart, Robust PCR as outliers detection tool. Chemom. Intell. Lab. Syst., 27 (1995) 41-54. B. Walczak, Outlier detection in bilinear calibration. Chemom. Intell. Lab. Syst., 29 (1995) 63-73. A. Singh, Outliers and robust procedures in some chemometric applications. Chemom. Intell. Lab. Syst., 33 (1996) 75-100. A.S. Hadi, A modification of a method for the detection of outliers in multivariate samples. J. Roy. Stat. Soc. B56 (1994) 393-396. P.J. Brown, C.H. Spiegelman and M.C. Denham, Chemometrics and spectral frequency selection. Phil. Trans. R. Soc. Ser. A, 337 (1991) 311-322. I.E. Frank, Intermediate least squares regression method. Chemom. Intell. Lab. Syst., 1 (1987) 232-242. A. Hoskuldsson, The H-principle in modelling with applications to chemometrics. Chemom. Intell. Lab. Syst., 14 (1992) 139-153. D. Jouan-Rimbaud, D.L. Massart, R. Leardi, et al.. Genetic algorithms as a tool for wavelength selection in multivariate calibration. Anal. Chem., 67 (1995) 4295-4301. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.G.M. Vandeginste, and C. Sterna, Elimination of uninformative variables for multivariate calibration. Anal. Chem., 68 (1996) 3851-3858. J.S. Shenk and M.O. Westerhaus, US Patent No. 4866644, Sept. 12, 1989. Y.D. Wang, D.J. Veltkamp and B.R. Kowalski, Multivariate instrument standardization. Anal. Chem., 63 (1991) 2750-2756. Z.Y. Wang, T. Dean and B.R. Kowalski, Additive background correction in multivariate instrument standardization. Anal. Chem., 67 (1995) 2379-2385.
381 44.
M. Forina, G. Drava, C. Armanino, et al., Transfer of calibration function in near-infrared spectroscopy. Chemom. Intell. Lab. Syst., 27 (1995) 189-203. 45. O.E. de Noord, Multivariate calibration standardization. Chemom. Intell. Lab. Systems, 25 (1995) 85-97. 46. E. Bouveresse and D.L. Massart, Standardisation of near-infrared spectrometric instruments. Vib. Spectrosc, 11 (1996) 3-15. 47. B.J. Wythoff, Backpropagation neural networks — a tutorial. Chemom. Intell. Lab. Syst., 18 (1993)115-155. 48. C. Borggaard and H.H. Thodberg, Optimal minimal neural interpretation of spectra. Anal. Chem., 64 (1992) 545-551. 49. S. Wold, Non-linear partial least squares modelling. II. Spline inner relation. Chemom. Intell. Lab. Syst., 14(1992)71-84. 50. S.D. Oman, T. Naes and A. Zube, Detecting and adjusting for nonlinearities in calibration of near-infrared data using principal components. J. Chemom., 7 (1993) 195-212. 51. S. Wold, N. Kettaneh-Wold and B.Skagerberg, Nonlinear PLS modeling. Chemom. Intell. Lab. Syst., 7 (1989) 53-65. 52. T. Naes, T. Isaksson and B. Kowalski, Locally weighted regression in NIR analysis. Anal. Chem., 2 (1990) 664-673. 53 . Z.Y. Wang, T. Isaksson and B.R. Kowalski, New approach for distance measurement in locally weighted regression. Anal. Chem., 66 (1994) 249-260. 54. S. Sekulics, B.R. Kowalski, Z.Y. Wang, et al.. Nonlinear multivariate calibration methods in analytical chemistry. Anal. Chem., 65 (1993) A835-A845.
Additional recommended reading Books H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester, 1989. J.H. Kalivas and P.M. Lang, Mathematical Analysis of Spectral Orthogonality. Dekker, New York, 1994.
Articles K.R. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal. Chem., 59 (1987) 1007A-1017A. B.R. Kowalski and M.B. Seasholtz, Recent developments in multivariate calibration. J. Chemometrics, 5 (1990) 129-145. H. Martens and T. Naes, Multivariate Calibration by Data Compression, Chapter 4 in Ref. [1]. P.J. Brown, Multivariate Calibration. J. Roy. Stat. Soc, B44 (1982) 287-321. K. Faber and B.R. Kowalski, Propagation of measurement errors for the validation of predictions obtained by principal component regession and partial least squares. J. Chemom., 11 (1997) 181-238. D.M. Haaland, Multivariate Calibration Methods Applied to the Quantitative Analysis of Infrared Spectra, Chapter 1 in "Computer-Enhanced Analytical Spectroscopy, Volume 3", edited by P.C. Jurs. Plenum Press, New York, 1992.
This Page Intentionally Left Blank
383
Chapter 37
Quantitative Structure-Activity Relationships (QSAR) 37.1 Extrathermodynamic methods Today it is a well-accepted view that biological and therapeutic activity of a drug depend upon its physicochemical and conformational (or steric) properties. The former include the ability of a molecule to cross membranes of cells or to be taken up by fatty material, the distribution of electric charges within the molecule, its capacity to make hydrogen bonds with other molecules, etc. The latter relate to the nature of the atoms that make up the molecule, the distances between atoms and the angles formed by the chemical bonds between them. The search for quantitative relations between chemical structure and biological activity is the subject of Quantitative Structure-Activity Relationships (QSAR), the purpose of which is to explain why a given drug produces its particular effect, and ultimately to predict the effect of newly synthesized chemical compounds [1,2]. Quantitative structure-property relationships have been known for a long time in organic chemistry, such as the regular increase in boiling temperature in homologous series of alkanes (e.g. methane, ethane, propane, etc.) as a function of the number of carbon atoms. It was natural, therefore, to suspect that similar regularities exist between chemical structure and biological activity. The first observation of a quantitative structure-activity relationship was made independently by Meyer and Overton around 1900 [3,4]. They found that the potency of anaesthetics depended upon their lipophilicity, i.e. their tendency to dissolve in oil rather than in water. This discovery is generally regarded as the starting point of QSAR methodology. It pointed toward a physicochemical interaction between a drug and the biological materials of the organism in which the drug is to exert its desired effect. A crucial element of QSAR is the concept of biological receptor, which emerged around the same time that Meyer and Overton made their historical observation. The concept states that a drug molecule must interact in a highly specific way with certain proteins that play a key role in the production of the desired effect. For example, if a pathological condition is due to an abnormally high activity of a certain enzyme or receptor protein, then one may attempt to block this enzyme or receptor by means of a chemical compound that binds specifically to it. This is the case with naloxone which is a specific antagonist
384
of the opioid receptor and which is used as an antidote for poisoning by morphine and related narcotics. The stereospecificity of the drug-receptor interaction was discovered in 1894 by Fischer in his study of the cleavage of glycosides by yeast [5]. He devised the still famous 'lock and key' paradigm. A drug acts as a key which can turn the receptor on or off, and which must satisfy narrow constraints (although the key may be flexible and the lock may appear to be somewhat plastic). The actual term receptor was coined in 1913 by Ehrlich, the discoverer of salvarsan (arsphenamine) which was regarded as the 'magic bullet' for the treatment of syphilis [6]. The part of the drug molecule that interacts with the receptor is called the pharmacophore. Compounds that share the same pharmacophore are considered to be similar with respect to the biological activity that is triggered by their receptor. In practice, however, the therapeutic benefit may vary considerably within a series of similar compounds due to differences in absorption, secretion, metabolization, uptake by fatty tissues, transport through membranes, toxicity, etc. The design of drugs is both an art and a science, but rational approaches, such as QSAR, play an ever increasing role in the generation of new lead compounds and the optimization of existing leads. The foundation of QSAR as a practical tool of drug design took place around 1964 with the introduction of two so-called extrathermodynamic methods. One of these methods is based on so-called linear free energy relationships (LFER) between biological activities of structurally related (congeneric) drugs and the physicochemical properties of the chemical substituents on a common parent molecule. For example, benzoic acid may be regarded as a parent compound which can be substituted at the ortho, meta and para positions by chlorine, bromine, amine, hydroxyl, acetyl, etc. The latter are referred to as the substituent groups. The ortho, meta and para locations with respect to the carboxyl group in the parent benzoic acid molecule are called the substituent positions. This approach is also referred to as Hansch analysis. Another extrathermodynamic method is based on the additivity of the contributions to the biological activity by various substituent groups at multiple substituent positions, and is called Free-Wilson analysis. Both the Hansch and Free-Wilson methods make use of multiple linear regression. These approaches allow us to determine the combination of substituents that provide maximal activity in a series of structurally related molecules. These developments have been extensively reviewed by Martin [7] and Kubinyi [8]. Multivariate chemometric techniques have subsequently broadened the arsenal of tools that can be applied in QSAR. These include, among others, Multivariate ANOVA [9], Simplex optimization (Section 26.2.2), cluster analysis (Chapter 30) and various factor analytic methods such as principal components analysis (Chapter 31), discriminant analysis (Section 33.2.2) and canonical correlation analysis (Section 35.3). An advantage of multivariate methods is that they can be applied in
385
supervised and unsupervised modes. They can handle series of structurally unrelated compounds and some of them can be used in the case when multiple biological activities are obtained. As we will see later on, a drug may act on a wide variety of receptors and hence may provide a typical spectrum of biological activities. Spectral map analysis is a factor analytic method which makes use of this property (Section 31.3.5). Partial Least Squares (PLS) regression (Section 35.7) is one of the more recent advances in QS AR which has led to the now widely accepted method of Comparative Molecular Field Analysis (CoMFA). This method makes use of local physicochemical properties such as charge, potential and steric fields that can be determined on a three-dimensional grid that is laid over the chemical structures. The determination of steric conformation, by means of X-ray crystallography or NMR spectroscopy, and the quantum mechanical calculation of charge and potential fields are now performed routinely on medium-sized molecules [10]. Modem optimization and prediction techniques such as neural networks (Chapter 44) also have found their way into QSAR. Much attention is devoted today to automatic searching of large libraries of chemical compounds in order to find interesting chemical structures that possess some similarity to known lead compounds. Current emphasis is on searching of three-dimensional structures within libraries of hundreds of thousands of compounds [11]. Finally, our knowledge of proteins, specifically of enzymes and receptors, has greatly increased by the use of techniques from biotechnology. It is now possible in many cases to isolate and clone these macromolecules in pure form and to determine their primary amino acid sequence, as well as their exact three-dimensional structure. This has led to so-called rational drug design by which drugs are designed by means of computer modelling rather than by the empirical (serendipitous) method of trial-and-error. In the light of these considerations one may regard QSAR as a multidisciplinary field of chemometrics. Statistics, optimization, pattem recognition and information technology converge in QSAR with chemistry, physicochemistry, biology, biochemistry and biotechnology. The start of QSAR coincides with the development of linear free energy relationships (LFER) between biological activities and the physicochemical properties of molecules. This development is generally referred to as the extrathermodynamic approach in QSAR for reasons that will become apparent later on in this section. A vast amount of literature has appeared on the subject since 1964. The content of this section has been largely inspired by a recent review of the subject by Kubinyi [8]. As we have remarked above, biological activity is the result of the binding of a drug D onto a receptor protein (or enzyme) P which results in the drug-receptor complex DP. The strength of the binding and, hence, the magnitude of the effect can be expressed by means of the change in Gibbs' free energy AG between the
386
free D and P on the one hand and the bound DP on the other hand. According to classical thermodynamics, the free energy release AG is made up from an enthalpic and an entropic term [12]: AG = AH-T AS
(37.1)
where AH is the change in enthalpy, AS is the change in entropy and T represents absolute temperature. The above relation shows that the strength of the binding between drug and receptor increases with the release of enthalpy and with the gain in entropy (i.e. decrease of order) in the drug-receptor complex. The free energy AG can also be related to the equilibrium constant K between the free and the bound state of drug and receptor: AG =-2.303/?r log/i:
(37.2)
where R represents the gas constant. The relation shows that if the reaction between drug and receptor proceeds in the direction of the formation of the drug-receptor complex (K> I), then a decrease of free energy will occur (AG < 0). Interactions of a drug with a receptor (or enzyme) are of a reversible noncovalent nature. The free energy released by drug-receptor interactions is smaller by one to two orders of magnitude than that of a covalent bond. The interactions take place mainly by means of electrostatic, hydrophobic and dispersion forces. Electrostatic interactions between charged groups are enthalpic. Hydrogen bonding is a weak electrostatic interaction which takes place between electron donating and electron accepting groups, such as amine, ketone, hydroxyl groups, etc. Hydrophobic interactions are mostly entropic. They can displace water molecules that are bound to the drug molecule or to the receptor surface. This diminishes the ordering of the water molecules and, hence, increases entropy, which in turn lowers the free energy of the drug-receptor complex. Dispersion forces are attractive when a drug molecule closely approaches the receptor surface. They are repulsive when the van der Waals radii of drug and receptor molecules tend to overlap. The nature of dispersion forces is mainly enthalpic. The question that now arises is how to define biological (or therapeutic) activity of a drug. There are good reasons for defining biological activity as the logarithm of the reciprocal of the effective dose (or concentration) that is needed in order to produce a well-defined biological effect. This is in accordance with WeberFechner's psychophysical law which states that a biological response is an approximately linear function of the logarithm of the physical stimulus [13]. Furthermore, experimental observations also show that the doses or concentrations that are tolerated by living organisms follow a log-normal distribution. The reciprocal of the effective dose is required, since the more active or potent drugs require a lower
387
dose in order to produce the desired effect. Usually, an effective dose is defined as the dose which produces the required effect in half of the samples or subjects that has been tested. In classical pharmacology on living subjects this produces the so-called median effective dose (or ED5Q). In biochemical pharmacology on isolated receptors results are often reported in the form of median inhibitory concentrations (or IC5Q). If we combine the previous observations with eq. (37.2), we obtain an expression which relates the biological activity (log 1/C) of a drug to the equilibrium constant (K) of the drug-receptor interaction: logl/C = alogK
(37.3)
where a denotes a proportionality constant. The next step in the development of the extrathermodynamic approach was to find a suitable expression for the equilibrium constant in terms of physicochemical and conformational (steric) properties of the drug. Use was made of a physicochemical interpretation of the dissociation constants of substituted aromatic acids in terms of the electronic properties of the substituents. This approach had already been introduced by Hammett in 1940 [14]. The Hammett equation relates the dissociation constant ^ of a substituted benzoic acid (e.g. meta-chlorobenzoic acid) to the so-called Hammett electronic parameter a: \ogK = \ogK^ + p(5
(37.4)
where KQ represents the dissociation constant of unsubstituted benzoic acid and where p is a proportionality constant. A distinction is made between a^ and a^, which take different values depending on the meta (m) or para ip) position of the substituent on the parent benzoic acid molecule. On the analogy of the physicochemical relation, one was led to define a biological Hammett equation which related the equilibrium constant of the drugreceptor complex to the electronic a parameters of the substituents (e.g. chlorine, bromine, methyl, ethyl, hydroxyl, carboxyl, acetyl, etc.) of the drug molecule. Since the equilibrium constant of a drug-receptor complex is reflected by the biological activity, this led to the first extrathermodynamic relationship in QSAR: logl/C = Z7o + ^CT
(37.5)
where b^ and b^ are coefficients that can be derived by means of linear regression (Chapter 8). For the historical reasons discussed above, the relation is also referred to as a linear free energy relationship (LFER). A fundamental assumption of the approach is that contributions to a from several substituent groups on the same parent compound are additive. The additivity assumption also holds for the more general Hansch model that will be discussed below.
388
37.1.1 Hansch analysis It soon became apparent that the biological Hammett equation produced an unsatisfactory fit between biological activity of a set of chemical analogs and the experimentally obtained electronic a values of the substituent groups. Hansch [15,16] first proposed to extend the equation with a term which accounted for the hydrophobic interaction. The latter can be characterized by determining the partition of the drug between oil (octanol) and water. The ratio of the concentrations of the drug in the two phases at equilibrium is called the partition coefficient P and the decimal logarithm of P is referred to as the lipophilicity. Lipophilicity measures the tendency of a drug to dissolve into fatty material. It is not exactly the same as hydrophobicity, which measures the ability of a drug to displace or expel water molecules at a binding site. Lipophilicity, however, is often a good approximation to hydrophobicity. This led to the Hansch equation, which reads in its simplest form: log 1/C = Z7o + ^1 a + ^2 log P
(37.6)
or when electrostatic interactions can be neglected: log 1/C = Z7o + ^1 log P
(37.7)
where b^, b^, Z?2 are coefficients that can be determined by means of multiple linear regression (see Chapter 10). The Hansch model can be expressed in a general form: y = Xb + e
(37.8)
where X represents the matrix of independent parameters (extended by a unit column-vector in order to account for the constant term), y is the vector of observed biological activities, e contains the residuals between observed and computed activities, and b is the vector of regression coefficients. The above Hansch equations are also generally referred to as linear free energy relationships (LFER) as they are derived from the free energy concept of the drug-receptor complex. They also assume that biological activity is linearly related to the electronic and lipophilic contributions of the various substituents on the parent molecule. A typical Hansch analysis has been applied to the 50% inhibitory concentrations (ICgo) of oxidative phosphorylation of 11 doubly substituted salicylanilides (Table 37.1) as reported by Williamson and Metcalf [17]. Multiple linear regression leads to the following model: log I/IC50 = -2.190 + 0.708 a + 0.348 log P
389
TABLE 37.1 Lipophilicity (log P), Hammett electronic parameter (a) and inhibitory concentration (1050) for oxidative phosphorylation of 11 doubly substituted salicylanilides [17]. The two substitution positions are labeled A andB. OH
,0^~- ™^^ #
A
B
logF
a
4'-Cl
4.68
0.463
2.512 2.042 1.549
IC50
1
5-Cl-3-(4-Cl-C6H4)
2
5-Cl
2'-Cl-4'-N02
2.12
1.724
3
5-CI-3-C6H5
3',4'-Cl2
4.79
0.836
4
5-CI-3-C6H5
2',5'-Cl2
4.55
0.836
1.445
5
5-CI-3-C6H5
2',4',5-Cl3
5.48
1.063
0.646
6
5-CI-3-C6H5
3'_Cl-4'-N02
4.36
1.879
0.617
7
5-Cl-3-(4-Cl-C6H4)
2'-Cl-5'-N02
4.98
1.173
0.263
8
5-Cl-3-(4-Cl-C6H4)
2'-Cl-4-CN
4.58
1.091
0.219
9
5-Cl-3-(4-Cl-C6H4)
2'-Cl-5'-CF3
5.93
0.878
0.200
10
5-Cl-3-(4-Cl-C6H4)
r,4',5-Cl3
6.41
1.063
0.173
11
5_Cl-3-r-But
2'-Cl-4'-N02
4.00
1.527
0.170
The residual standard deviation of regression (s^) equals 0.346, the coefficient of determination (R^) is 0.548 and the F-statistic amounts to 4.840 with 2 and 8 degrees of freedom (df) which is significant at the 0.05 level of significance (p). All coefficients of the regression are significant at the 0.05 level of probability, which lends evidence to the importance of both electrostatic (a) and hydrophobic interactions (log P). The coefficient of determination (R^) is small and the error term is relatively large. This suggests that predicted IC50 values are expected to deviate by more than a factor two from the observed values. Inspection of the residuals indicates that there may be two outliers (compounds 8 and 11) in this data set that do not fit well to the proposed Hansch model. Frequently, the relationship between biological activity and log P is curved and shows a maximum [18]. In that case, quadratic and non-linear Hansch models have been proposed [19]. The parabolic model is defined as: log lie = bQ^b^ logP + b^ (logPf which reaches a maximum at (log P)^ = -b^l{2b'2).
(37.9)
390
The bilinear model takes the form: log 1/C = Z^o +feilog P + ^2 log (^3 P + 1)
(37.10)
or equivalently: log 1/C =fco+ ^1 log P-^b^ log (Z?3 lO^^s ^ + 1) which obtains a maximum at log P^ = log(-b^/b^(b^ + ^2)). Note that the lipophilicity parameter log P is defined as a decimal logarithm. The parabolic equation is only non-linear in the variable log P, but is linear in the coefficients. Hence, it can be solved by multiple linear regression (see Section 10.8). The bilinear equation, however, is non-linear in both the variable P and the coefficients, and can only be solved by means of non-linear regression techniques (see Chapter 11). It is approximately linear with a positive slope (b^) for small values of log P, while it is also approximately linear with a negative slope (b^ + Z?2) for large values of log P. The term bilinear is used in this context to indicate that the QSAR model can be resolved into two linear relations for small and for large values of P, respectively. This definition differs from the one which has been introduced in the context of principal components analysis in Chapter 17. A non-linear Hansch model has been applied to the bactericidal concentrations (C) of 17 doubly substituted phenols (Table 37.2) which have been reported by Klarmann et al. [20]. By means of multiple linear regression we obtain the parabolic Hansch model of eq. (37.9): log 1/C = -3.474 + 2.298 log P - 0.225 (log Pf 5, = 0.178,
P2 = 0.933,
P = 97.1
with df = 2 and U,p< 0.0001.
The coefficients of the regression are all highly significant (p < 0.0001) and the fit of the model to the observed data is shown in Fig. 37.1. Using non-linear regression we obtain the bilinear Hansch model for the bactericidal activities of the 17 phenol analogs (Table 37.2): log 1/C = -1.552+ 1.011 logP-3.331 log (0.00244P+ 1) 5^ = 0.179 In this case, the bilinear model of eq. (37.10) fits the data as well as the parabolic one. A possible reason for the lack of improvement is that the part of the model which accounts for the higher values of log P is not well covered by the data (Fig. 37.1). The parabolic model yields an optimal value for log P of 5.10 while the optimum of the bilinear model is found at 5.18. Hansch analysis marked the breakthrough of QSAR. The method was soon extended with additional parameters with the aim of improving the fit between biological and physicochemical data and for the prediction of drugs with optimal
391 TABLE 37.2 Lipophilicity (log P) and bactericidal concentrations (O of 17 doubly substituted phenols [20] R
#
R
R'
logP
log 1/C
1
H
CI
2.39
0.81
2
Methyl
CI
2.89
1.34
3
Ethyl
CI
3.39
1.73
4
Propyl
CI
3.89
2.26
5
Butyl
CI
4.39
2.52
6
Amyl
CI
4.89
2.63
7
sec-Amyl
CI
4.69
2.23
8
Cyclohexyl
CI
4.90
2.25
9
Heptyl
CI
5.89
2.51
10
Octyl
CI
6.39
1.83
11
CI
H
2.15
0.50
12
CI
Methyl
2.65
0.91
13
CI
Ethyl
3.15
1.35
14
CI
Propyl
3.65
1.86
15
CI
Butyl
4.15
2.20
16
CI
Amyl
4.65
2.23
17
CI
tert-Amyl
4.33
2.00
activity. Measures of lipophilicity are mostly derived from partition coefficients P and chromatographic retention times [21]. Rekker [22] derived a method for computing lipophilicity 'de novo\ i.e. from first principles as opposed to experimentally, which improved on the experimental method used by Hansch, especially in the case of aliphatic compounds. Electronic parameters include Hammett's a^ and Cp which we have already discussed, the (inductive) field and resonance parameters F and R which have been derived from spectroscopic measurements by Swain and Lupton [23], the dipole moment and various quantum mechanical properties which relate to the distribution of electrons in the molecule. The Swain and Lupton parameters F and R have been shown to be linear combinations of Hammett' s a derived from meta and para substituents of benzoic acid. It is claimed
392 :).U
m
•
2.5 -
T-H
•
2.0 •
o 1.5 •
J
1.0 -
/•
0.5 0.0 -
1
1
I
1
LogP Fig. 37.1. Quadratic Hansch model fitted to the bactericidal activities (log 1/C) of 10 doubly substituted phenols in Table 37.2 as a function of lipophilicity (log F) [20].
that the field and resonance parameters are less correlated than Hammett's electronic parameters. Steric parameters account for the geometric properties of the compounds. Among these we only cite the steric hulk factor E^ of Taft [24], the molecular weight MH^and the five STERIMOL variables which were proposed by Verloop [25] to describe the size and shape of the molecules. Molar refractivity MR appears to be a parameter which correlates with lipophilicity and steric parameters. Connectivity indices have been introduced by Kier and Hall [26] for the description of the topological (graph) structure of the molecules. A connectivity index is a single number which expresses how atoms are arranged in a molecule. There are several different indices, each of which expresses a particular feature of the molecule, such as its degree of branchedness, the presence of double and triple bonds, of non-carbon atoms (apart from hydrogen), of aromatic rings, etc. Finally indicator variables can be included into the Hansch equation in order to account for the presence or absence of specific chemical functions. The use of indicator variables will be discussed more amply in the following section which deals with Free-Wilson analysis. Extensive data bases are now available which list lipophilicity, molar refractivity, electronic and steric values for a wide collection of substituents [27,28]. Starting from a particular parent compound one computes the value of a physicochemical parameter (e.g. lipophilicity) for a given drug by adding the contributions
393
of all the substituents that have been introduced on the parent compound. Hence, for a particular analog derived from an unsubstituted parent compound we obtain: logP = X O o g ^ ) /
(37.11)
CJ = X<^.-
(37.12)
where the summation index / extends over all substituents on the parent molecule. Computed values of log P are often represented by the symbol 7i, in order to distinguish them from the empirical values which have been derived from partition coefficients or from chromatographic retention times. A difficulty with Hansch analysis is to decide which parameters and functions of parameters to include in the regression equation. This problem of selection of predictor variables has been discussed in Section 10.3.3. Another problem is due to the high correlations between groups of physicochemical parameters. This is the multicoUinearity problem which leads to large variances in the coefficients of the regression equations and, hence, to unreliable predictions (see Section 10.5). It can be remedied by means of multivariate techniques such as principal components regression and partial least squares regression, appUcations of which are discussed below. The fundamental assumption of Hansch analysis is that substituent values are additive. This implies that substituents are mutually independent, i.e. that the effect of a substitution group at one position in the parent molecule is independent of substitution groups at other positions. The assumption of additivity is violated, for example, when hydrogen bonding occurs between two adjacent substituents. 37.1.2 Free-Wilson analysis The second extrathermodynamic method that we discuss here differs from Hansch analysis by the fact that it does not involve experimentally derived substitution constants (such as a, log P, MR, etc.). The method was originally developed by Free and Wilson [29] and has been simplified by Fujita and Ban [30]. The subject has been extensively reviewed by Martin [7] and by Kubinyi [8]. The method is also called the 'de novo' approach, as it is derived from first principles rather than from empirical observations. The underlying idea of Free-Wilson analysis is that a particular substituent group at a specific substitution site on the molecule contributes a fixed amount to the biological activity (log 1/C). This can be formulated in the form of the linear relationship:
logl/C =feo+ X S ^ 7 ^ ( ;
(37.13)
394
where b^ represents the biological activity of the reference compound. The indexy ranges between 1 and the number of substitution positions /?, while the index / varies between 1 and the number of possible substituents rip at the position y, minus 1. The coefficient b^j in eq. (37.13) denotes the contribution to the activity by the /th substituent group at theyth substitution position on the reference compound, and x^j is the corresponding indicator variable. The latter equals one if the /th substitution group is present at theyth position, and zero otherwise. Note that there is a distinction between the parent compound in Hansch analysis and the reference compound in Free-Wilson analysis. The parent compound is by definition the unsubstituted compound (with hydrogen atoms at each substitution site), while the reference compound can be chosen arbitrarily from the set of compounds. The substituent groups that appear in the reference compound do not appear in the Free-Wilson model that has been defined above. The number n of independent variables x^ is defined by: n = f,(nj-l)
(37.14)
./•
Hence, the number of compounds must at least be equal to n. Table 37.3 shows the complete table of eight indicator variables for 10 triply substituted tetracyclines [31] that have been tested for bacteriostatic activity (1/Z), which is defined here as the ratio of the number of colonies grown with a substituted and with the unsubstituted tetracycline. In this application we have three substitution positions, labelled U, V and W. The number of substituents at the three sites equals 2,3 and 3, respectively. Arbitrarily, we chose the compound with substituents H, NO2 and NO2 at the sites U, V and W as the reference compound. This leads to a reduction of the number of indicator variables from eight to five, as shown in Table 37.4. The solution of the Free-Wilson model can be obtained directly by means of multiple regression: log l/Z = -0.603-0.364/(CH3) + 0.060/(Cl) + 0.026/(Br)+ 1.129/(NH2) + 0.480/(CH3CONH) (s^ = 0.347, R^ = 0.832, F = 3.96
with df = 5 and 4, p = 0.10).
In the above expression the indicator variable I(X) takes the value 0 or 1, depending upon the absence or presence of the substituent X in a particular compound. The overall result of the regression is not significant at the 0.05 level of probability. This may be due to the unfavorable proportion of the number of compounds to the number of parameters in the regression equation (10 to 6). Only the indicator variable for substituent NH2 at position W in the tetracycline molecule reaches significance (p = 0.02). This can be confirmed by looking at Table 37.4
395 TABLE 37.3 Complete indicator table and bacteriostatic activities of 10 triply substituted tetracyclines. The three substitution positions are labeled U, V and W [31].
CO NHo
#
U H
w
V CH3
N02
CI
Br
N02
1/Z NH2
CH3CONH
1
1
0
1
0
0
1
0
0
0.60
2
1
0
0
1
0
1
0
0
0.21
3
1
0
0
0
1
1
0
0
0.15
4
1
0
0
1
0
0
1
0
5.25
5
1
0
0
0
1
0
1
0
3.20
6
1
0
1
0
0
0
1
0
2.75
7
0
1
1
0
0
0
1
0
1.60
8
0
1
1
0
0
0
0
1
0.15
9
0
1
0
0
1
0
1
0
1.40
0
1
0.75
10
0
1
0
0
1
0
which shows that all compounds with the NH2 substituent at W possess high activity, while the remaining compounds reached only low activities. It is not surprising, therefore, that the Free-Wilson equation indicates that the W site in the molecule is the most critical one for obtaining bacteriostatic activity. A Free-Wilson analysis can also be performed by means of analysis of variance (ANOVA) instead of multiple regression (Chapter 6). The interpretation of the results obtained by the two methods is essentially the same. In a broad sense, one may include the Free-Wilson equation within the class of linear free energy relationships (LFER). It is also subjected to the assumption of additivity of the contributions to the biological activity by substituent groups at different substitution sites. The assumption requires, for example, that there is no hydrogen bonding interaction between the various substitution groups. The interpretation of the result of a Free-Wilson analysis is somewhat different from that of a Hansch analysis. The coefficients in the Hansch model represent absolute contributions to the biological activity of a compound from the various
396 TABLE 37.4 Reduced indicator table and bacteriostatic activities 1/Zof 10 triply substituted tetracyclines, as derived from Table 37.3. The compound with substitution groups H, NO2 and NO2 at the positions U, V and W is taken as the reference compound.
#
w
U
V
1/Z
CH3
CI
Br
NH2
CH3CONH
1
0
0
0
0
0
0.60
2
0
1
0
0
0
0.21
3
0
0
1
0
0
0.15
4
0
1
0
1
0
5.25
5
0
0
1
1
0
3.20
6
0
0
0
1
0
2.75 1.60 0.15
7
1
0
0
1
0
8
1
0
0
0
1
9
1
0
1
1
0
1.40
10
1
0
1
0
1
0.75
types of interactions with the receptor (lipophilic, electronic, steric, etc.). As we have indicated above, non-significant parameters can be dropped in the Hansch equation. In the Free-Wilson model, one estimates relative contributions of the various substituents with respect to an arbitrarily chosen reference compound. In our illustration we find that the contribution of NH2 at site W exceeds the contribution of CH3CONH at the same site by an amount of 1.129 - 0.480 = 0.649. We can also state that any compound with NH2 at site W is on average 10^^"^^ or 4.46 times more potent than a comparable compound with CH3CONH at that site (all other substituent groups at sites U and V being the same). In order to make such interpretations, one cannot drop non-significant terms from the Free-Wilson model. In the extreme case when a particular coefficient of the regression equation is zero, this would only imply that the contribution of the corresponding substituent group is equal to the contribution of the substituent group at the same site in the reference compound [8]. The Free-Wilson analysis provides more site-specific information than a Hansch analysis. It is recommended to carry out a Free-Wilson analysis first in order to obtain an idea of the importance of the substituent groups and of the sensitivity of the substitution sites. This type of analysis can be regarded as being qualitative, as it points to the important pharmacophores in the molecule. The information thus obtained may guide the selection of the appropriate physicochemical, topological
397
and indicator variables that are to be included in the Hansch model. The latter approach can be considered as quantitative, as it makes use of experimentally or computationally derived parameters, such as 7C, o, E^, etc. The goodness of fit of the Free-Wilson model to the observed biological activities also indicates whether the additivity assumption is supported by the data. A comparison between the Hansch and Free-Wilson models has been made by Craig [32].
37.2 Principal components models Attention has already been drawn to correlations that occur between various physicochemical and conformational descriptors of QSAR. Since Hansch analysis involves multiple linear regression, these correlations may lead to unreliable predictions of biological activity. In the design of chemical synthesis one also wishes to sample the space of the variables in a uniform way such as to produce the greatest possible variety of compounds. For these reasons, use has been made of so-called Craig plots [33] in which compounds are plotted according to two selected variables (e.g. a and log P). Such plots have been useful for the optimization of biological activity by means of sequential Simplex optimization (Section 26.2) [34] or response surface methodology (Section 24.5). Due to the rapid proliferation of physicochemical, quantum mechanical and conformational variables, the need was soon felt to apply multivariate methods and pattern recognition techniques in QSAR. Cluster analysis was introduced by Kowalski [35] in QSAR as a tool for studying the correlation structure of the variables and for establishing similarities between the compounds. Factor analysis has been used at an early stage in QSAR by Franke [36] for analyzing the correlation structure of the variables (see also Chapter 34). This was followed by applications of principal components analysis and biplot representations of rectangular (compounds x properties) tables (see Chapters 17 and 31). Spectral map analysis (SMA) is one such technique which has been introduced by Lewi [37] for the study of drug-test specificities in pharmacological screening (Section 31.3.5). Principal components analysis, cluster analysis, multivariate regression involving multiple dependent variables and other multivariate techniques have been integrated by IMager [9] into structure-activity relationships in combination with multivariate bioassay. This approach, which is called rational empiricism, holds the middle between purely rational and purely empirical drug design. In the following sections we propose typical methods of unsupervised learning and pattern recognition, the aim of which is to detect patterns in chemical, physicochemical and biological data, rather than to make predictions of biological activity. These inductive methods are useful in generating hypotheses and models which are to be verified (or falsified) by statistical inference. Cluster analysis has
398
initially been used in QSAR for the detection of patterns in physicochemical and biological data [38]. These clusters can often be displayed visually in a two- or three-dimensional plot of principal components. 37.2.1 Principal components analysis A table of correlations between seven physicochemical substituent parameters for 90 chemical substituent groups has been reported by Hansch et al. [39]. The parameters include lipophilicity (log P), molar refractivity (MR), molecular weight (MW), Hammett's electronic parameters (a^ and a^), and the field and resonance parameters of Swain and Lupton (F and /?). Figure 37.2 displays the result of applying principal components analysis (PC A) to the correlations in Table 37.5 [40]. The PC A has been applied to the column-standardized data, as explained in Section 31.3.3. The horizontal and vertical axes of the loading plot represent the first two components, which account for 46 and 31 %, respectively, of the correlations in the data. A third component, representing about 20% of the correlations, is indicated by means of variable shading of the symbols. The line segments that join the representations of the parameters have been obtained by single linkage clustering [38]. The graphical result shows a clear distinction between the four electronic parameters (a^, o^, F and /?), on the one hand, and the other parameters (log P, MR and MW), on the other hand. As has been mentioned before, the two Hammett a parameters are more closely correlated than the field and resonance parameters F and R of Swain and Lupton. As the third component is mostly determined by F and /?, the actual distance between them is larger than appears from the two-component diagram of Fig. 37.2. Molar refractivity (MR) has already been characterized as being related to lipophilicity (such as log P) and steric parameters (such as MW), and this is confirmed by the principal components analysis. The method of PCA can be used in QSAR as a preliminary step to Hansch analysis in order to determine the relevant parameters that must be entered into the equation. Principal components are by definition uncorrelated and, hence, do not pose the problem of multicollinearity. Instead of defining a Hansch model in terms of the original physicochemical parameters, it is often more appropriate to use principal components regression (PCR) which has been discussed in Section 35.6. An alternative approach is by means of partial least squares (PLS) regression, which will be more amply covered below (Section 37.4). Principal components analysis can also be used in the case when the compounds are characterized by multiple activities instead of a single one, as required by the Hansch or Free-Wilson models. This leads to the multivariate bioassay analysis which has been developed by Mager [9]. By way of illustration we consider the physicochemical and biological data reported by Schmutz [41] on six oxazepines
399 TABLE 37.5 Correlations between 7 physicochemical substituent parameters obtained from 90 substituents groups [39]
logP (^m
^P
F
MR
MW
logP
^m
^.
F
R
1.000
-0.278
-0.153
-0.323
0.049
0.442
0.364
1.000
0.884
0.959
0.468
-0.119
0.232
1.000
0.716
0.827
-0.041
0.263
0.200
-0.153
0.187
1.000
0.067
0.221
1.000
0.829
1.000
R MR
1.000
MW
PC2
PCI Fig. 37.2. Principal components loading plot of 7 physicochemical substituent parameters, as obtained from the correlations in Table 37.5 [39,40]. The horizontal and vertical axes account for 46 and 31%, respectively, of the correlations. IVIost of the residual correlation is along the perpendicular to the plane of the diagram. The line segments define clusters of parameters that have been computed by means of cluster analysis.
and six thiazepines. These compounds are classified as neuroleptics (major tranquillizers) which are designed for the treatment of psychosis. The common chemical structure is shown in the insert of Table 37.6, where the dummy element X represents either O or S. The oxazepines and thiazepines are also substituted at the position R in one of the two phenyl rings by means of 6 different substituent
400
TABLE 37.6 Physicochemical and biological parameters obtained from 6 oxazepine (X = O) and 6 thiazepine (X = S) neuroleptics. These include Hammett's electronic parameter (a^), lipophilicity (log P) and its squared value, and the activities (log I/EDJQ) in two pharmacological tests in rats [41].
o I
ax^^ #
X
R
<^n,
1
CH3
-0.070
SCH3
0.151
CI
0.373
CN
0.564
-0.322
0.104
0.680
0.428
5
o o o o o
SO2CH3
0.595
-1.279
1.636
0.680
-1.176
6
0
NO2
0.706
0.340
0.116
0.760
0.836
7
s s s s s s
CH3
2 3 4
8 9 10 11 12
(log P)2
Apo-morphine
Catalepsy
0.237
-0.117
-0.332
0.597
0.356
0.069
-0.550
0.744
0.554
0.707
-0.033
logP 0.487
-0.116
0.524
0.275
-0.755
-1.176
SCH3
0.151
0.597
0.356
-0.170
-1.094
CI
0.373
0.744
0.554
0.122
-0.305
CN
0.558
-0.285
0.081
0.707
0.184
SO2CH3
0.600
-1.242
1.543
0.707
-1.012
NO2
0.706
0.376
0.141
1.239
-0.360
groups. The substituent parameters o^, log P and the square of log P are shown in Table 37.6 together with the biological activities (log I/ED50) in pharmacological tests which measure the ability to inhibit the stereotyped effects produced by apomorphine and the ability to produce catalepsy (abnormal waxy postures) in rats. These two tests are considered to be good animal models for testing neuroleptic effects of compounds. The biplot in Fig. 37.3 has been constructed from the factor scores of the 12 compounds and the factor loadings of the five physicochemical and biological variables [42,43]. (The biplot graphic technique is explained in Section 31.2.) It is
401
"-.
PC2
; V
^".
y \ '
y , o,CN
^^ " Q logP
" " y^
"" ^ y
y 0,CH^j ^
1 1
^"^
'^
<«' / / /
/
^
^sfCH3
-^
1
y O,CLO 0^ K
S,ClQf QO, SCH3
/ ^ // / / ^s,scn3
/
/
^
\
y
^©^'"''^ S, qtl
'
APOMORPHXNE - -^ '
-TT ,
/
x<
-^ ^^-' ^\^.^ -""
.C'''
V, ,' , * ^ ' ' -^' \ ^ \
^^ "^ x
^ •
(loaP)2 P 5 , S02CH3 O O,S02CH3
PCI
Fig. 37.3. Principal components biplot showing the positions of 6 substituted oxazepine (O) and 6 substituted thiazepine (S) neuroleptics with respect to three physicochemical parameters and two biological activities [41,43]. The data are shown in Table 37.6. The thiazepine analogs are represented by means of filled symbols. The horizontal and vertical components represent 50 and 39%, respectively, of the variance in the data.
the result of a principal component analysis of the column-standardized data in Table 37.6. The cosines of the angular distances between the dashed line segments represent the correlations between the corresponding parameters. The horizontal and vertical axes account for 50 and 39%, respectively, of the variance in the data. As a consequence, the parameters are displayed upon or close to the unit circle around the origin of the plot. The activities of the catalepsy and apomorphine tests are shown to be weakly correlated. These tests seem to express different types of neuroleptic activity. The electronic parameter a^ correlates strongly with the apomorphine test and less with catalepsy. Log P has little influence on catalepsy and is negatively associated with inhibition of apomorphine. Conversely (log P)^ has little association with apomorphine activity, but correlates negatively with catalepsy. Hence, a model for cataleptic effects would include a^ and (log P)^. A model for apomorphine inhibition would have to consider o^ and log P. All compounds appear to be arranged within a narrow band on the biplot, with the exception of those that are substituted by SO2CH3 at the R position.
402
Looking at the positions of the individual /?-substituents on the plot we observe an ordering within a tight band by increasing electronegativity: CH3, SCH3, CI, CN and NO2. This is the order that has been imposed by the a electronic parameter. Overall, the distinction between oxazepines and thiazepines appears to be small. The SO2CH3 substituent forms a marked exception and is highly influential because of an extremely low log P value and low cataleptic effect. Sometimes, however, exceptions to the general rule may be more interesting than the cases that comply with it. Neuroleptic compounds which inhibit the stereotyped effects of apomorphine without producing cataleptic effects are highly sought after. Unfortunately, the SO2CH3 analogs would be poor candidates for further testing because of their low lipophilicity (log P) which would probably lead to their rapid elimination from the body and hence to a short duration of action. 37.2.2 Spectral map analysis For a detailed description of spectral map analysis (SMA), the reader is referred to Section 31.3.5. The method has been designed specifically for the study of drug-receptor interactions [37,44]. The interpretation of the resulting spectral map is different from that of the usual principal components biplot. The former is symmetric with respect to rows and columns, while the latter is not. In particular, the spectral map displays interactions between compounds and receptors. It shows which compounds are most specific for which receptors (or tests) and vice versa. This property will be illustrated by means of an analysis of data reporting on the binding affinities of various opioid analgesics to various opioid receptors [45,46]. In contrast with the previous approach, this application is not based on extrathermodynamic properties, but is derived entirely from biological activity spectra. Binding studies are performed on isolated receptors that are labeled by means of a specific radioactive marker. Compounds that are tested may compete with the specific marker at the binding site of the receptor. The fraction of the marker that is displaced by the test compound is a measure of the affinity of the latter for that receptor. Usually, an active compound will show a spectrum of affinities for several related receptors. The aim of the multivariate analysis is to map the spectrum of affinities, hence the name spectral mapping. At the time when the report was published, pharmacologists had identified two opioid receptors which were labeled |i and 5. The former binds specifically to opioid analgesics such as morphine and its synthetic derivatives, the latter binds to derivatives of the so-called endogenous opioid peptides, in particular to enkephalins and endorphins. Strong evidence for another opioid receptor, named K, was available and this had been already identified in brain tissues. The data in Table 37.7 report on the results of 26 narcotic agonists and antagonists in four binding assays. The four radioactive markers comprised dihydromorphine (DHM, a |i-agonist), an enkephalin analog
403 TABLE 37.7 Inhibition constants (Kf) of 26 narcotic analgesic agonists and antagonists as obtained in 4 receptor binding tests (EKC, DHM, DADLE and NAL) [45].
#
NAL
DHM
DADLE
696.0
1.3
26.1
3.5
EKC
1
Morphine
2
Phenazocine
9.1
0.2
2.6
0.8
47.0
0.2
4.5
0.7
4.8
0.1
0.9
0.2
6012.0
1.0
16.6
1.3
113.0
8.3
1.3
5.7 4.6
3
Levorphanol
4
Etorphine
5
Dihydromorphine
6
D-Ala, D-Leu-Enkephalin
7
Met-Enkephalin
66.0
3.6
2.8
8
Leu-Enkephalin
120.0
8.1
3.1
11.6
9
Leu-Enkephalin, Arg-Phe
4430.0
4.0
3.0
4.2
10
FK-33824
6329.0
0.4
4.9
1.0
11
LY-127623
3924.0
0.7
2.5
1.2
12
Beta-Endorphin
1835.0
0.3
0.3
0.8
13
SKF 10074
241.0
1.4
5.0
1.1
14
Ketazocine
7.3
7.9
6.6
12.7
15
Ethylketocyclazocine
4.7
4.1
7.6
7.5
16
MR 2034
5.7
0.3
0.5
0.2
17
Buprenorphine
2.0
3.1
3.6
1.6
18
Lofentanil
2.7
0.7
1.2
0.4
19
Butorphanol
7.7
0.3
1.6
0.3
20
Cyclazocine
5.0
0.2
1.1
0.2
21
Levallorphan
4.0
0.2
3.3
0.7
22
Nalbuphine
38.0
2.4
21.2
2.4
23
Pentazocine
116.0
3.7
19.8
13.5
24
Nalorphine
55.0
0.7
13.9
3.7
25
Naloxone
11.1
1.4
16.5
1.0
26
Naltrexone
9.8
0.1
5.0
0.5
(DADLE, a 8-agonist), a ketazocine derivative (EKC, a K-agonist) and the opioid antagonist naloxone (NAL). The data are expressed as inhibition constants of the drug-receptor interaction, K^, which are proportional to the corresponding 50% inhibitory concentrations, IC5Q. The four receptors are identified by squares on the spectral map of Fig. 37.4. By convention, the areas of the squares are made
404
PCI
*EKC BUPRENORPHINZ ,'\ KETAZOCINE* ^ / ^ ETKYLKETOCYCLAZOCINEi \
LOFKNTANI^O
O NALOXONE
»
' \ O LEVALLORPHAN BpTORPHANOZ^ ___i,,_-__-_ I ., ^NALBXTPEINE PENTAZOCINE > _ ^PHENAZOCINE MET-ZNKEPHALXN' LEU-ENKEPHALIN* 0-AIJl,U-I.£I7-KNKSPHAI,XW»
^/' Q /fj? Q*^^™^^^ ,' UK 20Jtf / y ; NALORPHINE , + ' ETORPHINE )CINE 'O NAL
i—--_ _? n
DADLE'
/ I
LEVORPHANOL
i
LBU-EiiKEPHALIN,ARa-PHE DIHYDROMORPHINE^ OLY-127623
T BETA-ENDORPHIN
FK-33824Q
PC2
Fig. 37.4. Spectral map of the 26 opioid agonists and antagonists in 4 receptor binding tests, as described by Table 37.7 [45, 46]. Circles refer to the compounds. Squares represent the binding tests. Areas of circles and squares are proportional to the marginal mean affinities in the table. The lines that join the three poles (DHM, DADLE and EKC) of the map represent axes of contrast between the p.-, 5and K-opioid receptors. The horizontal and vertical components represent 18 and 79%, respectively, of the interaction in the data.
proportional to the average affinities of the 26 compounds for the corresponding receptors. The horizontal and vertical components represent 18 and 79%, respectively, of the interaction in the data. The term interaction is defined here as the variance that remains after double-centering of the data (Section 31.3.5). The spectral map shows three distinct poles of specificity. These are respectively the ^-receptor (DHM), the 5-receptor (DADLE) and the K-receptor (EKC). The naloxone receptor (NAL) appears to be strongly correlated with the |Li-receptor (DHM) and, hence, provides little additional information. In spectral map analysis, correlation between variables, as well as similarity between compounds, is evidenced by the proximity of their corresponding symbols. The lines drawn through the three poles of the map represent bipolar axes of contrast. A contrast is defined in this context as a log ratio or, equivalently, as a difference of logarithms. For example, the horizontal axis through the |Li- and 5-receptors defines the |LI/6 contrast. Compounds that project on the right side of this axis bind more specifically to the |Li-receptor, while those that project on the left side possess more
405
specificity for the 6-receptor. The larger the distance between two poles, the greater is the contrast between the corresponding receptors. The areas of the circles are proportional to the average affinity of the corresponding compounds for the four labeled receptors. It appears from the spectral map that the K-receptor is a highly specific receptor which produces strong contrasts in binding affinities of opioid analgesics. The contrast is most evident in ketazocine, ethylketazocine and buprenorphine which possess much more affinity for the K-receptor than for the two others. The contrast is also strong with dihydromorphine, beta-endorphin, an enkephalin analog and two experimental compounds (LY and FK) which have little or no affinity for the K-receptor.
37.2.3 Correspondence factor analysis Correspondence factor analysis (CFA) is most appropriate when the data represent counts of contingencies, or when there are numerous true zeroes in the table (i.e. when zero means complete absence of a contingency, rather than a small quantity which has been rounded to zero [47]). A detailed description of the method is found in Section 32.3.6. Correspondence factor analysis has been applied to activity spectra obtained from a screening test on rats (Table 37.8). One of the objectives of the test was to differentiate between morphinomimetics (opioid agonists) and neuroleptics [48]. The intensity of six observations have been scored on a scale from 0 (absence) to 6 (very pronounced). These include prostration, temperature increase and decrease, pupil diameter increase and decrease, and palpebral (eye) opening decrease. The objective of the analysis was to find criteria by which a differential classification of screening compounds could be made on the basis of these six criteria. Since the data consist of scores and since numerous true zeroes appear in the table, CFA is the method of choice [49]. The resulting biplot is presented in Fig. 37.5, the horizontal and vertical axes of which account for 40 and 31%, respectively, of the interactions in the data. Squares represent the observations, circles refer to compounds. Areas of circles and squares are proportional to the marginal sums of rows and columns in the table. The interpretation of this biplot is analogous to that of the spectral map. The distance between two circles is related to a contrast between the corresponding compounds (Section 32.5.3). On the biplot, we have marked neuroleptics by filled symbols and morphinomimetics by open symbols. It appears that neuroleptics and morphinomimetics can be differentiated on the basis of these six observations. Only spiperone forms an exception, as this specific neuroleptic is wrongly classified with the morphinomimetics. The distance between two squares is indicative of high contrasting power between the corresponding observations.
406 TABLE 37.8 Data obtained from a screening test on rats, differentiating between morphinomimetic (opioid analgesic) and neuroleptic compounds [48]. The six observations are scored on a 6-point scale ranging from absent (0) to highly pronounced (6). Compounds 1 to 13 are morphinomimetics; compounds 14 to 26 are neuroleptics.
#
Prostration Temperat. Temperat. Pupil diam. Pupil diam. Palpebral Total increase decrease increase increase decrease
1
Propoxyphene
0
3
1
3
0
0
7
2
Morphine
0
0
4
2
0
2
8
3
15
3
Methadone
0
0
6
6
0
4
Pethidine
0
1
4
6
0
2
13
5
Codeine
0
4
0
3
0
1
8
6
Dextromoramide
0
0
6
4
0
3
13
7
Anileridine
0
2
3
4
0
2
11
3
12
8
Phenoperidine
0
2
5
2
0
9
Etonitazine
0
0
6
6
0
0
12
10
Phenazocine
0
2
4
4
0
2
12
11
Piritramide
0
2
4
4
0
3
13
12
Fentanyl
0
1
6
6
0
3
16
1
5
13
Bezitramide
0
0
2
2
0
14
Chlorpromazine
6
0
6
5
0
6
23
15
Perphenazine
6
0
6
0
0
6
18
16
Haloperidol
0
0
6
0
3
6
15
17
Trifluoperazine
6
0
3
0
0
6
15
18
Chlorprothixene
5
0
6
6
0
6
23
19
Spirilene
0
0
3
0
0
5
8
20
Pimozide
0
0
5
0
1
5
11
21
Pipamperone
2
0
6
0
1
6
15
22
Sulpiride
0
0
0
0
1
2
3
23
Benperidol
4
0
6
6
0
6
22
24
Spiperone
0
3
0
3
0
5
11
25
Oxypertine
6
0
6
0
1
6
19
26
Thioridazine Total
1
0
6
6
0
6
19
36
20
110
78
7
96
347
407
PC2
PUPIL
DIAM.
DECREASE 9
SULPIRIDE TEMPERAT.
INCREASEn ,
PROPOXYPHENEO 9HALOPERIDOL @
SPIPERONE
PIRITRAMIDE \^-G ANILERIDINE O^-^HENAZOCINE PHENOPERIDII:ufi Q PETHIDINE FSNTANYL^ DBXTROMORAISINE ^ ^ n PUPIL DIAM. MORPHINE^ SPIRILENE0 ^OO 0 OETONITAZINE •f^ \ METHADONE PIPAMPERONS^ P ~n ^ »-«.^m».»rTn-. ^^ I n ^ ^gy BEZITRAMIDE PALPEBRAL DECREASE \ THIORIDAZINE TEMPERAT. DECREASE ^ BENPERIDOL P f CHLORPROTHIXENE OXYPERTINE^ \CHLORPROMAZINE :DE
INCREASE
9
PERPHENAZINE
^
@TRIFLUOPERAZINE
n
PROSTRATION
PCI
Fig. 37.5. Biplot obtained from correspondence factor analysis of the data in Table 37.8 [43]. Circles refer to compounds. Squares relate to observations. Areas of circles and squares are proportional to the marginal sums of the rows and columns in the table. The horizontal and vertical components represent 40 and 31%, respectively, of the interaction in the data.
Temperature decrease and increase of pupil diameter point toward morphinomimetic activity. Prostration, pupil diameter decrease and palpebral opening decrease are predictive of neuroleptic activity. Temperature decrease is the least contrasting observation, as it appears to be close to the origin of the biplot, which is marked by a small cross. The origin is the neutral or average point and is devoid of any contrast. As can be seen from the table, most compounds score highly for temperature decrease and, hence, its differentiating ability is low. Three observations stand out as prominent poles of the plot, namely temperature increase, prostration and pupil diameter increase. This might suggest that the screening test could be simplified by including only these three highly specific observations. Unfortunately, these observations are also highly insensitive, as the average scores of the 26 compounds on them are very low.
408
37.3 Canonical variate models While principal components models are used mostly in an unsupervised or exploratory mode, models based on canonical variates are often applied in a supervisory way for the prediction of biological activities from chemical, physicochemical or other biological parameters. In this section we discuss briefly the methods of linear discriminant analysis (LDA) and canonical correlation analysis (CCA). Although there has been an early awareness of these methods in QSAR [7,50], they have not been widely accepted. More recently they have been superseded by the successful introduction of partial least squares analysis (PLS) in QSAR. Nevertheless, the early pattern recognition techniques have prepared the minds for the introduction of modem chemometric approaches. 37.3.1 Linear discriminant analysis Linear regression, such as used in the Hansch and Free-Wilson methods, seeks to predict the actual value of biological activities from chemical, physicochemical, quantum mechanical and biological parameters of a set of compounds. In linear discriminant analysis (LDA) one only attempts to predict the class membership of the compounds from the above mentioned parameters (Section 17.9). The discriminant equation is obtained from a learning or training set of compounds with known class membership and applied to newly synthesized compounds for the prediction of their classification. In the case of two classes (e.g. morphinomimetics and neuroleptics) the discriminant equation can be obtained directly by means of linear regression (Chapter 10) or principal components regression (Section 35.6). In the general case of LDA with more than two classes one makes use of canonical variate analysis (Section 33.2.2). The result can be displayed graphically in the form of a biplot which allows to determine visually which parameters discriminate between which classes of compounds. Linear discriminant analyses have been carried out using the extrathermodynamic parameters that normally appear in Hansch equations (Log P and its squared value, a^, E^, etc.). The method can also be applied to biological spectra in order to discriminate between pharmacological classes of compounds that are submitted to screening tests. One of the assumptions of LDA is that the classes of compounds are normally distributed in the multivariate predictor space and that their multivariate equidensity (hyper)ellipsoids are similarly oriented and shaped. This is a stringent assumption which, in the strictest form, is seldom satisfied in practice. Another problem of LDA occurs when a highly discriminating parameter contributes little to the total variance in the data. These drawbacks can be alleviated by applying partial least squares (PLS) regression which is the subject of Section 35.7.
409
37.3.2 Canonical correlation analysis Canonical correlation analysis (CCA) is applicable in QSAR when a set of compounds is described, on the one hand, by a table of chemical and physicochemical parameters (X), such as occurs in Hansch or Free-Wilson models, and, on the other hand, by a table of multiple biological observations (Y). The method produces several pairs of canonical axes (see Section 35.3 for a more thorough discussion of CCA). Each pair of axes can be interpreted as a direction in the space of the chemical and physicochemical parameters which correlates maximally with a conjugated direction in the space of the biological variables. A pair of canonical axes is characterized by its canonical correlation coefficient, which is equal to the product-moment correlation between the projections of the compounds on the corresponding canonical axes [51]. The result of CCA can be displayed graphically in the form of two conjugated biplots. Each biplot is spanned by the dominant canonical axes, such that one can visually observe which chemical or physicochemical parameters correlate best with which biological variables [40]. Canonical correlation analysis has also been used for correlating different experimental settings for the same set of compounds, such as obtained in animal pharmacology, in biochemical binding studies and in clinical observations on humans. A drawback of the method is that highly correlating canonical variables may contribute little to the variance in the data. A similar remark has been made with respect to linear discriminant analysis. Furthermore, CCA does not possess a direction of prediction as it is symmetrical with respect to X and Y. For these reasons it is now replaced by two-block or multi-block partial least squares analysis (PLS), which bears some similarity with CCA without having its shortcomings. 37.4 Partial least squares models Partial least squares analysis (PLS) has rapidly found applications in QSAR since its introduction by Wold et al. [52]. The reader is referred to Sections 35.7 and 36.2.4 for a thorough discussion of PLS. 37.4.1 PLS regression and CoMFA The PLS regression model can be expressed in the form: y = Xb + e
(37.15)
where X represents the column-centered matrix of independent physicochemical variables, y is the centered vector of dependent biological activities, b contains the coefficients of regression and e is the vector of residuals.
410
As an illustration of PLS regression (PLSl) we reconsider the inhibitory potencies of oxidative phosphorylation of 11 doubly substituted salicylanilides [ 17] in Table 37.1. An extended Hansch model is defined by the linear free energy relation: log I/IC50 = bQ-hb^c + b2logP'\- Z?3 (log P)^
(37.16)
The most predictive PLS regression model for these data makes use of two PLS-components: log I/IC50 = -1.772 + 0.678 a + 0.168 log P + 0.020 (log Pf 5, = 0.313,
/?2 = 0.537
The selection of the number of PLS-components to be included in the model was done according to the PRESS criterion (Section 36.3). Note that the result is comparable to the one which we obtained earlier by means of the simple Hansch analysis (Section 37.1.1). Hence, in this case, there is no obvious benefit to include a quadratic term of log P in the model. It has been shown that PLS regression fits better to the observed activities than principal components regression [53]. The method is non-iterative and, hence, is relatively fast, even in the case of very large matrices. The most prominent application of PLS regression is in comparative molecular field analysis (CoMFA) developed by Cramer et al. [54]. While the previous approaches deal with global properties of the whole compound or at best with those of substituent groups, CoMFA is based on electrostatic, hydrophobic and steric properties (called fields) at individual points of the molecules. One must, therefore, imagine that a fine three-dimensional grid is laid over the individual molecules and that the properties are computed at each point of the grid. This produces for each molecule a large vector of independent parameters, the dimension of which is theoretically equal to the number of properties times the number of grid points. The latter amounts usually to several thousands of points, depending on the fineness of the grid. CoMFA makes use of PLS regression to compute a relation between the molecular fields of the compounds X and their biological activities y. It can be understood that the independent matrix X is extremely wide, as the number of variables exceeds the number of compounds (often by a factor as large as 100). This causes the matrix X to be highly multicollinear. For these reasons, PLS regression appears to be the method of choice for this application. We briefly outline the procedure used in CoMFA below. A group of compounds is selected which possess a common three-dimensional pharmacophore, i.e. a configuration of chemical groups that presumably interact with the target receptor or enzyme. The compounds are then aligned in space by translation and rotation, for example on the basis of their similarity with a known
411
active compound which serves as a common template. A similar alignment procedure is applied in Procrustes analysis, in which several data tables are related to one another (Section 35.2). After alignment, the 3-D structures of the molecules conform maximally with one another, such that they can be overlaid. In the next step, a three-dimensional grid is spanned over the set of aligned and overlaid molecules. A typical grid of 5x2x2 nm^ will theoretically generate 2 500 000 lattice points at a grid spacing of 0.02 nm. In the following step, one computes the electrostatic, hydrophobic and steric fields at each of the lattice points. It has been mentioned already that the electrostatic and steric fields are of an enthalpic nature, while the hydrophobic field is entropic. Hence, these three properties give a comprehensive description of the local extrathermodynamic properties of the molecule. It is assumed that some of these local fields of the compounds interact with specific (unknown) chemical groups of the target receptor or enzyme. PLS regression is then applied in order to determine which fields at which lattice points are responsible for the observed biological activities of the molecules. Once this relationship is obtained, predictions of biological activity can be obtained for newly synthesized compounds. The same assumptions apply to CoMFA as to ordinary Hansch analysis. These are additivity of effects and the availability of structurally similar (congeneric) molecules. The method does not account for pharmacokinetic effects, such as distribution, elimination, transport and metabolization. A prospective drug may appear to bind well to the receptor or enzyme, but may not reach the target site due to undesirable pharmacokinetic properties [8]. 37.4.2 Two-block PLS and indirect QSAR The methods based on relationships between chemical and physicochemical descriptors, on the one hand, and biological activities, on the other hand, usually involve a series of structurally related compounds. A more general approach makes use of biological descriptors which can be applied to a wider variety of compounds. The underlying idea is that the binding activities of compounds to proteins (such as enzymes, antibodies and receptors) are indirectly related to their electronic, hydrophobic and steric interactions. The proteins 'sniff out' the molecular fields of the compounds, so to speak, and the binding that results depends on the binding energies that are released. If a compound is screened in a battery of closely related proteins, then the spectrum of binding activities provides an indirect chemical and physicochemical description of the molecular structure. Often, it is more economical to determine binding spectra in vitro from biochemical assays than to perform experiments in vivo on animals. It is feasible to screen large libraries of compounds in robotized batteries of binding assays, using standard panels of related proteins [55]. Given a test set of compounds, it is
412
possible to derive a relationship between their already established standard binding spectra and a particular pharmacological activity which is the object of investigation, such as substance P-like activity in rats. The relationship can then be used to predict the pharmacological activity of newly synthesized compounds from their standard binding spectra. The fundamental assumption of this indirect QSAR approach is that the panel of proteins is capable of recognizing the pharmacophore which is responsible for the particular pharmacological effect. The panel of proteins must possess affinity for the molecules, and yet must have enough specificity in order to detect differences among them. If the assumption is met, then the binding spectra can be very useful, as they reflect only those molecular fields that are relevant to the interactions of proteins with the pharmacophore. PLS regression appears as the method of choice in the case when only a single biological activity is to be predicted (Section 35.7). In the case of multiple activities it is possible t0 predict the pharmacological spectrum from the biochemical binding spectrum by means of two-block PLS (or PLS2). The latter can be regarded as a generalization of PLS regression (or PLSl) in which the dependent vector y is replaced by the matrix of dependent activities Y. The two-block PLS model can be expressed from analogy with the PLS regression model (Section 37.4.1): Y = XB + E
(37.17)
where B is now the matrix of regression coefficients and E represents the matrix of residuals. An illustration of indirect QSAR is the prediction of the in-vivo pharmacological spectra of neuroleptics from their spectra of in-vitro binding to radioactively labeled receptors [56]. The pharmacological spectra are defined by the ED5Q (mg/kg) values obtained in five observations of the ATN test in rats [48]. The combined ATN test includes inhibition of agitation and stereotypy induced by apomorphine (A), inhibition of seizures and tremors provoked by tryptamine (T) and prevention of mortality resulting from a high dose of norepinephrine (N). The results obtained from 17 reference neuroleptics represent the dependent Y block (Table 37.9). The biochemical binding spectra of the same 17 neuroleptics are defined by their inhibition constants K^ (nM) of receptors in the brain that have been radioactively labeled by four specific ligands [57]. The latter comprise haloperidol and apomorphine which are specific for dopamine receptors, spiperone which is specific for serotonin receptors and the compound WB4101 which is specific for norepinephrine (alpha-adrenergic) receptors. (Remember that K^ values are proportional to IC5Q values, the proportionality constant being the same for each receptor.) The data constitute the independent X block (Table 37.10). The preprocessing steps that have been applied to the X and Y blocks are the same as those of spectral map analysis. First, logarithms have been taken of the reciprocals
413 TABLE 37.9 Pharmacological results (ED^^y in mg/kg) obtained in rats from 17 reference neuroleptic compounds in the combined ATN test [48]. Agitation
Stereotypy
Seizures
Tremors
Mortality 1.34
Benperidol
0.01
0.01
0.33
0.33
Chlorpromazine
0.25
0.29
0.88
0.58
1.54
Chlorprothixene
0.17
0.38
0.51
0.77
0.51
Clothiapine
0.10
0.13
0.38
1.54
3.10
Clozapine
6.15
10.70
3.07
4.66
14.10
Droperidol
0.01
0.01
0.51
0.67
0.77
Fluphenazine
0.06
0.06
0.58
1.77
2.33
Haloperidol
0.02
0.02
0.77
5.00
5.00
Penfluridol
0.22
0.33
160.00
160.00
160.00
Perphenazine
0.04
0.04
0.51
1.34
3.08
Pimozide
0.05
0.05
40.00
40.00
40.00
Pipamperone
3.07
5.35
0.58
2.03
6.15
Promazine
3.07
4.66
9.33
18.70
4.06
Sulpiride
21.40
21.40
320.00
320.00
1280.00
Thioridazine
4.06
5.35
10.70
28.30
1.16
Thiotixene
0.22
0.22
21.40
320.00
4.06
1.77
4.67
40.00
Trifluoperazine
0.04
0.06
of the ED5Q and K^ values, and the resulting activities were double-centered. The biplots of Figs. 37.6 and 37.7 are constructed from the first two PLS components, and display the pharmacological and biochemical spectra, respectively. The reading rules of these biplots are the same as those of the spectral map which have been explained above (Section 37.2.2). Briefly, circles represent neuroleptic compounds and squares refer to the tests. Areas of circles are proportional to the potencies of the compounds, while areas of squares are proportional to the sensitivities of the tests. Potencies and sensitivities have been defined from the marginal totals of the data tables. Compounds and tests that possess mutual specificity are found at a distance and in the same direction with respect to the center of the plot. The biplots show which binding tests are predictive for which pharmacological tests. Binding to the haloperidol and apomorphine labeled receptors corresponds with inhibition of agitation and stereotypy in rats, binding to the spiperone receptor
414 TABLE 37.10 Biochemical results (A', in nM) obtained in binding assays of radioactively labeled brain receptors [57] from the same 17 neuroleptics shown in Table 37.9. Haloperidol Benperidol Chlorpromazine
Apomorphine
Spiperone
WB4101
0.4
0.2
6.6
2.3
49.6
4.5
20.2
1.7
Chlorprothixene
11.0
5.6
3.3
1.0
Clothiapine
15.7
5.6
6.0
14.4
156.0
56.0
15.7
7.3
Droperidol
0.8
0.4
4.1
0.8
Fluphenazine
6.2
2.2
32.8
8.9
Haloperidol
1.2
1.3
48.0
3.1
Penfluridol
9.9
8.9
232.0
363.0
Perphenazine
3.9
3.6
33.0
91.3
Pimozide
1.2
1.4
32.8
41.0
124.0
16.0
5.0
46.0
Promazine
99.0
38.0
328.0
2.5
Sulpiride
31.3
44.7
26043.0
1000.0
Thioridazine
15.7
8.9
36.0
3.2
Thiotixene
2.5
2.2
9.2
12.9
Trifluoperazine
3.9
2.2
41.2
20.4
Clozapine
Pipamperone
corresponds with inhibition of seizures and tremors, and, finally, binding to the alpha-adrenergic receptor (labeled by WB4101) matches with prevention of mortality. These triangular patterns indicate a high specificity of the in-vivo tests for three receptors (dopamine, serotonin and norepinephrine) that are thought to be involved in psychotic manifestations such as hallucinations, delusions, mania, extreme agitation, anxiety, etc. These three receptors form distinct poles on the pharmacological and biochemical biplots of Figs. 37.6 and 37.7. There also is agreement between the positions of the compounds on the two biplots. Haloperidol and pimozide appear as specific inhibitors of the dopamine receptor, promazine and thioridazine are shown to be specific blockers of the norepinephrine receptor, and, finally, pipamperone is the only compound that exhibits high specificity for the serotonin receptor. A comparison of the two biplots (Figs. 37.6 and 37.7) reveals which binding assays are predictive for which pharmacological effects. It also shows that activity spectra in animals can be reliably predicted from binding profiles. The biplots can also be analyzed from a chemical point of view by
415
Serotonin I
I
•VTHEWDRS "S mseiZURES BENPERJDOL •PIPAMPEMNE CLOZAPINE'
i
Q TRIFLUOPERAZINE^
'>'SULPIRIDE [PERPHENAZINE
^
CLOTHIAPINE O
CHLOHPROMAZINE*
:0 V 0«)HALOPERiooL D o p a m i n G
1UPHEM4ZIMEC
I I
CHLOPPROTHIXENE%
PENFLURIDOL
- - 0 "^XD] " 1
DROPERIDOL
OPIHOZIDE
I
AGITATION ^ N STEREOTYPy
*PROHAZIHE
I I •THIORIDAZIHE
• 7H10TIXENE
I
jttNORTALITY
Norepinephrine Fig. 37.6. PLS biplot obtained from the pharmacological data in Table 37.9, after log double-centering and analysis by two-block PLS [56]. Circles represent 17 reference neuroleptic compounds, squares denote tests. Areas of circles and squares are proportional to the potencies of the compounds and the sensitivities of the tests, respectively. Reproduced with permission of E.J. Karjalainen.
Serotonin I I PIPANPERONE»
J aSPIPERONE
;
\
•CLOTHIAPINE TfilOTIXENEO
•CLOZAPINE
FLUPHENAZINE -h ^
CHLORPROTHIXENEO 'DROPERIDOLO
•PERPHENAZINE
^
^PENFLURIDOL
piMOiZDE
T^I^\^in.UOPERAZINE
Dopamine
hi-/p--B£NPffljoa.
CHLORPROMAZINEO • APOMORPHINE HALOPERIDOL THIORIDAZINE ^ v I ^ ^ OHALOPERIOOL ^^ DWS 4i0i
"^
OPRONAZINE
I Norepinepifirine Fig. 37.7. PLS biplot derived from the biochemical binding data in Table 37.10, after log doublecentering and analysis by two-block PLS [56]. The reading rules are the same as those indicated with Fig. 37.6. Reproduced with permission of E.J. Karjalainen.
416
correlating structural properties with positions on the maps. It appears that compounds attracted by the dopamine pole are chiefly of the butyrophenone and diphenylbutyl type, while those attracted by the adrenergic pole belong to the phenothiazine class. In a similar way as with PLS regression, one may determine the optimal number of components that must be included for obtaining the most reliable predictions, using cross-validation and minimization of PRESS (Section 36.3). Two-block PLS can be extended to multi-block PLS for the prediction of drug activities from several independent predictor blocks [58]. For example, one may attempt to predict clinical effects (the ultimate goal of drug design) from all available information (pharmacological, biochemical, pharmacokinetic, toxicological, etc.). Multi-block PLS has been used to predict clinical scores for various antischizophrenic effects [59] from in-vivo and in-vitro data [56]. The analysis showed that only a single component of clinical effects could be reliably predicted from animal and receptor studies, while all higher order clinical and biological components showed only weak correlations.
37.5 Other approaches In this chapter we have only addressed a selected number of topics and for lack of space we have left out many others. Cluster analysis has played a larger role in QSAR than appears from our overview. This technique is an established QSAR tool in recognition or classification of known patterns [38,60] as well as for cognition or detection of novel patterns [61]. Neural networks have been introduced in QSAR for non-linear Hansch analyses. The Perceptron, which is generally considered as a forerunner of neural networks has been developed by the Russian school of Rastrigin and coworkers [62] within the context of QSAR. The learning machine is another prototype of neural network which has been introduced in QSAR by Jurs et al. [63] for the discrimination between different types of compounds on the basis of their properties. Decision trees for optimal drug design strategies have been proposed by Topliss [64] and by Purcell et al [65]. The determination of the three-dimensional conformation of molecules is an important aspect of QSAR, which can be obtained from x-ray crystallography [66], NMR spectroscopy or, in the case of small molecular fragments by quantummechanical calculations [67,68]. A topic of actuality is the study of receptor proteins and enzymes for which data bases with crystallographic information are now made available. Computer modelling of the active sites of receptors and enzymes are important tools in rational drug design. Principal components and cluster analysis can be applied to the primary
417
sequences of these proteins in order to derive classifications and phylogenetic relationships [69]. Of great importance is the searching of very large data bases (with several thousands of compounds) for molecular structures that bear similarity to a known lead compound. The standard approach is to start from connectivity tables which describe the two-dimensional (graph) structure of the molecules in an unambiguous way. In the case of a molecule with n atoms, the nxn connectivity table defines the type of each atom (on the main diagonal) and the type of bond between each pair of atoms (on the off-diagonal positions). Similarity indices between compounds can be computed by various techniques, ranging from global connectivity indices [70] to atom-to-atom mapping [11]. Alternatively, molecular data bases may be searched for compounds that are as widely dissimilar as possible, in order to form standard panels of screening compounds. Principal components analysis has been used to reduce a large number of structural indices to a small set of independent descriptors that can be more easily submitted to pattern recognition techniques [71]. As we have stated in the introduction to this chapter and as appears from this overview, a wide variety of chemometric methods converges in QSAR, which plays a key role in the design of novel and improved drugs. References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12.
13.
A. Burger (Ed.), Medicinal Chemistry. Wiley, New York, 1970. E.J. Ariens (Ed.), Drug Design, Vols I-X. Academic Press, New York, 1971-1980. H. Meyer, Zur Theorie der Alkolnarkose I. Welche Eigenschaft der Anesthetica bedingt ihre narkotische Wirkung? Arch. Exp. Pathol. Physiol., 42 (1899) 109-118. E. Overton, Osmotic properties of cells in the bearing on toxicology and pharmacy. Zeitschr. Physik. Chem., 22 (1897) 189-209. E. Fischer, Einfluss der Configuration auf die Wirkung der Enzyme. Ber. Dtsch. Chem. Ges., 27(1894)2985-2993. Dorland's Illustrated Medical Dictionary, 26th edn. Saunders, Philadelphia, PA, 1981. Y.C. Martin, Quantitative Drug Design. A Critical Introduction. Marcel Dekker, New York, 1978. H. Kubinyi, QSAR: Hansch Analysis and Related Approaches. VCH, Weinheim, 1993. P.P. Mager, The Masca model of pharmacochemistry: II. Rational empiricisms in the multivariate analysis of opioids. In: Drug Design, (E.J. Ariens, Ed.), Vol. X. Academic Press, New York, 1980, pp. 343-401. J. W. McFarland, Comparative Molecular Field Analysis (CoMFA) of anticoccidial triazines. J. Med. Chem., 35 (1992) 2543-2550. C. Pepperrell, Three-Dimensional Chemical Similarity Searching. Research Studies Press (J. Wiley), Taunton, UK, 1994. A. Miklavc, D. Kocjan, J. Mavri, J. Koller and D. Hadzi, On the fundamental difference in thermodynamics of agonist and antagonist interactions with P-adrenergic receptors and the mechanism of entropy-driven binding. Biochem. Pharmacol., 40 (1990) 663-669. G.T. Fechner, Elemente der Psychophysik, 1907. Reprinted in: Elements of Psychophysik (D.H. Davis, Ed.). Holt, Rinehart and Winston, New York, 1966.
418 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
29. 30. 31. 32.
33. 34. 35. 36.
L.P. Hammett, Physical Organic Chemistry. McGraw-Hill, New York, 1940. C. Hansch and T. Fujita, pen Analysis. A method for the correlation of biological activity and chemical structure. J. Am. Chem. Soc, 86 (1964) 1616-1626. C. Hansch and W.J. Dunn III, Linear relationships between lipophilic character and biological activity of drugs. J. Pharmaceut. Sci., 61 (1972) 1-19. R.L. Williamson and R.L. Metcalf, Salicylanilides: a new group of active uncouplers of oxidative phosphorylation. Science, 158 (1967) 1694-1695. C. Hansch and J.M. Clayton, Lipophilic character and biological activity of drugs II: The parabolic case. J. Pharmaceut. Sci., 62 (1973) 1-21. J.W. McFarland, On the parabolic relationship between drug potency and hydrophobicity. J. Med. Chem., 13 (1970) 1192-1196. E. Klarmann, V.A. Shtemov and L.W. Gates, The alkyl derivatives of halogen phenols and their bactericidal action. I. Chlorophenols. J. Am. Chem. Soc, 55 (1933) 2576-2589. L. Buydens, D.L. Massart and P. Geerlings, Gas chromatographic behaviour and pharmacological activity of neuroleptica. Anal. Chim. Acta, 174 (1985) 237-244. R.F. Rekker and R. Mannhold, Calculation of Lipophilicity. The Hydrophobic Fragmental Constant Approach. VCH, Weinheim, Germany, 1992. C.G. Swain and E.C. Lupton, Field and resonance components of substituent effects. J. Am. Chem. Soc, 90 (1968) 4328-4337. R.W. Taft, Separation of polar, steric and resonance effects in reactivity. In: Steric Effects in Organic Chemistry (M.S. Newman, Ed.). Wiley, New York, 1956, pp. 556-575. A. Verloop, The STERIMOL Approach to Drug Design. Marcel Dekker, New York, 1987. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986. C. Hansch and A. Leo, Substituent Constants for Correlation Analysis in Chemistry and Biology. Wiley, New York, 1979. G. Grassy, R. Lahana and Oxford Molecular Ltd., TSAR, Tools for Structure-Activity Relationships, User's Guide, Issue 3. Proprietary Software developed by Oxford Molecular Ltd., Oxford, UK, 1993. _ .^ S.M. Free and J.XVL-Wilson, A mathematical contribution to structure-activity relationships. J. Med. Ciiem.,"7 (1964) 395-399. T. Fujita and T. Ban, Structure-activity study of phenethylamines as substrates of biosynthetic enzymes of sympathetic transmitters. J. Med. Chem., 14 (1971) 148-152. J.L. Spencer, J.J. Hlavka, J. Petisi, H.M. Krazinski and J.H. Boothe, 6-Deoxytetracyclines, V. 7,9-Disubstituted Products. J. Med. Chem., 6 (1963) 405-407. P.N. Craig, Comparison of the Hansch and Free-Wilson approaches to structure-activity correlation. In: Biological Correlations — The Hansch Approach (R.F. Gould, Ed.). Advances in Chemistry Series, No. 114. American Chemical Society, Washington DC, 1972, pp. 115-129. P.N. Craig, Interdependence between physical parameters and selection of substituent groups for correlation studies. J. Med. Chem., 14 (1971) 680-684. F. Darvas, Application of the sequential simplex method in designing drug analogs. J. Med. Chem., 17(1974)99-804. B.R. Kowalski and C.F. Bender, The application of pattern recognition to screening prospective anticancer drugs. J. Am. Chem. Soc, 96 (1974) 916-918. R. Franke, S. Dove and R. Kuehne, Hydrophobicity and hydrophobic interactions, 1. On the physical nature of aromatic hydrophobic substituent constants. Eur. J. Med. Chem., 14 (1979) 363-374.
419 37. 38. 39.
40. 41. 42. 43.
44.
45. 46. 47. 48. 49. 50. 51. 52.
53. 54.
55.
56.
57.
58.
P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim.-Forsch. (Drug Res.), 26 (1976) 1205-1300. B.R. Kowalski and C.F. Bender, Pattern recognition. A powerful approach to interpreting chemical data. J. Am. Chem. Soc, 94 (1972) 5632-5639. C. Hansch, Quantitative approaches to pharmacological structure-activity relationships. In: Structure-Activity Relationships (C.J. Cavallito, Ed.), Vol. 1. Pergamon, Oxford, 1973, pp. 75-165. P.J. Lewi, Multivariate data analysis in structure-activity relationships, In: Drug Design (E.J. Ariens, Ed.), Vol. X. Academic Press, New York, 1980. J. Schmutz, Neuroleptic piperazinyl-dibenzo-azepines. Arzneim.-Forsch. (Drug Res.), 25 (1975)712-719. K.R. Gabriel, The biplot graphical display of matrices with appHcation to principal component analysis. Biometrika, 58 (1971) 453-467. P.J. Lewi, Multidimensional data representation in medicinal chemistry. In: Chemometrics. Mathematics and Statistics in Chemistry (B.R. Kowalski, Ed.), Reidel, Dordrecht, 1984, pp. 351-376. P.J. Lewi, Spectral mapping of drug-test specificities. In: Advanced Computer-Assisted Techniques in Drug-Discovery (H. van de Waterbeemd, Ed.). VCH, Weinheim, Germany, 1994, pp. 219-253. P.L. Wood, S.E. Charleson, D. Lane and R.L. Hudgin, Multiple opiate receptors: differential binding of mu, kappa and delta agonists. Neuropharmacology, 20 (1981) 1215-1220. G. Calomme and P.J. Lewi, Multivariate analysis of structure-activity data. Spectral map of opioid narcotics in receptor binding. Actual. Chim. Therap., S.ll (1984) 121-126. J.-P. Benzecri, Analyse des Donnees. Analyse des Correspondences, Dunod, Paris, 1973. C.J.E. Niemegeers and P.A.J. Janssen, A systematic study of the pharmacology of DAantagonists. Life Sci., 24 (1979) 2201-2216. M.J. Greenacre, Correspondence Analysis in Practice. Academic Press, London, 1993. R. Franke, Optimierungs-Methoden in der Wirkstoff-Forschung. Quantitative StrukturWirkungs-Analyse. Akademie-Verlag, Berlin, 1980. T.W. Anderson, Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984. S. Wold, C. Albano, W.J. Dunn III, K. Esbensen, S. Hellberg, E. Johansson and M. Sjostrom, Pattern recognition: finding and using patterns in multivariate data. In: Food Research and Data Analysis (H. Martens and H. Russwurm Jr., Eds.). Applied Science, London, 1983, p. 147. S. de Jong, PLS fits closer than PCR. J. Chemom., 7 (1993) 551-557. R.D. Cramer III, D.E. Patterson and J.D. Bunce, Comparative Molecular Field Analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc, 110 (1988)5959-5967. L.M. Kauvar, D.L. Higgins, H.O. Villar, J.R. Sportman, A. Engqvist-Goldstein, R. Bukar, K.E. Bauer, H. Dilley and D.M. Rocke, Predicting ligand binding to proteins by affinity finger printing. Chem. Biol., 2 (1995) 107-118. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of real-life performance from laboratory results. In: Scientific Computing and Automation (Europe) (E.J. Karjalainen, Ed.). Elsevier, Amsterdam, 1990, pp. 199-209. J.E. Leysen, Review of neuroleptic receptors: specificity and multiplicity of in-vitro binding related to pharmacological activity. In: Clinical pharmacology in psychiatry (E. Usdin, S. Dahl, L.F. Gram and O. Lingjaerde, Eds.). MacMillan, London, 1982, pp. 35-62. L.E. Wangen and B.R. Kowalski, A multiblock partial least squares algorithm for investigating complex chemical systems, J. Chemom., 3 (1988) 3-10.
420 59.
60. 61. 62.
63.
64. 65. 66. 67. 68. 69.
70. 71.
J. Bobon, D.P. Bobon, A. Pinchard, J. Collard, T.A. Ban, R. De Buck, H. Hippius, P.A. Lambert and O.A. Vinar, A new comparative physiognomy of neuroleptics: a collaborative clinical report. Acta Psych. Belg., 72 (1972) 542-554. C. Hansch, S.H. Unger and A.B. Forsythe, Strategy in drug design. Cluster analysis as an aid in the selection of substituents. J. Med. Chem., 16 (1973) 1212-1222. D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, 1983. S.A. Hiller, V.E. Golender, A.B. Rosenblit, L.A. Rastrigin and A.B. Glaz, Cybernetic methods of drug design. 1. Statement of the problem — The Perceptron approach. Comp. Biomed. Res., 6(1972)411-421. P.C. Jurs, B.R. Kowalski and T.L. Isenhour, Computerized learning machines applied to chemical problems. Molecular formula determination from low resolution mass spectrometry. Anal. Chem., 41 (1969)21-27. J.G. Topliss, AppHcation of operational schemes for analog synthesis in drug design. J. Med. Chem., 15(1972) 1001-1011. W.P. Purcell, G.E. Bass and J.M. Clayton, Strategy of drug design: a guide to biological activity. Wiley, New York, 1973. J.P. Tollenaere, H. Moereels and L.A. Raymaekers, Atlas of Three-Dimensional Structure of Drugs. Elsevier, Amsterdam, 1979. L.B. Kier, Molecular Orbital Theory in Drug Research. Academic Press, New York, 1971. P.S. Portoghese, In: Molecular and Quantum Pharmacology (E.D. Bergman and B. Pullman, Eds.). Reidel, Dordrecht, 1974, pp. 352-353. P.J. Lewi and H. Moereels, Receptor mapping and phylogenetic clustering. In: Advanced Computer-Assisted techniques in Drug Discovery (H. van de Waterbeemd, Ed.). VCH, Weinheim, Germany, 1994, pp. 131-162. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986. S.C. Basak, V.R. Magnuson, G.J. Niems and R.R. Regal, Determining structural similarity of chemicals using graph-theoretic indices. Discr. Appl. Math., 19 (1988) 17-44.
421
Chapter 38
Analysis of Sensory Data 38.1 Introduction The determination and analysis of sensory properties plays an important role in the development of new consumer products. Particularly in the food industry sensory analysis has become an indispensable tool in research, development, marketing and quality control. The discipline of sensory analysis covers a wide spectrum of subjects: physiology of sensory perception, psychology of human behaviour, flavour chemistry, physics of emulsion break-up and flavour release, testing methodology, consumer research, statistical data analysis. Not all of these aspects are of direct interest for the chemometrician. In this chapter we will cover a few topics in the analysis of sensory data. General introductory books are e.g. Refs. [1-3]. There are four main types of data that frequently occur in sensory analysis: pair-wise differences, attribute profiling, time-intensity recordings and preference data. We will discuss in what situations such data arise and how they can be analyzed. Especially the analysis of profiling data and the comparison of such data with chemical information calls for a multivariate approach. Here, we can apply some of the techniques treated before, particularly those of Chapters 35 and 36. 38.2 Difference tests 38.2.1 Triangle test Let us consider a product developer who is trying to improve the taste of an existing product. The first question one could ask (and should ask before continuing) with the new product is: does the new product taste different from the old product? If trained panellists cannot establish a significant difference, it is hardly justifiable to do consumer tests, let alone launch the product on the market. A standard overall difference test is the triangle test (Fig. 38.1). In such a test one presents three samples, in no particular order, which should be tasted. Two out of the three samples are identical (e.g. the existing product, as a control) and the task is to identify the odd sample (the new product). If enough panellists correctly
422 Which is the odd sample?
© ©
®
Fig. 38.1. Triangle test: two similar products and one different product are presented; the assessor has to indicate the product that is different.
recognize the dissimilar sample one knows that a sensory difference exists. Table 38.1 gives the critical values for different panel sizes and different significance levels. For example, with 27 panellists (n = 27) one concludes that there is a difference (at the a = 5% significance level) when at least 14 correct responses are obtained. One should be aware that the conclusion of 'no significant difference detected' may not be interpreted as 'no difference present'. If the panel is small the triangle test has low power, i.e. there is a high probability that real differences may go unnoticed (see Sections 4.7 and 5.3). Therefore, a sizeable panel of at least 25 assessors is recommended. When the assessors have been selected and trained for their task the panel size may be somewhat smaller (at least 20). In general, economizing on the number of assessors for sensory tests may be a false economy: choosing a small inexpensive panel may result in great losses due to missing real opportunities for introducing improved products with increased market share. 38.2.2 DuO'trio test In the duo-trio test one presents to each panellist (or 'subject') an identified reference sample, followed by two coded samples, one of which matches the reference sample (Fig. 38.2). The subjects are asked to indicate which of the two coded samples matches the reference. If enough correct replies are obtained the two coded samples are perceived as different. Table 38.2 gives the critical values Which is equal to X: A or B?
®
® ®
Fig. 38.2. Duo-trio test: two different products are presented; the assessor has to indicate which of these two is similar to a third product.
423 TABLE 38.1 Table of critical values for the triangle test for differences a(%)
n «
10 5
a(%)
1 0.1
a(%)
a(%)
n
10 5
1
26
13
15 17
2
27
13
14
16 18
52 23
23
26 29
77
32 34 36 40
3 3 3
28
14
15
16 18
53
24 27 30
78
32 34
4 4 4
29
14
15
17 19
54 23
25
27 30
79
33
24 25
28 30
80 33
35
38 41
1
14
0.1
n
10 5
51
22 24 26 29
53
1
0.1
«
10 5
1
76
32 33
36 39
0.1
37 40
34 37 41
5 4 4
5
30
14
15
17 19
55
6
5 5
6
31
15
16
18 20
56 24 26 28 31
81
35
38 41
7
5 5
6 7
32
15
16
18 20
57 25
26 28 31
82 34 35
38 42
8 5 6
7 8
33
15
17
18 21
58 25
26 29 32
83
39 42
9
6 6
7 8
34
16
17
19 21
59 25
27 29 32
84 35
36
10 6 7
8 9
35
16
17
19 22
60 26
27 30 33
85
35
37 40 43
11 7 7
8
10
36
17
18 20 22
61
27 30 33
86 35
37 40 44
12 7 8
9
10
37
17
18 20 22
62 26
28
30 33
87 36
37 40 44
13 8 8
9
11
38
17
19 21 23
63
27 28
31
34
88 36
38 41
14 8 9
10 11
39
18
19 21 23
64 27 29
31
34
89 36
38 41 45
15 8 9
10 12
40
18
19 21
65
28 29
32 35
90 37 38 42 45
16 9 9
11 12
41
19 20 22 24
66
28 29
32 35
91
28
30 33 36
92 37 39 42 46
29
30 33 36
93
17 9
24
26
33
34 36
37
39 43
44
39 42 46
10 11 13
42
19 20 22 25
67
18 10 10 12 13
43
19 20 23 25
68
19 10 11 12 14
44 20 21
23 26
69 29
31
33 36
94 38 40 43 47
20 10 11 13 14
45
20 21
24 26
70 29
31
34 37
95
21 11 12 13 15
46
20 22 24 27
30 31
34 37
96
39 41
44 48
39 41
44 48 45 48
71
38 40 43 46
39 40 44 47
22 11 12 14 15
47
21
22 24 27
72
30 32 34 38
97
23 12 12 14 16
48
21
22 25 27
73
31
32 35 38
98 40 41
24 12 13 15 16
49
22 23
25 28
74 31
32 35 39
99 40 42 45 49
25 12 13 15 17
50 22 23
26 28
75
33
100 40 42 46 49
31
36 39
424 TABLE 38.2 Table of critical values for the duo-trio test a(%) n
10 5
a(%) 1
0.1
n
10 5
a(%) 1
a(%)
0.1
1
26 17 18 20 22
51 31 32 35 37
76 45 46 49 52
2
27 18 19 20 22
52 32 33 35 38
77 45 47 50 53
3
28 18 19 21 23
53 32 33 36 39
78 46 47 50 54
29 19 20 22 24
54 33 34 36 39
79 46 48 51 54
4
4
5
5
5
30 20 20 22 24
55 33 35 37 40
80 47 48 51 55
6
6
6
31 20 21 23 25
56 34 35 38 40
81 47 49 52 55
7
6
7
7
32 21 22 24 26
57 34 36 38 41
82 48 49 52 56
8
7
7
8
33 21 22 24 26
58 35 36 39 42
83 48 50 53 56
9
7
8
9
34 22 23 25 27
59 35 37 39 42
84 49 51 54 57
10 8
9
10 10
35 22 23 25 27
60 36 37 40 43
85 49 51 54 58
11 9
9
10 11
36 23 24 26 28
61 37 38 41 43
86 50 52 55 58
12 9
10 11 12
37 23 24 27 29
62 37 38 41 44
87 50 52 55 59
13 10 10 12 13
38 24 25 27 29
63 38 39 42 45
88 51 53 56 59
14 10 11 12 13
39 24 26 28 30
64 38 40 42 45
89 52 53 56 60
15
13 14
40 25 26 28 31
65 39 40 43 46
90 52 54 57 61
16 12 12 14 15
41 26 27 29 31
66 39 41 43 46
91 53 54 58 61
17 12 13 14 16
42 26 27 29 32
67 40 41 44 47
92 53 55 58 62
18 13
15 16
43 27 28 30 32
68 40 42 45 48
93 54 55 59 62
19 13 14 15 17
44 27 28 31 33
69 41 42 45 48
94 54 56 59 63
20
16 18
45 28 29 31 34
70 41 43 46 49
95 55 57 60 63
21 14 15 17 18
46 28 30 32 34
71 42 43 46 49
96 55 57 60 64
22
15 16
17 19
47 29 30 32 35
72 42 44 47 50
97 56 58 61 65
23
16 16
18 20
48 29 31 33 36
73 43 45 47 51
98 56 58 61 65
24
16 17
19 20
49 30 31 34 36
74 44 45 48 51
99 57 59 62 66
25
17 18
19 21
50 31 32 34 37
75 44 46 49 52
100 57 59 63 66
11
12
13
14 15
425
for various panel sizes and significance levels. For example, with a panel of n = 30 panellists one would need at least 20 correct answers to conclude that there is a significant (a = 5%) difference. 38.23 Paired comparisons In paired comparison tests two different samples are presented and one asks which of the two samples has 'most' of the sensory property of interest, e.g. which of two products has the sweetest taste (Fig. 38.3). The pairs are presented in random order to each assessor and preferably tested twice, reversing the presentation order on the second tasting session. Fairly large numbers (>30) of test subjects are required. If there are more than two samples to be tested, one may compare all possible pairs ('round robin'). Since the number of possible pairs grows rapidly with the number of different products this is only practical for sets of three to six products. By combining the information of all paired comparisons for all panellists one may determine a rank order of the products and determine significant differences. For example, in a paired comparison one compares three food products: (A) the usual freeze-dried form, (B) a new freeze-dried product, (C) the new product, not freeze-dried. Each of the three pairs are tested twice by 13 panellists in two different presentation orders, A-B, B-A, A-C, C-A, B-C, C-B. The results are given in Table 38.3. The results of such multiple paired comparison tests are usually analyzed with Friedman's rank sum test [4] or with more sophisticated methods, e.g. the one using the Bradley-Terry model [5]. A good introduction to the theory and applications of paired comparison tests is David [6]. Since Friedman's rank sum test is based on less restrictive, ordering assumptions it is a robust alternative to two-way analysis of variance which rests upon the normality assumption. For each panellist (and presentation) the three products are scored, i.e. a product gets a score 1,2 or 3, when it is preferred twice, once or not at all, respectively. The rank scores are summed for each product /. One then tests the hypothesis that this result could be obtained under the null hypothesis that there is no difference between the three products and that the ranks were assigned randomly. Friedman's test statistic for this reads Which of the two samples is the sweetest ? A or B?
© CD Fig. 38.3. Paired comparison test: two different products are presented and the assessor has to indicated the one that has most of a specified attribute.
426 TABLE 38.3 Results of pairwise comparison of three products in two presentation orders by 13 panellists. Paired comparison
Frequency
Ranking A
B
C
A>B, A>C, B>C
7
1
2
3
A>B, A>C, B
6
1
3
2
AC, B > C
5
2
1
3
A>B, AC
2
2
2
2
AC, B
1
2
2
2
A>B, A
4
2
3
1
AC
1
3
2
1
T=l2[C^Rf)/{bt(t+l)}]-3b(t-^l)
(38.1)
where t is the number of treatments (or products), b is the number of blocks (panellist x session) and/?, is the sum of the ranks for the ith product (treatment; / = 1,2,3). In our example, t = 3,b=l 3x2 = 26, and the rank sums for products A, B and C, are /?i = 7x1 + 6x1 + 5x2 + 2x2 + 1x2 + 4x1 + 1x3 = 40, /?2 = 57, and R^ = 59, respectively. This gives T= 12{(402+572+59^)/(26x3x4)} - (3x26x4) = 8.38. This result should be compared to a chi-square distribution with (t-l) = 2 degrees of freedom. Since 8.38 > 5.99 (the 5% critical level of a x^ distribution with 2 df; see Table 5.4), we conclude from this test that there exists a difference between the three products. The rank sums in Friedman's analysis provide a scale on which to position the different samples. Large differences on this scale indicate a significant product effect. In the example a difference between A on the one hand and B and C on the other is clearly indicated, the difference between B and C being negligible. Notice that in comparisons such as these sometimes slight inconsistencies in the results can be obtained. In two cases A was considered better than B, and B better than C, yet C was judged superior to A! This inconsistency or non-transitivity is known as Simpson's or de Condorcet's paradox. In this particular case it can perfectly well be attributed to random variation. Assessors who are not sure about their conclusion are forced to make a choice, which then can only be a random guess. It is possible, however, to obtain results which are conflicting and statistically significant at the same time: A < B and B < C, but C < A. This situation may occur when the attribute to be assessed in the comparisons is open to different interpretations. Actually, this is a case of multicriteria decision making (see Chapter 26) and it may be impossible to rank the three products unambiguously
427
along a single scale. One way out of this situation is to reconsider the testing design, improve the instructions to the assessors, rephrase the question asked in a way less prone to misinterpretation, etc. Another way out is to adapt the analysis of the data allowing for additional dimensions. In the next section we show how the paired comparison of a set of products (objects, items) can give the basic input for such a multidimensional scaling.
38.3 Multidimensional scaling Let us consider an experiment in which n products are compared. One judges each pair of products and gives a score for the dissimilarity of the two products. For example, one might give a score of 1, 2, 3, 4 or 5 to indicate "No", "Little", "Intermediate", "Much", or "Very Much" dissimilarity. The more two products differ, the larger will be their dissimilarity score. The dissimilarity can therefore be interpreted as a distance. All pairwise comparisons may be collected into an nxn symmetric dissimilarity or distance matrix. Such a matrix of pairwise distances can also be computed when one knows the coordinates of the products in some space. The problem is now reversed: knowing the distances (dissimilarities) can we find a configuration of the products in low-dimensional space such that it reproduces the given distances to a good approximation? MultUdimensional scaling (MDS) is the collective name for a set of methods and algorithms to find this low-dimensional configuration from a distance matrix. A simple example may serve to introduce the subject. Table 38.4 gives the distances between a few major European cities measured by the approximate flight TABLE 38.4 Table of approximate flying times (hours) between nine major European cities From/To:
At
Lo
Ma
Mu
Pa
Ro
St
VI
Wa
Athens
0
5
4.5
3
4
L5
6
4.5
4
London
5
0
2.5
2
L5
2.5
2.5
1
2.5
Madrid
4.5
2.5
0
2.5
2
2.5
6.5
2.5
5.5
Munich
3
2
2.5
0
1.5
L5
3.5
1.5
1.5
Paris
4
1.5
2
1.5
0
2
3.5
1
2.5
Rome
1.5
2.5
2.5
1.5
2
0
5
3
2
Stockholm
6
2.5
6.5
3.5
3.5
5
0
2
1.5
Vlaardingen
4.5
1
2.5
1.5
1
3
2
0
2
Warszaw
4
2.5
5.5
1.5
2.5
2
1.5
2
0
428
time. In so-called classical or metric MDS one applies Principal Coordinate Analysis introduced in Section 31.6. This is equivalent to finding the eigenvectors of the double-centered squared distance matrix A, with elements 6,^: 6..= •V2{dl-dl-d]^d^)
(38.2)
Here, d^j is the dissimilarity of object / and object y, d^. is the mean of row /, d^ the mean of column;, and d the overall mean of the matrix of dissimilarities. When the solution is found one hopes that a sufficiently good approximation can be obtained in a low-dimensional space, preferably a two-dimensional space for easy graphical display. Figure 38.4 shows the map obtained by displaying the scores for each city along the principal axes 1 and 2. One should be aware that flying times are only an approximate indication of the true distance, since the average flying speed may differ between different connections. Yet, one notices that the configuration of the cities is quite close to reality. One may well imagine that a similar analysis for cities from different continents would give a poor representation in two dimensions. Here, a three-dimensional representation would be much more adequate. With sensory data the best dimensionality of the solution cannot be guessed beforehand and is itself an analysis result of major interest. In practice one must carry out the MDS analysis for different choices of the dimensionality. The final choice is guided by goodness-of-fit criteria, interpretability of the solution and a desire to keep the results as simple as possible. 4 Stockholm
3 2 c o c
1
London
Vlaardingen
•(/)
(D
E
Warzaw
0 -1 • Rome
- 2 Madrid Athens
-3
- 2 - 1 0
1
Dimension 2 Fig. 38.4. Two-dimensional map of European cities derived from the flight time data of Table 38.4 using classical (metric) MDS.
429 Stockholm
2
•
1-
Warzaw London
o C/5 C CD
•
• en Vlaarding
0
Paris
• Munich • Rome
-1 Madrid
-2-
L
-
• Athens
,
1
2
-
,
'
1
0
r~~—'
1
\
2
Dimension 2 Fig. 38.5. Two-dimensional map of European cities obtained from non-metric MDS (using ranked distances) applied to the flight time data of Table 38.4,
Even when the input data are much less accurate a reasonable configuration can be recovered by MDS. For example, replacing the distances by their rank numbers and then applying non-metric MDS gives the configuration displayed in Fig. 38.5, which still represents the actual distance configuration quite well. In non-metric MDS the analysis takes into account the measurement level of the raw data (nominal, ordinal, interval or ratio scale; see Section 2.1.2). This is most relevant for sensory testing where often the scale of scores is not well-defined and the differences derived may not represent Euclidean distances. For this reason one may rank-order the distances and analyze the rank numbers with, for example, the popular method and algorithm for non-metric MDS that is due to Kruskal [7]. Here one defines a non-linear loss function, called STRESS, which is to be minimized:
STRESS =^^"^(5,J -f{d,.)Y
lY,Y.f^d,f
(38.3)
Minimizing this function is equivalent to finding a low-dimensional configuration of points which has Euclidean object-to-object distances 5,-^ as close as possible to some transformation/(.) of the original distances or dissimilarities, d^y Thus, the model distances 5,^ are not necessarily fitted to the originalrf^y,as in classical MDS, but to some admissible transformation of the measured distances. For example, when the transformation is a general monotonic transformation it preserves the
430
ordering of the dissimilarities, which may still be in accordance with the measurement level (ordinal). There are many extensions and adaptations to multidimensional scaling methods [8], which are beyond the scope of this book. In the computer science and pattern recognition literature one finds closely related procedures under the name of non-linear mapping [9]. Example Twenty-four different types of bread were assessed visually by 12 assessors. The samples differed in colour, size, shape, aeration and filling. The assessors were asked to indicate on a line scale the difference in appearance between photographs of two samples. A tick mark on the left side of the line scale indicates no difference, a tick mark on the right end of the line scale a large difference. The distances of the tick marks from the left end were measured for each of the pairs and for each panellist. The distances were averaged over all panellists and the average distances obtained were ranked and analyzed using Kruskal's non-metric multidimensional scaling technique. A 4-dimensional configuration satisfactorily approximated the original rank-distances. Figure 38.6 shows a 3D-graph of the MDS analysis. The graph reveals a clear clustering into six groups. These could be labelled as white bread, wheat bread, currant bread and pastry bread, and two intermediate forms. The largest perceived differences are associated with a difference in colour (-dim 1) and filling (-dim 2) and aeration (-dim 3). The other dimensions, related to size and shape differences, played a minor role in discriminating the various products.
€URRANr_ BftEAD
-0.22 0.32 dim 2 -0-69
-1.69 -1.25
Fig. 38.6. Three-dimensional configuration according to a non-metric multidimensional scaling applied to 24 types of bread differing in appearance as assessed by 12 panellists.
431
38.4 The analysis of Quantitative Descriptive Analysis profile data Quantitative Descriptive Analysis (QDA) is the term used to designate the characterization of samples in a reproducible way. The QDA technique is a much used tool in sensory analysis. QDA is quantitative in the sense that reproducible scores for sensory attributes can be obtained that are useful in describing the samples tested. In principle, our interest is not in the scores of individual panellists, but in the panel scores as a whole, hence we take the mean score of the panel as the reading of the sensory 'instrument'. (Of course, individual behaviour of panellists is of interest in monitoring the quality of individual panellists during training and later as a control.) The panellists are chosen because of their expert sensing abilities which enables them to perceive small differences between samples. They are in no way chosen to represent some target group of consumers. The task of a QDA panel is to 'measure', to describe. It should be stressed that a QDA panel records, it gives no judgement on the acceptability of a product. The latter should be done by larger consumer panels which are selected to represent the target population, not on account of their skills in sensory perception. The main usage of a QDA panel is to provide the product developer with a reliable instrument for characterizing the sensory properties of a certain type of products. The panel consists of carefully selected subjects that have gone through a thorough training. Aspects considered in the selection of QDA panellists are: good tasting abilities, odour identification, odour and taste recall, creative use of language, fitting in the group, no product aversion, availability, general health. The training stage of a QDA panel largely consists of generating a long list of attribute names that cover all perceptible characteristics of the product category concerned. The next task is to find agreement about the definition of the attributes in panel discussions. Then one seeks to reduce the set of attributes by grouping similar attributes into homogeneous clusters. Finally, one tests for the repeatability of each of the attributes as used by the individual panellists. Even after reducing the number of attributes originally generated the resulting set of attributes can be quite large: from 10 up to some 100 attributes. In many ways one may consider a well-trained QDA panel as an analytical instrument that measures the intensity of certain sensory attributes. For example, it has a certain precision that can be measured, it may show drift, it may require readjustment and its application is bound to strict protocols. A common way to record the intensity is by marking an unstructured line segment that has a left anchor ("absent") and right anchor ("very strong"). Usually the score is directly fed into the computer and recorded as a number between 0 ("absent") and 100 ("very strong"). A typical way to represent the results of a sensory panel session is in the form of profiles. Figure 38.7 shows the panel-average profiles for two products. Here, the scores on different attributes are plotted in one graph with the
432 Attribute
Appearance
Pale
fl •
Smooth
o'
Spreading slippery
jy
Smooth Fine Quick Thin Fine None None None
"'"C^
Mouthfeel
Flavour
*
*
^B > •
•
\ P
^*
O
T"\ 6
0
\
^ ^
* * it
*
Yellow Coarse Sticky Brittle Coarse Slow Thick Coarse Very Intense Very Salty Off-Flavour
5 10 15 20 25 30 35 Score
Fig. 38.7. Example of average sensory profile for two food products as obtained in a QDA panel. For five attributes (*) the difference is statistically significant (95% confidence level).
attributes arranged along one axis and the other axis providing the scale for the scores. Attribute scores belonging to the same product are joined, thus giving a multivariate profile of the sensory characteristics of the product. The analysis of QDA panel data is usually done by means of analysis of variance (see Chapter 6) on the individual attributes. With the individual data it is arguable whether analysis of variance is a proper method with its assumption of normality. In a two-way ANOVA lay-out with "Products" and "Panellist" as main factors one can test the main effects against the Product*Panellist interaction (see Section 6.8.2). The latter interaction can be tested against the pure error when the testing has been repeated in different sessions. Ideally, one would like this interaction to be non-significant. Absence of a significant Product*Panellist interaction means that the panellists have similar views on the relative ordering of the products. A practical consequence is that it is less of a problem when not all panellists are present at each tasting session. With panel mean scores the assumption of normality is much more easily satisfied. For such panel average scores one may for example test for a difference between products when the samples have been tasted at two or more sessions. In Fig. 38.7 the attributes for which there is a significant difference between the two products have been indicated. When there are many samples and many attributes the comparison of profiles becomes cumbersome, whether graphically or by means of 2inalysis of variance on all the attributes. In that case, PC A in combination with a biplot (see Sections 17.4 and 31.2) can be a most effective tool for the exploration of the data. However, it does not allow for hypothesis testing. Figure 38.8 shows a biplot of the panelaverage QDA results of 16 olive oils and 7 appearance attributes. The biplot of the
433 134% 21 131 Wanspar
,32 } /
^ G11 022
Fig. 38.8. Biplot of panel average QDA profiles (attributes) for 16 olive oil products.
column-centered data shows the approximate position of the products and the attributes in sensory space. The 2D-approximation of Fig. 38.8 is a fairly good one, since it accounts for 83% of the variance. Noteworthy in the biplot is the presence of one distinct sample, G13, characterized by a particular low transparency, perhaps due to the presence of "particles". The plot, which is based on a factoring using a = 0 and (3 = 1 (see Section 31.2), also gives a geometrical view on the correlation between the various attributes. For example, the attributes "brown" and "green" are highly correlated and negatively correlated with "yellow". The attributes "transparency" and "particles" are also strongly negatively correlated. Instead of using the principal component axes one might define two factors (see Section 34.2), obtained by a slight rotation over 20 degrees, which are associated with the particles/transparency contrast and the green/yellow contrast, respectively. The attributes "syrup" and "glossy" are not well represented in this 2-dimensional projection.
38.5 Comparison of two or more sensory data sets When many data sets for the same set of products (objects) are available it is of interest to look for the common information and to analyze the individual deviations. When the panellists in a sensory panel test a set of food products one might be interested in the answer to many questions. How are the products positioned, on the average, in sensory space? Are there regions which are not well
434
covered, which may provide opportunities for new products? How are the attributes related? Are some panellists deviating significantly from the majority signalling a need for retraining? When comparing different panels, is there a consensus among the panels? Can one compare the results of different panels across (cultural, culinary) borders. After all, descriptive panels should be more or less objective! Can one compare the results of panels (panellists) which have tested the same samples, but have used different attributes? Can this then be used to interrelate the different sets of attributes? A powerful technique which allows to answer such questions is Generalized Procrustes Analysis (GPA). This is a generalization of the Procrustes rotation method to the case of more than two data sets. As explained in Chapter 36 Procrustes analysis applies three basic operations to each data set with the objective to optimize their similarity, i.e. to reduce their distance. Each data set can be seen as defining a configuration of its rows (objects, food samples, products) in a space defined by the columns (sensory attributes) of that data set. In geometrical terms the (squared) distance between two data sets equals the sum over the squared distances between the two positions (one for data set X^ and one for Xg) for each object. The first operation in Procrustes analysis is to shift the center of gravity of each configuration of data points to the origin. This is a geometric way of saying that each attribute is mean-centered. In the next step the data configurations are rotated and possibly reflected to further reduce the distances between corresponding objects. When more than one data set is involved each data set is rotated in turn to the mean configuration of the other data sets. In the third step the data sets are shrunk or stretched by a scaling factor in order to increase their match. For each data set its scaling factor applies equally to all its attributes, thus the data configuration of objects shrinks or expands by an equal amount in any direction. The process of alternating rotations and scaling is repeated until convergence to stable and optimally close configurations. We now give a qualitative discussion of the usage and interpretation of the GPA method to sensory data. The most common application in sensory analysis of GPA is the comparison of the test results from different panellists. The interest may simply be to find the 'best' average configuration of the samples. For example, Fig. 38.9 shows the results of a Procrustes analysis on 7 cheese products measured in triplicate by a QDA panel of 8 panellists using 18 attributes describing odour, flavour, appearance and texture. The graph shows the average position of the 7 cheeses after optimal Procrustes transformation of the 8x3 = 24 different data sets. This final average GPA configuration is often referred to as the consensus configuration. The term 'consensus' is somewhat misleading as the final GPA configuration is merely the result of an averaging process, it is not the result of some group discussion between the panellists! In principle, the 7 products span a
435
/>.
A
CM W X
^1—1^
Q.
O
Q.
'
/D
"E
G^
^c principal axis 1 Fig. 38.9. Consensus plot showing the relative position of 7 cheese products (A-G) as assessed by a panel in 3 sessions. Triangles for each product indicate the three sessions. Differences between the three sessions are much smaller than differences between products.
6-dimensional space. In this case a two-dimensional projection onto the first two principal components of the GPA average configuration is a good enough approximation, accounting for 83% of the total variation. The triangles in the plot around each product indicate the average positions obtained at the three sessions. Clearly, the difference among the products exceeds the within-product betweensession variability. Therefore, an interpretation of the results with regard to the products can be meaningful. Conclusions which can be drawn from the graph are, for example: A takes a somewhat isolated position, E and F are close, so are C and G, B is an 'average' product, the lower right area is empty, and so on. For the product developer who has a background knowledge about the various products such a graphical summary of the sensory properties can be a useful aid in his work. For an interpretation of the principal axes one may draw a correlation plot. This is a plot of the loadings (correlations) of the individual attributes with the principal axis scores. Figure 38.10 shows an example of an international collaborative study involving panels from five different institutes [11]. The aim was to assess the degress of cross-cultural differences in the sensory perception of coffee. The panels characterized eight brands of coffee, each with an independently developed list of attributes. The correlation plot reveals that attributes with the same or a similar name are in general positioned close together. This lends credit to the 'objectivity' of the QDA technique. Attributes which are close to the circle of radius 1 are well represented by the 2-dimensional space of the first two principal axes. Thus, the correlation plot and the reference to the common configuration is helpful in judging the relations between the various attributes within and between different panels. One may also try to label each principal axis with a name that is
436
i/BURNT
CVi
BITTER BURNT WOODY
I
BITTER umt
I
bitter bitter ^ bitter -0.5 DENMARK
Poland GERMANY -1.0
France
-0.5
0.0 0.5 principal axis 1
GB/ICO
Fig. 38.10. Correlations of attributes with consensus principal axes.
suggested by attributes which are highly (positively or negatively) correlated with that axis. This is not always an easy task. Sometimes it is easier to distinguish main factors that are rotated with respect to the principal axes. One may also analyze which of the individual sets (i.e. panellists) are close to the mean and which are more deviant. For this analysis one determines the residuals for all products and attributes between the mean configuration and the individual data sets after the optimal Procrustes transformation. One then strings out each individual residual data set in the form of a long row vector. These row vectors are collected into a matrix where each row is now associated with a panellist. Performing a PCA on this matrix shows in a score plot the relative position of the individual panellist as a deviation from the mean. Figure 38.11 shows this plot for the 8 panellists of the cheese study. It reveals panellist 8 as being furthest removed from the rest. This panellist perhaps needs additional training. It is not strictly required to use the same attributes in each data set. This allows the comparison of independent QDA results obtained by different laboratories or development departments in collaborative studies. Also within a single panel, individual panellists may work with 'personal' lists of attributes. When the sensory attributes are chosen freely by the individual panellist one speaks of Free Choice Profiling. When each panellist uses such a personal list of attributes, it is likely that
437
Fig. 38.11. Deviations of the 8 panellists from the consensus, based on a PC A of the residuals x panellist table, where the residuals comprise all products, sessions and attributes.
the number of variables differs from panellist to panellist. In that case it is convenient to add dummy columns filled with zeros so that all panellists have data sets of the same, maximum, size. This so-called zero-padding does not affect the analysis. So far, the nature of the variables was the same for all data sets, viz. sensory attributes. This is not strictly required. One may also analyze sets of data referring to different types of data (processing conditions, composition, instrumental measurements, sensory variables). However, regression-type methods are better suited for linking such diverse data sets, as explained in the next section.
38.6 Linking sensory data to instrumental data The objective of relating sensory measurements to instrumental measurements is twofold. A first objective can be that it may help a better understanding of the sensory attributes. One should realize that such a goal usually can only be met partly since sensory perception is a highly complex process. The instrumental measurements too may be the result of complex processes. For example, the force recorded with an Instron instrument when compressing a food sample depends in an intricate way on its flow behaviour and breaking properties, which themselves are determined by the sample's internal structure. A second goal of relating the two types of measurements is that instrumental measurements may eventually replace the sensory panel. The driving force behind this second objective is that instrumental measurements are cheaper. Not much success has been scored in this area, due to the complexity of human sensory perception.
438
When relating instrumental measurements to sensory data one should focus on QDA-type data. Hedonic (or 'liking') scores and preference data are generally not well suited for comparisons with instrumental measurements, since there usually will not be a linear relationship. A simple example is saltiness, let us say, of a soup. A QDA panel can be used to 'measure' saltiness as a function of salt concentration. Over a small concentration range the response may be approximately linear. At higher concentrations the response may flatten off and in analogy with an analytical instrument one may consider that the panel is then performing outside its linear range. With preference testing the nature of the non-linearity is quite different. One does not measure the saltiness per se, but the condition that is best liked. Liking scores will show an optimum at some intermediate level of saltiness, so that the salty taste is neither too weak nor too strong. A table of correlations between the variables from the instrumental set and variables from the sensory set may reveal some strong one-to-one relations. However, with a battery of sensory attributes on the one hand and a set of instrumental variables on the other hand it is better to adopt a multivariate approach, i.e. to look at many variables at the same time taking their intercorrelations into account. An intermediate approach is to develop separate multiple regression models for each sensory attribute as a linear function of the physical/chemical predictor variables. Example Beilken et al. [12] have applied a number of instrumental measuring methods to assess the mechanical strength of 12 different meat patties. In all, 20 different physical/chemical properties were measured. The products were tasted twice by 12 panellists divided over 4 sessions in which 6 products were evaluated for 9 textural attributes (rubberiness, chewiness, juiciness, etc.). Beilken et al. [12] subjected the two sets of data, viz. the instrumental data and the sensory data, to separate principal component analyses. The relation between the two data sets, mechanical measurements versus sensory attributes, was studied by their intercorrelations. Although useful information can be derived from such bivariate indicators, a truly multivariate regression analysis may give a simpler overall picture of the relation. In recent years the application of techniques such as PLS regression to link the block of sensory variables to the block of predictor variables has become popular. PLS regression is well suited to data sets with relatively few objects and many highly correlated variables. It provides an analysis in terms of a few latent variables that often allows a meaningful interpretation and an effective graphical summary. When we analyze the data of Beilken et al. with PLS2 regression (see Section 35.7) a two-dimensional model is found to account for 65-90% of the variance of the sensory attributes, with the exception of the attributes juicy and greasy which cannot be modelled well with this set of explanatory variables.
439 4
1
2
eg
a h
f
0
-2] -4
i
-61.ll
,
-6
,
-4
^
-2
1
1
1
1
'
r*
0 t1
Fig. 38.12. Scores of products (meat patties) on the first two PLS dimensions.
Figure 38.12 shows the position of the twelve meat patties in the space of the first two PLS dimensions. Such plots reveal the similarity of certain products (e.g. C and D, or E and G) or the extreme position of some products (e.g. A or I or L). Figure 38.13 shows the loadings of the instrumental variables on these PLS factors and Fig. 38.14 the loadings of the sensory attributes. The plot of the products in the 0.5
WB SLOPE
CM
I
COOKLOSS COHESIV
0.0
w e PEAK CPJYD CPl jyp
WB SHEAR
PH TENSILE -0.5 -0.5
0.0
0.5
p1 Fig. 38.13. Loadings of predictor (instrumental) variables on the first two PLS dimensions.
440 0.61
COARSE CHEWY
0.4i TEXaiRE 0.2 CM O
0.0
GREAS'^dUICY CRUMBLY
T_^RUBBER
0.2
ADHESION 0.4-L -0.4
,
1
-0.2
,
1
0.0
n
0.2
1
1
0.4
.
r'
0.6
c1 Fig. 38.14. Loadings of dependent (sensory) variables on the first two PLS dimensions.
space of the PLS dimensions has a fair resemblance to the PC A scores plot only for the sensory variables. As a consequence, the PLS loading plot of the sensory variables (Fig. 38.14) gives a similar picture as a PCA loading plot would give. The associations between the variables within each set are immediately apparent from Fig. 38.13 or Fig. 38.14. For example, "hardness", "hrdxchv", "R-punch" and "CP-peak" are all highly correlated and indicate the firmness of the product. As another example, "tensile" strength is positively correlated with the amount of protein and negatively correlated with "moisture" and "fat". It would require an exhaustive inspection of the 20x9 correlation table to obtain similar conclusions about the relationship between variables of the two sets as the conclusions derived from simply comparing or overlaying Figs. 38.13 and 38.14.
38.7 Temporal aspects of perception In the foregoing we loosely talked about the intensity of a sensory attribute for a given sample, as if the assessors perceive a single (scalar) response. In reality, perception is a dynamic process, and a very complex one. For example, when a food product is taken in the mouth, the product disintegrates, emulsions are broken, flavours are released and transported from the mouth to the olfactory (smell) receptors in the nose. The measurement of these processes, analyzing and interpreting the results and, eventually, their control is of importance to the food
441
CO CD
time I s Fig. 38.15. Example of time-intensity (TI) curves.
manufacturer. There are many ways to study the temporal aspects of sensory perception. Experimentally many methods have been developed to measure socalled time-intensity curves or Tl-curves. Currently popular methodology is to use a slide-wire potentiometer or a computer mouse and to feed the data directly into the computer. The sort of curves that is obtained is shown in Fig. 38.15. Typically, one may characterize such a curve by a number of parameters, such as time-to-maximum-intensity, maximum intensity, time of decay, total area. The way to average such curves over panellists in order to derive a panel-average Tl-curve is not trivial. Geometrical averaging in both the intensity and time direction may help to best preserve shape. Separate analysis of variance of the characteristic parameters of the average curves can be used to assess the differences between the products. One may also try to fit a parametric function to each individual curve, for example, a combination of two exponential functions (see Chapters 11 and 39). The curves are then characterized by their best-fitting parameters, and these are compared in a subsequent analysis. Another method is to leave the curves as they are and to analyze the whole set of curves by PCA. It would seem natural to consider each curve as a multivariate observation and the intensities at equidistant points in time as the variables. Since there is a natural zero point (^ = 0, / = 0) in these measurements it makes sense in this case not to center each curve around its mean intensity, but to analyze the raw intensity data. Also, since the different maximum intensities may be related to concentration of the bitter component it would be imprudent to scale the curves to a common standard (e.g. maximum, mean intensity, area). However, one might consider a log transformation allowing for the nature (ratio scale) of the intensity scale. Figures 38.16 to 38.18 show the result of such an uncentered PCA applied to a set of Tl-curves obtained by 9 panellists [13]. The perceived intensity of bitterness
442 25Ch
Caffeine 1
Fig. 38.16. Loading curves (PCI, non-centered PCA) for 4 bitter solutions.
of four solutions, caffeine and tetrahop at two concentration levels, was recorded as a function of time. The analysis is applied separately to each of four bitter solutions. The loading plots for PCI and PC2 are shown in Figs. 38.16 and 38.17. Notice that PCI (Fig. 38.16) has little structure: it represents an equal weighting over most of the time axis. In fact the PCI loading plot very much resembles the average curve for each product. This is a common outcome with noncentered PCA. The loading plot for PC2 (Fig. 38.17) has a more distinct structure. Since it has a negative part it does not represent a particular type of intensity curve. PC2 affects the shape of the curve. One notices in Fig. 38.17 a distinct shape between the two tetrahop solutions and the two caffeine solutions. This interpretation of PCI as a size component and PC2 as a contrast component is a familiar phenomenon in principal component analysis of data (see Chapter 31). A different interpretation is obtained if one considers a rotation of the PCs. Rotation of PCI and PC2 does give more interpretable curves: PCI + PC2 gives a curve that rises steeply and decays deeply, representing a fast perception, whereas PCI - PC2 gives a curve that starts to rise much more slowly, reaching its maximum much later with a longer lasting perception. The score plot of the panellists in the space of PCI and PC2 is shown in Fig. 38.18, for one of the four product. A high score along PCI, e.g. panellist 6, implies that the panellist gave overall high intensity scores. A high score on PC2
443 _
60-1
<
O
- Tetra 2
Q.
c Q> O
c
2(H
o c 0)
I
> "55 -2(H c C C
-4(H
o o 0)
-SO 10
20
n— 30
"T 40
r50 Time
"1
60
70
80
"I
90
Fig. 38.17. Loading curves (PC2, non-centered PCA) for 4 bitter solutions.
Caffeine 1
4.6 4
73 5
8
-1
-4dimension 1
4-6 c o
E Fig. 38.18. Score plot (PC2 v. PCI) based on non-centered PCA of Tl-curves from 9 panellists for a bitter (caffeine) solution.
444
(e.g. panellists 9), implies a Tl-curve with a relatively fast rise and early peak, in contrast to a low (negative) score on PC2 (e.g. panellist 1), implying a TI curve with a relatively slow rise and late peak. There have also been attempts to describe the temporal aspects of perception from first principles, the model including the effects of adaptation and integration of perceived stimuli. The parameters in the specific analytical model derived were estimated using non-linear regression [14]. Another recent development is to describe each individual TI-curve,y;(r), / = 1, 2,..., n, as derived from a prototype curve, S(t). Each individual Tl-curve can be obtained from the prototype curve by shrinking or stretching the (horizontal) time axis and the (vertical) intensity axis, i.e. f.(t) = a^ S(bi i). The least squares fit is found in an iterative procedure, alternately adapting the parameter sets {^,, Z?.} for / = 1,2,..., n and the shape of the prototype curve [15].
38.7 Product formulation An important task of the food technologist is to optimize the ingredients (composition) or the processing conditions of a food product in order to achieve maximum acceptability. In practice this has often to be done under constraints of cost restriction and limited ranges of composition or processing conditions. Techniques such as response surface methodology (Chapter 24) and mixture designs (Chapter 25) are effective in formulation optimization. It is very often the case that the sensory perception of a product is not a simple linear function of the ingredients involved. A logarithmic function (Weber-Fechner law) or a power-law (Stevens function) often describe the relation between perceived intensity (/) and concentration (c): / = a log(c) Weber-Fechner law (38.4) I^ac^ Steven's law (38.5) The acceptance is generally a non-linear function of perceived intensity. A simple example is the salt level in a soup which clearly has a level of maximum acceptability between too weak and too salty a taste. The experimental designs discussed in Chapters 24-26 for optimization can be used also for finding the product composition or processing condition that is optimal in terms of sensory properties. In particular, central composite designs and mixture designs are much used. The analysis of the sensory response is usually in the form of a fully quadratic function of the experimental factors. The sensory response itself may be the mean score of a panel of trained panellists. One may consider such a trained panel as a sensitive instrument to measure the perceived intensity useful in describing the sensory characteristics of a food product.
445
Example Figure 38.19 shows the contour plots of the foaming behaviour, uniformity of air cells and the sweetness of a whipped topping based on peanut milk with varying com syrup and fat concentrations [16]. Clearly, fat is the most important variable determining foam (Fig. 38.19A), whereas com symp concentration determines sweetness (Fig. 38.19C). It is rather the mle than the exception that more than one sensory attribute are needed to describe the sensory characteristics of a product. An effective way to make a final choice is to overlay the contour plots associated with the response surfaces for the various plots. If one indicates in each contour plot which regions are preferred, then in the overlay a window region of products with acceptable properties is left (see Fig. 38.19D and Sections 24.5 and 26.4). In the
20
/
100
110' 1
B
y
,^
-"'—--.
W""
1 9D_ — L 8Q 15
"
• - ,
^
"^^^^ ~"~~-- ^
[ 70 10 15
20
^
^-^
25
20
% Corn Syrup
% Corn Syrup
20 ^ ^ ^ C ^ _ _ _ ^ _ J : , , - ^ - — ^
^u \
\
C
CO
CO
15-
15
J40
/45
U''':
/50
20 % Corn Syrup
••' , - - ^ - ^ 7 - ^ ^ ^ ^ ^ " " " ^
55,-'-
in
15
m-^-^:
25
10 15
'
•
I
«"•
1
I
-I
1
--•
20 % Corn Syrup
~T
.
= 1
25
Fig. 38.19. Contour plots of foam (A), uniformity of air cells (B) and sweetness (C) as a (fullyquadratic) function of the levels of fat and corn syrup. An overlay plot (D) shows the region of overall acceptability.
446
case of this example products with >135 g fat and <175 g com syrup per kg would give acceptable foaming behaviour, uniformity of air cells and sweetness. One might also construct a composite desirability function and optimize this composite response (see Section 26.4 on multi-criteria decision making). Sometimes, the best settings for the various responses are conflicting and there is no region where all properties are acceptable. In that case one has either to relax the acceptability limits for the various responses or to change the product concept. References 1. 2.
3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15 16.
M. Meilgaard, G. V. Civille and B.T. Carr, Sensory Evaluation Techniques, 2nd Edition. CRC Press, Boca Raton, PL, 1987. H.J.H. MacFie, Data Analysis in Flavour Research: Achievements, Needs and Perspectives, in: Flavour Science and Technology, M. Martens, G.A. Dalen and H. Russwurm Jr (Editors). Wiley, 1987. H. Stone and J.L. Sidel, Sensory Evaluation Practices, 2nd ed. Academic Press, San Diego, 1993. S. Siegel. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, Tokyo, 1956. R.A. Bradley, Science, statistics, and paired comparisons. Biometrics, 32 (1976) 213-232. H. A. David, The Method of Paired Comparisons. Charles Griffin, London, 1963. J.B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29 (1964) 1-27, 115-129. S.S. Schiffman, M.L. Reynolds and F.W. Young, Introduction to Multi-dimensional Scaling: Theory, Methods and Applications. Academic Press, New York, 1981. J.W. Sammon, A nonlinear mapping for data structure analysis. IEEE Trans. Comput., C-18 (1969)401-409. ABC Executive Flight Planner, June 1994. Reed Telepublishing, Dunstable, 1994. S. de Jong, J. Heidema and H.C.M. van der Knaap, Generalized Procrustes analysis of coffee brands tested by five European sensory panels. Food Qual. Pref, 9 (1998) 111-114. S. Beilken, L.M. Eadie, I. Griffiths, P.N. Jones and P.V. Harris, Assessment of the textural quality of meat patties. J. Food Sci., 56 (1991) 1465-1470. G. Dijksterhuis, Principal component analysis of Tl-curves: three methods compared. Food Qual. Pref., 5 (1994) 121-127. P. Overbosch and S. de Jong, A theoretical model for perceived intensity in human taste and smell. Physiol. Behav., 45 (1989) 607-613. G. Dijksterhuis and P. Filers, Modelling time-intensity curves using prototype curves. Food Qual. Pref., 8 (1997) 131-140. A. Abdullah, A.V.A Resurreccion and L.R. Beuchat, Formulation and evaluation of a peanut milk based whipped topping using response surface methodology. Lebensm. Wiss. Technol., 26(1993)162-166.
Additional recommended reading G.E. Arteaga, E. Li-Chan, M.C. Vazquez-Aretaga and S. Nakai, Systematic experimental designs for product formula optimization. Trends in Food Science Technol., 5 (1994) 243-253. J.B. Kruskal and M. Wish, Multidimensional Scaling. SAGE, Newberry Park, CA, 1991.
447 P. Lea, T. Naes and M. R0dbotton, Analysis of Variance for Sensory Data. Wiley, London, 1997 D. H. Lyon, M.A. Francombe, T.A. Hasdell and K. Lawson, Guidelines for Sensory Analysis in Product Development and Quality Control. Chapman and Hall, London, 1990. H. Martens and H. Russwurm, eds., Food Research and Data Analysis. Applied Science, London, 1983. M. O'Mahony, Sensory Evaluation of Foods. Statistical Methods and Procedures. Marcel Dekker, New York, 1986 T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science. Data Handling in Science and Technology Series, Elsevier, Amsterdam, 1996 J.R. Piggott (Editor), Sensory Analysis of Foods. Elsevier, London, 1984. J.R. Piggott, Statistical Procedures in Food Research. Elsevier, London, 1987.
This Page Intentionally Left Blank
449
Chapter 39
Pharmacokinetic Models Introduction Rescigno and Segre [1], in one of the first significant text books on the subject, defined pharmacokinetics as "the study of drug concentrations in different tissues and the construction of models suited to interpret such phenomena". We can distinguish four aspects of pharmacokinetics, namely absorption, distribution, biotransformation and excretion of a drug. The term drug includes here any naturally occurring or synthetic chemical compound. Although it is highly desirable to deliver a drug to the site where it must exert its effect, this is not generally possible except in those cases where the site of treatment is readily accessible (such as the skin). When topical (local) treatment is precluded, the drug must be carried to the target site via the circulatory or lymphatic system. To this effect it must first be absorbed by the blood of the central circulatory system. Absorption is extremely rapid and complete when a drug is delivered directly to the circulation by means of an intravenous injection. Slower and occasionally less efficient ways of absorption are via the mucous membranes, by oral intake, by injection into muscle or fatty tissues, or by means of patches which are applied to the skin. Pharmaceutical technology continuously tries to explore novel ways of administration in order to deliver a drug under the most optimal conditions. Important considerations in this respect are the solubility of the compound, the speed of release, and the susceptibility to biological degradation. Once the substance has been absorbed, it is distributed by the blood towards the various tissues of the body. If the drug is highly lipophilic, it may be rapidly deposited into fatty tissues and slowly released thereafter. Some drugs are readily adsorbed to membranes of cells, especially those of the red blood cells, others may adsorb onto circulating proteins. The plasma of blood, however, must carry the drug to the target tissue. If the treatment of a disease calls for a rapid and burstlike delivery to the target, then it is undesirable to have a large fraction of the drug stored away in various buffers. On the other hand, when a slow and progressive release is required, temporal storage into fatty tissues may be an advantage. Some tissues possess membranes which are highly selective for the passage of specific chemical substances. The blood-brain barrier and the placenta are among these.
450
Excretion is the process by which a substance leaves the body. The most common ways are via the kidneys and via the gut. Renal excretion is favored by water-soluble compounds that can be filtered (passively by the glomeruli) or secreted (actively by the tubuli) and that are collected into urine. Fecal excretion is followed by more lipid substances that are excreted from the liver into the bile, which is collected in the gut and passed out by the feces. Other routes of excretion are available through the skin and the lungs. As long as a compound circulates in the body it is at risk of being degraded by enzymes. This process is called biotransformation or metabolism. Many of the biotransformation processes occur in the liver which contains a wide variety of enzymes of the cytochrome P-450 family, all geared towards the breakdown of compounds that are foreign to the body (so-called xenobiotics). Metabolic breakdown may also occur in plasma by circulating enzymes or even in the vicinity of the target site. The pathway of biodegradation can be very complex and often results in several metabolites, each of which is distributed throughout the body and excreted in its own particular way. Some metabolites are neutral, others may be harmful, still others may possess a therapeutic effect. In some cases, a prodrug is delivered into the system which is inert by itself but which is metabolized near the target site into the active compound. Such a strategy is highly effective and least harmful to the system. In all these instances where biotransformation plays a relevant role, it will be necessary to study the kinetics of the major and most reactive metabolites in addition to those of the original compound. Pharmacokinetics is closely related to pharmacodynamics, which is a recent development of great importance to the design of medicines. The former attempts to model and predict the amount of substance that can be expected at the target site at a certain time after administration. The latter studies the relationship between the amount delivered and the observable effect that follows. In some cases the observable effect can be related directly to the amount of drug delivered at the target site [2]. In many cases, however, this relationship is highly complex and requires extensive modeling and calculation. In this text we will mainly focus on the subject of pharmacokinetics which can be approached from two sides. The first approach is the classical one and is based on so-called compartmental models. It requires certain assumptions which will be explained later on. The second one is non-compartmental and avoids the assumptions of compartmental analysis. Although the methods that are discussed in this chapter deal explicitly with the disposition of drugs in animals and humans, their scope is much wider. In general, these methods can be applied to study the transport of substances within parts of a system provided that these transports can be described by zero or first order kinetics. This applies, for example, when the rate of change of the amount in one part of the system depends linearly on the amounts present in all the various parts of the system. Applications are found commonly in first order chemical reactions,
451
Uptake and washout of tracers, radioactive and fluorescent decay, etc. In a broader context, kinetic models find application in population dynamics, ecology, meteorology, epidemiology, etc. Non-linear models are of special importance in these fields. Although their formulation is often only slightly more involved than their linear counterparts, the complexity of their solutions has only recently been explored. They lie at the origin of fractal geometry and deterministic chaos. Fractal geometry arises when a system does not evolve along smooth paths, such as the stream lines in laminar flow, but follows highly complex patterns, such as the eddies in turbulent flow. The latter system appears to behave chaotically, yet obeys constraints which force it to remain under the influence of so-called attractors. The dimensionality of an attractor in a seemingly chaotic system is characterized by a fractional number rather than by an integer one, hence the name fractal [3].
39.1 Compartmental analysis The origin of compartmental analysis is usually credited to Teorell [4], although some of the concepts had already been developed by physicists and physiologists in the eighteenth century in relation to the transfer of heat and the washout of tracer substances. Compartmental analysis can be applied when two major assumptions are satisfied. First, one must be able to decompose the system or organism into identifiable compartments which must be sufficiently homogeneous. The latter means that a substance is rapidly and uniformly mixed within each compartment. Very often, however, a compartment is to be considered as a lumped system, which groups apparently heterogeneous parts, for example all of the body except the blood compartment. The degree of lumping depends on the properties of the compounds and on the level of detail of the study. The second assumption requires that transport between compartments is governed by first order differential equations with constant coefficients of the general type: dX
""
( ""
^
X,
(39.1)
where X^ is the amount of drug within compartment / at time t, k^j is the transfer constant for transport from compartment / to compartment; (in reciprocal time units) and where n represents the number of compartments. An illustration of transfer constants in the case of a two-compartment system is given in Fig. 39.1. The equation above states that the net content in any compartment / equals the sum of all inflows from the other compartments minus the sum of all outflows towards the other compartments. Such a relation is also referred to as a mass balance differential equation, which is well-known in chemical engineering. In the
452
Skin 2
A kj2
^21
T
itravenous injection ^ ^
Blood Plasma 1
' W^
^
Elimination
Fig. 39.1. Two-compartment open model composed of a central (plasma) compartment and a target (skin) compartment. This model assumes that a dnig is delivered rapidly into plasma from which it is either exchanging with the target organ or eliminated by excretion or metabolism.
case of n compartments, the model is defined by n linear differential equations in n variables (X^) with constant coefficients k^y The latter are the model parameters which usually have to be estimated from observations of the concentrations of the substance in one or more compartments of the system at various times after the start of the experiment. The way by which these parameters are computed is explained below. These computed coefficients can be used in model simulations. They are most meaningful, however, when they can be used for the quantification of biological and physiological phenomena, such as the rate of clearance of the substance from the body via various routes of excretion and metabolism. In Fig. 39.1 we have represented a simple two-compartment open model for a substance which is injected intravenously. The model applies when the drug is mainly confined to the central compartment from which it is also directly excreted. Such a model is called 'open' as it allows for elimination of a substance from one or more of the compartments. It is also assumed here that a fraction of the drug penetrates the layers of the skin and that its biological or therapeutic effect is related to the concentration in this compartment. Such a crude model may be applicable, for example, to water-soluble antibiotics. Figure 39.2 shows a compartmental model for the same drug but on a more detailed level. Here we take into account that the drug is specifically excreted in the kidneys, that it is metabolized in the liver and accumulates in adipose (fatty) tissues. This type of compartmental model is called mammillary, which is characteristic for a central compartment which exchanges with all peripheral ones. In a catenary model, compartments are coupled one after the other in a chain-like fashion, as in a cascade of radioactive decay of isotopes. An even more detailed compartmental model is given by Fig. 39.3. In this case, the drug is administered orally. Therefore, it must be absorbed from the gut lumen into the gut tissue from where it is carried by the hepatic
453
IntraveilOUS injecti on
Skin
1 t w Liver
c
^ ^
'^Kidney
Blood Plasma ^ > >r
>i
Adipose Tissues
Vletabolism
T Excretion
Fig. 39.2. Multicompartment model which, in addition to Fig. 39.1, takes into account that the drug is buffered in adipose (fatty) tissue, excreted by the kidneys and metabolized in the liver.
circulation to the liver. In this case, a large fraction of the drug may be metabolized during its first pass through the liver. Some of the drug (and its metabolites) is returned to the gut lumen via the excretion in the bile. The substances are excreted by the gut into the feces or can be re-entered into the system through the gut tissue. The remaining part is exchanged between the liver and the plasma fluid. From the plasma, some part of the drug can be excreted through the kidneys, it can be stored in adipose tissues or it can be adsorbed on the membranes of red cells. Eventually, the drug penetrates the skin w^here it must produce its therapeutic benefits. This is an example of a mixed mammillary-catenary model [5]. An additional problem arises when the exchange processes are rate-limited. This may be caused by enzymes that become saturated when all their active sites are occupied by the drug, or it may be due to adsorbing proteins that have a limited binding capacity. In such cases, one obtains a type of Michaelis-Menten kinetics of the form: ^max^
dt
K
+x
(39.2)
where V is the rate of the process, V^^^ represents the maximal rate when the process is completely saturated, and K^ denotes the Michaelis constant. A ratelimited process introduces a non-linearity which is incompatible with the usual
454 Oral administration
Gut Lumen
Fecal excretion
Bile Gut Tissue
Skin
Liver
Blood Plasma
Metabolism
Red Cells
Adipose Tissues
Kidney
Renal excretion
Fig. 39.3. Multicompartment model which, in addition to Fig. 39.2, models absorption of a drug via the gut, excretion via the bile and adsorption to the membranes of red cells.
methods of compartmental analysis. In that case, one cannot produce analytical solutions for the time course of the concentrations in the various compartments. In some cases one may assume that the amounts X are much smaller than K^ in which case the relation reduces to a first order linear term. Alternatively, if one can assume that X is much larger than K^, then this leads to a zero order term, which means that the rate is constant throughout. In some elementary cases, one can linearize the model by means of an appropriate transformation, as will be explained in the last section. If none of these applies, one must resort to non-linear methods of analysis, which are outside the scope of this chapter (see Chapter 11). In the following subsections we will deal systematically with the basic aspects of linear compartmental systems.
455
39.1.1 One-compartment open model for intravenous administration Figure 39.4a represents schematically the intravenous administration of a dose D into a central compartment from which the amount of drug X^ is eliminated with a transfer constant k^^. (The subscript p refers to plasma, which is most often used as the central compartment and which exchanges a substance with all other compartments.) We assume that mixing with blood of the dose D, which is rapidly injected into a vein, is almost instantaneous. By taking blood samples at regular time intervals one can determine the time course of the plasma concentration C^ in the central compartment. This is also illustrated in Fig. 39.4b. The initial concentration Cp(0) at the time of injection can be determined by extrapolation (as will be indicated below). The elimination pool is a hypothetical compartment in which the excreted drug is collected. At any time the amount excreted X^ must be equal to the initial dose D minus the content of the plasma compartment Xp, hence: X^=D-X^
(39.3)
After a sufficiently long time, most or all of the dose D will have been excreted. The linear model for this system is described by the mass balance equation:
©
®
DoseD Cp<0)
x„.Cp Plasma Compartment 1^ >r
Xe D-
Elimination Pool
Fig. 39.4. (a) One-compartment open model with single-dose intravenous injection of a dose D. The transfer constant of elimination (excretion and metabolism) is kp^. (b) Time course of the plasma concentration Cp and of the contents in the elimination pool Xe.
456
- ^ = -^pe^P
(39.4)
with the initial condition that Xp(0) = D. The differential equation can be solved straightforwardly, yielding an exponential function with respect to time: Xp(0 = /)e"^^
(39.5)
and also X,(r) = D ( l - e " ^ ' ) Note that the solution is in terms of amounts X^ rather than observed plasma concentrations Cp. The conversion from amounts to concentrations is defined by means of: Cp(0 = - ^ ^p
= ^ e - V ^ =Cp(0)e-V
(39.6)
^p
where V^ represents the volume of distribution of the drug in plasma. The volume of distribution should not be confounded with the physical plasma volume (which is about four liters in adult humans). The former may exceed the latter by several orders of magnitude. This occurs when there is a large binding affinity of the drug molecules with membranes of red blood cells or with circulating proteins. In that case, only a minute fraction of the drug is recovered in the plasma, the larger part being buffered reversibly on membranes and proteins. We can derive the volume of distribution Vp from the dose D and from the observed plasma concentration at time zero Cp(0): V = - ^ P Cp(0)
(39.7)
The half-life time ty2 of a drug is an important pharmacokinetic parameter. In this simple model it can be obtained immediately from the solution in eq. (39.6): - ^ — = Cp(0)e"'-^-
(39.8)
from which we derive that: ,,.-f^
(39.9)
457
In a one-compartment model, the half-life time ty2 is inversely proportional to the transfer constant of elimination k^^. The longer it takes for a drug to be removed from the system, the longer is the period during which it can exert its therapeutic activity. A convenient way to estimate the half-life of a substance is to replot the observed plasma concentrations (Fig. 39.5a) on a decimal logarithmic scale versus time (Fig. 39.5b). Fitting a straight line on this semilogarithmic plot produces the extrapolated plasma concentration Cp(0) from the vertical intercept of the line. The transfer constant of elimination is determined from the slope of the line: log Cp(0 = log Cp(0) - 0.4343 k^,t = log B - s^t
(39.10)
with intercept log B = log Cp(0) and slope s^ = 0.4343 k^, which follows directly from the solution of the model defined by eq. (39.6). (Decimal logs are assumed throughout this chapter, unless specified otherwise.) The half-life time ty2 of the drug can also be estimated graphically from the semilogarithmic plot of Fig. 39.5b by determining the abscissa of the point on the fitted line which corresponds with an ordinate of Cp(0)/2. The concentration-time curve can be integrated numerically and yields the so-called area under the curve (AUC): AUC= j C^{t)At
(39.11)
which, in the case of a one-compartment model, can be worked out analytically: Cp(0) AUC = - ^ = ^pe
D
(39.12)
*^^p ^pe
The AUC is a measure of bioavailability, i.e. the amount of substance in the central compartment that is available to the organism. It takes a maximal value under intravenous administration, and is usually less after oral administration or parenteral injection (such as under the skin or in muscle). In the latter cases, losses occur in the gut and at the injection sites. The definition also shows that for a constant dose Z), the area under the curve varies inversely with the rate of elimination /Cpg and with the volume of distribution Vp. Figure 39.6 illustrates schematically the different cases that can be obtained by varying the volume of distribution Vp and the rate of elimination k^^ both on linear and semilogarithmic diagrams. These diagrams show that the slope (time course) of the curves are governed by the rate of elimination and that elevation (amplitude) of the curve is determined by the volume of distribution.
458
(a)
Cp
^
/.\
cA 1
-0.0051171
Cp(t) = 50.3e
Cp(0) —50-1
40 -^ \
30 H
2
\
\ Cp(0)
Cp(0)
1 20 H
10 -\
0 -1 0
'^__^ 100
, 200
,—
1
300
400
' 1
1
500
600
n*
1
700
800
t (min)
log Cp(t)= 1.701-0.005117 t
Arctan SQ —I— 200
300
—1— 500
—T
600
700
800
t (min) Fig. 39.5. (a) Plot of plasma concentration Cp (ng T^) versus time t. Cp(0) is the extrapolated initial concentration at time zero. At the half-life time tm the plasma concentration is half that of Cp(0). (b) Semilogarithmic plot of plasma concentration Cp (|ig 1"^) versus time t. The intercept B of the fitted line is the plasma concentration Cp at time 0. The slope s^ is proportional to the transfer constant of elimination kr^e.
459
®
®
Fig. 39.6. (a) Time courses of plasma concentration Cp in the one-compartment open model for intravenous injection, with different contingencies for the transfer constant of elimination kp^ and the volume of distribution Vp. (b) Time courses of plasma concentration Cp as in panel (a) on semilogarithmic plots.
From the AUC and knowing the dose, one can immediately derive from eq. (39.12) an important pharmacokinetic parameter which is the clearance Cl^ of the drug from the plasma:
P
AUC
- ^ p ^pe
(39.13)
Clearance is defined as the fraction of the volume of distribution Vp that is cleared of the drug per unit of time. In the case of elimination from the kidneys, the clearance provides a measure for the effectiveness of renal elimination with respect to the drug under study.
460
Example We consider the following synthetic plasma concentrations which are supposedly obtained after intravenous injection of 10 mg of a drug: r(min) 2 5 10 20 30 60 90 120 240 360 480 720 Cp(|LigH) 48.1 47.5 43.9 40.4 36.2 23.8 17.3 13.1 2.9 1.0 0.3 0.1 The data are also represented in Fig. 39.5a and have been replotted semilogarithmically in Fig. 39.5b. Least squares linear regression of log C^ with respect to time t has been performed on the first nine data points. The last three points have been discarded as the corresponding concentration values are assumed to be close to the quantitation limit of the detection system and, hence, are endowed with a large relative error. We obtained the values of 1.701 and 0.005117 for the intercept log B and slope s^, respectively. From these we derive the following pharmacokinetic quantities: Initial plasma concentration: Rate constant of elimination: Half-life time: Plasma volume of distribution: Area under the curve: Plasma clearance:
Cp(0) = 5 = 50.3 L | Lg V^ k^ = 5p/0.4343 = 0.0118 min"' r,^ = 0.693/A:p^ = 58.8 min V^ = DIB = 199 1 AUC = DIVJc^ - 4.268 mg min 1"' Cl^ = D/AVC = 2.34 1 min'
The synthetic data have been derived from a theoretical one-compartment model with the following settings of the parameters: Dose: Plasma volume of distribution: Initial plasma concentration: Half-life time: Rate constant of elimination: Area under the curve: Plasma clearance:
/)=10mg V^ = 200 1 C (0) = D/V = 50 L | Lg l' t^,^ = 60 min k^ = 0.693/ty^ = 0.01155 min' AUC = 4.329 mg min 1"' CI =2.31 1 min"' p
The synthetic data have been obtained by adding random noise with standard deviation of about 0.4 \ig 1"^ to the theoretical plasma concentrations. As can be seen, the agreement between the estimated and the computed values is fair. Estimates tend to deteriorate rapidly, however, with increasing experimental error. This phenomenon is intrinsic to compartmental models, the solution of which always involves exponential functions.
461
39.1.2 Two-compartment catenary model for extravascular administration This model is representative for the conditions described in the previous section, except for the mode of administration which can be oral, rectal or parenteral by means of injection into muscle, fat, under the skin, etc. (Fig. 39.7). In addition to the central plasma compartment, the model involves an absorption compartment to which the drug is rapidly delivered. This may be to the gut in the case of tablets, syrups and suppositories or into adipose, muscle or skin tissues in the case of injections. The transport from the absorption site to the central compartment is assumed to be one-way and governed by the transfer constant k^^ (Fig. 39.7a). The linear differential model for this problem can be defined in the following way:
®
©
DoseD D
Extravascular Compartment "ap
or""" Plasma Compartment
V
t Elimination Pool
Fig. 39.7. (a) Two-compartment catenary model for extravascular (oral or parenteral) administration of a single dose D which is completely absorbed. The transfer constant of absorption isfcap-(b) Time courses of the amount in the extravascular compartment Xa, the concentration in the plasma compartment Cp and the content in the elimination pool X^.
462
"dT"" "p^" j^
- ^ap ^ a
(39.14)
^pe ^ p
X, = D - ( X , +Xp) with the initial conditions: X,(0) = D
and
Xp(0) == 0
where X^, X^ and Xj, represent the quantities in the absorption and central plasma compartments, and in the elimination pool, respectively. The analytic solution of the system of differential equations in eq. (39.14) can be written as follows:
k ;"
Xp(0 = g
(39.15) (e-V-e-V)
The plasma function can be rewritten in terms of concentrations C^ rather than amounts of substance X^\ n C(t) = — ^
V
k k
*^p '^ap
"^
-k
(e'"-' - e " ' - ^ )
(39.16)
'^pe
The content X^ in the absorption compartment varies with time according to the solution of the one-compartment model which has been described in the previous section. This time course is represented by a single negative exponential function. Initially, the content is equal to the dose D and at a sufficiently long time after administration, the content becomes zero. The concentration C^ in the central compartment is initially zero, reaches a maximum and eventually becomes zero again. The content X^ of the elimination pool starts from 0 and asymptotically approaches the dose D at prolonged times (Fig. 39.7b). The solution for the central plasma compartment gives rise to three distinct cases depending on the difference between the transfer constants for absorption and elimination. Case 1 When /:^p > k^^, the time course of the plasma concentration C^ is dominated by the rate of elimination which is the slower of the two. This is desirable when a drug must be delivered as rapidly as possible to the plasma compartment, for example in the relief of acute pain. At sufficiently large times following administration, the
463
transient effect of absorption will have decayed such that one obtains approximately from eq. (39.16): r ( 0 - —
^"^
e-V
(39.17)
Case 2 If A:^p < /^pe, the time course of C^ is dominated by the slower rate of absorption. This is desirable when a drug is to be delivered over a prolonged period of time, for example in the relief of chronic pain. When a sufficiently large period of time has elapsed, the transient effect of elimination has decayed and the solution for the plasma compartment in eq. (39.16) becomes approximately: Co(0-^.
\
p ^pe
e-V
(39.18)
^ap
Case 3 In the theoretical situation when k^^ = k^^ the solution for C^ degenerates since both the numerator and denominator vanish and the expression becomes indeterminate. The limiting solution must then be derived from a rearrangement of the expression of C^ in eq. (39.16): p V >'
y^ P
ap
lim ^ap=V
(39.19) ^ap ~" ^pe
Differentiation of the numerator and denominator with respect to k - k , leads to: CJt) = ^kte-'^'
(39.20)
^p
For practical purposes we will assume here that k^^ > k^^. A solution of the two-compartment model can be derived by means of non-linear regression between experimental determinations of the plasma concentration C^ and the corresponding sample times t, using the exponential relation which has been derived above. This yields the transfer constants k^^, k^^ and the coefficient from which, knowing £>, the volume of distribution Vp can be obtained. An alternative graphical solution makes use of the biphasic exponential nature of the plasma concentration function in eq. (39.16). At larger time values, when the effect of absorption has decayed, the function behaves approximately as monoexponential. Under these conditions, and after replotting the concentration data on a (decimal) logarithmic scale, one obtains a straight line for the later part of the curve (Fig. 39.8a). This line represents the p-phase of the plasma concentration and is denoted by C^ :
464
(a)
log Cp(t)= 1.745-0.0051661
Arctansp/
^ — ^
—T
200
'\^
I
400
600
500
700
t(min)
(b)
log C_(t)= 1.745-0.0621 t
Arctan s„ /
I
:••;;....
— I —
20
t(min) Fig. 39.8. (a) Semilogarithmic plot of plasma concentration Cp (|ig T^) versus time t. The straight line is fitted to the later part of the curve (slow p-phase), with the exception of points that fall below the quantitation limit. The intercept B of the fitted line is the extrapolated plasma concentration C^ that would have been obtained at time 0 with an intravenous injection. The slope 5p is proportional to the transfer constant of elimination kp^. (b) Semilogarithmic plot of the residual plasma concentration C " (|ig r') versus time r, on an expanded time scale t. The straight line is fitted to the first part of the residual curve (fast a-phase), with the exception of points whose residuals fall below the quantitation limit. The intercept B of the fitted line, is the same as that in panel a. The slope 5a is proportional to the transfer constant of absorption from the extravascular compartment.
465
D logCP(0-log—
k ''
-OA343k^J = logB-s^t
(39.21)
with intercept log B and slope s^ = 0.4343 k^^. From the slope one obtains the transfer constant of elimination k , The next step is to compute the residuals between the observed plasma concentrations and the computed p-phase values, which are again replotted on a logarithmic concentration scale (Fig. 39.8b). The resulting line represents the absorption phase or a-phase of the plasma concentration: logC^(t) = logic (t)-C^(t)) = log
V k -k /p
'^ap '^pey
-0A343k^J 'P
(39.22)
= logB-s^t with the same intercept log B as derived above and slope s^^ = 0.4343 k^^, which yields the absorption transfer constant k^^. The procedure by which a sum of exponential functions is resolved into its component terms is called curve peeling. It allows to include empirical knowledge in the analysis about the number of exponential functions and possible time lag. A comparison of curve peeling with non-linear regression is presented in Section 39.1.6. Knowing the transfer constants k^^, k^^, the dose D and the intercept 5, we can compute the volume of distribution Vp from the expression (eq. (39.18): D
^an
— V k *^p '^ap
^^— = B -k
(39.23)
'^pe
The various effects produced by varying k^^, k^^ and Vp are illustrated in Fig. 39.9. In practice, the half-life time of the drug in the plasma compartment is derived from the p-phase, and is therefore denoted as t\i2:
.r„=5^
(39.24)
^pe
Likewise the half-life time of the drug in the extravascular compartment r "2 '^ given by: ^«2 z . 2 ^
(39.25)
^ap
The area under the curve AUC is obtained by integrating the plasma concentration function between times 0 and infinity. This integral can be obtained analytically from eq. (39.16):
466 |kap large
V , Ljg constant
k-n small ap
®
l^ap' '^e constant
^Vp small
©
Fig. 39.9. Time courses of plasma concentration Cp in a one-compartment model for extravascular administration, with different contingencies of (a) the transfer constant of absorption k^p, (b) the transfer constant of elimination kpc and (c) the volume of distribution Vp.
AUC= j C^(t)dt =
D
(39.26)
^P^e
and has the same form as in the case of the intravenous one-compartment model developed in the previous section. This is not surprising, however, since we know that the AUC is a measure of the bio-availability of the substance to the plasma compartment. In this model, all of the initially administered dose D must enter the
467
central compartment and, hence, the complete dose has an opportunity of exercising its effect. The situation would be different if a fraction of the dose were metabolised or excreted before entering the central plasma compartment. Such a situation will be discussed later on. The AUC is an important pharmacokinetic measure in the study of bio-equivalence for different modes of administration of a drug. The modes can be considered to be biologically equivalent when their AUC s are the same. The clearance of the drug from the plasma is then defined from the dose D and the AUC: C l = ^ - = V^k^, ^
AUC
(39.27)
P P
An important pharmacokinetic parameter is the time of appearance of the maximum t^ of the plasma concentration. This can be derived by setting the first derivative of the plasma concentration function in eq. (39.16) equal to zero and solving for t, which yields: '- -
k
-k
"
k
-k
^ ^ ^
where In means natural logarithm (base e). From this expression, one can see that the time t^^ only depends on the transfer constants of absorption k^^ and elimination k^^. Substitution of this expression in the plasma concentration function yields the value of the peak concentration C^it^J. Knowledge of the peak concentration is important since a drug usually has a minimum threshold for therapeutic effect and a maximum level above which toxic effects become manifest. Differentiation of the plasma concentration function in eq. (39.16) at time zero also yields the initial rate of change of C^ C'iO) = p
fdC^ dt
D y K,
(39.29)
This shows that the initial steepness of the plasma curve depends, for a given dose £), on the volume of distribution Vp and on the transfer constant of absorption k^^. The various characteristics of the plasma concentration curve are also schematically displayed in Fig. 39.9 for different contingencies of/c^p, k^^ and V^, Example The following synthetic plasma concentrations are representative for the time course of the plasma concentration C^ after a single oral administration of a dose of 10 mg:
468 r(min)
2
5
10
20
C_(|igr')
12.3 23.6 35.5 40.7
30
60
90
240
120
37. 1 26.9 19.8 14.0 3.1
360
480
720
1.1
0.1
0.1
- ->
<— p-phase
a-phase
noise
The semilogarithmic plot of the data is given in Fig. 39.8a. On this plot we can readily identify the linear p-phase of the plasma concentrations between 30 and 240 minutes. The last three points (at 360, 480 and 720 minutes) have been discarded because the corresponding plasma values are supposed to be close to the quantitation limit of the detection system. A least squares linear regression has been applied to the data pertaining to the p-phase, yielding the values of 1.745 and 0.005166 for the intercept log B and slope 5p, respectively. Using these results, we can compute the extrapolated plasma concentrations C^ between 0 and 20 minutes. From the latter, we subtract the observed concentrations C^ which yields the concentrations of the a-phase C " : r(min): C^Cngl-^): Cp(|ngH): Cp«(|jgH):
0 55.5 0 55.5
2 54.2 12.3 41.9
5 52.3 23.6 28.7
10 49.3 35.5 13.8
20 43.8 40.7 3.1
The semilogarithmic plot of the a-phase is shown in Fig. 39.8b. Through these data we force a least squares linear regression through the vertical intercept, i.e. the point with coordinates (0, 55.5). The latter is achieved by assigning a very large weight to this single point, for example 100 times the weight of each of the other points. This will guarantee that the regression line will pass through the vertical intercept of the semilogarithmic plot, as required from the theoretical considerations above. The resulting intercept log B and slope s^ are 1.745 and 0.0621, respectively. The pharmacokinetic parameters of the model are then readily derived from the defining equations and the results of the regression: Initial plasma cone, p-phase: Transfer constant of elimination:
C^ (0) = B = 55.5 \lg 1"' k = 5„/0.4343 = 0.0119 min"' pe
p
Half-life time, p-phase: Transfer constant of absorption: Plasma volume of distribution:
t^^ = 0.693/k^ = 58.3 min k^^ = 5^/0.4343 = 0.143 min"' V =k D/B(k - itJ = 196 1
Half-life time, a-phase: Area under the curve: Plasma clearance:
r«2 = 0-693/A:^ = 4.85 min AUC = DIVJc^ = 4.281 mg mini"' Cl^ = D/AVC = 2.34 1 min"'
p
ap
^
ap
pe^
469
Time of peak plasma cone: Initial rate of change of plasma cone:
t^ = (In kjk^)/{k^ - k^) = 19.0 min C p(0) = k^D/V^ = 6.37 |xg min"' V
The synthetic data have been computed from a theoretical two-compartment catenary model with the following parameters: Dose: Plasma volume of distribution: Half-life time of absorption: Transfer constant of absorption: Half-life time of elimination: Transfer constant of elimination: Area under the curve: Plasma clearance:
D = 10 mg :200l .a = 5 min '-111
n=
K-~: 0.693/r «2 = 0.1386 '1/2
= 60 min
=0.01155 AUC = 4.329 mg min \~' CK-= 2.311min"^
K- = 0.693/rf/2
The agreement with the theoretical and the 'experimental' results is still fair. This is due, however, to the rather moderate noise which has been superimposed on the theoretical plasma concentration values. The standard deviation of the random noise amounted to about 0.4 |ig 1~^ The success of a curve peeling operation also depends on the ratio of the transfer constants k^^ and k^^, which should differ at least by a magnitude of about 10. Errors tend to accumulate in the successive steps of a curve peeling process. As a consequence, we obtain that the transfer constant of the faster a-phase has been determined less accurately than the one of the slower p-phase. In general, the fitting of sums of exponential functions is an ill-conditioned problem, the solution of which rapidly deteriorates with increasing noise and with decreasing distinction between the transfer constants. 39,1.3 Two-compartment catenary model for extravascular administration with incomplete absorption The effect of incomplete absorption is that only a fraction F^ of a single-dose D is made available to the central plasma compartment. The solution of the previous model needs, therefore, to be modified by replacing the term D by F.^D. Consequently the area under the curve AUC^ under incomplete extravascular absorption will be smaller than the maximal AUC that results from complete absorption. The latter, as we have seen is equal to the AUC obtained from a single intravenous injection, which we denote by AUCj. These considerations can be summarized as follows:
470
F.'J^
(39.30)
If an oral or parenteral plasma concentration curve is available together with an intravenous one from the same subject(s), then one can derive the absorption fraction F^ from these. 39.1.4 One-compartment open model for continuous intravenous infusion In the previous discussion of the one- and two-compartment models we have loaded the system with a single-dose D at time zero, and subsequently we observed its transient response until a steady state was reached. It has been shown that an analysis of the response in the central plasma compartment allows to estimate the transfer constants of the system. Once the transfer constants have been established, it is possible to study the behaviour of the model with different types of input functions. The case when the input is delivered at a constant rate during a certain time interval is of special importance. It applies when a drug is delivered by continuous intravenous infusion. We assume that an amount D of a. drug is delivered during the time of infusion x at a constant rate k^ (Fig. 39.10). The first part of the mass balance differential equation for this one-compartment open system, for times t between 0 and x, is given by: dXp dt
"^
P'
"
(39.31)
with the initial condition that X^(0) = 0. For times t greater than x, the equation reduces to: dt X, =
=
-k^X P^ P
(39.32)
k^x-X^
with the conditions that the same result Xp(x) is obtained from both parts of the equation. The solution of the first part of the model is obtained by straightforward integration: X ro = — d - e - * - ' ) or equivalently:
(39.33)
471
®
®
Reservoir
1
>
;T
t
?T
t
Cp(t)
Plasma Compartment ^ Xe
f
1
^^^1
Elimination Pool
Fig. 39.10. (a) One-compartment open model for continuous intravenous infusion of a dose D from a reservoir. The rate of infusion is k^p. (b) Time courses of the content in the reservoir Xr, of the plasma concentration Cp and of the content of the elimination pool X^.
CM) =
"rp
(1-
-V )
(39.34)
^P^pe
where Vp is the volume of distribution of plasma. If the infusion is maintained for a sufficiently long time t, one obtains the condition for a steady-state plasma concentration which no longer changes with time. The condition follows immediately fromeq. (39.34), by setting tto infinity: C, =limC„(T) = ^ ^ p e
CL
(39.35)
Note that the steady-state plasma concentration Q^ varies proportionally with the rate of infusion k^ and inversely with the plasma clearance C/p, the definition of
472
which has been given by eq. (39.13). This result allows us to rewrite the solution of the model in eq. (39.34) in the form: Cp(0 = C , 3 ( l - e - V )
(39.36)
or equivalently: log 1 -
C.
= -0A343kt pe
(39.37)
Hence, the slope of the semilogarithmic plot of 1 - C^(t)/C^^ versus time t yields the transfer constant of elimination k^. From the known rate constant of infusion ^^p* the transfer constant of elimination k^ and a graphical estimate of C^^, one can then derive the plasma volume of distribution V^, using the steady-state condition which has been derived above. The transfer constant of elimination k^^ has already been shown to be related to the half-life time of the drug in the plasma compartment (eq. (39.9)): ty2 =
(39.38)
After substitution of this expression in the solution of the model in eq. (39.36) we obtain: Cp(r) = C,,(l-e-^-^^^'^''-)
(39.39)
which shows that the steady-state plasma concentration Q^ is only obtained when the time of infusion T is several times the half-life time ty2When infusion stops at time T, the plasma concentration Cp(T) decays exponentially according to the solution of the one-compartment open model in eq. (39.6): Cp(0 = Cp(T)e-^^'-^^
(39.40)
From a semilogarithmic plot of C^(t) versus r - T, one can again estimate the transfer constant of elimination k^. From the known values of Cp(T), ^^p and ^p^ one also obtains a new estimate of V^. In practice, one will seek to obtain an estimate of the elimination constant k^^ and the plasma volume of distribution V^ by means of a single intravenous injection. These pharmacokinetic parameters are then used in the determination of the required dose D in the reservoir and the input rate constant k^ (i.e. the drip rate or the pump flow) in order to obtain an optimal steady state plasma concentration
473
Example We make use of the previously defined parameters of the one-compartment open system (eq. (39.6)). Transfer constant of elimination:
k = 0.01155 min~^
Plasma volume of distribution:
V^ = 200 1
pe
In addition we assume that a dose Z) of 10 mg is delivered at a constant rate over a time span of 60 minutes. Duration of infusion: T = 60 min Rate of infusion: k^ = 10,000/60 = 166.7 |Lig min"' From the solution of the continuous administration model we can now derive the plasma concentration at the end of infusion T: 1 ftf\ 1
C(x) = • (1 - exp(-0.1155 X 60)) = 72.16 x (1 - 0.5) = 36.1 |ig l'^ P 200x0.01155 ^ This should be compared with the initial plasma concentration of 50 |Lig 1"^ which is obtained when the same dose D is injected as a single bolus. The final concentration C^(i) is also still far from the steady-state concentration C^^ (72.2 |ig r^), since the duration of infusion T is only equal to once the half-life time of the drug ty2 (60 min). 39.1.5 One-compartment open model for repeated intravenous administration We now consider the case where a single-dose D is injected repeatedly at constant time intervals 9. Again we consider the simplest possible situation which is the one-compartment open model with a transport constant for elimination k^^ and a volume of distribution V^ (Fig. 39.11). Administration is supposed to be by intravenous way which implies rapid distribution of the drug in the plasma compartment. Over the course of the study, one will observe peaks and troughs in the plasma concentration Cp. Peaks C ^^"^ occur at times 0, 0, 29, 39, etc. Troughs Cp"^'" are observed at times 9, 29, 39, etc. The time nQ marks the end of the nth cycle and the beginning of the (n+l)th one. Since our system behaves in a linear fashion, the concentration Cp at any time t is simply the sum of the concentration-time courses produced by all preceding injections. In particular, we obtain for the peaks and troughs:
474
®
©
Y Y Plasma Compartment
V Elimination Pool
0
e
29 39
Fig. 39.11. (a) One compartment open model for repeated intravenous injection of the same dose D at constant intervals 9. (b) Time course of the plasma concentration Cp with peaks C ^ ^ and troughs C ™" at multiples of the time interval 9. The peaks and troughs tend asymptotically toward the steady-state values C^^
C ™" (ne) = — (e"*^^ + e"^*^® +... + e"'^®) p
(39.41)
which follows from a consideration of Fig. 39.11 and from the general solution of the one-compartment open model in eq. (39.6). It is readily observed that the expression within brackets for the peak concentration C ^^"^ forms a geometric progression in which the first term equals one and with a common ratio of e~^®. (The common ratio is the factor which results from dividing a term of the progression by its preceding one.) For the sum of the finite progression in eq. (39.41) we obtain after n cycles: f^ max ^ P
(nG) =
-—l-e-^«
(39.42)
475
In the case of a sufficiently large number of cycles n we thus find that the peak and trough values approach their respective steady-state values C^^"^ and C^'"": ^ max ^ ^ max / ^ x ^
^
(39.43)
/ ^ min _ /^ max ^
SS
~
^
SS
XT
p
Usually, one has obtained an estimate for the elimination constant k^^ and the distribution volume Vp from a single intravenous injection. These pharmacokinetic parameters, together with the interval between administrations 9 and the single-dose Z), then allow us to compute the steady-state peak and trough values. The criterion for an optimal dose regimen depends on the minimum therapeutic concentration (which must be exceeded by C^^) and on the maximum safe concentration (which may not be exceeded by C^^^). It is readily seen that the magnitudes of the peak and trough values of C^ are inverse functions of the elimination constant k^^, the interval between successive dosings 9 and the distribution volume Vp. They are also directly proportional to the single dose D, In the limit, when the interval 9 between administrations becomes extremely small in comparison with the elimination half-life 0.693/^pg, the steady-state solutions are reduced to those already derived for a continuous intravenous infusion (eq. (39.35)): C j r ^ C ™ - — —^ = — ^ V k Q V k
(39.44)
since the rate of input k^ can be written in the form: ^rp-f
(39.45)
This property can be derived by means of a series expansion of the exponential function in eq. (39.43) and by neglecting higher order terms in 9. Example We consider again the pharmacokinetic parameters of the one-compartment model for a single intravenous injection (eq. (39.6)). Transfer constant of elimination:
k^ = 0.01155 min"^
Plasma volume of distribution: Single-dose: Interval between dosings:
Vp = 2001 D = 10 mg 9 = 60 min
pe
476
Note that the half-hfe time of the drug in the plasma compartment is equal to 0.693//:pe which amounts, in this case, to exactly 60 min. From the previously derived formula we compute the steady-state peak and trough concentrations in plasma:
c . . . = 25252
'
200 l-exp(-0.01155x60)
=^2_ . ,oo„,1-0.5
C3™" = 100 - 10000 / 200 = 50 |Lig 1-^ In this special case when the time between dosings is equal to the half-life time of the drug, we can deduce that the minimum (steady-state) plasma concentration with repeated dosing is equal to the peak concentration, obtained from a single dose. Under this condition, the corresponding maximum (steady-state) concentration is twice as much as the minimum one. 39.1.6 Two-compartment mammillary model for intravenous administration using Laplace transform This model is an extension of the one-compartment model for intravenous injection (Section 39.1.1) which is now provided with a peripheral buffering compartment which exchanges with the central plasma compartment. Elimination occurs via the central compartment (Fig. 39.12). The model requires the estimation of the plasma volume of distribution Vp and three transfer constants, namely k^^ for elimination and ^p^, /Cbp for exchange between the plasma and buffer compartments. These four parameters can be obtained from the mass balance differential equations: dX. ^p dr dX b dr
(39.46) •^pb^p
^bp^b
with X, = D-(Xp + Xb)
(39.47)
where Xp, X^ and X^ are the amounts of drug in the plasma and buffer compartments and in the elimination pool, respectively, and where D represents the single intravenous dose. These equations form a set of first order linear differential equations with constant coefficients and with initial conditions:
477
®
®
Plasma Compartment ^bp
"^pe y Buffer
Elimination Pool
Fig. 39.12. (a) Two-compartment mammillary model for single intravenous injection of a dose D. The buffer compartment exchanges with plasma with transfer constants /:pb and /:bp- (b) Time courses of the plasma concentration Cp, the content in the buffer compartment X^ and in the elimination pool X^..
Xp(0)=D
and Xb(0) = 0
While the simple linear models in the previous sections have been solved by straightforward integration, the present model (and more complicated ones) are more conveniently solved by means of the Laplace transform. In this section, we will outline only those properties of the Laplace transform that are directly relevant to the solution of systems of linear differential equations with constant coefficients. A more extensive coverage can be found, for example, in the text book by Franklin [6]. The Laplace transform of a time-dependent variable X{i) is denoted by Lap X{t) or x{s) and is defined by means of the definite integral over the positive time domain:
478
Lap X(t) = je-^' X(t)dt = x(s)
(39.48)
0
Since the integral is over time t, the resulting transform no longer depends on t, but instead is a function of the variable s which is introduced in the operand. Hence, the Laplace transform maps the function X(t) from the time domain into the 5-domain. For this reason we will use the symbol x{s) when referring to Lap X(t). To some extent, the variable s can be compared with the one which appears in the Fourier transform of periodic functions of time t (Section 40.3). While the Fourier domain can be associated with frequency, there is no obvious physical analogy for the Laplace domain. The Laplace transform plays an important role in the study of linear systems that often arise in mechanical, electrical and chemical kinetic systems. In particular, their interest lies in the transformation of linear differential equations with respect to time t into equations that only involve simple functions of s, such as polynomials, rational functions, etc. The latter are solved easily and the results can be transformed back to the original time domain. The Laplace transform of a first-order derivative is defined consistently with eq. (39.48) by means of the integral: Lap^^=fe-^d/ dt i0 dt
(39.49)
which can be integrated by parts: dXit) L a p ^ ^ = e-^'X(0 e "^yi) I -(-s) -^-s) { e-"X(t)dt = -X{0)+sLapXit) dt ^0 i
=s
x(s)-X(0)
0
(39.50) provided that X is a bounded function of t. The inverse Laplace transform is denoted by Lap"^: Lap-^jc(5) = X(0
(39.51)
Inverse Laplace transforms have been tabulated for most analytical functions, including power, exponential, trigonometric, hyperbolic and other functions. In this context we require only the inverse Laplace transform which yields a simple exponential: L a p - ^ — ^ = e-^^
(39.52)
479
When we apply these two properties of the Laplace transform to differential equations of our pharmacokinetic model in eq. (39.46), we obtain:
where x^ and x^ are functions of s. After rearrangement of eq. (39.53) we obtain a system of two linear equations in the two unknowns x^ and x^: (5 + fcpb +
fcpe)^p-^bp^b=^
^3^^^^
-^pb-^p+('^ + ^bp)-^b=0
The solution for x^ in eq. (39.54) can be written explicitly in the form: x(s) P
= D— ^^ =D ^ 5 2 + 5 ( a + p) + ap ( s + a ) ( s + P)
(39.55)
where a and P are called the hybrid transport constants. The latter are related to the physical transport constants by means of their sum and product: pb
bp
pe
(39.56)
«P=^pe^bp
If the transfer constants k^^, k^^ and k^^ are known, then the hybrid transfer constants a and P are the roots of the quadratic equation: f - Y(^pb + ^bp + ^pe) + Ke K, = 0
(39.57)
We can derive the plasma concentration function C^ in the 5-domain from: c(s) = ^ ^ P Vp
=— ^^ = — ^ + —^ Vp (jy + a)(jy + P) s + a s + ^
(39.58)
where the coefficients A^ and B^ are obtained by working out the right-hand part of the expression in eq. (39.58) and by equating the corresponding terms of the numerators in its right- and left-hand members. This yields the solution for Ap and fip as functions of a and p:
480
\
-^r and =^ ^"d B„^ P= = T r - ^ ^ Kp a-p " V a-p
(39.59)
The plasma concentration function C^ in the time domain is obtained by applying the inverse Laplace transform to the two rational functions in the expression for c^ in eq. (39.58): Cp(0 = Lap-^ Cp(5) = Ap e-«^ + B^ e'P^
(39.60)
where the coefficients A^, B^, a, P and V^ have to be estimated from the observed concentration-time data. From the initial conditions we must have that: Cp(0) = A p + 5 p = - ^
(39.61)
P
which is satisfied by the above results for A^ and B^. This relationship allows us to derive the volume of distribution V^:
provided that A^ and B^ can be determined. A similar solution can be derived for the contents X^ in the buffer compartment: X,(t) = A^Q-^ +B^c-^^
(39.63)
with the constraint that A^-\- B^ = 0 since ^^(O) must be zero. Usually, the buffer compartment is not accessible and, consequently, the absolute amount of X^ cannot be determined experimentally. For this reason, we will only focus our discussion on the plasma concentration Cp. It is important to know, however, that the time course of the contents in the two compartments is the sum of two exponentials, which have the same positive hybrid transfer constants a and p. The coefficients A and 5, however, depend on the particular compartment. This statement can be generalized to mammillary systems with a large number of compartments that exchange with a central compartment. The solutions for each of n compartments in a mammillary model are sums of n exponential functions, having the same n positive hybrid transfer constants, but with n different coefficients for each particular compartment. (We will return to this property of linear compartmental systems during the discussion of multi-compartment models in Section 39.1.7.) We now turn our attention to the graphical determination of the various parameters of our two-compartmental model, i.e. the plasma volume of distribution Vp,
481
the three transfer constants k^y^, k^^, k^^ and their derived quantities, such as the plasma clearance Cl^ and the bio-availability AUC. By convention, we assume that p is the smaller of the two hybrid transfer constants ((3 < a). If the difference is marked (e.g. by one decade) then we can clearly distinguish between the two exponentially decaying phases of the plasma concentration curve. The a-phase is the fastest and disappears rapidly after administration and the (i-phase is slower and dominates at longer times after administration. Under this circumstance it is possible to apply curve peeling on a semilogarithmic diagram, in the same way as has been done in the case of the two-compartment model with oral or parenteral administration defined by eq. (39.16). First, we determine the p-phase function on a semilog plot by means of linear regression on the later part of the concentration curve for which the time course is most linear (Fig. 39.13a). This yields: logCP(0 = logBp -0.4343pr
(39.64)
This yields numerical values for the intercept log B^ and slope P, which have been defined analytically before. The next step requires the determination of the residual concentrations C p by subtracting C^ from the observed C^. The resulting a-phase function is again obtained by means of linear regression on the earlier part of the time course of the logarithmic plasma concentration (Fig. 39.13b): logC«(0 = l o g ( C p ( 0 - C P ( 0 ) = logAp-0.4343a/
(39.65)
This way, we obtain numerical values for the intercept log A^ and the slope a, which have been defined formally before. Since we have now determined the hybrid constants a and p, we can derive V^ and k^^ from the coefficients A^ and B^ which resulted from our graphical analysis. The ratio of Ap to B^ produces (according to eq. (39.59)): ^ ^ a ^ - ^ ^p
(39.66)
^bp-P
from which the transfer constant ^^p from the buffer to the plasma compartments can be solved:
, -MlM
,39.67)
482
Using this result we now obtain the plasma distribution volume V^ from either A or fip(eq. (39.59)):
"
Ap a - p
"fip
a-p
(39.68)
Finally, we derive the remaining transfer constants k^y, and k^^ from the previously defined sum and product of the hybrid rate constants a and P (eq. (39.56)):
(a)
800
Fig. 39.13. (a) Semilogarithmic plot of the plasma concentration Cp (jiig 1"^) versus time t. The straight line is fitted to the later part of the curve (slow P-phase) with the exception of points that fall below the quantitation limit. The intercept B^ of the extrapolated plasma concentration C^ appears as a coefficient in the solution of the model. The slope is proportional to the hybrid transfer constant p, which is itself a function of the transfer constants of/:pe, itpb and k^p of the model, (b) Semilogarithmic plot of the residual plasma concentration C " (\xg 1"*) on an extended time scale t. The straight line is fitted on the first part of the residual curve (fast a-phase), with the exception of points whose residuals fall below the quantitation level. The intercept Ap appears as a coefficient in the solution of the model. The slope a is proportional to the hybrid transfer constant a, which is itself a function of the transfer constants of the pharmacokinetic model, (c) Semilogarithmic plot of the content X^ (|ig) in the buffer compartment. The rate constants of the exponential functions are the hybrid transfer constants a and p. The coefficients have been computed by means of the y-method (see text), using the results of curve peeling represented in panels a and b.
483
(b)
100
log Cp(t)= 1.542-0.024081
0.5 i
0.1
n— 10
-1
20
1
I
-1—
-1—
30
40
50
60
n 70
80
t (min)
(c)
700
t (min) Fig. 39.13 (continued).
800
484
k
- ^
^bp
(39.69)
^bp=a + P-^bp-^pe The area under the plasma concentration curve UAC results from integration of the sum of exponentials in eq. (39.60) between zero and infinity: AUC=J (Ape-+Bpe-P')dr = ^
+^
=— ^
(39.70)
which is, once again, the maximal bio-availability of the drug. This expression has also been obtained in the case of the one-compartment models for intravenous and for complete extravascular absorption (Sections 39.1.1 and 39.1.2). The plasma clearance Cl^ follows in the usual way from the dose D and the bio-availability AUC (eq. (39.27)): P
AUC
P
P'
Example We assume that the following synthetic plasma concentrations C^ have been obtained after a single intravenous administration at different times t\ /(min) Cp(ngr')
2 45.7
5 38.4
10 31.5
20 22.3
30 60 15.7 11.5
90
120
240
360
480
720
8.9
9.0
5.4
4.2
2.3
1.3
p-phase
a-phase
These data are also shown in the semilogarithmic plot in Fig. 39.13a, which clearly shows two distinct phases. A straight line has been fitted by least-squares regression through the data starting from the observation at 90 minutes down to the last one. This yields the values of 1.086 and -0.001380 for the intercept log B^ and slope 5p, respectively. From these results we have computed the extrapolated p-phase values C^ between 2 and 60 minutes. These have been subtracted from the experimental C^ values in order to yield the a-phase concentrations C " : r(min) Cp(Mgl-') CP(Mgl-')
2 45.7 12.2 33.5
5 38.4 12.0 26.4
10 31.5 11.8 19.7
20 22.3 11.5 10.8
30 15.7 11.1 4.6
60 11.5 10.1 1.4
485
The residual a-phase concentrations C " are shown in the semilogarithmic plot of Fig. 39.13b. Least-squares linear regression of log C " upon time produced 1.524 and -0.02408 for the intercept log A^ and the slope ^^c, respectively. From the curve peeling operation we thus obtained the following intercepts, hybrid transfer constants and half-life times of the a-and P-phases: ^p = 33.4 |ig r^ a = ^^ /0.4343 = 0.0555 min'^ t^2 = 0.693/a = 12.5 min 5p=12.2|Ligr^ p = 5p 70.4343 = 0.00318 rnin"^ /f/2 = 0.693/(3 = 218 min The ratio of the intercepts A^ to B^ indicates that the amplitude or peak value of the a-phase is 2.7 times larger than the one of the p-phase, while the ratio of a to p (or rf/2 to ty2) shows that the a-phase decays at a rate which is 17 times larger than that of the p-phase. Using the formulas derived above we can now compute the pharmacokinetic properties of the model: Transfer constant buffer to plasma: Transfer constant of elimination:
^^^p = (App + i5pa)/(Ap + B^) = 0.0172 min ' k^ = a^/k^^ = 0.0103 min"^
Transfer constant plasma to buffer: Plasma volume of distribution: Area under the curve: Plasma clearance:
^pb = ^ + P ~ ^bp ~ ^pe = 0.0312 min~^ Vp = D/(A^ + 5p) = 219 1 AUC = (AJa) + (Bp/p) = 4.442 mg min 1"' Cl^ = D/AUC = 2.25 1 min"'
The synthetic data have been derived from a theoretical model with the following parameters: Dose: D = 10 mg Plasma volume of distribution: = 2001 Transfer constant of elimination: K-= 0.01155 min"' Transfer constant buffer to plasma: K = 0.040 min"' Transfer constant plasma to buffer: pb = 0.020 min"' AUC = 4.329 mg min 1 Area under the curve: Plasma clearance: ch = 2.311 min"' a == 0.06816 Fast phase hybrid transfer constant: Slow phase hybrid transfer constant: p= : 0.003390
n=
A random noise with standard deviation of 0.4 |ig 1"^ has been added to the theoretical values in order to produce a realistic example. The specifications of the model are in part the same as those used for the one-compartment models which have been discussed above. The major distinction between this model and the
486
former ones is the addition of the buffer compartment which exchanges with the plasma compartment. A comparison between the various results (Figs. 39.5b, 39.8a and 39.13a) clearly shows that the buffer compartment produces a significant delay on the time course of the plasma concentration. At the beginning of this section we have assumed that hybrid rate constants should be larger than p by at least about one order of magnitude. In the case when a is close to p, residual plasma concentrations, after peeling off of the P-phase, may become very small and may even take negative values. Hence, the relative error on the semilogarithmic plot, which leads to the determination of a, will become very prominent. As a consequence, p will be estimated more accurately than a, A^ and ^p. Unfortunately, there is no real good alternative. One may perform a non-linear regression (see Chapter 11) on the plasma concentrations C^ using the graphic determinations of a, p, A^ and B^ as initial estimates for the parameters of the sum of exponential functions of time t. This will tend to spread out the estimation error over the four parameters, which are computed concurrently by the non-linear regression algorithm. In practical applications, however, it may be preferable to obtain the best possible estimate of p, even at the expense of the accuracy of the other parameters. Indeed, the rate constant of the P-phase determines the overall half-life of the substance in the system that is being modelled. In repetitive drug administration, for example, the half-life of the slowest phase (p in this case) determines the schedule of administration (once daily, twice daily, etc.) The determination of residues in tissues after discontinuation of medication is also governed by the slowest decaying phase. To conclude this section on two-compartment models we note that the hybrid constants a and p in the exponential function are eigenvalues of the matrix of coefficients of the system of linear differential equations: ^ pb "^ ^ pe
^ bp
~^pb
^bp
(39.71)
K=
If a and p are referred to by the general symbol y we must have that the following determinant A must be zero: A = |K-Yl| = 0
(39.72)
where I is the 2 x 2 identity matrix. Working out the determinant leads to the characteristic equation of the system (Section 29.6): f - Y(^pb + ^bp + V + ^pe ^bp = 0
(39.73)
The exponential functions, the weighted sums of which determine the time courses in the various departments, can thus be regarded as eigenfunctions of the phar-
487
macokinetic system [7]. The complete solution of this model in terms of the eigenvalues of the matrix of transfer constants is discussed below in Section 39.1.7.2. 39.1.7 Multi-compartment models 39.1.7.1 The convolution method Often, it is required to predict the time course of the plasma concentration from a model with oral administration or with continuous infusion, when only data from a single intravenous injection are available. In this case, the Laplace transform can be very useful, as will be shown from the following illustration. In the catenary model of Fig. 39.14a we have a reservoir, absorption and plasma compartments and an elimination pool. The time-dependent contents in these compartments are labelled X^, X^, X^ and X^, respectively. Such a model can be transformed in the ^-domain in the form of a diagram in which each node represents a compartment, and where each connecting block contains the transfer function of the passage from one node to another. As shown in Fig. 39.14b, the ;X/t)
(D
^ ^ Reservoir
®
©
x,(s)
Xa(t)
\bsorptioi
compartment
Xa(s)
1
•
•
•'ap (s+k^p) (s+kp^)
Xe«
;
'^^> ^\
^
^
x/s)
XpW
^ap
Plasma Compartment
Elimination Pool
>'':' ^ •
s
^\f ^ •
s
>-•
Xe(s)
Fig. 39.14. (a) Catenary compartmental model representing a reservoir (r), absorption (a) and plasma (p) compartments and the elimination (e) pool. The contents Xr, Xa, Xp and X^ are functions of time t. (b) The same catenary model is represented in the form of a flow diagram using the Laplace transforms Xr, Xa and Xp in the ^-domain. The nodes of the flow diagram represent the compartments, the boxes contain the transfer functions between compartments [ 1 ]. (c) Flow diagram of the lumped system consisting of the reservoir (r), and the absorption (a) and plasma (p) compartments. The lumped transfer function is the product of all the transfer functions in the individual links.
488
transfer function from the reservoir to the absorption compartment is l/(s + k^^), from the adsorption to the plasma compartment is k^^(s + ^p^), and from the plasma to the elimination pool is k^Js. Note that the transfer constant from an emitting input node and to a receiving output node appears systematically in the numerator and denominator of the connecting transfer function, respectively. This makes it rather easy to model linear systems, of the catenary, mammillary and mixed types. Parts of the model in the Laplace domain can be lumped by multiplying the transfer functions that appear between the input and output nodes. In Fig. 39.14c we have lumped the absorption and plasma compartments. The resulting Laplace transform of the output jCp(5) is then related to the input jc/5) by means of the transfer function g(s): x^(s) = g(s)x,is)
(39.74)
where 8(s) = -
, ^;;
,
,
(39.75)
This model can now be solved for various inputs to the absorption compartment. In the case of rapid administration of a dose D to the absorption compartment (such as the gut, skin, muscle, etc.), the Laplace transform of the reservoir function is given by: x^s) = D
(39.76)
In this case we obtain the simplest possible expression for the plasma function in the 5-domain: x^(s) = g(s)D = -
/^
;'
(39.77)
The inverse transform X^(t) in the time domain can be obtained by means of the method of indeterminate coefficients, which was presented above in Section 39.1.6. In this case the solution is the same as the one which was derived by conventional methods in Section 39.1.2 (eq. (39.16)). The solution of the twocompartment model in the Laplace domain (eq. (39.77)) can now be used in the analysis of more complex systems, as will be shown below. When the administration is continuous, for example by oral infusion at a constant rate offc^p,the input function is given by: k x^(s) = -^ s
(39.78)
489
and the resulting plasma function becomes: x^(s) = g(s)'^ = -
'"/')''
(39.79)
The inverse Laplace transform can be obtained again by means of the method of indeterminate coefficients. In this case the coefficients A, B and C must be solved by equating the corresponding terms in the numerators of the left- and right-hand parts of the expression: "^^ ^ ^ + _ A _ + _ ^
(39.80)
The inverse transform of the plasma function is then given by: Xp(0 = A + Be"^^^' +Ce"^^'
(39.81)
After substitution of the values of A, B and C we finally obtain the plasma concentration function X^{t) for the two-compartment open system with continuous oral administration: ^,it)
k.
= k^ ^ap
l-e"'-^ ^pe
^pe
1 e -k.j
A
(39.82)
^ap
From the above solution we can now easily determine the steady-state plasma content X^^ after a sufficiently long time t: X^^=-^
(39.83)
^pe
If X^(t) and X^(t) are the input and output functions in the time domain (for example, the contents in the reservoir and in the plasma compartment), then X^{t) is the convolution of X-{t) with G(t), the inverse Laplace transform of the transfer function between input and output: X^(r) =
JG(x)X,(t-x)d%=\G{t-x)X;{x)dx 0
0
= G(trX,it) where the symbol * means convolution, and where T denotes the integration variable.
490
In the 5-domain, convolution is simply the product of the Laplace transform: x^(s) = g{s) x.,(s)
(39.85)
By means of numerical convolution one can obtain X^(t) directly from sampled values of G(t) and X^(t) at regular intervals of time t. Similarly, numerical deconvolution yields X-^(t) from sampled values of G(t) and X^(t). The numerical method of convolution and deconvolution has been worked out in detail by Rescigno and Segre [1]. These procedures are discussed more generally in Chapter 40 on signal processing in the context of the Fourier transform. 39.1.7.2 The y-method Thus far we have only considered relatively simple linear pharmacokinetic models. A general solution for the case of n compartments can be derived from the matrix K of coefficients of the linear differential equations: A:,J ""^12
^21 ^22
^nl
~^n2
K=
(39.86) ~^\n
~^2n
^nn
in which an off-diagonal element k^j represents the transfer constant from compartment / to compartment^, and in which a diagonal element k^^ represents the sum of the transfer constants from compartment / to all n-\ others, including the elimination pool. If compartment / is not connected to say compartmenty, then the corresponding element k^^ is zero. The index 1 is reserved here to denote the plasma compartment. In the previous section we found that the hybrid transfer constants of a twocompartment model are eigenvalues of the transfer constant matrix K. This can be generalized to the multi-compartment model. Hence the characteristic equation can be written by means of the determinant A: A = |K-Yl| = 0
(39.87)
where I is the AI x n identity matrix. In the general case there will be n roots 7-, which are the eigenvalues of the transfer matrix K. Each of the eigenvalues defines a particular phase of the time course of the contents in the n compartments of the model. The eigenvalues are the hybrid transfer constants which appear in the exponents of the exponential function. For example, for the zth compartment we obtain the general solution:
491
^/(^) = X^//^"'^'
(39.88)
where G^j is the coefficient of theyth phase in the ith compartment. The coefficient G,y can be determined from the minors of the determinant A, as shown by Rescigno and Segre [1]:
G,j=xm
^(-iy^u^ A'
(39.89) Jy=yj
where Aj, is the minor of A, which is obtained by crossing out row 1 and column /, and where A' denotes the derivative (dA/dy) of A with respect to y. This general approach for solving linear pharmacokinetic problems is referred to as the y-method. It is a generalization of the approach by means of the Laplace transform, which has been applied in the previous Section 39.1.6 to the case of a two-compartment model. The theoretical solution from the y-method allows us to study the behaviour of the model, provided that the transfer constants in K are known. In the reverse problem, one must estimate the transfer constants in K from an observed plasma concentration curve C^(k). In this case, we may determine the hybrid transfer constants jj and the associated intercepts of the plasma concentration curve Gj • by means of graphical curve peeling or by non-linear regression techniques. Using these experimental results we must then seek to compute the transfer constants from systems of equations, which relate the hybrid rate constants y^ and the associated intercepts Gy to the transfer constants in K. We may also insert the general solution into the differential equation, their derivatives (at time 0) and their integrals (between 0 and infinity) in order to obtain useful relationships between the hybrid transfer constants y^, the intercepts Gy and the model parameters in K. The calculations can be done by computer programs such as PROC MODEL in SAS [8]. Not all linear pharmacokinetic models are computable, however, and criteria for computability have been described [9]. Some models may be indeterminate, yielding an infinity of solutions, while others may have no solution. By way of illustration, we apply the y-method to the two-compartment mammillary model for intravenous administration which we have already seen in Section 39.1.6. The matrix K of transfer constants for this case is defined by means of: ^ pb "^ ^ pe
^ bp
K= ~^pb
^bp
(39.90)
492
and the corresponding characteristic equation can be written in the form: p pe + "^ pb " y
~ ^ bp
A=IK-Yll =
=0
(39.91)
^bp-Y
-pb
The eigenvalues y of the model are the roots a and P of the quadratic equation: Y ' - Y ( / : p b + ^bp + ^ p e ) - ^ b p ^ p c = 0
(39.92)
the sum and product of which are easily derived: U
i2
P
Y,Y2 = a p
bp
pb
PC
^3^^^^
=^bp^pe
The general solution for the plasma compartment is now expressed as follows: Cp(f) = G„e-T''' +G,2e-^=' = Ape-™ +B^e-^
(39.94)
with D
f
(-l)A, A'
'"-''-t^
(39.95)
^Y=Yi=a
and
D f (-l)A, <^12 = ^ p =
V„p V
A' —
^T=72=P
which can be developed into: A =
D a-A:i,p V^2a-(k^, + k,^ + k^,) D
«P
=
P-^bp
Vp2p-(/:pb + ^bp + ^ c )
Dtt-^bp Vp a - P
(39.96)
^P^bp-P Vp a - P
using the expression in eq. (39.93) for the sum of the roots a + P in terms of the transfer constants. The expressions for A^ and Bp are exactly the same as those obtained in Section 39.1.6. In a similar way, we can derive a general solution for the contents in the buffer compartment:
493
X^CO^G^ie-^'^ +G22e-^^^ = A^e-"' +B^Q-^'
(39.97)
with G,,=A,=D
1'^—^ A J
=-D-^
(39.98) a-B
and G22
-B^=D
A'
^D-^"" y Y=Y2=P
^
The time course of X,^ in the buffer compartment is represented in Fig. 39.13c using the theoretical values for D, k^^, a and p of the two-compartment model of Section 39.1.6. At the time t^^ of peak concentration at 46.3 min, about half of the initial dose D, i.e. 5.02 mg, is stored in the buffer.
39.2 Non-compartmental analysis Non-compartmental analysis is a statistical moment theory which has been derived from chemical engineering and has been applied more recently to tracer and drug kinetics [10,11]. Although this approach does not assume that the system can be compartmentalized, it is not free from model assumptions. In fact, its assumptions may be even more stringent than those pertaining to compartmental analysis. In particular, statistical moment theory assumes that transit times of molecules inside a system follow a stochastic distribution, where the nature of the distribution depends on the structure of the system. (See Section 3.2 for a discussion of the moments of a distribution.) For example, if a tracer substance is injected into a body, the tracer particles are offered a choice of different paths within the circulatory system. Some will complete a full cycle through the body in a short time, while others may have been delayed during their transit through various organs. The results of non-compartmental analysis also depend critically on the accuracy of the method of measurement, as will be explained more fully in the example. The simplest non-compartmental parameter that can be obtained from the time course of the plasma concentration is its area under the curve AUC (see also Section 39.1.1):
AUC = J C p(Odr
(39.99)
494
Fig. 39.15. Area under a plasma concentration curve AUC as the sum of a truncated and an extrapolated part. The former is obtained by numerical integration (e.g. trapezium rule) between times 0 and r, the latter is computed from the parameters of a least squaresfitto the exponentially decaying part of the curve (p-phase).
This parameter can be obtained by numerical integration, for example using the trapezium rule, between time 0 and the time T when the last plasma sample has been taken. The remaining tail of the curve (between T and infinity) must be estimated from an exponential model of the slowest descending part of the observed plasma curve (P-phase) as shown in Fig. 39.15. The area under the curve AUC can thus be decomposed into a truncated and extrapolated part: A U C - AUCo,^ + AUC^^ = j Cp(Oclr+ \BQ-^' At
(39.100)
with n-\ 0
^ '=1
and B T
(39.101)
P
The mean residence time MRT of the drug in plasma can be expressed in the form:
495
MRT = ^ ~
^AUMC AUC
J C^(t)dt 0
where AUMC is called the area under the moment curve and is defined as the integral of t C^(t) between time t ranging from 0 to infinity. The expression for MRT in eq. (39.102) defines the mean of the continuous variable t, using the corresponding plasma concentration C^(t) as a weighting function. From this point of view, the denominator can be regarded as a normalizing constant which ensures that the integral of the weighting function over the whole time domain equals unity. This is consistent with the definition of the mean of a discrete variable in terms of the first moment of its probability density function, as was explained in Section 3.2. For this reason, the graph of tC^it) is called the (first) moment curve. The quantity AUMC can be decomposed into an observed and an extrapolated part: T
AUMC = AUMCo,^ + AUMC^^^ = j tC^(t)dt+ 0
j tBQ'^' dt
(39.103)
T
The extrapolated area can be expressed analytically by means of integration by parts between times 7 and infinity:
AUMC^^ = J ?Se-P' d? = ^ r + I e-P^
P
P
(39.104)
The truncated part of the integral can be obtained by numerical integration (e.g. by means of the trapezium rule) of the function tC^(t) between times 0 and T. The mean residence time MRT is an important pharmacokinetic parameter, especially when a substantial fraction of the drug is excreted or metabolized during its first pass through an organ, such as the liver. In the case of a one-compartment open model with single-dose intravenous administration, the mean residence time is simply the inverse of the elimination transfer constant k^^, since according to the above definition we obtain: jtC^{t)dt MRT. = ^ IV
ex,
J Cp(Od?
DJ tQ-''^' dt = -^
=— OO
DJ e"*^-' dt
J^
"^
(39.105)
496
Since
f r e ^^ dr = —- and '^pe
0
f e ^^ dt = — 0
pe
The mean residence time MRT can thus be defined as the time it takes for a single intravenous dose to be reduced to 36.8% in a one-compartment open model. This follows from the property derived above in eq. (39.105) and from eq. (39.5): DQ-^^^^^^^
=DQ-^
=0.368 D
(39.106)
since /:p,MRTi,= l This relationship shows the analogy with the previously derived half-life time ty2 (in Section 39.1.1), which is the time required for a single intravenous dose to be halved in a one-compartment open system. Since we already derived in eq. (39.9) that: 0.693 ^1/2 ~ •
we can relate the two parameters by means of: r,/2 = 0.693 MRTj,
(39.107)
When a single-dose intravenous and an oral (or other extravastular) plasma curve are both available from the same subject(s) one can define the mean absorption time MAT by means of the mean residence time obtained from the intravenous curve MRTj^ and the extravascular curve MRT: MAT = MRT - MRTj,
(39.108)
The parameter MAT is representative of the time that the drug remains unabsorbed [9]. From the mean residence time MRT, the area under the curve AUC and the administered dose D, one can derive the steady-state volume of distribution V^^ of the drug in plasma:
V.=^5^!5Ii. AUC
(39.109)
497
In the special case of a one-compartment open model, it can easily be shown that the steady-state volume V^^ is identical to the plasma volume V^ which has been defined before in Section 39.1.1 (eq. (39.12)): Ks= = ^0 " AUC/:p, P
(39.110)
since in this case MRT is the reciprocal of the transfer constant of elimination ^p^. In the general case, however, V^^ is independent of the elimination of the drug and only depends on the non-compartmental parameters MRT and AUC, for a given dose D. In this respect, one can state that the plasma clearance Cl^ is also a non-compartmental parameter which, for a given dose D, depends only on the area under the curve AUC (cf. eq. (39.13)):
P
AUC
Other non-compartmental parameters that are easily obtainable from a plasma concentration curve are the time of appearance of the maximum t^^ and the peak concentration value C^(t^). The variance of the residence times VRT is derived from the area under the second moment of the plasma concentration curve AUSC: j(r-MRT)2 Cp(Odf VRT = ^ ^AUSC ~ AUC
(39.111)
j Cp(Od/ VRT and the second moment function {t - MRT)^ C^{t) are related in the same way as MRT and the (first) moment function t C^{t) in eq. (37.102). The quantity AUSC can be decomposed into an observed and an extrapolated part: AUSC= AUSCo,7^ + AUSCy. ,^ T oo = j ( r - M R T ) 2 Cp(Odr + ^{t-MKY^
(39.112) BQ'^' dt
498
where the mean residence time MRT and the area under the curve AUC have already been defined. An analytical expression can be derived for the extrapolated partof AUSC: AUSCy, ^ = j(r - MRT)2 B e-P'dr = ^ - + I 7 - MRT + ^ I
^e-P^ (39.113)
as can be derived by means of integration by parts between times T and infinity. Application of the trapezium rule to the function {t-MKlf C^{t) between times 0 and r yields the truncated part of the integral. Finally, we can define the standard deviation of the retention times SRT as the square root of the variance VRT. The quantities AUMC and AUSC can be regarded as the first and second statistical moments of the plasma concentration curve. These two moments have an equivalent in descriptive statistics, where they define the mean and variance, respectively, in the case of a stochastic distribution of frequencies (Section 3.2). From the above considerations it appears that the statistical moment method strongly depends on numerical integration of the plasma concentration curve C^{t) and its product with t and (r-MRT)^. Multiplication by t and (r-MRT)^ tends to amplify the errors in the plasma concentration C^{t) at larger values of t. As a consequence, the estimation of the statistical moments critically depends on the precision of the measurement process that is used in the determination of the plasma concentration values. This contrasts with compartmental analysis, where the parameters of the model are estimated by means of least squares regression. Example We reconsider the data used previously in Section 39.1.2 in the discussion of the two-compartment system for extravascular administration (e.g. oral, subcutaneous, intravascular). The data are truncated at 120 minutes in order to obtain a realistic case. It is recalled that these data have been synthesized from a theoretical model and that random noise with a standard deviation of about 0.4 |jg 1"^ has been superimposed. r(min) Cp(Mgl-')
0 0
2 12.3
5 23.6
10 35.5
20 40.7
30 37. 1
60 26.9
90 19.8
120 14.0
These data can be integrated numerically between times 0 and 120 minutes. The remaining part between 120 minutes and infinity must be extrapolated from the downslope of the curve (P-phase) which can be modelled by means of the exponential function: CP(0 = 5e-P^=55.5e^-^i^^^
499
the coefficients of which have been determined graphically in Section 39.1.2 from the experimental data. We now provide the results that can be derived numerically from the data by means of the statistical moments model. The exact values provided by the theoretical model in Section 39.1.2 are added within parentheses. Dose: Z) = 10 mg Intercept of p-phase: B = 55.5 jug V Hybrid transfer constant of p-phase: p = 0.0119 min'^ Last sample time: T=120 min Truncated area under the curve, by numerical integration: AUCQJ^ = 3.151 mg min
H
Extrapolated area under the curve: AUCT^^ =~Q'^'^
= 1.118 mg min
H
Area under the curve: AUC = AUCo,^ + AUCy.^ = 4.270 mg min 1"^ (3.968) Truncated area under the moment curve, by numerical integration: AUMCop 160.7 mg min^ 1"^ Extrapolated area under the moment curve:
AUMc^^ " r ^ i" f ^^^" ^^^'^ ^^ ™^^ *^ Area under the moment curve: AUMC = AUMCo,^ + AUMCy.^ = 338.9 mg min^ 1"^ (372.2) Mean residence time: MRT = AUMC/AUC = 91.1 min (93.8) Mean absorption time: MAT = MRT - MRTj, = 7.1 min (7.2) since in this case we find that: MRTj, = l/k^^ = 1/p = 84.0 min (86.6)
500
Truncated area under the second moment curve, by numerical integration: AUSCoj- = 8472 mg min^ 1"' Extrapolated area under the second moment curve: AUSCy.,
r
[
P
- e - P ' ' = 22168 mgminM-'
P
Area under the second moment curve: AUSC = AUSCo,^ + AUSCy^^ = 30640 mg min^ r^ (29948) Variance of residence times: VRT = AUSC/AUC = 7176 min^ (7548) Standard deviation of residence times: SRT = YRV^^ = 84.7 min (86.9) Steady-state volume of distribution: V,, = D MRT/AUC = 213 1 (236) Plasma clearance: C/p = D/AUC = 2.34 1 min-i (2.52) It can readily be seen from this example that the contributions of the extrapolated areas to the total areas are relatively more important for the higher order moments. In this example, the contributions are 28, 61 and 72% for AUC, AUMC and AUSC, respectively. Because of this effect, the applicability of the statistical moment theory is somewhat limited by the precision with which plasma concentrations can be observed. The method also requires a careful design of the sampling process, such that both the peak and the downslope of the curve are sufficiently covered.
39.3 Compartment models versus non-compartmental analysis Compartmental analysis is the most widely used method of analysis for systems that can be modeled by means of linear differential equations with constant coefficients. The assumption of linearity can be tested in pharmacokinetic studies, for example by comparing the plasma concentration curves obtained at different dose levels. If the curves are found to be reasonably parallel, then the assumption of linearity holds over the dose range that has been studied. The advantage of linear
501
Fig. 39.16. Paradigm for the fitting of sums of exponentials from a compartmental model (c) to observed concentration data (o) as contrasted by the results of statistical moment analysis (s). (After Thorn [13].)
models is that they can be conveniently analysed by means of the Laplace transform and methods that are derived from it, such as convolution and the y-method, which were discussed above. As we have shown above, pharmacokinetic compartmental analysis requires estimates of the transfer constants ^^p, ^p^ and the volume of distribution V^ from experimental plasma concentration curves. This involves the fitting of sums of exponential functions, the number of which is generally equal to the number of compartments in the system (not including the reservoir and the elimination pool). The non-robustness of this problem is well-known. Relatively small errors may have a large effect on the estimated intercept and slope values. The graphical method of curve peeling on semilogarithmic plots is subject to error propagation from the slower phases to the faster ones and should not be recommended in the case when more than three compartments are to be fitted and when the hybrid rate constants (slopes of the regression lines) are not clearly distinct. In that case, non-linear regression can be combined with curve peeling, where the latter then serves as a means for providing initial estimates of the slopes and intercepts. The alternative to compartmental analysis is statistical moment analysis. We have already indicated that the results of this approach strongly depend on the accuracy of the measurement process, especially for the estimation of the higher order moments. In view of the limitations of both methods, compartmental and statistical, it is recommended that both approaches be applied in parallel, whenever possible. Each method may contribute information that is not provided by the other. The result of compartmental analysis may fit closely to the data using a model that is inadequate [12]. Statistical moment theory may provide a model which is closer to reality, although being less accurate. The latter point has been made in paradigmatic form by Thom [13] and is represented in Fig. 39.16.
502
©
® 1 V
_, ,*
/\^^^ope =
_\ max
j^'^'^
1
^ max
0
0
1
Fig. 39.17. Schematic illustration of Michaelis-Menten kinetics in the absence of an inhibitor (solid line) and in the presence of a competitive inhibitor (dashed line), (a) Plot of initial rate (or velocity) V against amount (or concentration) of substrate X. Note that the two curves tend to the same horizontal asymptote for large values of X. (b) Lineweaver-Burk linearized plot of 1/Vagainst \/X. Note that the two lines intersect at a common intercept on the vertical axis.
39.4 Linearization of non-linear models A classical non-linear model of chemical kinetics is defined by the MichaelisMenten equation for rate-limited reactions, which has already been mentioned in Section 39.1.1: V =
dX " dt
(39.114)
where V represents the rate (or velocity) of the process and X represents the amount (or concentration) of the substrate. The parameters of this model are the maximal rate of the process V^^^x ^^d the saturation constant K^. The latter can also be defined as the amount of substrate which produces a rate V which is half the maximal rate V^^, as can be verified by substituting K^ by X in the above relation. Usually, one plots the initial rate V against the initial amount X, which produces a hyperbolic curve, such as shown in Fig. 39.17a. The rate and amount at time 0 are larger than those at any later time. Hence, the effect of experimental error and of possible deviation from the proposed model are minimal when the initial values are used. The Michaelis-Menten equation can be linearized by taking reciprocals on both sides of eq. (39.114) (Section 8.2.13), which leads to the so-called Lineweaver-Burk form:
503
- =^ ^ - +^ V
V
X
^
^ max ^*^
(39.115)
V ^ max
slope = KJV^^^ intercept = l/V^^ The parameters V^^^^ and K^^ can be obtained from the intercept and slope of the linear relationship between 1/Vand l/X as shown in Fig. 39.17b. Least-squares linear regression (Chapter 8) of 1/Vupon l/X, although a simple solution to the problem, suffers from a lack of robustness which is induced by the reciprocal transformation of V and X. Serious heteroscedasticity (or lack of homogeneity of variance) occurs in the differences between observed and computed values of l/V. Indeed, the variances of these residuals tend to increase with large values of l/X. (It should be noted that the variance of l/V is approximately equal to the variance of V divided by the fourth power of V, where V becomes vanishingly small when X tends to zero.) This could be remedied by means of weighted linear regression, in which each point is assigned a weight which is inversely proportional to the corresponding variance. (Heteroscedasticity and weighted regression have been explained in Section 8.2.3.) Unfortunately, the distribution of l/V at the various levels of l/X is usually unknown. It has been recommended, therefore, to use distribution-free methods of linear regression [14]. A robust procedure is to compute the intercept and slopes of the linear relationship by means of the single median regression method (Section 12.1.5.1). Another linearized form of the Michaelis-Menten equation is defined by: - = —i-X + ^ ^ V
V
(39.116)
V
max
' max
slope = l/V^^^ intercept = i^JV^,, This variant can be derived from the Lineweaver-Burk form in eq. (39.115) by multiplying both sides by X. From a statistical point of view, it does not seem to have an advantage over the Lineweaver-Burk form [14]. The latter variant, however, can be more easily extended to more complex systems of substrate-enzyme reactions, as will be shown below. In the case of competitive inhibition, the substrate is displaced by a substance which has greater affinity for the enzyme (or receptor protein) than its natural substrate. For example, a competitive inhibitor (or antagonist) will try to occupy the binding sites such that the enzyme is prevented from exerting its normal activity on the substrate. It is assumed here that the binding between inhibitor and
504
enzyme is reversible. The inhibiting activity depends on the amount (or concentration) Y of the competing substance and on the inhibition constant K^ of the inhibitor-enzyme complex. The potency of a competitive inhibitor is inversely proportional to its K^ value. The relationship between the rate of the enzymatic reaction V and the amount of substrate X can be expressed in a linearized form: ^ K
1 "^
slope = (I
-f-^-
(39.117)
^Y/K.)KJV^^
intercept = l/V^^ Note the close analogy with the Lineweaver-Burk form of the simple MichaelisMenten equation. In a diagram representing l/V against l/X one obtains a line which has the same intercept as in the simple case. The slope, however, is larger by a factor (1 + Y/K^ as shown in Fig. 39.17b. Usually, one first determines V^^^ and Kj^ in the absence of a competitive inhibitor (Y = 0), as described above. Subsequently, one obtains K^ from a new set of experiments in which the initial rate V is determined for various levels of X in the presence of a fixed amount of inhibitor Y. The slope of the new line can be obtained by means of robust regression. The cases of non-competitive inhibition and even more complex non-linear reaction kinetics will not be discussed further here. Example We consider the initial velocities V, observed with different substrate concentrations X in a rate-limited enzymatic reaction [15]: XCmM) V (mg/min)
1.25 0.101
1.67 0.130
2.50 0.156
5.00 0.250
10.0 0.303
20.0 0.345
Ordinary least squares regression of l/V upon l/X produces a slope of 9.32 and an intercept of 2.36. From these we derive the parameters of the simple MichaelisMenten reaction (eq. (39.116)): V,,, = 0.424 mg/min /^^ = 3.95mM We now consider the case of a competitive inhibitor which has been added to the above reaction at the fixed concentration of 40 mM [15]. The following initial velocities of the competitively inhibited Michaelis-Menten process are observed at the same substrate concentrations as above:
505
X(mM) V (mg/min)
1.25 0.061
1.67 0.074
2.50 0.106
5.00 0.169
10.0 0.227
20.0 0.313
Ordinary least squares regression of l/V upon l/X yields a slope of 17.7 and an intercept of 2.46. Using the previously derived values for V^^^ ^"^ K^, and setting Y equal to 40, we can derive the inhibition constant of the competitively inhibited Michaelis-Menten reaction (eq. (39.117)) K. = 44 A mM. Non-linear models, such as described by the Michaelis-Menten equation, can sometimes be linearized by a suitable transformation of the variables. In that case they are called intrinsically linear (Section 11.2.1) and are amenable to ordinary linear regression. This way, the use of non-linear regression can be obviated. As we have pointed out, the price for this convenience may have to be paid in the form of a serious violation of the requirement for homoscedasticity, in which case one must resort to non-parametric methods of regression (Section 12.1.5). References 1. 2.
3. 4.
5. 6. 7. 8. 9. 10. 11. 12. 13.
A. Rescigno and G. Segre, Drug and Tracer Kinetics. Blaisdell, Waltham, MA, 1966. P.J. Lewi, J.J.P. Heykants, F.T.N. Allewijn, J.G.H. Dony and P.A.J. Janssen, On the distribution and metabolism of neuroleptic drugs. Part I: Pharmacokinetics of haloperidol. Drug. Res., 20 (1970)943-948. J. Gleick, Chaos: Making a New Science. Viking, New York, 1987. T. Teorell, Kinetics of distribution of substances administered to the body. I. The extravascular modes of administration. Arch. Int. Pharmacodyn., 57 (1937) 205-225. II. The intravascular modes of administration. Arch. Int. Pharmacodyn., 57 (1937) 226-240. F.F. Farris, R.L. Dedrick, P.V. Allen and J.C. Smith, Physiological model for the pharmacokinetics of methyl mercury in the growing rat. Toxicol. Appl. Pharmacol., 119 (1993) 74-90. P. Franklin, An Introduction to Fourier Methods and the Laplace Transformation. Dover, New York, 1958. P.J. Lewi, Pharmacokinetics and eigenvector decomposition. Arch. Int. Pharmacodyn., 215 (1975)283-300. G.S. Klonicki, Using SAS/ETS Software for Analysis of Pharmacokinetic Data. SAS Institute Inc., Gary, NC, 1993. P.L. Williams, Structural identifiability of pharmacokinetic models — compartments and experimental design. J. Vet. Pharmacol. Therap., 13 (1990) 121-131. P.R. Mayer and R.K. Brazell, Application of statistical moment theory to pharmacokinetics. J. Clin. Pharmacology, 28 (1988) 481-483. J. Powers, Statistical analysis of pharmacokinetic data. J. Vet. Pharmacol. Therap., 13 (1990) 113-120. K. Zierler, A critique of compartmental analysis. Ann. Rev. Biophys. Bioeng., 10 (1981) 531-562. R. Thom, Structural Stability and Morphogenesis. Benjamin-Cummings, Reading, MA, 1975.
506 14. 15.
I. A. Nimmo and G.L Atkins., The statistical analysis on non-normal (real?) data. TiBS (Trends in Biological Science), 4 (1979) 236-239. R.J. Tallarida and R.B. Murray, Manual of Pharmacological Calculations. Springer, New York, 1981, pp. 35-37.
Additional recommended reading: F. Belpaire and M. Bogaert, Fundamentals of Pharmacokinetics, (in Dutch). University of Gent, Hey mans Institute, Gent, 1978. M. Berman, The formulation and testing of models. Ann. N.Y. Acad. Sci., 108 (1963) 182-194. A. Rescigno, Mathematical foundations of linear kinetics. In: Pharmacokinetics: Mathematical and Statistical Approaches to Metabolism and Distribution of Chemicals and Drugs. (J. Eisenfeld and M. Witten, Eds.), North-Holland, Amsterdam, 1988. J.G. Wagner, Biopharmaceutics and Relevant Pharmacokinetics. Drug Intelligence Publ., Hamilton, IL, 1971.
507
Chapter 40
Signal Processing 40.1 Signal domains When measuring a signal, one records the magnitude of the output or the response of a measurement device as a function of an independent variable. For instance, in chromatography the signal of a Flame Ionization Detector (FID) is measured as a function of time. In spectrometry the signal of a photomultiplier or diode array is measured as a function of the wavelength. In a potentiometric titration the current of an electrode is measured as a function of the added volume of titrant. Time, wavelength and added volume in the above-mentioned examples are the domains of the measurement. A chromatogram is measured in the time domain, whereas a spectrum is measured in the wavelength domain. Usually, signals in these domains are directly translated into chemical information. In spectrometry for example peak positions are calculated in the wavelength domain and in chromatography they are calculated in the time domain. Signals in these domains are directly interpretable in terms of the identity or amount of chemical substances in the sample. In some cases one may consider a measurement domain which is complementary to one of the domains mentioned before. By 'complementary' we mean that each value in the complementary domain contains information on all variables in the other domain. In this terminology, the wavelength and frequency of the radiation emitted by a light source in spectrometry are not complementary domains. One frequency corresponds to a single wavelength (by the relation v = \/X) and not a range of wavelengths. On the contrary, each individual point of an interferogram measured with a Fourier Transform Infra Red (FTIR) spectrometer at a certain displacement 5 of one of the mirrors in the Michelson interferometer contains information on all wavelengths in the IR spectral domain, i.e. on the whole IR spectrum. One can switch between two complementary domains by a mathematical operation. The Fourier transform (FT), which is the main subject of this chapter, is such a mathematical operation. We illustrate this with the way an FTIR spectrophotometer works [1]. In FTIR spectroscopy, the radiation from the light source is split into two beams by a beam-splitter. Behind the splitter, the two beams
508 fixed mirror t
. movable mirror
source direction of motion beam splitter
detector Fig. 40.1. Mirror assembly of a FTIR Michelson interferometer.
are recombined by a mirror before reaching the detector (Fig. 40.1). If the path lengths of both beams are equal, the intensity of the light reaching the detector is equal to the intensity of the light source before splitting. However, when the path lengths of both beams are unequal the measured intensity is lower, because some frequencies of the light are out-of-phase. This is shown in Fig. 40.2 for light consisting of a single frequency. For a path length difference 6 equal to zero, the intensity is maximal, for b = )J2 the intensity is zero and for 6 = A,/5 the intensity is in between. By slowly moving a mirror, the path length difference between the two beams is changed as a function of the position of the mirror. As a result, the light
i
5=X/5 8=^72
' ,
/ \id) / \
'
/ • \ ( c ) / \
i
8=0
^ '' \
(a+d)
,.....t
...(ate)
t / ^ .
/ (a+b)
'(a)'
t Fig. 40.2. Interference of two beams with the same frequency (wavelength). The path-length difference between beams (a) and (b) is zero, between (a) and (c) X/2, between (a) and (d) X/5; (a+b), (a+c) and (a+d) are the amplitudes after recombination of the beams.
509
intensity reaching the detector varies according to a pattern which depends on the path length difference and absorbance of a sample placed in one of the beams. At each path length difference or displacement 6, frequencies coinciding with 5 = {2k 4- \ykll are out-of-phase, others for which 5 =fc?iare in phase and others for which kX
40.2 Types of signal processing Transforms are important in signal processing. An important objective of signal processing is to improve the signal-to-noise ratio of a signal. This can be done in the time domain and in the frequency domain. Signals are composed of a deterministic part, which carries the chemical information and a stochastic or random part which is caused by deficiencies of the instrumentation, e.g. shot noise
510
generated by a photomultiplier and noise generated by a flame ionization detector in chromatography. The presence of such noise hampers the interpretation of the deterministic part of the signal. Therefore, techniques are required to improve the signal-to-noise ratio. This is called signal enhancement. One way to achieve this is by improving the signal before digitization by electronic devices which process the analog signal during the measurement. The discussion of these devices is beyond the scope of this book. Nowadays, signals are digitally stored in computer memory and processed afterwards. A simple way to improve the signal-to-noise ratio is by averaging the signal over a number of data points, e.g. five consecutive data points are averaged and the central point is replaced by that mean value. This method and others are discussed in more detail in Section 40.5.2. However, more complex methods may be necessary to enhance the signal-to-noise ratio. Noise is associated with fluctuations that are fast compared to those of the underlying signal, i.e. the frequency of noise is higher than the frequency of the signal. Therefore, frequencies associated with noise can be distinguished from the frequencies associated with the signal. It is, therefore, possible to selectively eliminate noise in the frequency domain. Fourier transforms and other processing techniques can be used to achieve this. This is described in Section 40.5.3. Signals obtained from measurement devices such as analytical instruments may be distorted. In spectrometry, for instance, the true spectrum can only be obtained by using monochromatic light. Spectrometers, however, emit light with a certain band width. As a consequence, the observed peaks are broadened causing a lower resolution between adjacent peaks. In some cases this broadening of the signal can be removed by applying signal restoration or deconvolution (Section 40.6). In this chapter we discuss techniques both for signal enhancement and signal restoration. Techniques to model or to reconstruct the deterministic part of a digital signal in the presence of noise are discussed in Chapter 41.
40.3 The Fourier transform 40.3.1 Time and frequency domain Before discussing the Fourier transform, we will first look in some more detail at the time and frequency domain. As we will see later on, a FT consists of the decomposition of a signal in a series of sines and cosines. We consider first a signal which varies with time according to a sum of two sine functions (Fig. 40.3). Each sine function is characterized by its amplitude A and its period T, which corresponds to the time required to run through one cycle {In radials) of the sine function. In this example the frequencies are 1 and 3 Hz. The frequency of a sine function can be expressed in two ways: the radial frequency oo (radians per second), which is
511
3
05
Fig. 40.3. A composite sine function (solid line) which is the sum of two sine functions of 1 Hz (dotted line) and 3 Hz (dashed line).
equal to 2%IT, and the number of cycles per second, v, which is equal to 1/rHz. For all t which are a multiple of period T, the amplitude of the sine is zero. The function describing a sum of two sines is given by: fif) = A^sinCcOiO + A2sin(co20, or j{t) = A^sm(2nt/T^) + A2smi2nt/T2X or fit) = Aisin(27i;ViO + A2sin(27i;v20 Although the sum signal shown in Fig. 40.3 consists of two frequencies Vj = l/Jj = 1 Hz and V2 = I/T2 = 3 Hz, this is hardly visible in the time domain. By switching from the time domain to the frequency domain, i.e. by measuring the amplitude of the sine (and cosine) functions present in the signal as a function of the frequency, two distinct signals are found, a pulse at cOj = 27iVi = 2n and one at (O2 = 27t;V2 = 6n (Fig. 40.4). The height of the pulses corresponds to the amplitudes Aj and A2 of the two sine functions. The radial frequency co of a periodic function is positive or negative, depending on the direction of the rotation of the unit vector (see Fig. 40.5). O) is positive in the counter-clockwise direction and negative in the clockwise direction. From Fig. 40.5a one can see that the amplitudes (A_J of a sine at a negative frequency, -co, with an amplitude. A, are opposite to the values A^^ of a sine function at a positive frequency, O), i.e. A_^^ = Asini-m) = -Asm(mt) = -A+^. This is a property of an anti-symmetric function. A cosine function is a symmetric function because A_^ = Acos(-(or) = Acos(coO = A^^ (Fig. 40.5b). Thus, positive as well as negative
512 0)
It
I
Q.
E 03
^
1 2n
^
2
^
^~^
3 An
67c
V (CDS)
^^^ —^ frequency CO
Fig. 40.4. The signal of Fig. 40.3 represented in the frequency domain.
Fig. 40.5. (a) Sine function with a radial frequency +0) and -co. (b) Cosine function with a radial frequency +0) and -CO. (c) Representation of (a) in the frequency domain, (d) Representation of (b) in the frequency domain.
frequencies, co, may be considered. A cosine function generates two pulses of equal sign at co and -co in the frequency domain (Fig. 40.5c). On the other hand, a sine function generates two pulses of opposite sign at these two frequencies. Figures 40.3 and 40.4 are two representations of the same signal, each providing specific information. In the time domain information is available on the overall
513
amplitude of the signal, whereas in the frequency domain explicit information is obtained on the frequencies present in the signal. In general, each continuous signal can be decomposed in a sum of an infinite number of sine and cosine functions, each with a specific frequency v and amplitude. In practice, a signal in the time domain is measured during a finite time. When transforming a signal which has been measured during a time t, only sines (and cosines) are considered with a period T, which exactly fits a number of times in the measurement time. As a result the Fourier transform of a finite continuous signal is a discrete function.
40.3.2 The Fourier transform of a continuous signal Let us consider the antisymmetric block shaped signal/(O shown in Fig. 40.6a, with the values -1 and +1 measured during a finite time, T^. In many textbooks the measuring time is defined as 2T^ for practical reasons. In this chapter we also adopt this convention. Usually the origin of the time axis is shifted to the centre of the measurement time, which is then defined from -T^ to +T^. As a first approximation this block signal f{t) can be represented by a sine function with a period equal to 2T^, or a frequency v equal to 1/(27^), i.e. fit) ^ 1 sm(2nt/2T^). As can be seen from Fig. 40.6b, this approximation is imperfect and is improved by adding a second sine function. Because the period T of this sine function should exactly fit a number of times in the measurement time 27^, only the frequencies l / ( 2 r j , 2 / ( 2 r j , 3/(2rj,..., n/(2TJ are considered. In this particular case the fit cannot be improved by adding a sine with a period equal to 2/(27^). However, including a sine with a period equal to 3/(27^) and an amplitude 1/3 yields the next best approximation, which is: fit) - 1 sm{2nt/2TJ + 1/3 sin(27i^3/2rj The approximation is still imperfect and can be improved by adding a third sine function, now with a period 5/(2T^) and an amplitude 1/5, i.e. ft) - 1 sm(2nt/2TJ + 1/3 sin(27rr3/2rj + 1/5 sin(27i^5/2rj The process of adding sine functions can be continued giving the following expression: ft) - {sm{2nt/2TJ + l/3sin(3(27i)f/2rj + l/5sin(5(27i)^/2rj + ... + l/{2n + l)sin((2n + l)2nt/2TJ} for all integer
n^oo.
514
f(t)|
(a)
+1
-T
T
m
(b) f(t)
(C)
frequency (Hz)
Fig. 40.6. (a) Block signal measured from -T„t to +r;„. (b) Signal (a) approximated by a sum of one, two and three sine functions, (c) Signal (a) represented in the frequency domain.
515
By plotting the amplitudes 1, 0, 1/3, 0, 1/5,..., l/(2n + 1) at the frequencies v^ = 1/(27^), V2 = 2/(2r^), ..., v„ = n / ( 2 r j , we obtain the Fourier transform of the antisymmetric block signal (Fig. 40.6c). For 2T^ equal to 1 s the frequency v^ is 1 Hz, V2 = 2 Hz and so on. However, in order to indicate that these frequencies belong to a sine function and not to a cosine function we also need to calculate the amplitudes at the negative frequencies -v^, --V2,..., -v^, which are - 1 , 0, -1/3, 0, -1/5,..., -1/(2/2 + 1). The interval (Av) between two successive frequencies in the plot is equal to (n + l)/2T^ - nllT^ - II2T^. This leads to the important property that both the lowest frequency and the frequency interval in the Fourier domain are equal to the reciprocal of the measurement time. There is no upper limit to the frequencies which are considered. Because the block signal in Fig. 40.6a is antisymmetric, no cosine functions were needed to fit the block signal. However, in general one needs to include cosine functions as well as sine functions. Therefore, the Fourier transform consists of two parts, the amplitudes of the sine functions and the amplitudes of the cosine functions, which are called Fourier coefficients. Because sine and cosine functions have a zero mean, one should first subtract the mean value from the signal before calculating the FT. In fact the mean value corresponds to a cosine with a zero frequency. This term is called the DC term in analogy to electronics where DC means Direct Current. Summarizing, any signal/(O, measured from time t = -T^ to ^ = + 7^, can be decomposed in a sum of sine and cosine functions as follows:
/(0=I
(nlTti)
A„ cos
„
.
+B„sin
(nlnt m J
n=0
= x A„ cos
+ B„ sin
(rilKt\
V^^m )
Because luinlTTJ = nco = co^ +00
0
/ ( 0 = X t ^ n C o s ( c o „ 0 + BnSm(co„0]= X [A^ cos(co,0 + ^n sin(co^O] n=0
or / ( 0 = - X [A„cos(co„0 + 5nSin(co„0] 2n:
(40.1)
516
The Fourier coefficient at zero frequency {n = 0) is equal to AJ2 cos(n2nt/2TJ = AQ/2 COS(07C) = A^2. Equation (40.1) is the expression for what is called the backward or inverse Fourier transform, because it reconstructs signal f(t) from its Fourier coefficients A^ and B^(n = -oo to ©o). In principle, the values of A^ and B^ can be calculated by fitting eq. (40.1) to the signal by a least squares regression. However, the coefficients A^ and B^ are calculated directly [2] by solving the following two equations, which represent iht forward Fourier transform: + T„.
A =
fi
=
2 7m
27m
dt j m cos .Unt^ T .
(n =-oo,..., oo)
-T
(40.2)
J /(r)sm nnt — dt
(n = -oo,..., oo)
m y
-T
A relationship, known as Euler's formula, exists between a complex number [x -^jy] (x is the real part, y is the imaginary part of the complex number (j = V-1)) and a sine and cosine function. Many authors and textbooks prefer the complex number notation for its compactness and convenience. By substituting the Euler equations cos(r) = (e^^ + e~^0/2 and sin(0 = (e'^ - t~^y2j in eq. (40.1), a compact complex number notation for the Fourier transform is obtained as follows: jnnt
m-\ S
-jnnt
2
jnnt
-jnnt
2j'
Becausey^ = -l,A^ = A_^ and B^ = -B_^ (in analogy to A^^ = A.^^ and B^^ = -fi_J 1
jnnt
m = - ZiA^-jBJe-'-'
(40.3)
The amplitudes (A^) of the cosines, therefore, represent the real part of the FT. They are the real Fourier coefficients. The amplitudes (5„) of the sines are the imaginary part of the FT, which are the imaginary Fourier coefficients. Recalling the fact that the antisymmetric block signal, given in Fig. 40.6a, could be described with a sum of sine functions, means that its FT only consists of imaginary Fourier coefficients. A symmetric signal only contains real Fourier coefficients. As a consequence, setting the imaginary coefficients equal to zero symmetrizes the signal obtained after inverse transform. Usually, the Fourier coefficients A^ and B^ in eq. (40.2) are combined in a single complex number [A„ -jBJ:
517 + 7-
K-JB„=--27, J/(0 m
1 27,m
cos
rnit
-J
^rnit^ 7 Sin \^m J
At (40.4)
-jnnt
j / ( O e ^- df -7
Equations (40.3) and (40.4) are called the Fourier transform pair. Equation (40.3) represents the transform from the frequency domain back to the time domain, and eq. (40.4) is the forward transform from the time domain to the frequency domain. A closer look at eqs. (40.3) and (40.4) reveals that the forward and backward Fourier transforms are equivalent, except for the sign in the exponent. The backward transform is a summation because the frequency domain is discrete for finite measurement times. However, for infinite measurement times this summation becomes an integral. Summarizing, two complementary representations of a signal have been derived: f(0 in the time domain and [A^ - jBJ in the frequency domain. The imaginary Fourier coefficients, 5^, represent the frequencies of the sine functions and the real Fourier coefficients, A^, represent the frequencies of the cosine functions. In the remainder of this chapter we use the shorthand notation F(n) for [A„ -jB^]. Many signals, such as a Gaussian peak, have a symmetric clock-shaped form about their maximum. If the origin of the time axis is chosen in the centre of symmetry the FT contains only real Fourier coefficients. If, however, the origin of the measurement axis is put at the start of the peak, the function looses its symmetry about the origin resulting in a FT with real and imaginary coefficients. This indicates that the location of the origin of the data influences the Fourier transform. The Fourier coefficients can be combined in a so called power spectrum which is defined as P^ = ^J[(A'^ + B^)]. One should realize that because the contributions of A„ and B^ are implicitly included in P„, one cannot back-transform the power spectrum to the time domain. However, the power spectrum gives the total amplitude of the frequencies present in the signal. The Fourier transform may be considered as a special case of the general data transform
Fiy)=
\
f(t)-K(vj)dt
K(v,t) is called the transform kernel. For the Fourier transform the kernel is e"^^^ Other transforms (see Section 40.8) are the Hadamard, wavelet and the Laplace transforms [4].
518
40.3.3 Derivation of the Fourier transform of a sine In this section we derive the Fourier transform of a sine function f(0 = sin(coO by solving eq. (40.4). Because cos(jc) = (e'^ + e"^^)/2 and sin(jc) =(d^ - e"^^)/(2/), we can express sin(coO in the complex number notation: sin(coO = (e®^^ - t~^^y{2j) The Fourier transform of sin(cor) is: 1
r
-jnn —
( A „ - ; f i „ ) = — J sin(0)0e
^ df
-T
(40.5) t -inn -
IT _J, 2j 1 y n
J n/
2T
OT 2T JJ 02j1
J 2j
We work out the first term in eq. (40.5) AITI
d/ (0
"
27 _{. 2; 7 co-
rnz
Because j e^dj: = e^ and -J e ^djc = e ^ this term is equal to: +r .^r
1 27
^'"-y)
J e ^
IT
2yy|co-y
nn \
' -&
f
^
nn ^
'
-2;|«-Y
J-r
_
j sin(cor -n%) 2 (oT -nn
In the same way we find that the second term 5„ is equal to j sin((jor -\- nn) 2 (oT -^nn This gives the following expression for the Fourier transform of a sine function:
519
{A,-jBJ
= -
sin(cor - nn) (oT -nn
sin(cor + nn) 0)7 + nn
(40.6)
Only measurement times which are infinite or are a multiple of co are allowed (e.g. CO = InmllT). Consider a finite measurement time IT equal to 2nm/a). For co = 2nn/2T eq. (40.6) becomes:
(A„-jBJ = -^
sin((m - n)n) ( m - n)n
sin((m + n)n) {m+ n)n
All sin(m - n)n are zero for all integer (m - n) except forn = m and for n = -m for which respectively the first and the second term of the above equation are equal to 1. As a result the FT is: A„-jB,
= -j/2(l+0)
= (0-j/2)
and A_„-;B_„ = - ; / 2 ( 0 - l ) = (0+;/2) The frequency corresponding to n = m is co^ = InmllT = co and to n = -m is co_^ = -InmllT = -CO. The FT is thus imaginary, which is in agreement with the fact that a sine is an antisymmetric function. The factor 1/2 is introduced because positive and negative frequencies are considered. Namely, f(0 = sin(coO = 1/2 sin(coO -l/2sin(-co0.
40.3.4 The discrete Fourier transformation Modem signal processing in analytical chemistry is usually performed by computer. Therefore, signals are digitized by taking uniformly spaced samples from the continuous signal, which is measured over a finite time. Assume a signal f(0 which is measured during a time 2T^, starting at a time t^ and ending at a time {t^ + IT^). In accordance with the notation used when discussing continuous signals, we shift the origin to the centre of the measurement time and consider the signal as being measured from a time -T^ to +7^. For the compactness of the expressions we also require that the origin of the data coincides with one of the data points. Therefore, the number of data should be an odd number 2A^ + 1. The sampling interval At is then defined by A^ = IT JIN. As a result, the digitizing process can be represented as: f(0 => f(kAt) with k = -N to -\-N. The value k = -N corresponds to -T^ and the value k = N to T^. This digitization procedure is schematically shown in Figs. 40.7a and 40.7b. The new variable k varies from ~N to+N,
520
f(t)|
f(kAt)
(a)
(b)
V^ ^ I I I I
0
1
2
2
0
2
^
Fig. 40.7. (a) A continuous signalJ{t) measured from to =0 to ^o +2Tn, = 2 seconds, (b) Same signal after digitization,/(/:A0 as a function of/:.
The expressions for the forward and backward Fourier transforms of a data array of2N-\- 1 data points with the origin in the centre point are [3]: Forward: F(n) =
k=+N
1 2A^ + 1
-jnlnk
X / ( M r ) e 2^-1
(40.7)
k=-N
for n = -^max ^^ '^max' ^max corrcspouds to the maximum frequency which is in the ta. The value vah of n^^ is derived in the next section. data. Backward: .
f(kAt) = -
.....
^ jnlnk
X ^(^)e 2^^^
(40.8)
for/: = -M...,+A^ The frequency associated with F(n) is v^. This frequency should be equal to n times the basis frequency, which is equal to 1/(27^) (this is the period of a sine or cosine which exactly fits in the measurement time). Thus v„ = n/(2T^) = n/(2NAt). It should be noted that in literature one may find other conventions for the normalization factor used in front of the integral and summation signs. 40.3.5 Frequency range and resolution As mentioned before, the smallest observable frequency (v^jj,) in a continuous signal is the reciprocal of the measurement time (1/27^). Because only those frequencies are considered which exactly fit in the measurement time, all frequencies should be a multiple of Vj^j^, namely n/2T^ with n = -©o to +oo. As a result the Fourier transform of a continuous signal is discrete in the frequency domain,
521
with an interval (Ai+l)/(2rj - nl{2TJ = l/(2TJ. This frequency interval is called the resolution. Thus the measurement time defines the lowest observable frequency and the resolution. Both are equal to l/2T^. In the case of a continuous signal there is no upper limit for the observable frequencies. For an infinite measurement time the Fourier transform also becomes a continuous signal because 1/27^ -^ 0. On the contrary, when transforming a discrete signal, the observable frequencies are not unlimited. The reason lies in the fact that a sine or cosine function should be sampled at least twice per period in order to be uniquely defined. This sampling frequency is called the Nyquistfrequency. This is illustrated in Fig. 40.8 showing a signal measured in one second and digitized in 4 data points. Clearly, these 4 data points can be fitted in two ways: with a sine with a period equal to 1 s or with a sine with a period equal to 1/3 s. However, the Fourier transform always yields the lowest frequencies with which the data can be fitted, in this case, the sine with a period equal to 1 s. The Nyquist frequency is the upper limit (Vj^ax = 1/(2A0) of the frequencies that can be observed. An increase of the maximally observable frequency in the FT can only be achieved by increasing the sampling rate (smaller A^ of the digitization process. From the values of v^^j^^ and v^^^, one can derive that for a signal digitized in (2A'^ +1) data points, N frequencies are observed, namely: + 1 = 2TJ2At - 1 + 1 = M [(v„ - v^iJ/resolution] + 1 = (l/2At - l/2TJ(l/2TJ The same A^ frequencies are observed in the negative frequency domain as well. In summary, the Fourier transform of a continuous signal digitized in 2A^ + 1 data points returns A^ real Fourier coefficients, A^ imaginary Fourier coefficients and the average signal, also called the DC term, i.e. in total 2A + 1 points. The relationship between the scales in both domains is shown in Fig. 40.9.
(D
Q.
E (0
Fig. 40.8. (a) Sine function sampled at the Nyquist frequency (2 points per period), (b) An under-sampled sine function.
522 33 = (2N+1) data points 1 32 16
1 0 16
At 0
-N
N
^0
A(n) • n
1
N real coefficients
1 v = 1/(2T„)
(+ value at n=0)
ii
\
v = 1/(2 At) Av=1/(2Tm)
B{n)
N Imaginary coefficients
— •
n
1
1
1 v=1/(2T^)
\ \ Av=1/(2T„)
16 v = 1/(2 At)
Fig. 40.9. Relationship between measurement time {2T„,), digitization interval and the maximum and minimal observable frequencies in the Fourier domain.
TABLE 40.1 Signal measured at five time points with Ar = 0.5 s and 2T^ = 2 s Akhi)
Original time scale (seconds)
Time scale with origin shifted to the centre
k
f,=0
t,=-\
-2
A-\) = 2
(2 = 0.5
t2 = -0.5
-1
y(-0.5) = 3
(3=1
^3 = 0
0
./(0) = 4
(4=1.5
r4 = 0.5
1
m5) = 4
u=2
u=\
2
By way of illustration we calculate the FT of the discrete signal listed in Table 40.1. The origin of the time domain has been placed in the centre of the data. The forward transform is calculated as follows (eq. (40.7))
523
Mean value is: 3; A^ = 2; 2A^ + 1 = 5 v„i„= 1/27^ = 0.5 Hz Av = l/2T^ = 0.5 Hz v' max _ = l / 2 A f = 1/1 = 1 Hz l . n = 0:Vo = OHz
-^ k=-2
-^ k=-2
= 1/5(2 + 3 + 4 + 4 +2) = 3 2. n = l:vi = 1/27^ = 0.5 Hz ni)=l/5lf()tA0e--'^''*'5 = = l/5(2e'""'5 +3e'^"'5 + 4e° + AQ-J^"^^ + 2e-''*"'^) = l/5(2cos47i/5 + 2/sin47i/5 + 3cos27t/5 + 3;sin27i/5 + 4 + 4cos27i/5 - 4/sin27i/5 + 2cos47i/5 - 2jsm4n/5) = 1/5 (-1.618 +1.1756/- + 0.9271 + 2.8532/- + 4 + 1.2361 - 3.8042; - 1.618 -1.1756/) = = 1/5(2.93 - 0.95;) = 0.58 - 0.19; 3. n = 2: V2 = V, + Av = IHz = 1/2A? = v^^^ Fi2) = 1/5 I f{kAt)e-J^'^^ = = l/5(2e'^"'5 +3e'""'5 + 4e0 + 4e-''"^5 + 2e-J»^^) = l/5(2cos87i/5 + 2jsm8n/5 + 3cos47C/5 + 3;sin47i/5 + 4 + 4cos4jt/5 - 4;sin47r/5 + 2cos87i/5 - 2/sin87i/5) = 1/5(0.618- 1.9021y--2.4271 + 1.7634/+ 4-3.2361 -2.3511; +0.618 + 1.902;) = l/5(-0.427 + -0.588;-) = - 0.085 - 0.12;
At« = 2 the maximally observable frequency is reached, and the calculation can be stopped. The results are summarized in Fig. 40.10.
524
f(kAt)
0.5 1
1.0 2
1.5 Hz 3 n
0.5 1
1.0 2
1.5 Hz
3
n
Fig. 40.10. Fourier transform of the data points listed in Table 40.1.
40.3.6 Sampling In the previous section we have seen that the highest observable frequency present in a discrete signal depends on the sampling interval (At) and is equal to l/2Ar, the Nyquist frequency. This has two important consequences. The minimally required digitizing rate (sampling points per second) in order to retain all information in a continuous signal is defined by its maximum frequency. Secondly, the Fourier transform is disturbed if the continuous signal contains frequencies higher than the Nyquist frequency. This disturbance is called aliasing or folding. The principle of folding is illustrated in Fig. 40.11. In this figure two sine functions A and B are shown which have been sampled at the same sampling rate of 16 Hz. Signal A (8 Hz) is sampled with a rate which exactly fits the Nyquist frequency (v^J, namely 2 data points per period. Signal B (11 Hz) is under-sampled as it requires a sampling rate of minimally 22 Hz. The frequency of signal B is 3 Hz (= 5v) higher than the maximally observable frequency. This does not mean that for signal B no frequencies are observed in the frequency domain. Indeed, from Fig. 40.11 one can see that the data points of signal B can also be fitted with a 5 Hz sine, which is 3 Hz (= 6v) lower than v^^^ (= 8 Hz). As a consequence, if a signal contains a frequency which is 5v higher than the Nyquist frequency, false frequen-
525
t (sec) Fig. 40.11. Aliasing or folding, (a) Sine of 8 Hz sampled at 16 Hz (Nyquist frequency), (b) Sine of 11 Hz sampled at 16 Hz (under-sampled), (c) A sine of 5 Hz fitted through the data points of signal (b).
cies are observed at a frequency which is 5v lower than the Nyquist frequency. One should always be aware of the possible presence of 'false' frequencies by aliasing. This can be easily checked by changing the sampling rate. True frequencies remain unaffected whereas 'aliased' frequencies shift to other values. The Nyquist frequency defines the minimally required sampling rate of analytical signals. As an example we take a chromatographic peak with a Gaussian shape given by y(x) = atxp(-x^/2s^) (Gaussian function with a standard deviation equal to s located in the origin, x = 0). The FT of this function is [5]: 3;(v) = a^{2ns)txp(-v^
/(2/s^))
From this equation it follows that the amplitudes of the frequencies present in a Gaussian peak are normally distributed about a frequency equal to zero with a standard deviation inversely proportional to the standard deviation s of the Gaussian peak. As a consequence, the maximal frequency present in a Gaussian peak is approximately equal to (0 + 3{l/s)). In order to be able to observe that frequency, a sampling rate of at least 6/s is required or 6 data points per standard deviation of the original signal. As the width of the base of a Gaussian peak is approximately 6 times the standard deviation, the sampling rate should be at least 36 points over the whole peak. In practice higher sampling rates are applied in order to avoid aliasing of high noise frequencies and to allow signal processing
526
(a)
0.1
(b)
0.1
-0.1 0 05
-0.2
-0.3 «
64
64
127
127 0.2
0.1
-0.1
-0.2 I
127
Fig. 40.12. (a) FT (real coefficients) of a Gaussian peak located in the origin of the measurements (256 data points). Solid line: wVi = 20; dashed line: wVi = 5 and corresponding maximal frequencies, (b) FT (real and imaginary coefficients) of the same peak shifted by 50 data points.
(see Section 40.5). Figure 40.12a gives the real Fourier coefficients of two Gaussian peaks centered in the origin of the data and half-height widths respectively equal to 20 and 5 data points. As one can read from the figure, the maximum frequency in the narrow peak (dashed line in Fig. 40.12) is about 4 times higher than the maximum frequency in the wider peak (solid line in Fig. 40.12). 40,3.7 Zero filling and resolution In Section 40.3.5 we concluded that the resolution (Av) in the frequency spectrum is equal to the reciprocal of the measurement time. The longer the measurement time in the time domain, the better the resolution is in the frequency domain. The opposite is also true: the longer the measurement time in the frequency domain (e.g. in FTIR or FT NMR), the better is the separation of the peaks in the spectrum after the back-transform to the wavelength or chemical shift domain.
527
Fig. 40.13. Exponentially decaying pulse NMR signal.
In FTIR and FT NMR the amplitude of the measured signal is an exponentially decaying function (Fig. 40.13). One could conclude that in this case continuing the measurements makes little sense. However, it shortens the measurement time and would, therefore, limit the resolution of the spectrum after the back-transform. It is, therefore, common practice to artificially extend the measurement time by adding zeros behind the measured signal, and to consider these zeros as measurements. This is called zero filling. The effect of zero filling is illustrated on the inverse Fourier transform of an exponentially decaying FT NMR signal before (Fig. 40.14a) and after zero filling (Fig.40.14b).One should take into account that if the last measured data point is not close to zero, zero filling introduces false frequencies because of the introduction of a stepwise change of the signal. This is avoided by extrapolating the signal to zero following an exponential function, called apodization. 40.3.8 Periodicity and symmetry In Section 40.3.4 we have shown that the FT of a discrete signal consisting of 2A^ + 1 data points, comprises A^ real, A^ imaginary Fourier coefficients (positive frequencies) and the average value (zero frequency). We also indicated that A^ real and N imaginary Fourier coefficients can be defined in the negative frequency domain. In Section 40.3.1 we explained that the FT of signals, which are symmetrical about the r = 0 in the time domain contain only real Fourier coefficients.
528
6.4 s
1 I
10 Hz
10 Hz
Fig. 40.14. Effect of zero filling on the back transform of the pulse NMR signal given in Fig. 40.12. (a) before zero filling, (b) after zero filling.
whereas signals which are antisymmetric about the r = 0 point in the time domain contain only imaginary Fourier coefficients (sines). This property of symmetry may be applied to obtain a transform which contains only real Fourier coefficients. For example, when a spectrum encoded in 2A^ + 1 = 512 + 1 data points is artificially mirrored into the negative wavelength domain, the spectrum is symmetrical about the origin. The FT of this spectrum consists of 512 real and 512 imaginary Fourier coefficients. However, because the signal is symmetric all imaginary coefficients are zero, reducing the FT to 512 real coefficients (plus one for the mean) of which 256 coefficients at negative frequencies and 256 coefficients at positive frequencies. 40.3.9 Shift and phase A shift or translation of f(0 by /Q results in a modulation of the Fourier coefficients by exp(-j(OtQ). Without shift f(0 is transformed into F(co). After a shift by /Q, f(r - ^o) is transformed into txpi-jcoto) F((o), which results in a modulation (by a cosine or sine wave) of the Fourier coefficients. The frequency CO^Q of the
529
modulation depends on ^Q, the magnitude of the shift. The back transform has the same property. F{n - n^ is back-transformed to f(t)txp(j2nnQt/2N). A shift by A^ data points, therefore, results in f{kAt)Qxp(j2nNk/2N) = f{kAt)cxp(jnk) = f{kAt) (-1)^. This property is often applied to shift the origin of the Fourier domain to the centre of its 2A^ + 1 data array when the software by default places its origin at the first data point. In Section 40.3.6 we mentioned that the Fourier transform of a Gaussian peak positioned in the origin is also a Gaussian function. In practice, however, peaks are located at some distance from the origin. The real output, therefore, is a damped cosine wave and the imaginary output is a damped sine wave (see Fig. 40.12b). The frequency of the damping is proportional to the distance of the peak from the origin. The functional form of the damped wave is defined by the peak shape and is a Gaussian for a Gaussian peak. Inversely, the frequency of the oscillations of the Fourier coefficients contains information on the peak position. The phase spectrum 0{n) is defined as 0(n) = arctan(A(n)/S(n)). One can prove that for a symmetrical peak the ratio of the real and imaginary coefficients is constant, which means that all cosine and sine functions are in phase. It is important to note that the Fourier coefficients A(n) and B(n) can be regenerated from the power spectrum P(n) using the phase information. Phase information can be applied to distinguish frequencies corresponding to the signal and noise, because the phases of the noise frequencies randomly oscillate. 403.10 Distributivity and scaling The Fourier transform is distributive over summation, which means that the Fourier transform of the two individual signals is equal to the sum of the Fourier transforms of the two individual signals F\f^{t) +/2(0] = F\f\if)\ + ^l/iCOlThe enhancement of the signal-to-noise ratio (or filtering) in the Fourier domain is based on that property. If one assumes that the noise n{t) is additive to the signal s{t), the measured signal m{i) is equal to s{t) + n{t). Therefore, F[m{t)\ = F[s{t)\ + F[n{i)], or M(v) = 5(v)+7V(v) Assuming that the Fourier transformed spectra 5(v) and A^(v) contribute at specific frequencies, the true signal, s(t), can be recovered from M(v) after elimination of A^(v). This is csllcd filtering (see further Section 40.5.3) The Fourier transform is not distributive over multiplication:
Fm)f2(f))^F{f,{twm))' It is also easy to show that for the scalar a F{a(j{t))) = aF(t).
530
40.3.11 The fast Fourier transform As explained before, the FT can be calculated by fitting the signal with all allowed sine and cosine functions. This is a laborious operation as this requires the calculation of two parameters (the amplitude of the sine and cosine function) for each considered frequency. For a discrete signal of 1024 data points, this requires the calculation of 1024 parameters by linear regression and the calculation of the inverse of a 1024 by 1024 matrix. The FT could also be calculated by directly solving eq. (40.7). For each frequency (2N + 1 values) we have to add 2N + 1 (the number of data points) values which are each the result of a multiplication of a complex number with a real value. The number of complex multiplications and additions is, therefore, proportional to {2N + 1)^. Even for fast computers this is a considerable task. Therefore, so-called fast Fourier transform (FFT) algorithms have been developed, originally by Cooley and Tukey [6], which are available in many software packages. The number of operations in FFT is proportional to (2N + l)log2(2A^ +1) permitting considerable savings of calculation time. The calculation of a signal digitized over 1024 points now requires 10"^ operations instead of 10^, which is about 100 times faster. A condition for applying the FFT algorithm is that the number of points is a power of 2. The principle of the FFT algorithm can be found in many textbooks (see additional recommended reading). Because the FFT algorithm requires the number of data points to be a power of 2, it follows that the signal in the time domain has to be extrapolated (e.g. by zero filling) or cut off to meet that requirement. This has consequences for the resolution in the frequency domain as this virtually expands or shortens the measurement time.
40.4 Convolution As a rule, a measurement is an imperfect representation of reality. Noise and other blur sources may degrade the signal. In the particular case of spectrometry a major source of degradation is peak broadening caused by the limited bandwidth of a monochromator. When a spectrophotometer is tuned at a wavelength XQ other neighbouring wavelengths also attain the detector, each with a certain intensity. The profile of these intensities as a function of the wavelength is called the slit function, h(X). An example of a slit function is given in Fig. 40.15. This slit function is also called a convolution function. Under certain conditions, the shape of the slit function is a triangle symmetrical about XQ. The width at half-height is called the spectral band-width. When measuring a "true" absorbance peak with a half-height width not very much larger than the spectral band-width, the observed
531
300
302
306
308
Fig. 40.15. (a) Slit function (point-spread function) h{X) for a spectrometer tuned at 304 nm. (b) f(X) is the true absorbance spectrum of the sample.
peak shape is disturbed. The mechanism behind this disturbance is called convolution. Convolution also occurs when filtering a signal with an electronic filter or with a digital filter, as explained in Section 40.5.2. By way of illustration the spectrometry example is worked out. Two functions are involved in the process, the signal/(^) and the convolution function h{X). Both functions should be measured in the same domain and should be digitized with the same interval and at the same ^values (in spectrometry: X-values). Let us furthermore assume that the spectrum/(^) and convolution function h{X) have a simple triangular shape but with a different half-height width. Let us suppose that the true spectrum y(^) (absorbance xlOO) is known (Fig. 40.15b): ?ii = 301nm=>f(301) = 0 ?i2 = 302nm=»y(302) = 0 ?i3 = 303nm=>y(303) = 0 ?i4 = 304nm=>/(304) = 5 ^5 = 305nm=>y(305)=10 ^6 = 306 nm =^J{306) = 5 ;i7 = 307nm=^y(307) = 0 For all other wavelengths the absorbance is equal to zero.
532
Let us also consider a convolution function h(X), also called point-spread function. This function represents the profile of the intensity of the light reaching the detector, when tuned at wavelength \ . If we assume that the slit function has the triangular form given in Fig. 40.15a and that the spectrometer has been tuned at a wavelength equal to 304 nm, then for a particular slit width the radiation reaching the detector is composed by the following contributions: 25% comes from X = 304 or the relative intensity is: 0.25 18.8% comes from \ = 303 and 305 0.188 12.5% comes from X = 302 and 306 0.125 6.2% comes from X = 301 and 307 0.062 The relative intensities sum up to 1. None of the other wavelengths reaches the detector. When tuning the spectrometer at another wavelength, the centre of the convolution function is moved to that wavelength. If we encode the convolution function relative to the set point /i(0), then we obtain the following discrete values (normalized to a sum =1): h(0) = 0.25 /z(_l) = /z(+l) = 0.188 /z(-2) = /z(+2) = 0.125 /z(-3) = /z(+3) = 0.062 h(-A) = /z(+4) = 0 All values from h(-^) to h{-oo), and from h{+A) to /i(+oo) are zero. Let us now calculate the signal g(304) which is measured when the spectrometer is tuned at X. = 304 nm: ^(304) = 0.25/(304) + 0.188/(305) + 0.188/303) + 0.125^(306) + 0.125^(302) + 0.062/(307) + 0.062/(301) = /z(0)/(304) + /i(l)/(305) + /i(-l)/(303) + /z(2)/(306) + /i(-2)y(302) + /z(3)y(307) + /z(-3)y(301) Because /z(0) = /z(304-304), /z(l) = /z(305-304), /i(2) = /z(306-304) and h{-\) = /i(303-304) and so on, the general expression for ^(jc) can be written in the following compact notation: gix)=^fiy)h(y-x) for all X and y for which/(j) and h(y - x) are defined.
(40.9)
533
302
304
306
308
Fig. 40.16. Measured absorbance spectrum for the system shown in Fig. 40.15.
A shorthand notation for eq. (40.9) is g{x)=f{x)^h{x)
(40.10)
where * is the symbol for convolution. Extension of the convolution to the wavelengths, 301 to 307 nm, yields the measured spectrum g(x) shown in Fig. 40.16. The broadening of the signal is clearly visible. One should note that signals measured in the frequency domain may also be a convolution of two signals. For instance the periodic exponentially decaying signal shown in Fig. 40.13 is a convolution of a sine function with an exponential function. An important aspect of convolution is its translation into the frequency domain and vice versa. This translation is known as the convolution theorem [7], which states that: - Convolution in the time domain is equivalent to a multiplication in the frequency domain g{x) =J{x) * h(x) ^ ^ G(v) = F(v) H(v) and - Convolution in the frequency domain is equivalent to a multiplication in the time domain G(v) = F(v) * H{v) ^=^ g{x) =fix) hix) From the convolution theorem it follows that the convolution of the two triangles in our example can also be calculated in the Fourier domain, according to the following scheme: (1) Calculate F(v) of the signal/(O (2) Calculate H(v) of the point-spread function h{t) (3) Calculate G(v) = F(y)H(v): The real (Re) and imaginary (Im) transform coefficients are multiplied according to the multiplication rule of two complex numbers:
534 Imaginary FT coefficients
Real FT coefficients (a) FT
FT
(b)
(d)
(C)
RFT
13
17
:^1
25
Fig. 40.17. Convolution in the time domain ofJ{t) with h(t) carried out as a multiplication in the Fourier domain, (a) A triangular signal (wi/j = 3 data points) and its FT. (b) A triangular slit function h{t) (w./, = 5 data points) and its FT. (c) Multiplication of the FT of (a) with that of (b). (d) The inverse FT of (c).
Re(G(v)) = Re(F(v))Re(//(v)) - Im(F(v))Im(//(v)) Im(G(v)) = -(Re(F(v))Im(//(v)) + Im(F(v))Re(//(v))) (4) Back-transform G(v) to g(t) These four steps are illustrated in Fig. 40.17 where two triangles (array of 32 data points) are convoluted via the Fourier domain. Because one should multiply Fourier coefficients at corresponding frequencies, the signal and the point-spread function should be digitized with the same time interval. Special precautions are needed to avoid numerical errors, of which the discussion is beyond the scope of this text. However, one should know that when J[t) and h(t) are digitized into sampled arrays of the size A and B respectively, both J{t) and h(t) should be extended with zeros to a size of at least A + J5. If (A + B) is not a power of two, more zeros should be appended in order to use the fast Fourier transform.
535
40.5 Signal processing 40.5.1 Characterization of noise As said before, there are two main applications of Fourier transforms: the enhancement of signals and the restoration of the deterministic part of a signal. Signal enhancement is an operation for the reduction of the noise leading to an improved signal-to-noise ratio. By signal restoration deformations of the signal introduced by imperfections in the measurement device are corrected. These two operations can be executed in both domains, the time and frequency domain. Ideally, any procedure for signal enhancement should be preceded by a characterization of the noise and the deterministic part of the signal. Spectrum (a) in Fig. 40.18 is the power spectrum of "white noise" which contains all frequencies with approximately the same power. Examples of white noise are shot noise in photomultiplier tubes and thermal noise occurring in resistors. In spectrum (b), the power (and thus the magnitude of the Fourier coefficients) is inversely proportional to the frequency (amplitude -- 1/v). This type of noise is often called 1//
IF(/'
L^'VAMMM/ iFd'ii
(b)
IFI/11
Fig. 40.18. Noise characterisation in the frequency domain. The power spectrum IF(v)l of three types of noise, (a) White noise, (b) Flicker or 1//noise, (c) Interference noise.
536
o u
0.00
8.00
16.00
24.00 32.00 Time(sec)(» io 1)
4 0.00
48.00
Fig. 40.19. Baseline noise of a UV-Vis detector in HPLC.
noise (/is the frequency) and is caused by slow fluctuations of ambient conditions (temperature and humidity), power supply, vibrations, quality of chemicals etc. This type of noise is very common in analytical equipment. An example is the power spectrum (Fig. 40.20) of the noise of a UV-Vis detector in HPLC (Fig. 40.19). In some cases the power spectrum may have peaks at some specific frequencies (see Fig. 40.18c). A very common source of this type of noise is the 50 Hz periodic interference of the power line. In reality most noise is a combination of the noise types described. Spectral analysis is a useful tool in assessing the frequency characteristics of these types of signal. We first discuss signal enhancement in the time domain, which does not require a transform to the frequency domain. It is noted that all discrete signals should be sampled at uniform intervals. 40.5.2 Signal enhancement in the time domain In many instances the quality of the signal has to be improved before the chemical information can be derived from it. One of the possible improvements is the reduction of the noise. In principle there are two options, the enhancement of the analog signal by electronic devices (hardware), e.g. an electronic filter, and the
537
(b)
10 4
3
5
8 10-3
3 5 8 10-2 3 F r e q u e n c y (CPS)
5
8 10-1
3
5
8 io 0
Fig. 40.20. The power spectrum of the baseline noise given in Fig. 40.19.
manipulation of the signal after digitization by computer, a so called digital filter. Analytical equipment usually contains hardware to obtain a satisfactory signalto-noise ratio. For example, in AAS the radiation of the light source is modulated by a light chopper to remove the noise introduced by the flame. The frequency of this chopper is locked into a lock-in amplifier, which passes only signals with a frequency equal to the frequency of the chopper. As a result noise with other frequencies than the chopper frequency is eliminated, including the 1//noise. An apparent advantage of digital filters over the analog filters is their greater flexibility. When the original data points are stored in computer memory, the enhancement operation can be repeated under different conditions without the need to remeasure the signal. Finally, for the processing of a given data point also data points measured later are available, which is intrinsically impossible when processing an analog signal. This advantage becomes clear in Section 40.5.2.3. Many instrument manufacturers apply digital devices in their equipment, or supply software to be operated from the PC for post-run data processing of the signal by the user, e.g. a digital smoothing step. To ensure a correct appHcation, the principles and limitations of smoothing and filtering techniques are explained in the following sections.
538
5 6 time (s)
6 6 time (s)
Fig. 40.21. Averaging of 100 scans of a Gaussian peak.
40.5.2.1 Time averaging Some instruments scan the measurement range very rapidly and with a great stability. NMR is such a technique. In this case the signal-to-noise ratio can be improved by repeating the scans (e.g. A^ times) and adding the corresponding data points. As a result the magnitude of the deterministic part of the signal is multiplied by A^, and the standard deviation of the noise by a factor of ^|N (set Chapter 3). Consequently, the signal-to-noise ratio improves by a factor vA^. There are clear limitations to this technique. First of all, the repeated scans must be sampled at exactly the same time values. Furthermore, one should be aware that in order to obtain an improvement of the signal-to-noise ratio by a factor doubling of the measurement time is required. The limiting factor then becomes the stability of the signal. The effect of a 100 times averaging of a Gaussian peak (standard deviation of the noise is 10% of the peak maximum) is demonstrated in Fig. 40.21. The signal-to-noise ratio is improved without deformation of the signal.
40.5.2.2 Smoothing by moving average In the accumulation process explained in the previous section, data points collected during several scans and measured at corresponding time values, are added. One could also consider to accumulate the values of a number of data points in a small segment or window in the same scan. This is the principle of smoothing, which is explained in more detail below.
539
The simplest form of smoothing consists of using a moving average, which has been introduced in Chapter 7 for quahty control. A window with an odd number of data points is defined. The values in that window are averaged, and the central point is replaced by that value. Thereafter, the window is shifted one data point by dropping the last one in the window and including the next one (un-smoothed measurement). The averaging process is repeated until all data points have been averaged. For the smoothing of a given data point, data points are used which were measured after this data point, which is not possible for analog filters. The expression for the moving average of point / is: j=m
j=-m
where g^ is the smoothed data point i^f^^j is the original data point /+/, (2m + 1) is the size of the smoothing window. The process of moving averaging — and we will see later, the process of smoothing in general — can be represented as a convolution. Consider, therefore, the data points/(I),/(2), .... , fin). The moving average of point/(5) using a smoothing window of 5 data points is calculated as follows: Ai) multiply by
0
add and divide by 5
[0 +
m
^3)
y(4)
AS)
./(6)
0
1
1
1
0+
./(3) +
m+
./(5) +
1
.K^) 1
.m 0...
./(6) +
./(7) +
0 + ..]/5
If we consider the zeros and ones to form a function h(t) and if we position the origin of that function /i(0) in the centre of the defined window, then the above process becomes:
mKt):
m=
./(I) K-A)
./(2)
0
0
-4
^(-3) -3
./(4)
./(5)
h(-2)
/i(-l)
h{0)
1
1
1
m -2
-1
0
./(7) ^(+2)
m.. /i(+3)..
1
1
0...
1
2
3...
.m
K+l)
givmg ^(5) = 1/5 JJ{m)h{m - 5) or in general g{t) = Yj'{m)h{m - 0/NORM
(40.11)
for all m for which f(m) and h(m-t) are defined. NORM is a factor to keep the integral of the signal constant.
540
Equation (40.11) corresponds exactly to the expression of a convolution (eq. (40.9)) introduced earlier, demonstrating that the mechanism of smoothing indeed is equivalent to that of convolution. Thus eq. (40.11) can be rewritten as:
g(t) =m * h(t)
(40.12)
As a consequence, a moving average in the time domain is a multiplication in the Fourier domain, namely: G(v) = F(v) H(v)
(40.13)
where H(v) is the Fourier transform of the smoothing function. This operation is called filtering and H(v) is the filter function. Filtering is further discussed in Section 40.5.3. For the moment it is important to realize that smoothing in the time domain has its complementary operation in the frequency domain and vice versa. Besides the improvement of the signal-to-noise ratio, smoothing also introduces two less desired effects, which are illustrated on a Gaussian peak. In Fig. 40.22a we show the effect of the smoothing of this peak with increasingly larger smoothing
1.2
(0)
(5)
(9)
(17)
(25)
'''^
1 0 6 Q.5 0.4
(window)/\A^^2 1.2
(0)
(5)
(9)
(17)
(25)
h/hp
1
1 f
(b) •
•-
0.8 0.6 0.4 0,2
f¥^ -V4^
4h—\f^4fn^,M,J
v-y\r^-^—^
-U(window)/w
1/2
Fig. 40.22. Distortion (h/ho) of a Gaussian peak for various window sizes (indicated within parentheses), (a) Moving average, (b) Polynomial smoothing.
541
windows. As one can see, the peak becomes lower and broader, but remains symmetrical. The peak remains symmetric because of the symmetry of the filter about its central point. Analog filters are by definition asymmetric because they can only include data points at the left side of the data point being smoothed (see also Section 40.5.2.4 on exponential smoothing). The applicability of moving averaging and smoothing in general depends on the degree of deformation associated with a certain improvement of the signal-to-noise ratio. Figure 40.22a shows the effect of a moving averaging applied on a Gaussian peak to which white random noise is added, as a function of the window size. Clearly, for window sizes larger than half the half-height peak width, the peak is broadened and the intensity drops. As a result adjacent peaks may be less resolved. Peak areas, however, remain unaffected. From Fig. 40.22a one can derive the maximally applicable window size (number of data points) to avoid peak deformation, given the scan speed and the digitisation rate. Suppose, for instance, that the half-height width of the narrowest peak in an IR spectrum is about 10 cm~\ the digitization rate is 10 Hz and the scan rate is 2 cm"^ per sec. Hence, the half-height width is digitized in 50 data points. The largest smoothing window, therefore, introducing minor disturbances is 25 data points. If the noise is white the signal-to-noise ratio is improved by a factor 5. Another unwanted effect of smoothing is the alteration of the frequency characteristics of the noise. This calls for caution. Because low frequencies present in the noise are not removed, the improvement of the signal-to-noise ratio may be limited. This is illustrated in Fig. 40.23 where one can see that after smoothing with a 25 point window low-frequency noise is left. In Section 40.5.3 filtering
v ^
Fig. 40.23. Polynomial smoothing (noise = N(0,3%)): 5-point; 17-point; 25-point smoothing window and the noise left after smoothing.
542
methods in the frequency domain are discussed, which are capable of removing specific frequencies from the noise. 40.5,2.3 Polynomial smoothing The convolution or smoothing function, h{t), used in moving averaging is a simple block function. However, one could try and derive somewhat more complex convolution functions giving a better signal-to-noise ratio with less deformation of the underlying deterministic signal. Let us consider a smoothing window with 5 data points. Polynomial smoothing would then consist of fitting a polynomial model through the 5 data points by linear regression. In this case a polynomial with a degree equal to zero (horizontal line) up to 4 (no degrees of freedom left) can be chosen. In the latter case the model exactly fits the data points (no residuals) and has no effect on the noise and signal. The fit of a model with a degree equal to zero is equivalent to the moving average. Clearly, one should try and find an optimal value for the degree of the polynomial, given the size of the window, the shape of the deterministic signal and the characteristics of the noise. Unfortunately no hard rules are available for that purpose. Therefore, several polynomial models and window sizes should be tried out. The smoothing procedure consists of replacing the central data point in the window by the value obtained from the model, and repeating the fit procedure by shifting the window one data point until the whole signal has been scanned. For a signal digitized over 1000 data points, 996 regressions over 5 points have to be calculated. This would be very impractical and computing intensive. Savitzky and Golay [8] derived convolutes, h{t) for each combination of degree of the polynomial and size of the window. The effect of a convolution of a signal with these convolutes is the same as fitting the signal with the corresponding polynomial in a moving window. For instance for a quadratic model and a 5-point window the convolutes are (see Table 40.2): h{-2) = - 3 ; h{-\) = 12; /i(0) = 17; /z(+l) = 12; /i(+2) = - 3 ; else h{t) = 0 In order to keep the average signal amplitude unaffected a scaling factor NORM is introduced, which is the sum off all convolutes, here 35. The smoothing procedure is now g(t) =
If(m)h(m-t)/lh{m)
for all m for which/(m) is defined and h(m -t)^0. The effect of a 5-point, 17-point and 25-point quadratic smoothing of a Gaussian peak with 0.3% noise is shown in Fig. 40.22b. Peaks are distorted as
543 TABLE 40.2 Convolutes for quadratic and cubic smoothing (adapted from Refs. [8,10]) 23
21
Points
25
-12
-253
-11
-138
-42
-10
-33
-21
-171
-09
62
-2
-76
-136
19
17
15
-08
147
15
9
-51
-21
-07
222
30
84
24
-6
-78
-06
13
11
9
7
5
287
43
149
89
7
-13
-11
-05
343
54
204
144
18
42
0
-36
-04
387
63
249
189
27
87
9
9
-03
422
70
284
224
34
122
16
44
14
-2
-02
447
75
309
249
39
147
21
69
39
3
-3
-01
462
78
324
264
42
162
24
84
54
6
12
00
467
79
329
269
43
167
25
89
59
7
17
01
462
78
324
244
42
162
24
84
54
6
12
02
447
75
309
249
39
147
21
69
39
3
-3
-2
-21
03
422
70
284
224
34
122
16
44
14
04
387
63
249
189
27
87
9
9
-21
05
343
54
204
144
18
42
0
-36
-11
06
287
43
149
89
7
-13
07
222
30
84
24
-6
-78
08
147
15
9
09
62
-2
-76
10
-33
-21
-171
-42
11
-138
12
-253
NORM
5175
805
3059
-51 -21 -136
2261
323
1105
143
429
231
21
35
well, but to a less extent than for moving averaging. For a Gaussian peak the window size of a quadratic polynomial smoothing should now be less than 1.5 times the half-height width compared to the value of 0.5 found for moving averaging. The effect on the noise reduction and frequencies left in the noise is comparable to the moving average filter. We refer to the work of Enke [9] for a detailed discussion of peak deformation versus signal-to-noise improvement under
544
- 3 - 2 - 1 0 1 2 3 Fig. 40.24. Polynomial smoothing: window of 7 data points fitted with polynomials of degrees 0,1,2, 3 and 4.
different circumstances. Generally, polynomial smoothing is preferred over moving averaging, because larger windows are allowed before the signal is deformed. The convolutes h{t), adapted according to Steinier [10], are tabulated in Table 40.2 for several degrees of the smoothing polynomial and window sizes. Polynomial models of an even 2n and odd 2n-\-\ degree have the same value for the central point (Fig. 40.24). Therefore, the same convolutes and the same smoothing results are found for a quadratic and cubic polynomial fit. In the same way as for moving averaging the polynomial smoothing can be represented in the frequency domain as a multiplication (eq. (40.13)). This aspect is further discussed in Section 40.5.4. 40.5.2.4 Exponential smoothing The principle of exponential averaging has been introduced in Chapters 7 and 20, and is given by the following equation: jc,. =(l-X)Xi
+ ^/-i 0<X<1
(40.14)
which states that a data point / is smoothed by taking a weighted average of the measurement at time / and the smoothed previous data point at time / - 1. By tuning the weight X, the smoothing can be varied from no smoothing at all (for A. = 0) to obtaining a constant value jCj (for A = 1), which remains unchanged regardless of the value of new incoming measurements. The effect of the exponential smoothing process is thus defined by a single parameter. Moreover, the smoothed data points can easily be calculated by hand. This simplicity and versatility are the reasons for the popularity of the filter. Exponential averaging is mainly applied to smooth time series or stochastic processes in which one wants to detect a drift or a deviation from the stationary state. The filter is less appropriate to smooth analytical signals
545 TABLE 40.3 Example of exponential smoothing according to Jcj = XJCJ_I + (1 - X)xi (k = 0.2)
(1 - X)xi + Xxi. 1
3
0.8(3) + 0.2(0) = 2.4
2
7
0.8(7) + 0.2(2.4) = 6.08
3
4
0.8(4)+ 0.2(6.08) =4.41
4
6
0.8(6)+ 0.2(4.41) = 5.68
5
1
0.8(1)+ 0.2(5.68) = 1.94
6
8
0.8(8)+ 0.2(1.94) = 6.79
7
5
0.8(5) + 0.2(6.79) = 5.36
8
2
0.8(2) + 0.2(5.36) = 2.67
9
9
0.8(9) + 0.2(2.67) = 7.73
10
5
0.8(5) + 0.2(7.73) = 5.55
(e.g. spectra or chromatograms), as it introduces asymmetric deformations of the signal. The algorithm and its simplicity are demonstrated in Table 40.3. The measurements are given in the first column. In the second column the smoothed data points are given for X = 0.2. The exponential shape of the filter follows directly by elaborating eq. (40.14) for a few consecutive data points (see Table 40.4). From this table we can see that a smoothed data point at time / is the average of all data points measured before, weighted with an exponentially decaying weight X^ (k< 1) with d the distance of that data point from the measurement to be smoothed. Such shapes are also found for electronic filters with a given time constant. The effect of exponential smoothing is visualized in the plot of the x^ and x^ values (Fig. 40.25) listed in
TABLE 40.4 The exponential shape of the moving average filter (xi = (1 - 'k)xi + Xxj_i) I
Measurement
Smoothed value
1
xi
^1 = (l->-)xi
2
X2
X2 = (l-X)X2+'kxi =
(1-X)x2 + X(l-X)xi
3
X3
^3 = (1-A-)X3+XJC2 =
(1-A-)X3 + M1->.)X2 +
X\l-X)Xy
4
X4
^4 =(1->.)X4+AJC3 =
(l->-)X4 + >-(l-A,)X3 + A-^(1->.)X2 +
5
X5
X5 = (1-X)X5+X^4 =
(1-X)X5 + X,(1->.)X4 + A,2(1-X)X3 + P(1->.)X2 + X\l-X)X^
X\l-k)Xi
546 x{t) 10 8 6
J
5
L_
10
6 t
Fig. 40.25. Effect of exponential smoothing on the data points listed in Table 40.3 (solid line: original data; dotted line: smoothed data). Signal 1 0.8 0.6 0.4 0.2
0
5
10
15
20
26 30 Time
35
40
45
50
Fig. 40.26. Effect of exponential smoothing {X = 0.6) on a Gaussian peak (wy^ = 6 data points) (solid line: original data; bold line: smoothed data).
Table 40.3. As one can see, the filter introduces a slower response to stepwise changes of the signal, as if it were measured with an instrument with a large response time. Because fluctuations are smoothed, the standard deviation of the signal is decreased, in this example from 2.58 to 1.95. A Gaussian peak is broadened and becomes asymmetric by exponential smoothing (Fig. 40.26).
547
40.5.3 Signal enhancement in the frequency domain Instead of smoothing the data directly in the domain in which they were acquired, the signal-to-noise ratio can also be improved by transforming the signal to the frequency domain and eliminating noise frequencies present in the measurements, after which one returns to the original domain. For instance, the power spectrum of the noise of a flame ionization detector (Fig. 40.27) reveals the presence of two dominant frequencies, namely at 2 and at 10 Hz. By substituting all Fourier coefficients at frequencies higher than 5 Hz by zero, all high frequency noise is eliminated after back transforming to the time domain. This operation is called filtering, and because in this particular case low frequencies are retained, this filter is called a low-pass filter. Equally, one could define a high-pass filter hy setting all low frequency values equal to zero. Mathematically this operation can be described by the same equation (eq. (40.13)) as derived for polynomial smoothing, namely: G(v) = F{v)H(v) where H(v) is the filter function, which is now defined in the frequency domain. Often used filter functions are: Low-pass filter: H(v) = 1 for all v < VQ else H(v) = 0 High-pass filter: H(v) = 1 for all v > VQ else H(v) = 0
o X
0.64
a u 0.3 2
16
32 F r e q . in
48
CPS
Fig. 40.27. Power spectrum of the noise of a flame ionization detector.
548
VQ is called the cut-off frequency. H(v) is referred in this context as di filter transfer function. Many other filter functions can be designed, e.g. an exponential or a trapezoidal function, or a band pass filter. As a rule exponential and trapezoidal filters perform better than cut-off filters, because an abrupt truncation of the Fourier coefficients may introduce artifacts, such as the annoying appearance of periodicities on the signal. The problem of choosing filter shapes is discussed in more detail by Lam and Isenhour [11] with references to a more thorough mathematical treatment of the subject. The expression for a band-pass filter is: H(v) = 1 for v^i„ < v < v^^^^ else //(v) = 0. This filter is particularly useful for removing periodic disturbances of the signal. The effect of a low-pass filter applied on a Gaussian peak is shown in Fig. 40.28 for two cut-off frequencies. The lower the cut-off frequency of the filter, the more noise is removed. However, this increasingly effects the high frequencies present in the signal itself, causing a deformation. On the other hand, the higher the cut-off frequency the more high-frequency noise is left in the signal. Thus the choice of the cut-off frequency is often a compromise between the noise one wants to eliminate and the deformation of the signal one can accept. The more the frequencies of noise and signal are similar, the more difficult it becomes to improve
Fig. 40.28. Effect of alow-pass filter, (a) original Gaussian signal, (b) FT of (a), (c) Signal (a) filtered with V() = 10. (d) Signal (a) filtered with Vo = 20.
549
the signal-to-noise ratio. For this reason it is very difficult to eliminate 1//noise, because its power increases with lower frequencies. 40.5.4 Smoothing and filtering: a comparison Filtering and smoothing are related and are in fact complementary. Filtering is more complicated because it involves a forward and a backward Fourier transform. However, in the frequency domain the noise and signal frequencies are distinguished, allowing the design of a filter that is tailor-made for these frequency characteristics. Polynomial smoothing is more or less a trial and error operation. It gives an improvement of the signal-to-noise ratio but the best smoothing function has to be empirically found and there are no hard rules to do so. However, because of its computational simplicity, polynomial smoothing is the preferred method of many instrument manufacturers. By calculating the Fourier transform of the smoothing convolutes derived by Savitzky and Golay one can see that polynomial smoothing is equivalent to low-pass filtering. Figure 40.29 shows the Fourier transforms of the 5-point, 9-point, 17-point and 25-point second-order convolutes given in Table 40.2 (the frequency scale is arbitrarily based on a 1024 data points sampled with
Fig. 40.29. Fourier spectrum of second-order Savitzky-Golay convolutes. (a) 5-point. (b) 9-point. (c) 17-point. (d) 25-point (arrows indicate cut-off frequencies).
550
1 Hz). Another feature of polynomial smoothing is that smoothing and differentiation (first and second derivative) can be combined in single step, which is explained in Section 40.5.5. 40.5.5 The derivative of a signal Signals are differentiated for several purposes. Many software packages for chromatography and spectrometry offer routines for determining the peak position and for finding the up-slope and down-slope integration limits of a peak. These algorithms are based on the calculation of the first- or second-derivative. In NIRA small differences between spectra are magnified by taking the first or second derivative of the spectra. Baseline drifts are eliminated as well. The simplest procedure to calculate a derivative is by taking the difference between two successive data points. However, by this procedure the noise is magnified by several orders of magnitude leading to unacceptable results. Therefore, the calculation of a derivative is usually linked to a smoothing procedure. In principle one could smooth the data first. This requires a double sweep through the data, the first one to smooth the data and the second one to calculate the derivative. However, the smoothing and differentiation can be combined into a single step. To explain this, we recall the way Savitzky and Golay derived the smoothing convolutes by moving a window over the data and fitting a polynomial through the data in the window. The central point in the window is replaced by the value of the polynomial. Instead, one may replace it by the value of the first or second derivative of that polynomial in that point. Savitzky and Golay [8] published convolutes (corrected later on by Steinier [10]) for that operation (see Table 40.5). This procedure is the recommended method for the calculation of derivatives. Fig. 40.30 gives the second-derivative of two noisy overlapped Gaussian peaks, obtained with a quadratic 7-points smoothed derivative. The two negative regions (shaded areas in the figure) reveal the presence of two peaks. 40.5.6 Data compression by a Fourier transform Sets of spectroscopic data (IR, MS, NMR, UV-Vis) or other data are often subjected to one of the multivariate methods discussed in this book. One of the issues in this type of calculations is the reduction of the number variables by selecting a set of variables to be included in the data analysis. The opinion is gaining support that a selection of variables prior to the data analysis improves the results. For instance, variables which are little or not correlated to the property to be modeled are disregarded. Another approach is to compress all variables in a few features, e.g. by a principal components analysis (see Section 31.1). This is called
551 TABLE 40.5 Convolutes for the calculation of the smoothed second derivative (adapted from Ref. [8]) Points
25
23
21
19
17
15
13
11
9
7
5
-12
92
-11
69
77
-10
48
56
190
-09
29
37
133
51
-08
12
20
82
34
40
-07
-3
5
37
19
25
91
-06
-16
-8
-2
6
12
52
22
-05
-27
-19
-35
-5
1
19
11
15
-04
-36
-28
-62
-14
-8
-8
2
6
28
-03
-43
-35
-83
-21
-15
-29
-5
-1
7
5
-02
-48
-40
-98
-26
-20
-48
-10
-6
-8
0
2
-01
-51
-43
-107
-29
-23
-53
-13
-9
-17
-3
-1
00
-52
-44
-110
-30
-24
-56
-14
-10
-20
-4
-2
01
-51
-43
-107
-29
-23
-53
-13
-9
-17
-3
-1
02
-48
-40
-98
-26
-20
^8
-10
-6
8
0
2
03
-43
-35
-83
-21
-15
-29
-5
-1
7
5
04
-36
-28
-62
-14
-8
-8
2
6
28
05
-27
-19
-35
-5
1
19
11
15
06
-16
-8
-2
6
12
52
22
07
-3
5
37
19
25
91
40
08
12
20
82
34
09
29
37
133
51
10
190
48
56
11
69
77
12
92
NORM
26910 17710 33649 6783
3876
6188
1001
429
462
42
feature reduction. Data may also be compressed by a Fourier transform or by one of the transforms discussed later in this chapter. This compression consists of taking the FT of the data and retaining the first n relevant Fourier coefficients. If the data are symmetrically mirrored about the first data point, the FT only consists
552
5
40
45
50
0.05 h
•0.D5 -0.1 -0.15
Fig. 40.30. Smoothed second-derivative (window: 7 data points, second-order) according to Savitzky-Golay.
of real coefficients which facilitates the calculations (see Section 40.3.8). Figure 40.31 shows a spectrum of 512 data points, which is reconstructed from respectively the first 2, 4, 8...,256 Fourier coefficients. The effect is more or less comparable to wavelength selection by deleting data points at regular intervals. As a consequence one looses high frequency information. When the rows of a data table are replaced by the first n relevant Fourier coefficients, the properties of a data table are retained. For instance, the Fourier coefficients of the rows of a two-way data table of mixture spectra remain additive (distributivity property). Similarities and dissimilarities between rows (the objects) are retained as well, allowing the application of pattern recognition [12] and other multivariate operations [13,14].
553 FFT
(b) 4 6
16
32
64
128
256
'pi
512
(a)
Fig. 40.31. Data compression by a Fourier transform, (a) A spectrum measured at 512 wavelengths; (b) spectrum after reconstruction with 2, 4,..., 256 Fourier coefficients.
40.6 Deconvolution by Fourier transform In Section 40.4 we mentioned that the distortion introduced by instruments can be modeled by a convolution. Moreover, we demonstrated that noise filtering by either an analog or digital filter is a convolution process. In some cases the distortion introduced by the measuring device may damage the signal so much that the analytical information wanted cannot be derived from the signal. For instance in chromatography the peak broadening introduced during the elution process may cause peak overlap and hamper an accurate determination of the peak area. Therefore, one may want to mathematically remove the damage. This process of signal restoration is known as deconvolution or inverse filtering. Deconvolution is the inverse operation of convolution. While convolution is mathematically straightforward, deconvolution is more complicated. It requires an operation in the Fourier
554
domain and a careful design of the inverse filter. The basic deconvolution algorithm follows directly from eq. (40.13)
F(v) =
G(v) H(v)
where G(v) is the Fourier transform of the damaged signal, F(v) is the FT of the recovered signal and H(v) is the FT of the point-spread function. The back transform of F(v) gives/(A:). Thus a deconvolution requires the following three steps: (1) Calculate the FT of the measured signal and of the point-spread function to obtain respectively G(v) and //(v). (2) Divide G(v) by ^(v) at corresponding frequency values (according to the rules for the division of two complex numbers), which gives F(v) (3) Back-transform F(v) by which the undamaged signaly(jc) is estimated. The effect of deconvolution applied on a noise-free Gaussian peak is shown in Fig. 40.32a. Unfortunately as can be seen in Fig. 40.32c a deconvolution carried out in the presence of noise (s=l%of the signal maximum) leads to no results at all. This is caused by the fact that two different kinds of damage are present sjgrlal
1.2
(a)
1
1 0.8
*
0.6
^
tl 11 1 1
0.4 0.2
u 0
16
32
0.8 0.6 0.4
/l
48
64 80 time
0.2 0 96
112 128
Fig. 40.32. Deconvolution (result in solid line) of a Gaussian peak (dashed line) for peak broadening ((>v./Jpsf/(H'i/jG = 1). (a) Without noise, (b) With coloured noise (A^(0,1 %), 7JC = 1.5): inverse filter in combination with a low-pass filter, (c) With coloured noise (M0,1 %), Tx=\.5): inverse filter without low-pass filter.
555
simultaneously, namely signal broadening and noise. The model for the damaged signal, therefore, needs to be expanded to the following expression: g(x)=f(x)^h{x)
+ n{x)
The Fourier transform G(v) of g(x) is given by: G(v) = F(v) H(v) + A^(v). Thus the result of applying the inverse filter is: H(v)
H(v)
The unacceptable results of Fig. 40.32c are caused by the term N(v)/H{v) which is large for high v values (= high frequency). Indeed H(v) approaches zero for high frequencies whereas the value of A^(v) does not. The influence of the noise can be limited by combining the inverse filter with a low-pass noise filter, which removes all frequencies larger than a threshold value VQ. In this way one can avoid that the term N(v)/H{v) inflates to large values (Fig. 40.32b). We observe that the overall procedure consists of two contradictory operations: one which sharpens the signal by removing the broadening effect of the measuring device, and one which increases broadening because noise has to be removed. Consequently, the broadening effect of the measuring device can only be partially removed. In Section 40.7 we discuss other approaches such as the Maximum Entropy method and Maximum Likelihood method, which are less sensitive to noise. An essential condition for performing signal restoration by deconvolution is the knowledge of the point-spread function (psf) h(x). In some instances h{x) can be postulated or be determined experimentally by measuring a narrow signal having a bandwidth which is at least 10 times narrower than the width of the point-spread function. The effect of deconvolution is very well demonstrated by the recovery of two overlapping peaks from a composite profile (see Fig. 40.33). The half-height width of the psf was 1.25 times the peak width for the peak systems (a) and (b). For the peak system (c) the half-height width of the psf and signal were equal. It is still possible to enhance the resolution also when the point-spread function is unknown. For instance, the resolution is improved by subtracting the secondderivative g'\x) from the measured signal g{x). Thus the signal is restored by ag(x) - (1 - a)g'Xx) with 0 < a < 1. This algorithm is called pseudo-deconvolution. Because the second-derivative of any bell-shaped peak is negative between the two inflection points (second-derivative is zero) and positive elsewhere, the subtraction makes the top higher and narrows the wings, which results in a better resolution (see Fig. 40.30). Pseudo-deconvolution methods can correct for sym-
556
1 0 U 9 O 6 0 7 O 6 0 !> 0 4 O J 0 2 O 1 0 0
O
11
20 40
6 0 60 100 120 140 160 180 200
20 40
60 80 100 120 140 160 180 200
(c)
1 U 0 9
"V -r-v 1
o tt
o /
//
c o
i
\ / \1 V
M
O t J 4
tl
G J
ij
0 J
1 I
ij
O J 00
_-.yi/. 20 40
M
L V
v:-.^
60 8 0 100 120 140 160 1 8 0 2 0 0
Fig. 40.33. Restoration of two overlapping peaks by deconvolution. Dashed line: measured data. Solid line: after restoration. Dotted line: difference between true and restored signals.
metric point-spread functions. However, for asymmetric point-spread functions, e.g. the broadening introduced by a slow detector response, pseudo-deconvolution is not applicable.
40.7 Other deconvolution methods In previous sections a signal was enhanced or restored by applying a processing technique to the signal in order to remove damage from the data. Examples of damage are noise and peak broadening. We have also seen that removing noise and restoring peak broadening are two opposite operations. Filtering introduces broadening whereas peak sharpening introduces noise. In Section 40.6 we mentioned
557
that the calculation of fix) from the measured spectrum g{x) by solving g(x) = y(x)*/z(x) by deconvolution is a compromise between restoration and filtering. Because of a lack of hard rules the selection of the noise filter introduces some arbitrariness in the procedure. Another class of methods such as Maximum Entropy, Maximum Likelihood and Least Squares Estimation, do not attempt to undo damage which is already in the data. The data themselves remain untouched. Instead, information in the data is reconstructed by repeatedly taking revised trial data f(x) (e.g. a spectrum or chromatogram), which are damaged as they would have been measured by the original instrument. This requires that the damaging process which causes the broadening of the measured peaks is known. Thus an estimate g(x) is calculated from a trial spectrum J{x) which is convoluted with a supposedly known pointspread function h{x). The residuals e{x) = g{x) - g(x) are inspected and compared with the noise n{x). Criteria to evaluate these residuals are Maximum Entropy (see Section 40.7.2) and Maximum Likelihood (Section 40.7.1). 40.7.1 Maximum Likelihood The principle of Maximum Likelihood is that the spectrum,/(x), is calculated with the highest probability to yield the observed spectrum g{x) after convolution with h{x). Therefore, assumptions about the noise n{x) are made. For instance, the noise n^ in each data point / is random and additive with a normal or any other distribution (e.g. Poisson, skewed, exponential,...) and a standard deviation s^. In case of a normal distribution the residual e^ = gi-gi = gt - ifh)^ in each data point should be normally distributed with a standard deviation s^. The probability that ifh)- represents the measurement g- is then given by the conditional probability density function Pig^if): P{gi\f)^—^
exp
2s}
Under the assumption that the noise in point / is uncorrelated with the noise in pointy, the likelihood that (/*/i). for all measurements, /, represents the measured set gj, g2, ..., g^ is the product of all probabilities: (g,-(/*/^),)' 1=1
i=i
5,V27i
2s}
(40.15)
This likelihood function has to be maximized for the parameters in f. The maximization is to be done under a set of constraints. An important constraint is the knowledge of the peak-shapes. We assume that f is composed of many individual
558
Fig. 40.34. Signal restoration by a Maximum Likelihood approach.
peaks of known shape. However, we make no assumption about the number and position of the peaks. Because f is non-linear and contains many parameters to be estimated, the solution of eq. (40.15) is not straightforward and should be calculated in an iterative way by a sequential optimisation strategy. Figure 40.34 shows the kind of resolution improvement one obtains. Under the normality assumption the Maximum Likelihood and the least squares criteria are equivalent. Thus one can also minimize Zef by a sequential optimization strategy [15]. 40.7.2 Maximum Entropy Before going into detail in the meaning of entropy and maximum entropy, the effect of applying this principle for signal enhancement is shown in Fig. 40.35. As can be seen, the effect is a drastic improvement of the signal-to-noise ratio and an enhancement of the resolution. This effect is thus comparable to what is achieved by the Maximum Likelihood procedure and by inverse filtering. However, the Maximum Entropy technique apparently does improve the resolution of the signal without increasing noise. In physical chemistry, entropy has been introduced as a measure of disorder or lack of structure. For instance the entropy of a solid is lower than for a fluid, because the molecules are more ordered in a solid than in a fluid. In terms of probability it means also that in solids the probability distribution of finding a molecule at a given position is narrower than for fluids. This illustrates that entropy has to do with probability distributions and thus with uncertainty. One of the earliest definitions of entropy is the Shannon entropy which is equivalent to the definition of Shannon's uncertainty (see Chapter 18). By way of illustration we
559
(a)
Ctfcefnlwl Shift/ppm
(b)
c
t«
140 ChgrnTcalSKVl/ppm
Fig. 40.35. Signal restoration by the Maximum Entropy approach.
consider two histograms of 20 analytical results, one obtained with a precise method and one obtained with a less precise method (see Table 40.6). On average the method yields a result equal to 100, in the range of 85 to 115. According to Shannon, the uncertainty of the two methods can be expressed by means of: fi=-^Pi^0S2
Pi
i=\
where p^ is the probability to find a value in class /.
560 TABLE 40.6 Shannon's uncertainty for two probability distributions (a broad and narrow distribution) Intervals
85 90
90-95
95-100
100-105
105-110
110-115
Distribution 1
2
3
5
5
3
2
Probability
0.10
0.15
0.25
0.25
0.15
0.10
-logaP/
3.32
2.737
2.00
2.00
2.737
3.32
-Pi^og^Pi
0.332
0.411
0.50
0.50
0.411
0.332
Distribution 2
0
2
8
8
2
0
Probability
0
0
-•og2P, -Pi^og2Pi
0
0.10
0.40
0.40
0.10
3.323
1.322
1.322
3.323
0.332
0.529
0.529
0.332
0
H = 2.486
H= 1.722
Application of this equation to the probability distributions given in Table 40.6 shows that H for the less precise method is larger than for the more precise method. Uniform distributions represent the highest form of uncertainty and disorder. Therefore, they have the largest entropy. We now apply the same principle to calculate the entropy of a spectrum (or any other signal). The entropy, 5 of a spectrum given by the vector y is defined as S = -I.Pi\og^i
with Pi = \yi\/Byi\
The entropy of a noise spectrum with equal probability of measuring a certain amplitude at each wavelength, is maximal. When structure is added to the spectrum the entropy decreases. Noise is associated with disorder, whereas structure means more order. In order to get a feeling for the meaning of entropy, we calculated the entropy of some typical spectra: a noise spectrum, the same noisy spectrum to which we added a spike, a noise free spectrum with one spike and one with two spikes (Table 40.7). As one can see, noise has the highest entropy, whereas a single spike has no entropy at all. It represents the highest degree of order. As indicated before, the maximum entropy approach does not process the measurements themselves. Instead, it reconstructs the data by repeatedly taking revised trial data (e.g. a spectrum or chromatogram), which are artificially corrupted with measurement noise and blur. This corrupted trial spectrum is thereafter compared with the measured spectrum by a x^-test. From all accepted spectra the maximum entropy approach selects that spectrum, f with minimal structure (which is equivalent to maximum entropy). The maximum entropy approach applied for noise elimination consists of the following steps:
561
TABLE 40.7 Entropy of noise, noise plus a spike at / = 5, a spike at / = 5 and two spikes at / = 5 and 6 i
Noise y
Single spike
Noise ••\- spike Pi
-Pi^^ZiPi
y
Pi
-Pi^og^i
y
Pi
Two spikes y
Pi
-Pi^og^i
1
0.43
0.141
0.398
0.43
0.095
0.324
0
0
0
0
0
0
2
-0.41
0.135
0.418
-0.41
0.091
0.317
0
0
0
0
0
0
3
0.34
0.112
0.361
0.34
0.075
0.280
0
0
0
0
0
0
4
0.49
0.161
0.424
0.49
0.108
0.347
0
0
0
0
0
0
5
0.01
0.003
0.025
1.50
0.331
0.531
1.5 1
0
1 0.5
1
6
0.25
0.082
0.296
0.25
0.055
0.230
0
0
0
1 0.5
1
7
-0.42
0.138
0.394
-0.42
0.093
0.320
0
0
0
0
0
0
8
-0.04
0.013
0.081
-0.04
0.009
0.006
0
0
0
0
0
0
9
-0.28
0.092
0.317
-0.28
0.062
0.250
0
0
0
0
0
0
0
0
0
0
0
10 -0.37
0.122
0.370
Entropy//= 3.084
-0.37
0.082
0.297
/ / = 2.956
0
H =0
H = 2.0
(1) Start the procedure with a trial spectrum fj. If no prior knowledge is available on the spectrum one starts the iteration process with a structure-less noise spectrum. Indeed, in that case there is no evidence to assume a particular structure beforehand. However, prior knowledge may justify to introduce some extra structure. (2) Calculate the variance of the differences d between the measurements and the trial spectrum f^. (3) Test whether the variance of these differences (sl)is significantly different from the variance of the measurement noise (s^) by a x^-test: X^ = (n - l)sj/s^ (n data points). For large n, Xcrit ~ ^• (4) If the trial spectrum is significantly different from the measured spectrum, the trial spectrum is adapted into f2 = fj + Af, whereafter the cycle is repeated from step (2) (see e.g. [16] for the derivation of A) until the spectrum meets the x^ criterion. (5) By repeating steps (1) to (4) with several 'noise' spectra, a set of spectra is obtained which meet the x^ criterion. All these spectra are marked as 'feasible' spectra. (6) Finally from the set of 'feasible' spectra the spectrum is selected with the maximum entropy.
562
The maximum entropy method thus consists of maximizing the entropy under the y^ constraint. An algorithm to maximize entropy is the so-called Cambridge algorithm [16]. When the maximum entropy approach is used for signal restoration a step has to be included between steps (1) and (2) in which the trial spectrum is first convoluted (see Section 40.4) with the point-spread function before calculating and testing the differences with the measured spectrum. The entropy of the trial spectrum before convolution is evaluated as usual.
40.8 Other transforms 40.8,1 The Hadamard transform In Section 40.3.2 we mentioned that the Fourier coefficients A^ and B^ can be calculated by fitting eq. (40.1) to the signaly(0 by a least squares regression. This fit is represented in a matrix notation as given in Fig. 40.36. The vector X represents the measurements, whereas A and B are vectors with respectively the real and imaginary Fourier coefficients. The columns of the two matrices are the sine and cosine functions with increasing frequency. These sines and cosines constitute a base of orthogonal functions. This representation also shows the resemblance of a FT with PC A. The measurement vector which initially contains A^ features (e.g. wavelengths) is reduced to a vector with n < N features by a projection on a smaller orthogonal sub-space defined by the n columns in the transform matrix. In PCA these n columns are the n principal components and in FT these columns are sines and cosines. Depending on the properties of these columns, the scores have a specific meaning, which in the FT are the Fourier coefficients. In theory, any base of orthogonal functions can be selected to transform the data. A base which is related to the cosine and sine functions is a series of orthogonal block signals with increasing frequency (Fig. 40.37). Any signal can be decomposed in a series of block functions, which is called the 1 2 B,
B„ Nl Fig. 40.36. Matrix representation of a Fourier transform.
563
6 H
~i
I
12
~T
16
1
1
20
r
— I —
24
Fig. 40.37. A base of block signals. FHT
J
L
16
32
"^~>^V
64
128
i\
256
Fig. 40.38. Spectrum given in Fig. 40.31 reconstructed with 2, 3,..., 256 Hadamard coefficients.
564
Hadamard transform [17]. For example the IR spectrum (512 data points) shown in Fig. 40.31a is reconstructed by the first 2, 4, 8, ... 256 Hadamard coefficients (Fig. 40.38). In analogy to spectrometers which directly measure in the Fourier domain, there are also spectrometers which directly measure in the Hadamard domain. Fourier and Hadamard spectrometers are called non-dispersive. The advantage of these spectrometers is that all radiation reaches the detector whereas in dispersive instruments (using a monochromator) radiation of a certain wavelength (and thus with a lower intensity) sequentially reaches the detector. 40.8,2 The time-frequency Fourier transform A common feature of the Fourier and Hadamard transform is that they describe an overall property of the signal in the measurement range, ^ = 0 to ^ = 7^. However, one may be interested in local features of the signal. For instance, it may well be that at the beginning of the signal the frequencies are much higher than at the end of the signal as shows Fig. 40.39. This is certainly true when the signal contains noise and peaks with different peak widths. In Fig. 40.39 there are regions with a high, low and intermediate frequency. One way to detect these local features is by calculating the FT in a moving window of size T^, and to observe the
Fig. 40.39. Signal with local frequency features.
565
i+1,w i+1,w Fig. 40.40. The moving window FT principle.
evolution of the Fourier coefficients as a function of the position of the moving window. In this example when the centre of the window coincides with the position of one of the peaks, the low frequency components are dominant, whereas in an area of noise the high frequencies become dominant. This means that peaks are detected by monitoring the Fourier coefficients as a function of the position of the moving window. The procedure of the moving FT is schematically shown in Fig. 40.40. At each position / of the window (size = T^) a filter function h(i) is defined by which the signal/(0 is multiplied before the FT is calculated. In general, this is expressed as follows: F{v,a) =
Fmh{a)]
where F is the symbol for the FT and a refers to the filter transfer function h(a). For each a, n Fourier coefficients are obtained, which can be arranged in matrix of A^ and B, coefficients: ^1,0 • • • ^ l , n ^ l , 0
"•^\,n
^ 2 , 0 '"^2,n^2,0
"^2,n
•^a,n^a,0
"^a,n
^a,0
The columns of this matrix contain the time information (amplitudes at a specific frequency as a function of time) and the rows the frequency information.
566
40.8.3 The wavelet transform Another transform which provides time and frequency information is the wavelet transform (WT). By analogy with the Fourier transform, the WT decomposes a signal into a set of basis functions, called a wavelet basis. In FT the basis functions are the cosine and sine function. The wavelet basis is also a function, called the analyzing wavelet. Frequently applied analyzing wavelets are the Morlet and Daubechies wavelets [18], of which the Haar wavelet is a specific member (Fig. 40.41). A series of wavelets is generated by stretching and shifting the wavelet over the data. The shift b is called a translation and the stretching or widening of the basis wavelet with a factor a is called a dilation. Suppose for instance that the analyzing wavelet is a function h(t). A series of wavelets h^u(t) is then generated by introducing a translation b and a dilation a, according to ^Ja
K a
A series of Morlet wavelets for various dilation values is shown in Fig. 40.42. In a similar way as for the FT, where only frequencies are considered which fit an exact number of times in the measurement time, here, only dilation values are considered which are stretched by a factor of two. A wavelet transform consists of fitting the measurements with a basis of wavelets as shown in Fig. 40.42, which are generated by stretching and shifting a mother wavelet h(t). The narrowest wavelet (level a^) is shifted in small steps, whereas a broad wavelet is shifted in bigger steps. The shift b is usually a multiple of the dilation value. The fitting of these basis of wavelets on the data yields the wavelet transform coefficients. Coefficients associated with narrow wavelets describe the local features in a signal, whereas the broad wavelets describe the smooth features in the signal. a)
c)
A
Fig. 40.41. The Haar wavelet (a) and three Daubechies wavelets (b-d).
567
Fig. 40.42. A family of Morlet wavelets with various dilation values.
To transform measurements available in a discrete form a discrete wavelet transform (DWT) is applied. Condition is that the number of data is equal to 2". In the discrete wavelet transform the analyzing wavelet is represented by a number of coefficients, called wavelet filter coefficients. For instance, the first member (smallest dilation a and shift b = 0)of the Haar family of wavelets is characterized by two coefficients Cj = 1 and C2 = 1. The next one with a dilation 2a is characterized by four coefficients: Cj = 1, C2 = 1, C3 = 1 and C4 = 1. Generally, the wavelet member n is characterized by l"" coefficients. The widest wavelet considered is the one for which 2" is equal to N, the number of measurements. The value of n defines the level of the wavelet. For instance, forn = 2 the level 2 wavelet is obtained. For each level, a transform matrix is defined in which the wavelet filter coefficients are arranged in a specific way. For a signal containing eight data points (arranged in a 8x1 column vector) and level 1, the transform matrix has the following form: 0
0
0
C2
0 0
0
0
0
0
0
Cl
C2
0
0
0
0
0
0
c,
C-,
C)
C2
0
0
0 0
0
Cl
0
0
0
Multiplication of this 4x8 transformation matrix with the 8x1 column vector of the signal results in 4 wavelet transform coefficients or N/2 coefficients for a data vector of length A^. For Cj = C2 = C3 = C4 = 1, these wavelet transform coefficients are equivalent to the moving average of the signal over 4 data points. Consequently,
568
these wavelet filter coefficients define a low-pass filter (see Section 40.5.3), and the resulting wavelet transform coefficients contain the 'smooth' information of the signal. For this reason this set of wavelet filter coefficients is called the approximation coefficients and the resulting transform coefficients are the a-components. The transform matrix containing the approximation coefficients is denoted as the G-matrix. In the above example with 8 data points, the highest possible transform level is level 3 (8 non-zero coefficients). The result of this transform is the average of the signal. The level zero (1 non-zero coefficient) returns the signal itself Besides this first set of coefficients, a second set of filter coefficients is defined which is the equivalent of a high-pass filter (see Section 40.5.3) and describes the detail in the signal. The high-pass filter uses the same set of wavelet coefficients, but with alternating signs and in reversed order. These coefficients are arranged in the H-matrix. The H-matrix for the level two transform of a signal with length 8 is: Cl
-C\
0
0
0
0
C2
0
0
0
0
-C\
0 0
0 0
0 0
0 0
0
0
Cl
-C|
0
0
0
0
0
0
C-,
-c
The coefficients in the H-matrix are the detail coefficients. The output of the H-matrix are the ^-components. With Nil detail components and Nil approximation components, we are able to reconstruct a signal of length N. The discrete wavelet transform can be represented in a vector-matrix notation as: a = W'^f
(40.16)
where a contains N wavelet transform coefficients, W is an A^xA^ orthogonal matrix consisting of the approximation and detail coefficients associated to a particular wavelet and f is a vector with the data. The action of this matrix is to perform two related convolutions, one with a low-pass filter G and one with a high-pass filter H. The output of G is referred to as the smooth information and the output of H may be regarded as the detail information. By way of illustration we consider a sequence of a discrete sample of 16 points, taken from Walczak [19], F = [0 0.2079 0.4067 0.5878 0.7431 0.8660 0.9511 0.9945 0.9945 0.9511 0.8660 0.7431 0.5878 0.4067 0.2079 0.0] which is fitted with a Haar wavelet at level a\ We first define the 16x16 matrix W of the wavelet filter coefficients equal to:
569 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 - 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 .
Rows 1-8 are the approximation filter coefficients and rows 9-16 represent the detail filter coefficients. At each next row the two coefficients are moved two positions (shift b equal to 2). This procedure is schematically shown in Fig. 40.43 for a signal consisting of 8 data points. Once W has been defined, the a^ wavelet transform coefficients are found by solving eq. (40.16), which gives: 1 0 0 0 0 0 0 0 /1/2 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1-10 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -
0 ro.oooo" 0 0 0 0.5878 0 0 0 0.7431 0 0.8660 0 0.9511 1 0.9945 0 0.9945 0 0.9511 0 0.8660 0 0 0.7431 0 0 0.5878 0 0.4067 0 0.2079 1 [o.oooo_ 0.2079 0.4067
" 0.1470" 0.7032 1.1378 1.3757 1.3757 1.1378 0.7032 0.1470 -0.1470 -0.1281 -0.0869 -0.0307 0.0307 ' 0.0869 0.1281 . 0.1470 J
The factor •\ll/ 2 is introduced to keep the intensity of the signal unchanged. The 8 first wavelet transform coefficients are the a or smooth components. The last eight coefficients are the d or detail components. In the next step, the level 2 components are calculated by applying the transformation matrix, corresponding to the a^ level on the original signal. This a^ transformation matrix contains 4 wavelet filter
570
original data detail
approximation rn
m
Ti
rrm
iTTTi
m
1 ,
[,
\,
J
\
1
m
1 iTi
IJ 11FT]
• iJ Fig. 40.43. Waveforms for the discrete wavelet transform using the Haar wavelet for an 8-points long signal with the scheme of Mallat' s pyramid algorithm for calculating the wavelet transform coefficients.
coefficients, which is a doubUng of the width of the wavelet. This wavelet is shifted four positions instead of two at the previous level. In our example this leads to a transform matrix with four approximation rows and four detail rows with 16 elements. Multiplication of this matrix with the 16-points data vector results in a vector with four a and four d components. The level 2 coefficients a^ are equal to: ro.oooo" 02079 O4067
/1/4
1 1 0 0 0 0 0 0 i -1
1 1 0 0 0 0 0 0 1 -1
0 0 0
0 0 0
0 0 0
0 0 0
0 1 0 0
0 1 0 0
0 0 1 -1 0 0 0 0
0 1 0 0 0
0 1 0 0
0 1 -1 0 0 0 0
0 0 1 0
0 0 1 0
0 0 0 0 1 -1 0 0
0 0 1 0
0 0 1 0
0 0 0 0 1 -1 0 0
0 0 0 1
o]
05878 07431 08660 0.9511
0 0 0 1
0 0 0 1
0
0
0
0
09945
0 0 0 0 I -1
0 0 1
0 0
0.9511 O8660 07431
0 0 1 09945
-ij
05878 0.4067 02079
LoooooJ
O6012 1.7773 1.7773 O6012 -0.3922 -0.1682 01682 03933
571
The same result is obtained by multiplying the vector of a coefficients obtained in the previous step with an 8x8 a' level transform matrix:
ViTI
' 0.6012"
0
0
0 0
1 1 0 0 0 0 1 1
0
0 [0.1470' 0 0.7032
1.7773
0
0
0
0
0 1
1.1378
0
0 1
1.3757
0.6012
1 - 1 0 0 0 1 0 0 0 0 0 0
0
0
0
0
0
1.3757
-0.3933
0 0 1 0 0 1
0
1.1378 0.7032
-0.1682
1
1 0
0 0 0
0
0
- 1 0 0 0
1 0
0
- 1 L0.1470_
1.7773
0.1682 _ 0.3933
This is the principle of the pyramidal algorithm developed by Mallat [20], which is computationally more efficient. Continuing the calculations according to this algorithm, the four a components are input to a 4x4 a^ level transformation matrix, giving the level-3 components:
^n/2
1 1 0 0 ] [0.6012" 0 0 1 1 1.7773 1-10
0
1.7773
0 0 1 -ij [0.6012
1.6819" 1.6818 -0.8316 0.8316J
and finally the level-4 coefficients (a^) are calculated according to:
^fU2
1.6819
2.3785
1 - 1 1.6818
0.0000
1
1
Having a closer look at the pyramid algorithm in Fig. 40.43, we observe that it sequentially analyses the approximation coefficients. When we do analyze the detail coefficients in the same way as the approximations, a second branch of decompositions is opened. This generalization of the discrete wavelet transform is called the wavelet packet transform (WPT). Further explanation of the wavelet packet transform and its comparison with the DWT can be found in [19] and [21]. The final results of the DWT applied on the 16 data points are presented in Fig. 40.44. The difference with the FT is very well demonstrated in Fig. 40.45 where we see that wavelet a^ describes the locally fast fluctuations in the signal and wavelet a^ the slow fluctuations. An obvious application of WT is to denoise spectra. By replacing specific WT coefficients by zero, we can selectively remove
572
approximations
details
J ^ ^
2^^.
Fig. 40.44. Wavelet decomposition of a 16-point signal (see text for the explanation).
50
100
150
X'O
250
300
350
400
460
500
QI—•^'^—^^A^-*^ 50
100 150 200 250
50
100 150 200 250
0.2
V
-0.2
123
Hri 128
Fig. 40.45. Wavelet decomposition of a signal with local features.
noise from distinct areas in the signal without disturbing other areas [22]. Mittermayr et al. [23] compared the wavelet filters to Fourier filters and to polynomial smoothers such as the Savitzky-Golay filters. Wavelets have been applied to analyze signals arising from several areas, as acoustics [24], image processing [22], seismics [25] and analytical signals [23,26, 27]. Another obvious application is to use wavelets to detect peaks in a noisy signal. Each sudden change of the signal by the appearance of a peak results in a
573
wavelet coefficient at that position [27]. Recently, it has been shown that signals can be compressed to a fairly small number of coefficients without much loss of information. Bos et al. [26] applied this property to compress IR spectra by a factor of 20 prior to a classification by a neural net. Feature reduction by wavelet transform for multivariate calibration has been studied by Jouan-Rimbaud et al. [28]. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
F.C. Strong III, How the Fourier transform infrared spectrophotometer works. J. Chem. Educ, 56(1979)681-684. R. Brereton, Tutorial: Fourier transforms. Use, theory and applications to spectroscopic and related data. Chem. Intell. Lab. Syst., 1 (1986) 17-31. R.N. Bracewell, The Fourier Transform and its Applications. 2nd rev. ed., McGraw-Hill, New York, 1986. G. Doetsch, Anleitung zum praktischen gebrauch der Laplace-transformation und der Ztransformation. R. Oldenbourg, Munchen 1989. K. Schmidt-Rohr and H.W. Spiess, Multidimensional Solid-state NMR and Polymers. Academic Press, London 1994, pp. 141. J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19 (1965) 297-301. E.G. Brigham, The Fast Fourier Transform. Prentice-Hall, Englewood Cliffs NJ, 1974. A. Savitzky and M.J.E. Golay, Smoothing and differentiating of data by simplified leastsquares procedures. Anal. Chem., 36 (1964) 1627-1639. C.G. Enke and T.A. Nieman, Signal-to-noise ratio enhancement by least-squares polynomial smoothing. Anal. Chem., 48 (1976) 705A-712A. J. Steinier, Y. Termonia and J. Deltour, Comments on smoothing and differentiation of data by simplified least squares procedure. Anal. Chem., 44 (1972) 1906-1909. B. Lam and T.L. Isenhour, Equivalent width criterion for determining frequency domain cutoffs in Fourier transform smoothing. Anal. Chem., 53 (1981) 1179-1182. W. Wu, B. Walczak, W. Pennincks and D.L, Massart, Feature reduction by Fourier transform in pattern recognition of NIR data. Anal. Chim. Acta, 331 (1996) 75-83. L. Pasti, D. Jouan-Rimbaud, D.L. Massart and O.E. de Noord, Application of Fourier transform to multivariate caUbration of Near Infrared Data. Anal. Chim. Acta, 364 (1998) 253-263. W.F. McClure, A. Hamid, E.G. Giesbrecht and W.W. Weeks, Fourier analysis enhances NIR diffuse reflectance spectroscopy. Appl. Spectrosc, 38 (1988) 322-329. L.K. DeNoyer and J.G. Dodd, Maximum Likelihood deconvolution for spectroscopy and chromatography. Am. Lab., 23 (1991) D24-H24. E.D. Laue, M.R. Mayger, J. Skilling and J. Staunton, Reconstruction of phase-sensitive twodimensional NMR-spectra by maximum entropy. J. Magn. Reson., 68 (1986) 14-29. S.A. Dyer, Tutorial: Hadamard transform spectrometry. Chemom. Intell. Lab. Syst., 12 (1991) 101-115. I. Daubechies, Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math., 41(1988) 909-996. B. Walczak and D.L.Massart, Tutorial: Noise suppression and signal compression using the wavelet packet transform. Chemom. Intell. Lab. Syst., 36 (1997) 81-94.
574 20. 21. 22. 23. 24. 25. 26. 27. 28.
I. Daubechies, S. Mallat and A.S. Willsky, Special issue on wavelet transforms and multiresolution signal analysis. IEEE Trans. Info Theory, 38 (1992) 529-531. B. Walczak, B. van den Bogaert and D.L. Massart, Application of wavelet packet transform in pattern recognition of near-IR data. Anal. Chem., 68 (1996) 1742-1747. S.G. Nikolov, H. Hutter and M. Grasserbauer, De-noising of SIMS images via wavelet shrinkage. Chemom. Intell. Lab. Syst., 34 (1996) 263-273. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet denoising of Gaussian peaks: a comparative study. Chemom. Intell. Lab. Syst., 34 (1996) 187-202. R. Kronland-Martinet, J. Morlet and A. Grossmann, Analysis of sound patterns through wavelet transforms. Int. J. Pattern Recogn. Artif. Intell., 1 (1987) 273-302. P. Goupillaud, A. Grossmann and J. Morlet, Cycle-octave and related transforms in seismic signal analysis. Geoexploration, 23 (1984) 85-102. M. Bos and J. A.M. Vrielink, The wavelet transform for pre-processing IR spectra in the identification of mono- and di-substituted benzenes. Chemom. Intell. Lab. Syst., 23 (1994) 115-122. M. Bos and E. Hoogendam, Wavelet transform for the evaluation of peak intensities in flowinjection analysis. Anal. Chim. Acta, 267 (1992) 73-80. D. Juan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Noord and D.L. Massart, Application of wavelet transform to extract the relevant component from spectral data for multivariate calibration. Anal. Chem., 69 (1997) 4317-4323.
Additional recommended reading C.K. Chui, Introduction to Wavelets. Academic Press, Boston, 1991. D.N. Rutledge (Ed.), Signal Treatment and Signal Analysis in NMR. Elsevier, Amsterdam, 1996. B.K. Alsberg, A.M. Woodward and D.B. Kell, An introduction to wavelet transforms for chemometricians: a time-frequency approach. Chemom. Intell. Lab. Syst., 37 (1997) 215-239. F. Dondi, A. Betti, L. Pasti, M.C. Pietrogrande and A. FeHnger, Fourier analysis of multicomponent chromatograms — application to experimental chromatograms. Anal. Chem., 65 (1993) 2209-2222.
575
Chapter 41
Kalman Filtering 41.1 Introduction Linear regression and non-linear regression methods to fit a model through a number of data points have been discussed in Chapters 8, 10 and 11. In these regression methods all data points are collected first followed by the estimation of the parameters of a postulated model. The validity of the model is checked by a statistical evaluation of the residuals. Generally, the same weight is attributed to all data points, unless a weighing procedure is applied on the residuals, e.g., according to the inverse of the variance of the experimental error (see Section 8.2.3.2). In this chapter we discuss an alternative method of estimating the parameters of a model. There are two main differences from the regression methods discussed so far. First, the parameters are estimated during the collection of the data. Each time a new data point is measured, the parameters of the model are updated. This procedure is called recursive regression. Because at the beginning of the measurement-estimation process, the model is based on a few observations only, the estimated model parameters are imprecise. However, they are improved as the process of data collection and updating proceeds. During the progress of the measurement-estimation sequence, more data points are measured leading to more precise estimates. The second difference is that the values of the model parameters are allowed to vary during the measurement-estimation process. An example is the change of the concentrations during a kinetics experiment, which is monitored by UV-Vis spectrometry. Multicomponent analysis is traditionally carried out by measuring the absorbance of a sample at a number of wavelengths (at least equal to the number of components which are analyzed) and calculating the unknown concentrations (see Chapter 10). These concentrations are the parameters of the model. The basic assumption is that during the measurement of the absorbances of the sample at the selected wavelengths, the concentrations of the compounds in the mixture do not vary. However, if the measurements are carried out during a kinetics experiment, the concentrations may vary in time, and as a result the composition of the sample varies during the measurements. In this case, we cannot simply estimate the unknown concentrations by multiple linear regression as explained in Chapter 10. In order to estimate the concentrations of the
576
compounds as a function of time during the data acquisition, two models are needed, the Lambert-Beer model and the kinetics model. The Lambert-Beer model relates the measurements (absorbances) to the concentrations. This model is called the measurement model {A =J{a,c)). The kinetics model describes the way the concentrations vary as a function of time and is called the system model (c = J{k,t)). In this particular instance, the system model is an exponential function in which the reaction rate k is the regression parameter. The terms * systems' and 'states' are associated with the Kalman fdter. The system here is the chemical reaction and its state at a given time is the set of concentrations of the compounds at that time. The output of the system model is the state of the system. The measurement and system models fully describe the behaviour of the system and are referred to as the state-space model. Thus the system and measurement models are connected. The parameters of the measurement model (in our example, the concentrations of the reactants and the reaction product) are the dependent variables of the system model, in which the reaction rate is the regression parameter and time is the independent variable. Later on we explain that system models may also contain a stochastic part. It should be noted that this dual definition implies that two sets of parameters are estimated simultaneously, the parameters of the measurement model and the parameters of the systems model. In this chapter we introduce the Kalman filter, with which it is possible to estimate the actual values of the parameters of a state-space model, e.g., the rate of a chemical reaction from the evolution of the concentrations of a reaction product and the reactants as a function of time which in turn are estimated from the measured absorbances. Let us consider an experiment where a flow-through cell is connected to a reaction vessel, in which the reaction takes place. During the reaction, which may be fast with respect to the scan speed of the spectrometer, a spectrum is measured. At r = 0 the measurement of the spectrum is started, e.g., at 320 nm in steps of 2 nm per second. Every second (2 nm) estimates of the concentrations of the compounds are updated by the Kalman filter. When arrived at the end of the spectral range (e.g. 600 nm) we may continue the measurement process as long as the reaction takes place, by reversing the scan of the spectrometer. As mentioned before, the confidence in the estimates of the model parameters improves during the measurement-estimation process. Therefore, we want to prevent a new but deviating measurement to influence the estimated parameters too much. In the Kalman filter this is implemented by attributing a lower weight to new measurements. An important property of a Kalman filter is that during the measurement and estimation process, regions of the measurement range can be identified where the model is invalid. This allows us to take steps to avoid these measurements affecting the accuracy of the estimated parameters. Such a filter is called the adaptive Kalman fdter. An increasing number of applications of the Kalman filter
577
has been published, taking advantage of the formulation of a systems model which describes the dynamic behaviour of the parameters in the measurement model and exploiting the adaptive properties of the filter. In this chapter we discuss the principles of the Kalman filter with reference to a few examples from analytical chemistry. The discussion is divided into three parts. First, recursive regression is applied to estimate the parameters of a measurement equation without considering a systems equation. In the second part a systems equation is introduced making it necessary to extend the recursive regression to a Kalman filter, and finally the adaptive Kalman filter is discussed. In the concluding section, the features of the Kalman filter are demonstrated on a few applications.
41.2. Recursive regression of a straight line Before we introduce the Kalman filter, we reformulate the least-squares algorithm discussed in Chapter 8 in a recursive way. By way of illustration, we consider a simple straight line model which is estimated by recursive regression. Firstly, the measurement model has to be specified, which describes the relationship between the independent variable x, e.g., the concentrations of a series of standard solutions, and the dependent variable, 3;, the measured response. If we assume a straight line model, any response y^ is described by: yi = ^0 + ^1 ^i + ^i
bg and b^ are the estimates of the true regression parameters % and Pj, calculated by linear regression and e^ is the contribution of measurement noise to the response. In matrix notation the model becomes: y. = xj h + e^ where x- is a [2x1] column vector 1
and b is [2x1] column vector
h A. A recursive algorithm which estimates PQ and pj has the following general form: New estimate = Previous estimate + Correction
578
After each new observation, the estimates of the model parameters are updated (= new estimate of the parameters). In all equations below we treat the general case of a measurement model with p parameters. For the straight line model /? = 2. An estimate of the parameters b based ony - 1 measurements is indicated by b(/ - 1). Let us assume that the parameters are recursively estimated and that an estimate b(/ - 1) of the model parameters is available fromj - 1 measurements. The next measurement y(j) is then performed at x(/), followed by the updating of the model parameters to b(/). The first step of the algorithm is to calculate the innovation (I), which is the difference between measured y(j) and predicted response y(j) at x(/). Therefore, the last estimate b(/ - 1) of the parameters is substituted in the measurement model in order to forecast the response y(j), which is measured at x(j):
y(j) = bo(j-l) + b,(j-l)x(j)or >;(/•) = xT(/*)b(/'-l)
(41.1)
The innovation /(/) (not to be confused with the identity matrix I) is the difference between measured and predicted response at x(j). Thus /(/) = y(j) - yij)The value of the innovation is used to update the estimates of the parameters of the model as follows: b(7) = b ( 7 - l ) + k ( ; ) / ( j ) px\
px\
pxl
(41.2)
1x1
Equation (41.2) is the first equation of the recursive algorithm. k(/) is a/7xl vector, called the gain vector. Looking in more detail at this gain vector, we see that it weights the innovation. For k equal to the null vector, b(/) = b(/ - 1), leaving the model parameters unadapted. The larger the k, the more weight is attributed to the innovation and as a consequence to the last observation. Therefore, one may intuitively understand that the gain vector depends on the confidence we have in the estimated parameters. As usual this confidence is expressed in the variancecovariance matrix of the parameters. Because this confidence varies during the estimation process, it is indicated by P(/) which is given by:
nj)-One expects that during the measurement-prediction cycle the confidence in the parameters improves. Thus, the variance-covariance matrix needs also to be updated in each measurement-prediction cycle. This is done as follows [1]: P(7)=P(;-l)-k(;)xT(7)P(7-l) pxp
pxp
pxl
Ixp
pxp
(41.3)
579
Equation (41.3) is the second equation of the recursive filter, which expresses the fact that the propagation of the measurement error depends on the design (X) of the observations. Once the fitting process is complete the square root of the diagonal elements of P give the standard deviations of the parameter estimates. Key factor in the recursive algorithm is the gain vector k, which controls the updating of the parameters as well as the updating of the variance-covariance matrix P. The gain vector k is the most affected by the variance of the experimental error r(j) of the new observation y(j) and the uncertainty PO - 1) in the parameters b(/ - 1). When the elements of PO* - 1) are much larger than r(/), the gain factor is large, otherwise it is small. After each new measurement j the gain vector is updated according to eq. (41.4) k(7) = P ( y - l ) x ( y ) ( x T ( y ) P ( y - l ) x ( ; ) + r ( ; ) ) - i pxl
pxp
pxl
\xp
pxp
px\
(41.4)
1x1
The expression x^(/)P(/ - l)x(/) in eq. (41.4) represents the variance of the predictions, y(j), at the value x(j) of the independent variable, given the uncertainty in the regression parameters P(/). This expression is equivalent to eq. (10.9) for ordinary least squares regression. The term r(j) is the variance of the experimental error in the response y(j). How to select the value of r(j) and its influence on the final result are discussed later. The expression between parentheses is a scalar. Therefore, the recursive least squares method does not require the inversion of a matrix. When inspecting eqs. (41.3) and (41.4), we can see that the variancecovariance matrix only depends on the design of the experiments given by x and on the variance of the experimental error given by r, which is in accordance with the ordinary least-squares procedure. Typically, a recursive algorithm needs initial estimates for b(0) and P(0) to start the iteration process and an estimate of r(j) for all j during the measurementestimation process. When no prior information on the regression parameters is available b(0) is usually set equal to zero. In many textbooks [1], it is recommended to choose for P(0) a diagonal matrix with very large diagonal elements, expressing the large uncertainty in the chosen starting values b(0). As explained before, r(j) represents the variance of the experimental error in observation y(j). Although the choice of P(0) and r(j) introduces some arbitrariness in the calculation, we show in an example later on that the gain vector (k) which fully controls the updating of the estimates of the parameters is fairly insensitive for the chosen values r(/) and P(0). However, the obtained final value of P depends on a correct estimation of r(j). Only when the value of r(j) is equal to the variance of the experimental error, does P converge to the variance-covariance matrix of the estimated parameters. Otherwise, no meaning can be attributed to P. In summary, after each new measurement a cycle of the algorithm starts with the calculation of the new gain vector (eq. (41.4)). With this gain vector the variance-
580
Iteration step
Fig. 41.1. Evolution of the innovation during the recursive estimation process (iteration steps) (see Table 41.1).
covariance matrix (eq. (41.3)) is updated and the estimates of the parameters are updated (eq. (41.2)). By monitoring the innovation sequence, the stability of the iteration process can be followed. Initially, the innovation shows large fluctuations, which gradually fade out to the level expected from the measurement noise (Fig. 41.1). An innovation which fails to converge to the level of the experimental error indicates that the estimation process is not completed and more observations are required. However, P(0) also influences the convergence rate of the estimation process as we show with the calibration example discussed below. By way of illustration, the regression parameters of a straight line with slope = 1 and intercept = 0 are recursively estimated. The results are presented in Table 41.1. For each step of the estimation cycle, we included the values of the innovation, variance-covariance matrix, gain vector and estimated parameters. The variance of the experimental error of all observations y is 25 10"^ absorbance units, which corresponds to r = 25 10"^ au for a l l / The recursive estimation is started with a high value (10^) on the diagonal elements of P and a low value (1) on its off-diagonal elements. The sequence of the innovation, gain vector, variance-covariance matrix and estimated parameters of the calibration lines is shown in Figs. 41.1^1.4. We can clearly see that after four measurements the innovation is stabilized at the measurement error, which is 0.005 absorbance units. The gain vector decreases monotonously and the estimates of the two parameters stabilize after four measurements. It should be remarked that the design of the measurements fully defines the variance-covariance matrix and the gain vector in eqs. (41.3) and (41.4), as is the case in ordinary regression. Thus, once the design of the experiments is chosen
581 TABLE 41.1 Recursive estimation of the parameters BQ and Z?^ of a straight line (see text for symbols; y are the measurements). InitiaUsation: bT(0) = [0 0]; r = 25 10-^; P^^(0) = P2,2(^)= 1000; Fi 2CO) = PiA^) = 1 J
x(/')
y(j)
y(j)
I(j)
h(j)
1
m
1 0.1
0.101
0
0.101
0.09999 0.01010
0.99000 0.09998
2
1 0.2
0.196
0.102
0.094
0.0060 0.9500
3
1 0.3
0.302
0.291
0.011
4
1 0.4
0.403
0.401
5
1 0.5
0.499
6
1 0.6
0.608
7
1 0.7
0.703
0.706
8
1 0.8
0.801
9
1 0.9
10 1 1.0
P(/") 9.8990 -98.99
-98.99 989.9
-1.0000 9.99995
1.37 10"" -8.09 lO""
-8.16 10-" 5.249 10"'
-0.0024 1.0072
-0.7307 5.2029
-6.0 10"' -2.61 10-"
-2.62 10-" 1.303 10"'
-0.002
-0.0032 1.0138
-0.5254 3.0754
3.73 10-' -1.26 10-"
-1.26 10-" 5.06 10^
0.504
-0.005
-0.0012 1.0042
-0.4085 2.0236
2.68 10-' -7.4 10"'
-7.41 10"' 2.49 10""
0.601
0.007
-0.0035 1.0138
-0.334 1.433
2.10 10-' -4.88 10'
-4.89 10"'
-0.003
-0.0025 1.0104
-0.2838 1.0694
1.7 10"' -3.5 10'
-3.5 10"' 8.78 10"'
0.806
-0.005
-0.0014 1.0064
-0.2468 0.8290
1.5 10' -2.6 10'
-2.6 10"' 5.84 10'
0.897
0.904
-0.007
-0.0002 1.0015
-0.2185 0.6618
1.27 10' -2.02 10'
-2.02 10"' 4.08 10"'
1.010
1.002
0.0082 -0.0014 1.0060
-0.1962 0.5408
1.13 10' -1.62 10'
-1.62 10' 2.97 10"'
1.41 10^
(x(j),j = 1, ..., n), one can predict how the variance-covariance matrix behaves during the iterative process and decide on the number of experiments or decide on the design itself. This is further explained in Section 41.3. The relative insensitivity of the estimation process for the initial value of P is illustrated by repeating the calculation with the diagonal elements P(l,l) and P(2,2) of P set equal to 100 instead of equal to 1000 (see Table 41.2). As can be seen from Table 41.2 the gain vector, the variance-covariance matrix and the estimated regression parameters rapidly converge to the same values. Also using unrealistically high values for the experimental error (e.g. r(j) = 1) does not affect the convergence of the gain factors too much as long as the diagonal elements of P remain high. However, we also
582
Iteration number
Fig. 41.2. Gain factor during the recursive estimation process (see Table 41.1). lt+4 1E+2
b 1E+0 \ \
P(2,2)
1E-2
\pair'"---~.^___^ 1E-4
1
8
9
10
Iteration number
Fig. 41.3. Evolution of the diagonal elements of the variance-covariance matrix (P) during the estimation process (see Table 41.1).
observe that P no longer represents the variance-covariance matrix of the regression parameters. If we start with a high r(j) value with respect to the diagonal elements of P(0) (e.g. 1:100), assuming a large experimental error compared to the confidence in the model parameters, the convergence is slow. This is indicated by comparing the innovation sequence for the ratio r(j) to P(0) equal to 1:1000 and 1:100 in Table 41.2. In recursive regression, new observations steadily receive a lower weight, even when the variance of the experimental error is constant (homoscedastic). Consequently, the estimated regression parameters are generally not exactly equal to the values obtained by ordinary least squares (OLS).
583 Q.
i
1
^—"
o 1 Q)
C
"(0 o 0) (A
0.8 0.6
:
0.4
-
•
slope
1 / /
0.2 u
intercept
^ / ^ ,,. 1
1
,1
1
,
1
, 1.
,
1
1
1
10 Iteration number
Fig. 41.4. Evolution of the model parameters during the iteration process (see Table 41.1).
This straight line example exemplifies the fact that by exploiting the recursive approach alternative calibration practices are possible. The standard practice in calibration is to measure calibration standards prior to the analysis of the unknown samples and to construct a calibration line. Thereafter one starts the analysis of a sample batch. To check whether the analysis system performs well, each batch of samples is preceded by a check sample with a known concentration. The results of the check sample are plotted in a control chart (see Chapter 7). When the value exceeds the warning or action limits, one may decide to re-establish the calibration line by remeasuring a new set of calibration standards. This procedure disregards the knowledge on the calibration factors already available from the previous calibration run. By applying the recursive algorithm, it is in principle sufficient to remeasure a limited set of calibration standards and to update the estimates of the calibration factors already available from the previous calibration step. However, as can be seen from the example given in Table 41.1, this would mean that the weight attributed to the new measurements steadily decreases (small gain vector). This leads to the undesirable situation where the calibration factors are hardly updated. However, it should be bome in mind that the reason to decide to recalibrate was that during the measurement of the unknown samples, calibration factors may be changed. As a result, the uncertainty in the calibration factors were judged to be unacceptably large, requiring recalibration. Expressed in terms of a system, the system state (here slope and intercept) has not been observed for some time. Because this system state varies in time (in an unknown deterministic or stochastic way), the uncertainty about the system state increases and the state has to be observed again. This uncertainty is taken into account in the Kalman filter by the system equation, which is discussed in the next section. One of the methods to
584 TABLE 41.2 Influence of initial values of the diagonal values of the variance-covariance matrix P(0) and the variance of the experimental error on the gain vector and the Innovation sequence (see Table 41.1 for the experimental values, y) Pu(0)
r=\
r=25 10-6 kl 1
Pi,i(0) == P2 2(0) = 100
= P2,2<[0)=1000
k2
I
kl
r=l
r = 25 10"^ k2
kl
I
0.99 0.10
0.101
0.98 0.10
0.101
0.99
k2
kl
I
0.11
0.101
k2
0.98 0.11
I 0.101
2
-1.0
10.0
0.094
-0.75 8.31
0.094
-1.0
9.99
0.094
0.00 3.33
0.095
3
-0.73 5.20
0.011
-0.62 4.76
0.035
-0.66
4.99
0.011
-0.33 3.31
0.105
4
-0.53 3.08
0.002
-0.48 2.94
0.011
-0.50
2.99
0.002
-0.37 2.48
0.069
0.0005
-0.40
1.99 -0.045
-0.34 1.80
0.034
-0.33 1.42
-0.010
-0.33
1.42
0.007
-0.30 1.34
0.034
-0.003
-0.28
1.07
-0.001
-0.28
1.07 -0.003
-0.27 1.03
0.016
b
-0.41 2.02
-0.005
6
-0.33
1.43
0.007
7
-0.28
1.07
-0.39
1.98
8
-0.24 0.83
-0.005
-0.25 0.83
-0.003
-0.25
0.83 -0.004
-0.24 0.81
0.009
9
-0.22 0.66
-0.007
-0.22 0.66
-0.006
-0.22
0.66 -0.007
-0.21 0.65
0.003
10
-0.19 0.54
0.008
-0.20 0.54
0.009
-0.20
0.54
-0.19 0.54
0.016
Pi.i
P2.2
Pu
PKI
^2.2
0.008
P2,2
Pu
P2,2
1
9.9
989.9
9.9
989.9
0.988
98.8
1.96
98.8
2
1.4 E-4
5.2 E-3
4.2
166
1.2 E-4
4.9 E-3
1.96
65.5
3
6.0 E-5
1.3 E-3
2.23
47.5
5.7 E-5
1.2 E-3
1.64
32.8
4
3.7 E-5
5.0 E-4
1.47
19.6
3.7 E-5
4.9 E-4
1.27
16.5
5
2.7 E-5
2.5 E-4
1.09
9.89
2.7 E-5
2.5 E-4
1.01
9.01
6
2.1 E-5
1.4 E-4
0.86
5.67
2.1 E-5
1.4 E-4
0.82
5.37
7
1.7 E-5
8.8 E-5
0.71
3.56
1.8 E-5
8.8 E-5
0.69
3.43
8
1.5 E-5
5.8 E-5
0.60
2.37
1.5 E-5
5.9 E-5
0.59
2.31
9
1.3 E-5
4.1 E-5
0.52
1.66
1.3 E-5
4.1 E-5
0.52
1.63
10
1.1 E-5
3.0 E-5
0.47
1.21
1.1 E-5
3.0 E-5
0.46
1.19
assess this uncertainty as a function of time is by monitoring (e.g., in a Shewhart control chart) the value of the calibration factors obtained by a traditional calibration procedure, and modelling them by, e.g., an autoregressive model (see Section 20.4). However, because analytical chemists are unfamiliar with this type of filter, and the recursive approach is not simple, Kalman filters are hardly applied in calibration.
585
41.3 Recursive multicomponent analysis At this point we introduce the formal notation, which is commonly used in literature, and which is further used throughout this chapter. In the new notation we replace the parameter vector b in the calibration example by a vector x, which is called the state vector. In the multicomponent kinetic system the state vector x contains the concentrations of the compounds in the reaction mixture at a given time. Thus x is the vector which is estimated by the filter. The response of the measurement device, e.g., the absorbance at a given wavelength, is denoted by z. The absorbtivities at a given wavelength which relate the measured absorbance to the concentrations of the compounds in the mixture, or the design matrix in the calibration experiment (x in eq. (41.3)) are denoted by h^. Because zij) = zij) + v(/) and zij) = h^(j)x(j - 1 ) , the measurement equation given by eq. (41.1) is transformed into: z{j) = h^j)x(j^ \xp
1x1
l)+v(7) for; = 0,1, 2,... pxl
(41.5)
1x1
where v(j) includes the random measurement error (not to be confounded with r(j) which is the variance of the random measurement error) and the systematic error caused by the bias in model parameters, x(/ - 1). The equations for the variancecovariance update and Kalman gain update are adapted accordingly. The overall sequence of the equations is given in Table 41.3. It should be noted that here the state vector x is time-invariant. TABLE 41.3 Kalman filter algorithm equations for time-invariant system states Initialisation: x(0), P(0) Kalman gain update k ij) = P ( ; - l)h (;) (h'^ (;•) P(y - l)h (;) + K;))"^ px\
pxp pxl
Ixp
(41.6)
pxp pxl
Variance-covariance update PO; = PO"-l)-kWh'^0-)PO'-l) pxp
pxp pxl
(41.7)
Ixp pxp
Measurement zij) State update x(j)=xU-l)-^k(j)(z{j)-h^ (j)xU-l)) pxl pxl 1x1 pxl 1x1 pxl
(41.8)
586 TABLE 41.4 Absorptivities of CI2 and Br2 in chloroform [2] Wavenumber cm-'/103
Absorptivities (>a 03) CI2
22
Absorbance of sample Br2
4.5
168
0.0341
24
8.4
211
0.0429
26
20.0
158
0.0335
28
56
30
0.0117
30
100
32
71
4.7 5.3
0.0110 0.0080
By way of illustration we work out the two-component analysis of chlorinebromine by UV-Vis spectrometry. In Table 41.4 the absorptivities of CI2 and Br2 in chloroform at six wavenumbers are given [2], with the absorbances measured for a supposedly unknown sample with concentrations (CQ^ = 0.1 and c^^. = 0.2). Concentrations of CI2 and Br2 are estimated by recursive regression. With this example we want to demonstrate that the evolution of the estimations of the unknown concentration depends on the design of the experiments, here the set of wavelengths and the order in which these wavelengths are selected. A measure for the precision of the estimation process is the variance-covariance matrix P. As said before, this precision can be evaluated before any measurement has been carried out. The two following wavelength sequences are evaluated — sequence A: 22-32 10^ cm"^ in ascending order, and sequence B: 22, 32, 24, 30, 26, 28 10^ cm~^ The measurement equation for the multicomponent system is given by:
za;=h^(/>o'-i) where z(j) is the predicted absorbance at the 7th measurement, using the latest estimated concentrations x(j - 1) obtained after the (/* - l)th measurement. h^O) contains the absorptivities of CI2 and Br2 at the wavelength chosen for the jth measurement. Step 1. Initialisation (/ = 0) P(0) =
"1000
1
1
1000
r(j) = 1 for ally
x(0) =
587
Step 2. Update of the Kalman gain vector k(l) and variance-covariance matrix P(l) k(l) = P(0)h(l)(hT(l)P(0)h(l) + i r ' P(l) = P(0)-k(l)h'^(l)P(0) This gives for design A:
k(l) =
fiooo L 1
0.0045 1 [0.0045" 1000 1 +1 |[([0.0045 0.168] 1 1000 0.168 1000 [ 0.168
r
[0.1596" ' L 5-744 _
P(l) =
"1000 1
1 ~
1000 1 "0.1596" [0.0045-0.168] 1 1000 _ 5.74^t_
K)00^
999.2 -25.8 -25.8 34.88
Step 3. Predict the value of the first measurement z(l) z(l) = [0.0045 0.168]
To' 0
=0
Step 4. Obtain the first measurement: z(l) = 0.0341. Step 5. Update the predicted concentrations: x(l) = x(0) + k(l) (z(l) - 2(1)) x(l) =
"0" 0
+
"0.16'
[0.0341-0] =
_5.7_
0.0054 0.196
Step 6. Return to step 2. These steps are summarized in Tables 41.5 and 41.6. The concentration estimates should be compared with the true values 0.1 and 0.2 respectively. For design B the results listed in Table 41.7 are obtained. From both designs a number of interesting conclusions follow. (1) The set of selected wavelengths (i.e. the experimental design) affects the variance-covariance matrix, and thus the precision of the results. For example, the set 22, 24 and 26 (Table 41.5) gives a less precise result than the set 22, 32 and 24 (Table 41.7). The best set of wavelengths can be derived in the same way as for multiple linear regression, i.e. the determinant of the dispersion matrix (h^h) which contains the absorptivities, should be maximized.
588 TABLE 41.5 Calculated gain vector k and variance covariance matrix P\ f 11(0) = P2.2(^) = 1000; P^j^O) = ^2.1(0) = 1 > ''= 1 Step
Wavelength
k(«)
P(n)
A:l 1
22
2 3
kl
^2,2
^1.2-^2,1
34.88
-25.88
0.16
5.74
999.2
24
1.16
2.82
995.8
14.73
-34.12
26
9.37
1.061
859.7
12.98
-49.53
4
28
13.17
245.0
11.4
-18.11
5
30
7.11
-0.52
71.4
10.5
-5.6
6
32
3.71
-0.25
52.6
10.4
-4.3
-0.67
TABLE 41.6 Estimated concentrations (see Table 41.5 for the starting conditions) Step
Wavelength
x2 CL
Br.
1
22
0.0054
0.196
2
24
0.0072
0.2004
3
26
0.023
0.2022
4
28
0.0802
0.1993
5
30
0.0947
0.1982
6
32
0.0959
0.198
TABLE 41.7 Calculated gain vector, variance covariance matrix and estimated concentrations (see Table 41.5 for the starting conditions) Step
Wavelength
k{n) k\
x(«)
P(A2)
kl
Pu
^2.2
^1,2-^2,1
JCl
x2
999.2
34.88
-25.88
0.0054
0.196
166.2
34.43
-6.42
0.0825
0.194
166.2
13.81
-6.54
0.0825
0.198
62.6
13.68
-2.86
0.0938
0.197
1
22
0.16
2
32
11.76
3
24
0.016
4
30
6.24
5
26
0.59
1.560
62.1
10.4
-4.1
0.0941
0.198
6
28
2.82
0.069
52.2
10.4
-4.3
0.0955
0.198
5.744 -0.27 2.859 -0.22
589 TABLE 41.8 Concentration estimation with the optimal set of wavelengths (see Table 41.5 for the starting conditions) Step
Wavelength
k(n) k\
x(n)
P(^^) k2
Pli
P22
^1,2-^2,1
x\
x2
1
22
0.16
999.2
34.88
-25.88
0.0054
0.196
2
32
11.76
-0.27
166.2
34.43
-6.42
0.0825
0.1942
3
30
6.244
-0.18
62.6
34.34
-3.42
0.0940
0.1939
5.744
(2) From the evolution of P in design B (Table 41.7), one can conclude that the measurement at wavenumber 24 10^ cm~^ does not really improve the estimates already available after the sequence 22, 32. Equally the measurement at 26 10^ cm~^ does not improve the estimate already available after the sequence 22,32,24 and 30 10^ cm~^. This means that these wavelengths do not contain new information. Therefore, a possibly optimal set of wavenumbers is 22,32 and 30. Inclusion of a fourth wavelength namely at 28 10^ cm~^ probably does not improve the estimates of the concentrations already available, since the value of P converged to a stable value. To confirm this conclusion, the recursive regression was repeated for the set of wavelengths 22, 32 and 30 10^ cm"^ (see Table 41.8). Thyssen et al. [3] developed an algorithm for the selection of the best set of m wavelengths out of n. Instead of having to calculate 10^^ determinants to find the best set of six wavelengths out of 300, the recursive approach only needs to evaluate a rather straightforward equation 420 times. The influence of the measurement sequence on the speed of convergence is well demonstrated for the four-component analysis (Fig. 41.5) of a mixture of aniline, azobenzene, nitrobenzene and azoxybenzene [3]. In the forward scan mode a quick convergence is attained, whereas in the backward scan mode, convergence is slower. Using an optimized sequence, convergence is complete after less than seven measurements (Fig. 41.6). Other methods for wavelength selection are discussed in Chapters 10 and 27.
41.4 System equations When discussing the calibration and multicomponent analysis examples in previous sections, we mentioned that the parameters to be estimated are not necessarily constant but may vary in time. This variation is taken into account by
590
a
1 "
K
M
1 .6
•
"
K
X K M
M
K
X
«""
1.2 •".
X
K *
X
X Z
X
r>
0.8
*-
rzi
xa^ie"" -4
_
AA 1
1
14
1
1
28
1
1
42
1
1
56
1-
>
<
70
—> k
78
—> k
0.8 +
0.4 \
Fig. 41.5. Multicomponent analysis (aniline (jci), azobenzene fe), nitrobenzene fe) and azoxybenzene (JC4)) by recursive estimation (a) forward run of the monochromator (b) backward run (k indicates the sequence number of the estimates; solid lines are the concentration estimates; dotted lines are the measurements z).
591
l.B f
1.2
e.4 I ps..
X2K10 XltUB
M
28
42
56
70
—> k
Fig. 41.6. Multicomponent analysis (see Fig. 41.5) with an optimized wavelength sequence.
the system equation. As explained before, the system equation describes how the system changes in time. In the kinetics example the system equation describes the change of the concentrations as a function of time. This is a deterministic system equation. The random fluctuation of the slope and intercept of a straight line can be described by a stochastic model, e.g., an autoregressive model (see Chapter 20). Any unmodelled system fluctuations left are included in the system noise, w(j). The system equation is usually expressed in the following way: x(/*) = F ( / j - l ) x ( / - l ) + w(/')
(41.9)
where F(/j - 1) is the system transition matrix, which describes how the system state changes from time ti_^ to time /^. The vector w(/) consists of the noise contributions to each of the system states. These are system fluctuations which are not modelled by the system transition matrix. The parameters of the measurement equation, the h-vector and system transition matrix for the kinetics and calibration model are defined in Table 41.9. In the next two sections we derive the system equations for a kinetics and a calibration experiment. System state equations are not easy to derive. Their form depends on the particular system under consideration and no general guidance can be given for their derivation.
592 TABLE 41.9 Definition of the state parameter (x), h-vector and transition matrix (F) for two systems State parameters (x)
h-vector
1. Calibration
Slope and intercept of [ 1 c] where c is the concentthe calibration line at ration of the calibration time t standard measured at time t
2. 1 st order kinetics monitored by Uv-Vis A -> B
Concentrations of A and B at time t
Absorbance coefficients of A and B at the wavelength of the reading at time t
Transition matrix (F) time constant of the the variations of slope and intercept 1st order reaction rate
41.4.1 System equation for a kinetics experiment Let us assume that a kinetics experiment is carried out and we want to follow the concentrations of component B which is formed from A by the reaction: A -> B. For a first-order reaction, the concentrations of A (= jCj) and B (= x^^) as a function of time are described by two differential equations: dxj Idt = -k^x^ djC2/dr = fcjjCj
which can be rewritten in the following recursive form: jc,(r + 1) = (1 - A:i)jci(0 + w,(0
(41.10)
x^{t + 1) = k^x^{t) + JC2(0 + ^2(0 w indicates the error due to the discretization of the differential equations. These two equations describe the concentrations of A and B as a function of time, or in other words, they describe the state of the system, which in vector notation becomes: x ( r + l ) = Fxft) + w(0 with x\t^\)^[x,{t^\)x^{t+\)] w^(r+l) = [vi;i(r+1)^2(^+1)] and the transition matrix equal to: "l-A:,
0"
F
Jr —
k^
1
(41.11)
593
41.4.2 System equation of a calibration line with drift In this section we derive a system equation which describes a drifting calibration line. Let us suppose that the intercept x,(/ + 1) at a time; + 1 is equal to x,(/) at a time j augmented by a value a(/'), which is the drift. By adding a non-zero system noise w„ to the drift, we express the fact that the drift itself is also time dependent. This leads to the following equations [5,6]: x,(7+l) = x,0') + aO') a(/ + l) = a(/) + w„(/ + l) which is transformed into the following matrix notation:
.a(7 + l)
0 1 1 •^i(y) + 0 1 a(7) WaCi+l)
(41.12)
A similar equation can be derived for the slope, where P is the drift parameter: X2(;+l)
.P0"+1).
0
1 ip2(;) .0 iJIpCy).
(41.13)
H ' p ( ; + l)
Equations (41.12) and (41.13) can now be combined in a single system model which describes the drift in both parameters: \x,{i+\)
x^ii+i) a(/+l)
.PO'+l).
10
1 0 " \x\{j)^
0
0 1 U2(;) + 0 0 1 0 «(;•) H'aCy + i) >vp(; + l) 0 0 0 1 ^PO).
0
10
or x(/ + 1) = Fx(/) + w(7' + 1) with
F=
10 10" 0 10 1 0 0 10 0 0 0 1
Xi
andx =
a
.P.
F describes how the system state changes from time tj to tj^^.
(41.14)
594
For the time invariant calibration model discussed in Section 41.2, eq. (41.14) reduces to: "1
OTA:, •^iO')1
0
lJ[jC2lU)}
where x^ = intercept and X2 = slope.
41.5 The Kalman filter 41.5.1 Theory In Sections 41.2 and 41.3 we applied a recursive procedure to estimate the model parameters of time-invariant systems. After each new measurement, the model parameters were updated. The updating procedure for time-variant systems consists of two steps. In the first step the system state x(/ - 1) at time ti_^ is extrapolated to the state x(j) at time tj by applying the system equation (eq. (41.15)) in Table 41.10). At time tj a new measurement is carried out and the result is used to TABLE41.10 Kalman filter algorithm equations Initialisation: x(0), P(0) State extrapolation: system equation x(/!/- 1) = F(/)xO- 11/'- 1) + wO)
(41.15)
Covariance extrapolation P(/V- 1) = F(/-)P(/- II7- 1)F''(/) + Q 0 - 1)
(41.16)
New measurement: z(j) Measurement equation z(J) = h'(j)x(j\j - 1) + v(/') forj = 0„ 1 „ 2,... Covariance update P(/'iy) = P(/"iy - 1) - mh'ijWiiy
-1)
(41.17)
Kalman gain update k(/) = POV- l)h(/)(h''(/-)POV- l)hO) + rO))-' State update x(J\J) = ^(J\J-^) + mKz(j)-h'(j)x(j\j-
D)
(41.18)
(41.19)
595
update the state x(j) (eq. (41.19) in Table 41.10). In order to make a distinction between state extrapolations by the system equation and state updates when making a new observation, a double index (j\j - 1) is used. The index (j\j - 1) indicates the best estimates at time tj, based on measurements obtained up to and including those obtained at point tj^^. Equations (41.15) and (41.19) for the extrapolation and update of system states form the so-called state-space model. The solution of the state-space model has been derived by Kalman and is known as the Kalman filter. Assumptions are that the measurement noise v(j) and the system noise w(;) are random and independent, normally distributed, white and uncorrected. This leads to the general formulation of a Kalman filter given in Table 41.10. Equations (41.15) and (41.19) account for the time dependence of the system. Eq. (41.15) is the system equation which tells us how the system behaves in time (here inj units). Equation (41.16) expresses how the uncertainty in the system state grows as a function of time (here inj units) if no observations would be made. Q(/ - 1) is the variance-covariance matrix of the system noise which contains the variance of w. The algorithm is initialized in the same way as for a time-invariant system. The sequence of the estimations is as follows: Cycle 1 1. Obtain initial values for x(OIO), k(0) and P(OIO) 2. Extrapolate the system state (eq. (41.15)) to x(llO) 3. Calculate the associated uncertainty (eq. (41.16)) P( 110) 4. Perform measurement no. 1 5. Update the gain vector k(l) (eq. (41.18)) using P(IIO) 6. Update the estimate x(llO) to x(lll) (eq. (41.19)) and the associated uncertainty P(lll) (eq. (41.17)) Cycle 2 1. Extrapolate the estimate of the system state to x(2ll) and the associated uncertainty P(2I1) 2. Peform measurement no. 2 3. Update the gain vector k(2) using P(2I1) 4. Update (filter) the estimate of the system state to x(2l2) and the associated uncertainty P(2I2) and so on. In the next section, this cycle is demonstrated on the kinetics example introduced in Sections 41.1 and 41.4. Time-invariant systems can also be solved by the equations given in Table 41.10. In that case, F in eq. (41.15) is substituted by the identity matrix. The system state, x(/), of time-invariant systems converges to a constant value after a few cycles of the filter, as was observed in the calibration example. The system state,
596
x(/), of time-variant systems is obtained as a function of y, for example the concentrations of the reactants and reaction products in a kinetic experiment monitored by a spectrometric multicomponent analysis. 41.5.2 Kalman filter of a kinetics model Equation (41.11) represents the (deterministic) system equation which describes how the concentrations vary in time. In order to estimate the concentrations of the two compounds as a function of time during the reaction, the absorbance of the mixture is measured as a function of wavelength and time. Let us suppose that the pure spectra (absorptivities) of the compounds A and B are known and that at a time t the spectrometer is set at a wavelength giving the absorptivities h^(0- The system and measurement equations can now be solved by the Kalman filter given in Table 41.10. By way of illustration we work out a simplified example of a reaction with a true reaction rate constant equal to k^ =0.1 min"^ and an initial concentration jCi(O) = 1. The concentrations are spectrophotometrically measured every 5 minutes and at the start of the reaction after 1 minute. Each time a new measurement is performed, the last estimate of the concentration A is updated. By substituting that concentration in the system equation x^(t) = x^(0)txp(-k^t) we obtain an update of the reaction rate k. With this new value the concentration of A is extrapolated to the point in time that a new measurement is made. The results for three cycles of the Kalman filter are given in Table 41.11 and in Fig. 41.7. The "c ik (0 I i •G
1?
(Q
g>
1
O C
o (Q
08
C
8
0.6
oC o 0.4 0.2 o
0
_l
I
I
l_
_J
I
\
o
o
o
L_
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 • time
Fig. 41.7. Concentrations of the reactant A (reaction A ^ B) as a function of time (dotted line) (CA = 1, CB = 0); • state updates (after a new measurement), O state extrapolations to the next measurement (see Table 41.11 for Kalman filter settings).
597 TABLE41.il The prediction of concentrations and reaction rate constant by a Kalman filter; (JCJCO) =l,k^=0.l Time
Concentrations
Continuous A
B
Wavelengthy
Discrete A
Absorptivities
Absorbance
A
z
B
z
Estimate of the concentration of A State update State extra(41.19)
B
min-^
1 ^-
polation (41.15)
1)
x(OIO) 1.00
0
1
0
1
0
1
0.90
0.10
0.90
0.10
2
0.82
0.18
0.81
0.19
0.82
->1
10
1
9.1
10
jc(lll)0.91
;c(llO)
0.90
3
0.74
0.28
0.73
0.27
0.74
4
0.67
0.33
0.66
0.34
0.67
5
0.61
0.39
0.59
0.41
6
0.55
0.45
0.53
0.47
0.57
7
0.50
0.50
0.48
0.52
0.52
8
0.45
0.55
0.43
0.57
0.48 0.43
->2
5
2
3.8
3.87
jc(2l2) 0.63
4211)
0.62
9
0.41
0.59
0.39
0.61
10
0.37
0.63
0.35
0.65
11
0.33
0.67
0.31
0.69
0.34
12
0.29
0.71
0.28
0.72
0.31
13
0.27
0.73
0.25
0.75
0.28
14
0.25
0.75
0.23
0.77
15
0.22
0.78
0.21
0.79
16
0.20
0.80
0.18
0.82
0.15
17
0.18
0.82
0.17
0.83
0.13
18
0.16
0.84
0.15
0.85
0.12
19
0.15
0.85
0.14
0.86
0.11
^ 3
2
5
3.97
3.81
A:(3I3) 0.38
jc(3l2)
0.40
0.26 ->4
7
2
3.10
3.15
;c(4l4)0.16
x(4l3)
0.23
0.10
')State extrapolations in italic are obtained by substituting the last estimate of k^ equal to -\n{x(j\j))/t in the system equation (41.15).
dashed line in Fig. 41.7 shows the true evolution of the concentration of the reactant A. For a matter of simplicity we have not included the covariance extrapolation (eq. (41.16)) in the calculations. Although more cycles are needed for a convincing demonstration that a Kalman filter can follow time varying states, the example clearly shows its principle.
598
41.5.3 Kalman filtering of a calibration line with drift The measurement model of the time-invariant calibration system (eq. (41.5)) should now be expanded in the following way: z(j) = h'iDxij - 1) + v(j) for; = 0, 1, 2,...
(41.20)
where h^(/) = [lc(/)0 0] c(j) is the concentration of the analyte in theyth calibration sample
x'{j-l)
= [b, b, a p ]
The model contains four parameters, the slope and intercept of the calibration line and two drift parameters a and p. All four parameters are estimated by applying the algorithm given in Table 41.10. Details of this procedure are given in Ref. [5].
41.6 Adaptive Kalman filtering In previous sections we demonstrated that a major advantage of Kalman filtering over ordinary least squares (OLS) procedures is that it can handle timevarying systems, e.g., a drifting calibration parameter and drifting background. In this section another feature of Kalman filters is demonstrated, namely its capability of handling incomplete measurement models or model errors. An example of an incomplete measurement model is multicomponent analysis in the presence of an unknown interfering compound. If the identity of this interference is unknown, one cannot select wavenumbers where only the analytes of interest absorb. Therefore, solution by OLS may lead to large errors in the estimated concentrations. The occurrence of such errors may be detected by inspecting the difference between the measured absorbance for the sample and the absorbance estimated from the predicted concentrations (see Chapter 10). However, inspection of PRESS does not provide information on which wavelengths are not selective. One finds that the result is wrong without an indication on how to correct the error. Another type of model error is a curving calibration line which is modelled by a straight line model. The size and pattern of the residuals usually indicate that there is a model error (see Chapter 8), after which the calculations may be repeated with another model, e.g., a quadratic curve. The recursive property of the Kalman filter allows the detection of such model deviations, and offers the possibility of disregarding the measurements in the region where the model is invalid. This filter is the so-called adaptive Kalman filter.
599
41.6.1 Evaluation of the innovation Before we can apply an adaptive filter, we should define a criterion to judge the validity of the model to describe the measurements. Such a criterion can be based on the innovation defined in Section 41.2. The concept of innovation, /, has been introduced as a measure of how well the filter predicts new observations:
/(/•)=zij) - h^ox/ -1)=zij) - i(j) where zij) is the jth measurement, x(/ - 1) is the estimate of the model parameters after 7 - 1 observations, and h^(/) is the design vector. Thus /(/) is a measure of the predictive ability of the model. For the calibration example discussed in Section 41.2, x(/ - 1) contains the slope and intercept of the straight line, and h^(/) is equal to [1 c(j)] with c(j) the concentration of the calibration standard for the jth calibration measurement. For the multicomponent analysis (MCA), x(/ - 1 ) contains the estimated concentrations of the analytes after j - I observations, and h^(/) contains the absorptivities of the analytes at wavelength;. It can be shown [4] that the innovations of a correct filter model applied on data with Gaussian noise follows a Gaussian distribution with a mean value equal to zero and a standard deviation equal to the experimental error. A model error means that the design vector h in the measurement equation is not adequate. If, for instance, in the calibration example the model was quadratic, h^ should be [1 c(j) c(/)^] instead of [1 c(j)]. In the MCA example h^(/) is wrong if the absorptivities of some absorbing species are not included. Any error in the design vector h^ appears by a non-zero mean for the innovation [4]. One also expects the sequence of the innovation to be random and uncorrelated. This can be checked by an investigation of the autocorrelation function (see Section 20.3) of the innovation. 41.6.2 The adaptive Kalman filter model The principle of adaptive filtering is based on evaluating the observed innovation calculated at each new observation and comparing this value to the theoretically expected value. If the (absolute) value of the innovation is larger than a given criterion, the observation is disregarded and not used to update the estimates of the parameters. Otherwise, one could eliminate the influence of that observation by artificially increasing its measurement variance r(/), which effectively attributes a low weight to the observation. For a time-invariant system, the expected standard deviation of the innovation consists of two parts: the measurement variance (r(/)), and the variance due to the uncertainty in the parameters (P(/)), given by [4]:
600
OJU) = [r(j) + h^(/*)P(/' - l)mr'
(41.21)
As explained before, the second term in the above equation is the variance of the response, £(/), predicted by the model at the value h(/) of the independent variable, given the uncertainty in the regression parameters P(/ - 1) obtained so far. This equation reflects the fact that the fluctuations of the innovation are larger at the beginning of the filtering procedure, when the states are not well known, and converge to standard deviation of the experimental error when the confidence in the estimated parameters becomes high (small P). Therefore, it is more difficult to detect model errors at the beginning of the estimation sequence than later on. Rejection and acceptance of a new measurement is then based on the following criterion: if |/0)|>3a,(/): reject otherwise: accept Adaptation of the Kalman filter may then simply consist of ignoring the rejected observations without affecting the parameter estimates and covariances. When using eq. (41.21) a complication arises at the beginning of the recursive estimation because the value of P depends on the initially chosen values P(0) and thus is not a good measure for the uncertainty in the parameters. When large values are chosen for the diagonal terms of P in order to speed up the convergence (high P means large gain factor k and thus large influence of the last observation), eq. (41.21) overestimates the variance of the innovation, until P becomes independent of P(0). For short runs one can evaluate the sequence of the innovation and look for regions with significantly larger values, or compare the innovation with r(j). By way of illustration we apply the latter procedure to solve the multicomponent system discussed in Section 41.3 after adding an interfering compound which augments the absorbance at 26 10^ cm"^ with 0.01 au and at 28 10^ cm"^ with 0.015 au. First we apply the non-adaptive Kalman filter to all measurements. The estimation then proceeds as shown in Table 41.12. The above example illustrates the self adaptive capacity of the Kalman filter. The large interferences introduced at the wavelengths 26 and 28 10^ cm~^ have not really influenced the end result. At wavelengths 26 and 28 10^ cm"^ the innovation is large due to the interferent. At 30 10^ cm"^ the innovation is high because the concentration estimates obtained in the foregoing step are poor. However, the observation at 30 10^ cm~^ is unaffected by which the concentration estimates are restored within the true value. In contrast, the OLS estimates obtained for the above example are inaccurate (jCj = 0.148 and X2 = 0.217) demonstrating the sensitivity of OLS for model errors.
601 TABLE 41.12 Non-adaptive Kalman filter with interference at 26 and 28 10-3 cm-^ (see Table 41.5 for the starting conditions) Step 7
Wavelength
0
Innovation
^1
^2
Absorbance
CI2
Br2
Measured^)
Estimated^)
0
0
1
22
0.054
0.196
0.0341
0
0.0341
2
24
0.072
0.2004
0.0429
0.0414
0.0015
3
26
0.169
0.211
0.0435
0.0331
0.0104
4
28
0.313
0.204
0.0267
0.0158
0.0110
5
30
0.099
0.215
0.0110
0.0409
-0.030
6
32
0.098
0.215
0.0080
0.0082
-0.0002
'^Values taken from Table 41.4. 2)Calculated with the absorbtivity coefficients from Table 41.4.
The estimation sequence when the Kalman filter is adapted is given in Table 41.13. This illustrates well the adaptive procedure, which has been followed. At 26 10-^ cm~^ the new measurement is 0.0104 absorbance units higher than expected from the last available concentration estimates (x^ = 0.072 and X2 = 0.2004). This deviation is clearly larger than the value 0.005 expected from the measurement noise. Therefore, the observation is disregarded and the next measurement at 28 10^ cm~^ is predicted with the last accepted concentration estimates. Again, the difference between predicted and measured absorbance (= innovation) cannot be explained from the noise and the observation is disregarded as well. At 30 10^ cm"^ the predicted absorbance using the concentration estimates from the second step is back within the expectations, P and k can be updated leading to new concentration estimates x^ = 0.098 and X2 = 0.1995. Thereafter, the estimation process is continued in the normal way. The effect of this whole procedure is that the two measurements corrupted by the presence of an interferent have been eliminated after which the measurement-filtering process is continued.
41.7 Applications One of the earliest applications of the Kalman filter in analytical chemistry was multicomponent analysis by UV-Vis spectrometry of time and wavelength independent concentrations, which was discussed by several authors [7-10]. Initially, the spectral range was scanned in the upward and downward mode, but later on
602 TABLE 41.13 Adaptive Kalman filter with interference at 26 and 28 cm-' (see Table 41.5 for the starting conditions) Step 7
Wavelength
0 1
22
Innovation
^1
X2
Absorbance
CI2
Br2
Measured')
Estimated^)
0
0
0.054
0.196
0.0341
0
0.0341 0.0015 0.0104
2
24
0.072
0.2004
0.0429
0.0414
3
26
0.0331
28
-
0.0435
4
-
0.0267
0.0100
0.0166
0.00814
-0.002
0.00801
-0.00001
5
30
0.0980
0.1995
0.0110
6
32
0.0979
0.1995
0.0080
')Values taken from Table 41.4. 2)Calculated with absorbtivity coefficients from Table 41.4.
optimal sequences were derived for faster convergence to the result [3]. The measurement model can be adapted to include contributions from linear drift or background [6,11]. This requires an accurate model for the background or drift. If the background response is not precisely described by the model the Kalman filter fails to estimate accurate concentrations. Rutan [12] applied an adaptive Kalman filter in these instances with success. In HPLC with diode array detection, threedimensional data are available. The processing of such data by multivariate statistics has been the subject of many chemometric studies, which are discussed in Chapter 34. Under the restriction that the spectra of the analytes should be available, accurate concentrations can be obtained by Kalman filtering in the presence of unknown interferences [13]. One of the earliest reports of a Kalman filter which includes a system equation is due to Seelig and Blaunt [14] in the area of anodic stripping voltametry. Five system states — potential, concentration, potential sweep rate and the slope and intercept of the non-Faradaic current — were predicted from a measurement model based on the measurement of potential current. Later on the same approach was applied in polarography. Similar to spectroscopic and chromatographic applications, overlapping voltamograms can be resolved by a Kalman filter [15]. A vast amount of applications of Kalman filters in kinetic analysis has been reported [16,17] and the performance has been compared with conventional non-linear regression. In most cases the accuracy and precision of the results obtained from the two methods were comparable. The Kalman filter is specifically superior for detecting and correcting model errors.
603
The Kalman filter is particularly well-suited to monitor the dynamic behaviour of processes. The measurement procedure in itself can be considered to be a system which is observed through the measurement of check samples. One can set up system equations, e.g., a system equation which describes the fluctuations of the calibration factors. Only a few applications exploiting this capability of a Kalman filter have been reported. One of the difficulties is a lack of system models, which describe the dynamic behaviour of an analytical system. Thyssen and coworkers [17] demonstrated the potential of this approach by designing a Kalman filter for the monitoring of the calibration factors. They assembled a so-called self-calibrating Flow Injection Analyzer for the determination of chloride in water. The software of the instrument included a system model by which the uncertainty of the calibration factors was evaluated during the measurement of the unknown samples. When this uncertainty exceeded a certain threshold the instrument decided to update the calibration factors by remeasuring one of the calibration standards. Thyssen [18] also designed an automatic titrator which controlled the addition of the titrant by a Kalman filter. After each addition the equivalence point (the state of the system) was estimated during the titration.
References 1. 2. 3.
4. 5. 6.
7. 8. 9. 10.
11.
D. Graupe, Identification of Systems. Krieger, New York, NY, 1976. Landolt-Bornstein, Zalen Werte und Funktionen. Teil 3, Atom und Molekular Physik. Springer, Berlin 1951. P.C. Thijssen, L.J.P. Vogels, H.C. Smit and G. Kateman, Optimal selection of wavelengths in spectrophotometric multicomponent analysis using recursive least squares. Z. Anal. Chem., 320(1985)531-540. A. Gelb (Ed.), Applied Optimal Estimation. MIT Press, Cambridge, MA, 1974. G. Kateman and L. Buydens, Quality Control in Analytical Chemistry, 2nd Edn. Wiley, New York, 1993. P.C. Thijssen, S.M. Wolfrum, G. Kateman and H.C. Smit, A Kalman filter for calibration,evaluation of unknown samples and quality control in drifting systems: Part 1. Theory and simulations. Anal. Chim. Acta, 156 (1984) 87-101. H.N.J. Poulisse, Multicomponent analysis computations based on Kalman Filtering. Anal. Chim. Acta, 112 (1979) 361-374. C.B.M. Didden and H.N.J. Poulisse. On the determination of the number of components from simulated spectra using Kalman filtering. Anal. Lett., 13 (1980) 921-935. T.F. Brown and S.D. Brown, Resolution of overlapped electrochemical peaks with the use of the Kalman filter. Anal. Chem., 53 (1981) 1410-1417. S.C. Rutan and S.D. Brown, Pulsed photoacoustic spectroscopy and spectral deconvolution with the Kalman filter for determination of metal complexation parameters. Anal. Chem., 55 (1983) 1707-1710. P.C. Thijssen, A Kalman filter for calibration, evaluation of unknown samples and quality control in drifting systems: Part 2. Optimal designs. Anal. Chim. Acta, 162 (1984) 253-262.
604 12.
13.
14. 15.
16. 17.
18.
19.
S.C. Rutan, E. Bouveresse, K.N. Andrew, P.J. Worsfold and D.L. Massart, Correction for drift in multivariate systems using the Kalman filter. Chemom. Intell. Lab. Syst., 35 (1996) 199-211. J. Chen and S.C. Rutan, Identification and quantification of overlapped peaks in liquid chromatography with UV diode array detection using an adaptive Kalman filter. Anal. Chim. Acta, 335(1996) 1-10. P. Seelig and H. Blount, Kalman Filter applied to Anodic Stripping Voltametry: Theory. Anal. Chem., 48 (1976) 252-258. C.A. Scolari and S.D. Brown, Multicomponent determination in flow-injection systems with square-wave voltammetric detection using the Kalman filter. Anal. Chim. Acta, 178 (1985) 239-246. B.M. Quencer, Multicomponent kinetic determinations with the extended Kalman filter. Diss. Abstr. Int. B 54 (1994) 5121-5122. M. Gui and S.C. Rutan, Determination of initial concentration of analyte by kinetic detection of the intermediate product in consecutive first-order reactions using an extended Kalman filter. Anal. Chim. Acta, 66 (1994) 1513-1519. P.C. Thijssen, L.T.M. Prop, G. Kateman and H.C. Smit, A Kalman filter for caHbration, evaluation of unknown samples and quality control in drifting systems. Part 4. Flow Injection Analysis. Anal. Chim. Acta, 174 (1985) 27-40. P.C. Thijssen, N.J.M.L. Janssen, G. Kateman and H.C. Smit, Kalman filter applied to setpoint control in continuous titrations. Anal. Chim. Acta, 177 (1985) 57-69.
Recommended additional reading S.C. Rutan, Recursive parameter estimation. J. Chemom., 4 (1990) 103-121. S.C. Rutan, Adaptive Kalman filtering. Anal. Chem., 63 (1991) 1103A-1109A. S.C. Rutan, Fast on-line digital filtering. Chemom. Intell. Lab. Syst., 6 (1989) 191-201. D. Wienke, T. Vijn and L. Buydens, Quality self-monitoring of intelligent analyzers and sensor based on an extended Kalman filter: an application to graphite furnace atomic absorption spectroscopy. Anal. Chem., 66 (1994) 841-849. S.D. Brown, Rapid parameter estimation with incomplete chemical calibration models. Chemom. Intell. Lab. Syst., 10 (1991) 87-105.
605
Chapter 42
Applications of Operations Research 42.1 An overview Ackoff and Sasieni [1] defined operations research (OR) as "the application of scientific method by interdisciplinary teams to problems involving the control of organized (man-machine) systems so as to provide solutions which best serve the purposes of the organization as a whole". Operations research consists of a collection of mathematical techniques. Some of these are linear programming, integer programming, queuing theory, dynamic programming, graph theory, game theory, multicriteria decision making, and simulation. They often are optimization techniques and are characterized by their combinatorial character: their aim is to find an optimal combination. Typical problems that can be solved are: (1) allocation, (2) inventory, (3) replacement, (4) queuing, (5) sequencing and combination, (6) routing, (7) competition, and (8) search. Several, but not all, of these mathematical methods (e.g. multicriteria decision making. Chapter 26) or problems (the non-hierarchical clustering methods of Chapter 30, which can be treated as allocation models) have been treated earlier. In this chapter, we will briefly discuss the methods that are relevant to chemometricians and have not been treated in earlier chapters yet.
42.2 Linear programming Suppose that a manufacturer prepares a food product by adding two oils (A and B) of different sources to other ingredients. His purpose is to optimize the quality
606
of the product and at the same time minimize cost. The quality parameters are the amount of vitamin A (y^) and the amount of polyunsaturated fatty acids (y2). The cost of a unit amount of oil A is 40, that of B is 25. In this introductory example, we will suppose that the cost of the other ingredients is negligible and does not have to be taken into account. Moreover, we suppose that the volume remains constant by adaptation of the other ingredients. The cost or objective function to be minimized is therefore Z = 4 0 J C , + 25JC2
(42.1)
where jc j is the amount in grams of oil A per litre of product and X2 the amount of oil B. An optimal product contains at least 65 vitamin A units and 40 polyunsaturated units per litre (all numbers in this section have been chosen for mathematical convenience and are not related to real values). More vitamin A or polyunsaturated fatty acids are not considered to have added benefit. Suppose now that oil A contains 30 units of vitamin A per litre and 10 units of polyunsaturated fatty acids and oil B contains 15 units of vitamin A and 25 units of polyunsaturated fatty acid. One can then write the following set of constraints. y^ = 30x^
+ 15A:2>65
3;2=10JCI + 2 5 J C 2 > 4 0
(42.2)
JC, > 0, ;c2 > 0
The line y^=65 = SOx, + 15JC2 is shown in Fig. 42.1. All points on that line or above satisfy the constraint 30JCJ + 15JC2 ^ 65. Similarly, all points lying above line J2 = IOJC, + 25JC2 = 40 satisfy the second constraint of eq. (42.2), while the last constraints limit the acceptable solutions to positive or zero values for x^ and X2. The acceptable region is the shaded area of Fig. 42.1. We can now determine which pairs of (JC,, X2) values yield a particular z. In Fig. 42.1 line z, shows all values for which z = 50. These (JC,, X2) do not belong to the acceptable area. However, we can now draw parallel lines until we meet the acceptable area. This happens in point B with line 12- The coordinates of this point are obtained by solving the set of simultaneous equations 30JC, + 15JC2 = 6 5 l IOJCJ + 2 5 J C 2
=40J
This yields x^ = 41/24, X2 = Will and z = 91.2.
607
Fig. 42.1. Linear programming: oil example.
Let us look at another example (Fig. 42.2) [2]. A laboratory must carry out routine determinations of a certain substance and uses two methods, A and B, to do this. With method A, one technician can carry out 10 determinations per day, with method B 20 determinations per day. There are only 3 instruments available for method B and there are 5 technicians in the laboratory. The first method needs no sophisticated instruments and is cheaper. It costs 100 units per determination while method B costs 300 units per determination. The available daily budget is 14000 units. How should the technicians be divided over the two available methods, so that as many determinations as possible are carried out? Let the number of technicians working with method A be x^ and with method B ^2, and the total number of determinations z; then, the objective function to be maximized is given by: Z=\Ox,+
20A:,
(42.3)
608
Fig. 42.2. Linear programming: laboratory technicians example.
The constraints are: }^i = ^2 < 3
^2 = X, + ^2 < 5
y^ = (10 X 100) jci + (20 X 300) x^ < 14000
(42.4)
X, > 0, jc, > 0 The optimal result obtained in this way is x^ = 3.2, JC2 = 1.8, z = 68. We observe that in both cases - the set of acceptable solutions is convex, i.e. whatever two points one chooses from the set, the line connecting them lies completely within the domain defined by the set; - the optimal solution is one of the comer points of the convex set. It can be shown that this can be generalized to the case of more than two variables. The standard solution of a linear programming problem is then to define the corner points of the convex set and to select the one that yields the best value for the objective function. This is called the Simplex method. The second example illustrates a difficulty that can occur, namely the optimal solution concerns 1.8 technicians working with method B, while one needs an integer number. This can be solved by letting one technician working full time and another four days out of five with this instrument. When this is not practical, the
609
solution is not feasible and one should then apply a related method, called integer programming in which only integer values are allowed for the variable x^. When only binary values are allowed, this is called binary programming. Problems resembling the first example, but much more complex, are often studied in industry. For instance in the agro-food industry linear programming is a current tool to optimize the blending of raw materials (e.g. oils) in order to obtain the wanted composition (amount of saturated, monounsaturated and polyunsaturated fatty acids) or property of the final product at the best possible price. Here linear programming is repeatedly applied each time when the price of raw materials is adapted by changing markets. Integer programming has been applied by De Vries [3] (a short Englishlanguage description can be found in [2]) for the determination of the optimal configuration of equipment in a clinical laboratory and by De Clercq et al. [4] for the selection of optimal probes for GLC. From a data set with retention indices for 68 substances on 25 columns, sets of p probes (substances) (/? = 1, 2,..., 20) were selected, such that the probes allow to obtain the best characterization of the columns. This type of application would nowadays probably be carried out with genetic algorithms (see Chapter 27). The fact that only linear objective functions are possible, limits the applicability of the methodology in chemometrics. Quadratic or non-linear programming are possible however. The former has been applied in the agro-food industry for the determination of the composition of an unknown fat blend from its fatty acid profile and the fatty acid profiles of all possible pure oils [5]. The solution of this problem is searched under constraints of the number of oils allowed in the solution, a minimal or maximal content, or a content range. This problem can be solved by quadratic programming. The objective function is to minimize the squared differences between the calculated and actual fatty acid composition of the oil blend. An attractive feature of the programming approach to solve this type of problems is that it provides several solutions in a decreasing order of value of the objective function. All these methods together are also known as mathematical programming.
42.3 Queuing problems In several chapters we discussed how the quality of the analytical result defines the amount of information which is obtained on a sampled system. Obvious quality criteria are accuracy and precision. An equally important criterion is the analysis time. This is particularly true when dynamic systems are analyzed. For instance a relationship exists between the measurability and the sampling rate, analysis time and precision (see Chapter 20). The monitoring of environmental and chemical processes are typical examples where the management of the analysis time is
610
important. In this chapter we will focus on the analysis time. The time between the arrival of the sample and the reporting of the analytical result is usually substantially longer than the net analysis time. Delays may be caused by a congestion of the laboratory or analysis station or by managerial policies, e.g. priorities between samples and waiting until a batch of a certain size is available for analysis. A branch of Operations Research is the study of queues and the influence of scheduling policies on the formation of queues. Queues in waiting rooms, for instance, or the occupation of beds in hospitals, telephone and computer networks have been extensively studied by queuing theory. On the contrary only a few studies have been conducted on queues and delays in analytical laboratories [6-10]. Despite the fact that several operational parameters can be registered with modem Laboratory Information Management Systems (LIMS), laboratory activities are apparently too complex to be described by models from queuing theory. For the same reason the alternative approach by simulation (see Section 42.4) is not a real management tool for decision support in analytical laboratory management. On the other hand simulation techniques proved to be a useful tool for the scheduling of robots. In this section waiting and queues are discussed in order to provide some basic understanding of general queuing behaviour, in particular in analytical laboratories. This should allow a qualitative forecast of the effect of managerial decisions.
42.3.1 Queuing and Waiting No queues would be formed if no new samples are submitted during the time that the analyst is busy with the analysis of the previous sample. If all analysis times were equally long and if each new sample arrived exactly after the previous analysis is finished, the analytical facility could be utilized up to 100%. On the other hand, if samples always arrive before the analysis of the previous sample is completed, more samples arrive than can be analyzed, causing the queue to grow indefinitely long. Mathematically this means that: AZq = 0, w = 0
when ?i/|Li < 1
and n^ = oo^w = oo when )J\i > 1 with AZq the number of samples waiting in queue, w the waiting time in queue, X the number of samples submitted per unit of time (e.g. a day), and \i the number of samples which can be analyzed per unit of time.
611
Because X = l/lAT, wherelAT is the mean interarrival time, and L | L = 1 / AT, where AT is the mean analysis time
:^/|i = Af/iAf = p p is called the utilization factor of the facility or service station. In reality, the queue size {n^ and waiting time (w) do not behave as a zeroinfinity step function at p = 1. Also at lower utilization factors (p < 1) queues are formed. This queuing is caused by the fact that when analysis times and arrival times are distributed around a mean value, incidently a new sample may arrive before the previous analysis is finished. Moreover, the queue length behaves as a time series which fluctuates about a mean value with a certain standard deviation. For instance, the average lengths of the queues formed in a particular laboratory for spectroscopic analysis by IR, ^H NMR, MS and ^^C NMR are respectively 12, 39, 14 and 17 samples and the sample queues are Gaussian distributed (see Fig. 42.3). This is caused by the fluctuations in both the arrivals of the samples and the analysis times. According to the queuing theory the average waiting time (w) exponentially grows with increasing utilization factor (p) and asymptotically approaches infinity when p goes to 100% (see Fig. 42.4). Figure 42.4 shows the waiting time for the simplest queuing system, consisting of one server, independent analysis and arrival processes, Poisson distributed (see Section 15.3) arrivals (number of samples per day) and exponentially distributed analysis times. In the jargon of queuing theory such a system is denoted by M/M/1 where the two Ms indicate the arrival and analysis processes respectively (M = Markov process) and ' 1' is the number of servers. The number of arrivals follows a Poisson distribution when samples are submitted independently from each other, which is generally valid when the samples are submitted by several customers. The probability of n arrivals in a time interval t is given by: P M ^ ' - ^ ^
(42.5)
where Xt is the average number of arrivals during the interval t. For instance the number of samples (samples/day) submitted to the spectroscopic department mentioned earher [9] can be modelled by Poisson distributions with the means 2.8,7.7, 2.06 and 2.5 samples per day (Fig. 42.5).
612
20
20
60
60
100
140
180
220
- ^ - 1 (days)
Fig. 42.3. Time series of the observed queue lengths (n) in a department for structural analysis, with their corresponding histograms fitted with a Gaussian distribution.
613
utilization factor (p)
Fig. 42.4. The ratio between the average waiting time (w) and the average analysis time (AT) as a function of the utilization factor (p) for a system with exponentially distributed interarrival times and analysis times (M/M/1 system). 1
IR t
c
^^
a.
HNMR
20 15 10
0
1 2
3
4
n^ 5
7
lii 0
8
2
I I I 4
10 12 14 16 18
6
»• n
13
MS
CO
CNMR
25 r
m
>
t>
CO
03
20 V
c: -
•
.
-
^
tl.
0 1
2
3
L
4
5
6
7
8
15 [
0 1
2
3
4
L^
5
6
7
8
Fig. 42.5. The distribution of the probability that n samples arrive per day, observed in a department for structural analysis. (I) observed (•) Poisson distribution with mean [i.
614
The following relationships fully describe an M/M/1 system: - the average queue length (n^) which is the number of samples in queue, excluding the one which is being analysed: ^^=p2/(l-p) - the average waiting time (vv) in queue (excluding the analysis time of the sample): w = ATp/(l-p): ^(1-p) Queuing models also describe the distribution of the waiting times though only for relatively simple queuing systems. Waiting times in an M/M/1 system are exponentially distributed. The probability of a waiting time shorter than a given Wj^gx is given by (see Fig. 42.6): P ( w < w ^ ^ ) = l - p e -^iVVm
,/AT
= l - p e -P^mj^
It means that a large part of the samples (65% for p = 0.7) waits for less than the mean waiting time (w^^ = vv). On the other hand, there is a significant probability (35% for p = 0.7) that a sample has to wait longer than this average. It also means that when the laboratory management wants to guarantee a certain maximal turn around time (e.g. 95% of the samples within w^^J, the mean waiting time should be 27% of w^3^ (p = 0.7). Figure 42.7 shows the waiting time distributions of the samples in the IR, *H NMR, ^^C NMR and MS departments mentioned before. The probability of finding k samples in a queue is:
t/AT Fig. 42.6. Probability that the waiting time is smaller than t (t given in units relative to the average analysis time).
615 cumulative %
100
IP
%
100
1 • <»•
y/^
i
A
lOhl 1
2
4
6 8 10 12 14 16 18 20 Waiting time (days)
0
60 40
2
C NMR
20 (
20
lllllir 4
6 8 10 12 14 16 18 20 Waiting time (days)
13 %
80
/
15h|
0
cumulativ
MS
cumulative %
cumulative % 100
^ _ _ ^ - • 1100
15
10
0
2
4
6 8 10 12 14 16 18 20 Waiting time (days)
i 0
2
r*La 1
4
bl
lllh llllllfrrr^f*..
6 8 10 12 14 16 18 20 Waiting time (days)
Fig. 42.7. Histograms and cumulative distributions of the delays (waiting time + analysis time) in a department for structural analysis. (I) Observed values. (•) Cumulative distribution. (•) Fit witli a theoretical model (not discussed in this chapter).
p(n=k) =
p\l-p)
(42.6)
The observed Gaussian distribution of the queue lengths in the spectroscopic laboratory (see Fig. 42.3) is not in agreement with the theoretical distribution given by eq. (42.6). This indicates that an analysis station cannot be modelled by the simple M/M/1 model. We will return to this point later. Another interesting feature is the dynamic behaviour of a queue, which is expressed by the time constant of the queue. As explained in Chapter 20 the time constant of a time series can be determined from the autocorrelation function which gives the autocorrelation (O) for observations with the same time difference T. The autocorrelation plots and time constants (T) of the queues shown in Fig. 42.3 are given in Fig. 42.8. In general, the time constant of a queue increases with an increasing utilization factor [9], which means that periods of congestion may be relatively long. For instance, for an M/M/1 system 7 = ATp/(l - p)^, giving 7 = 8 AT for p = 0.7. The behaviour of a queue depends mainly on three factors: the arrival pattern of the samples, the analysis time and the applied managerial rules.
616 r(T)
1 . .. HNMR
1
A 08
\
T 06
A 08
\
'
04
06
x^^^x ^
0.2
0
0.4 0.2
^
-0 2 -0.4
IR
1
r(T)
^---^^^
0
^ ^
-0.2
0
2
4
6
8
10
12
14
16
18
20 -0.4
— ^ ^ T (days)
0
2
4
6
8
10
12
14
16
18
20
— ^ - T(days) r{T)
A '
1
^
MS
r(T)
08 0.6
: \
'
\
0.4 0.2
0.4
\
^
\
-0.4
2
0.6
\ \v
\
-0.2
0
^^CNMR
1 0.8
i
4
6
8
y
v.^, 10
14
16
18
\
0 2
\
0
/
-0.2
^-'-^ 12
\
«
-0 4 20
— ^ - T (days)
0
2
4
6
8
10
12
14
16
18
20
^^ t (days)
Fig. 42.8. Autocorrelograms of the queue lengths observed in a department for structural analysis.
As it is impossible to discuss them all, a number of cases has been compiled in Table 42.1 with reference to relevant literature in which the cases are discussed by queuing theory or are simulated (see Section 42.4). An interesting aspect of a queuing system is the distribution of the 'idle' and 'busy' time. Idle time is the time that no samples are waiting and are being analyzed. The time that samples are analyzed is called the busy time. For a system with a utilization factor of 70%, analysts are waiting for samples during 30% of their time, which may very much disturb any laboratory manager. However matters become better acceptable when we consider that 30% overhead activities are more the rule than the exception. On the other hand for an M/M/1 system one can derive that the average duration of the busy time is quite long and that the idle time is split up in many relatively short periods. For instance for a system with a utilization factor of 70% and an analysis time of 1 h the mean busy time is AT/(1 - p) = 3.3 h and the mean idle time is AT/p = 1.4 h. Therefore, analysts will not be able to schedule all overhead activities during the time that no samples are waiting, but will be forced to interrupt the analysis process in favour of overhead activities. Because overhead activities compete with the analysis process, queues are never empty, which is demonstrated by the queues given in Fig. 42.3. Thus the
617 TABLE 42.1 Classification of queuing systems Sample arrival
Analysis time Analysis mode
Management rules
1. Single arrivals [11]
1. Analysis time (single) [11]
1. Queue discipline [11]
a. uniform
a. uniform
a. first-in-first-out (FIFO)
b. Poisson
b. exponential
b. last-in-first-out (LIFO)
c. other distribution
c. other distribution
c. priority to easy samples [10]
2. Batch arrivals [12]
d. interrupted (illness, failure) [9]
(expected analysis time)
a. uniform
e. depends of batchsize
d. random
b. Poisson
2. Batch size [12]
2. Priority rules [9]
c. other
a. fixed
a. absolute classes
3. Batch size [12]
b. distributed
b. depends on waiting
a. fixed
c. minimal size
3. Overhead scheduling [9]
b. distribution
a. random interruption
4. No arrival if n > n^Ht
b. scheduled interruption c. interruption ifn< w^rit
fact that queues are never empty does not necessarily indicate an oversaturated system. 42.3.2 Application in analytical laboratory management The overview given in Table 42.1 demonstrates that queues and waiting can only be studied by queuing theory in a limited number of cases. Specifically the queuing systems that are of interest to the analytical chemist are too complex. However the behaviour of simple queuing systems provides a good qualitative insight in the queuing processes occurring in the laboratory. An alternative approach which has been applied extensively in other fields, is to simulate queuing systems and to support the decision making by the simulation of the effect of the decision. This is the subject of Section 42.4. Before this, we will summarize a number of rules of thumb relevant to the laboratory manager, who may control the queuing process by controlling the input, the analytical process and the resources. (1) Input control: Maximum delays may be controlled by monitoring the work load (Wq AT). When n^ AT > w^^^, or when n^ exceeds a critical value (n^rit)' customers are requested to refrain from submitting samples. Too high a frequency of such warnings indicates insufficient resources to achieve the desired Wj^^x-
618
(2) Priorities do not influence the overall average delay, because vv = avvj + (1 - a)w2, where a is the fraction of samples with a high priority. The values of vvj and W2 depend on a, the kind of priority (see Table 40.2) and p [10]. (3) The effect of collecting batches depends on the shortening of the analysis time by batch analysis and the time needed to collect the batches. (4) By automation one can remove the variation of the analysis time or shorten the analysis time. Although the variation of the analysis time causes half of the delay, a reduction of the analysis time is more important. This is also true if, by reducing the analysis time, the utilization factor would remain the same (and thus n^) because more samples are submitted. Since p = AT / lAT, any measure to shorten the analysis time will have a quadratic effect on the absolute delay (because vv = AT / (lAT - AT)). As a consequence the benefit of duplicate analyses (detection of gross errors) and frequent recalibration should be balanced against the negative effect on the delay. (5) By preference, overhead activities should be scheduled in regular blocks, e.g. at the end of the day [10]. (6) For fixed resources (costs), the sampling rate in process control can be increased to maximal utilization (p = 1) of the available resources without penalty only if samples are taken at regular time intervals (no variation) and there is no variation in the analysis time (automated measuring device). In other situations an optimal sampling rate will be found where the measurability is maximal. Recalling the fertilizer example discussed in Chapter 20 we can derive the optimal sampling rate for the A^ determination by an ion-selective electrode, by substituting the analysis time (10 min) by the delay in the measurability equation. However considering that in process control it is always preferable to analyze the last submitted sample (by eventually skipping the analysis of the waiting samples, because they do not contain additional information), it is obvious that a last-infirst-out (LIFO) strategy should be chosen. 42.4 Discrete event simulation In Section 42.2 we have discussed that queuing theory may provide a good qualitative picture of the behaviour of queues in an analytical laboratory. However the analytical process is too complex to obtain good quantitative predictions. As this was also true for queuing problems in other fields, another branch of Operations Research, called Discrete Event Simulation emerged. The basic principle of discrete event simulation is to generate sample arrivals. Each sample is characterized by a number of descriptors, e.g. one of those descriptors is the analysis time. In the jargon of simulation software, a sample is an object, with a number of attributes (e.g. analysis time) and associated values (e.g. 30 min). Other objects are e.g. instruments and analysts. A possible attribute is a list of the analytical
619
procedures which can be carried out by the analyst or instrument. The more one wants to describe the reaUty in detail, the more attribute-value pairs are added to the objects. For example, if one wants to include down times of the instrument or absence due to illness, such attributes have to be added to the object. An event takes place when the state of the laboratory changes. Examples of events are: - a sample arrival: this introduces a state change because the sample joins the queue; - an analysis is ready: this introduces a state change because the instrument and analyst become idle. With each event a number of actions is associated. For example when a sample arrives, the following actions are taken: - if instrument and analyst are idle and if all other conditions are met (batch size), start the analysis. This implies that the next event 'analysis is ready' is generated, the status of the instrument and analyst is switched to 'busy'. Generate the time of next arrival. - otherwise: register the arrival time in the queue; augment the queue size by one. Generate the time of the next arrival. As events generate other events, the simulation keeps going from event to event until some terminating conditions are met (e.g. the end of the simulation time, or the maximum number of samples has been generated). As one can see a specific programming environment, called object oriented programming [17], is required to develop a simulation model, consisting of object-attribute-value (O-A-V) triplets and rules (see also Section 43.4.2). The little research that has been conducted on the simulation of laboratory systems [9,10,13-15] was primarily focused on the demonstration that it is possible to develop a validated simulation model that exhibits the same behaviour in terms of queues, delays as in reality. Next, such a validated model is interrogated with the question "What if?". For instance, what if: - priorities are changed? - resources are modified? - minimal batch size is increased? In Fig. 42.9 we show the simulation results obtained by Janse [8] for a municipal laboratory for the quality assurance of drinking water. Simulated delays are in good agreement with the real delays in the laboratory. Unfortunately, the development of this simulation model took several man years which is prohibitive for a widespread application. Therefore one needs a simulator (or empty shell) with predefined objects and rules by which a laboratory manager would be capable to develop a specific model of his laboratory. Ideally such a simulator should be linked to or be integrated with the laboratory information management system in order to extract directly the attribute values.
620
CO
T3 CD CO -Q
O
«
8
:^ 2
i*^ ^ ;: E
CM S,
J
o_c5
CM
•^
"w"
c\ ^-^ o |
o c3
O
B
^ -o c
to| rr^
73
•c
^
bO C
u^
CD n
o x^ OS
T-
^ ^? ;g
a
Si CO
(N
:i
c
tiJO
x:
JO)
O
621
Klaessens [14-17] developed a ^laboratory simulator', written in SIMULA, which by a question-answering session assembles the simulation model. SIMULA [18] is a programming environment dedicated to the simulation of queuing systems. KEE [19] offers a graphics-driven discrete event simulator, in which the objects are represented by icons which can be connected into a logical network (e.g. a production line for the manufacturing of electronic devices). Although KEE has proven its potential in many areas, no examples are known of analytical laboratories simulated in KEE. From the few but interesting simulation studies of analytical laboratories, the following conclusions can be drawn. - A condition for discrete event simulation to become a relevant tool for the laboratory manager is the availability of an easy to use simulator with a user friendly user interface. - Simulators are a special kind of expert systems and should be treated as such. They should be used to support the decision making process but not to replace the human creativity. - Simulators should be used in a cyclic fashion in order to improve the reliability of the forecast. By simulating small changes in the laboratory organisation and by comparing the forecasts with the real effects, the simulator is refined until the forecasts are within a certain confidence band. - A particular problem is the number of events that should be simulated before the results are stabilized about a mean value. This problem is comparable to the question of how many runs are required to simulate a Gaussian distribution within a certain precision. Experience shows that at least 1000 sample arrivals should be simulated to obtain reliable simulation results. The sample load (samples/day) therefore determines the time horizon of the simulation, which for low sample loads may be as long as several years. It means also that in practice many laboratories never reach a stationary state which makes forecasting difficult. However, one may assume that on the average the best long term decision will also be the best in the short run. One should be careful to tune a simulator based on results obtained before equilibrium is reached. - Experimental design techniques (see Chapters 21 to 26) should be used to study the relevant factors and interactions. 42.5 A shortest path problem A network or graph consists of a set of points {nodes) connected by lines {edges or links). These links can be one-way (one can go from point A to B, but not vice versa) or two-way. When the edges are characterized by values, it is called a weighted graph. In the usual economic problems for which one applies networks, these values are cost, time, or distance.
622
Graphs have been applied in chemometrics to describe molecules. Indeed, molecules can be represented as nodes (the atoms) connected by bonds (the edges). Because in the present chapter, we want to emphasize the optimization aspect, we refer the reader to the literature e.g. Refs. [20-23]. In this section therefore, a routing problem is considered in which one must go from one node (the origin) to another (the terminal node). There are many ways by which this is possible and the routing problem consists in finding a path that minimizes the sum of the values of the edges that constitute the path. This is easily understood if one supposes that the nodes are towns and the values of the edges between neighbouring towns are the distances between the towns. The routing problem then consists in choosing the shortest way of going from one town to another. This type of problem is called a shortest path or minimal path problem. It is applied here to the optimization of chromatographic separation schemes for multicomponent samples and more particularly to the ion-exchange separation of samples containing several different ions [24]. In this type of application, one usually employs more or less rapid and clear-cut separation steps. This can be explained best by considering the simplest possible case, namely the separation of three ions. A, B, and C. The original situation is that the three ions have been brought together (not separated) on a chromatographic column and the final situation should be that they are eluted and separated from each other. These two situations constitute the initial and the terminal nodes of the network. They are denoted by ABC// and //A/B/C. The elements which remain on the column are given to the left of symbol // and the symbol / means that the ions to the left and to the right of it are separated. There are many ways in which one can go from situation ABC// to situation //A/B/C, as shown in the network in Fig. 42.10. Step 1 There are two possibilities, namely: (a) one can elute one element and retain the other two on the column. This leads to nodes AB//C, AC//B, BC //A; or (b) one can elute two elements and retain the other. This leads to nodes A//BC, B//ACandC//AB. Step 2 (a) Following step la, one of the two remaining ions is eluted. For example, if in the first step A was eluted, one now elutes B or C. In step 2a one can reach the situations A//B/C, B//A/C or C//A/B. (b) Following step lb, two ions are eluted together and are therefore not separated; they have to be adsorbed first on another column. In the mean time, the single ion, left on the first column, can be eluted. Two ions are then adsorbed on a
623 M//BC
A//B/C a8^
//A/B/C a 11
Fig. 42.10. A shortest path problem. See text for symbols.
column and one is eluted. The different possibilities are AB//C, AC//B, BC//A, i.e. the situations reached also after step la. From there one proceeds to step 2a.
Step 3 This follows step 2a. Only one ion remains on the column. It is now eluted, so that the terminal node is reached. These different possibilities and their relationships can be depicted as a directed graph (Fig. 42.10). Directed graphs are graphs in which each edge has a specific direction. To find the shortest path one has to give values to the edges of the graph. As the problem is to find the procedure that permits to carry out the separation in the shortest time possible, these values should be the times necessary to carry out the steps symbolized by the links in the graph or a value proportional to the time. We shall not detail the manner in which these times were derived. Essentially, there are three possibilities: (a) If the separation depicted by a particular link is possible, the time is considered to be equal to the distribution coefficient of the ion which is the slowest to be eluted in this step (the distribution coefficient as defined in ion exchange chromatography is proportional to the elution time). (b) If the separation depicted by a particular link is not possible, a very high value (900) is given to that link. (c) If the link contains a transfer from one column to another, a value corresponding to the estimated time needed for this transfer is given.
624
One may question whether the application of graph theory is really necessary as no doubt a separation such as that described above can be investigated easily without it. However, when more ions are to be separated, the number of nodes grows very rapidly. The calculation of the number of nodes is rather complicated. It is simpler in the special case where transfers from one column to another are not allowed (all ions eluted from the column must be completely separated from all the other ions). In this instance, and for three elements, the following nodes should be considered: ABC//, A//B/C, AB//C, AC//B, BC//A, B//A/C, C//A/B, and //A/B/C. If one considers only the stationary phase (to the left of //), one notes that all the combinations of zero, one, two, and three ions out of three are present. Calling n the total number of ions and p (0 < p < n) the number of ions in a particular combination taken from these n, the total number of combinations is equal to
p=0
where C^ is the symbol used for the number of combinations ofp elements out of a set of A2. This can be shown to be equal to 2^. In this particular instance, a separation scheme for eight elements would contain 256 nodes. For the general case (transfers allowed), no less than 17008 nodes would have to be considered! Even for 4 elements, 38 nodes are obtained and it becomes difficult to consider all of the possibilities without using graph theory. The shortest (cheapest) path in a graph can be found, for example, with a very simple algorithm. Let us suppose that one has to construct a highway from town a^ to a town a^ (Fig. 42.10). There are several possible layouts, which are determined by the towns through which one must pass, and these must be selected from ^2 to a^Q. The values of the links in the resulting graph are given by the estimated costs. The value 900 is a way to indicate that this link is impossible or undesirable. The problem is, of course, to find the cheapest route. A value A | = 0 is assigned to town (node) a, and the value of all the nodes a^ directly linked to a j is computed by using the equation A^ = Aj + /(a,, a J where /(a,, a^) is the length of edge (^j, aj. In this way, we assign the values 3, 5, 900, 0, 900 and 900 to the nodes <22, a^, a^, a^, a^ and a-^, respectively. This procedure is repeated for the nodes a^ linked directly to one of the nodes a^ by using the equation A^ = A^ + /(a„, a^. We continue to do this until a value has been assigned to each of the nodes in the graph. We would, for example, assign the values 15, 1800, 900 and 902 to ag, a^, a^^ and a^, respectively. In the first stage, we assigned a possible value to all of the nodes, but not necessarily the lowest possible value. For example, the value A-j of node a^ is now 900. The value 900 is artificially high, indicating that for practical reasons it is
625
impossible to go from a^ to a^. In the highway example, this could mean a mountain ridge and in the ion-exchange case, it could be a separation that cannot be carried out in a reasonable time. This value is derived here from edge (a^, a^) using the equation Ay = Aj + /(aj, a^). Town a^, however, can also be reached from town ^2- The value of Ay is then given by A2 + /(a2, a^) and is equal to 13; this replaces the original value 900. In this way, all of the nodes are checked until one knows that each town is reached in the cheapest possible way. The optimal path is then found by retracing the steps that led to the final value for A jj. In the graph in Fig. 42.11, this is the path a^, a^, a^, a^^ with a total value of 6. The graph in Fig. 42.10 is the graph obtained for the separation of Ca, Co, and Th on a cation-exchange column. One arrives at this graph by replacing nodes ai-ai 1 by CaThCo//, Ca//Th/Co, Th//Ca/Co, Co//Ca/Th, Ca Th//Co, Ca Co//Th, Th Co//Ca, Ca//Th/Co, Co//Ca/Th, Th//Ca/Co, and //Ca/Th/Co, respectively. The weights of the edges are distribution coefficients obtained from the literature. The conclusion in this particular instance is that one must first reach situation a^, i.e. Ca Th//Co, meaning that one must first elute Co. As the weight of the edge is 0, this means that at least one solvent has been described in the literature that permits one to elute Co with a distribution coefficient of 0 without eluting Ca and Th. The following steps are the elution of Th (distribution coefficient = 2) and Ca (distribution coefficient = 4). As in Section 42.2, we should remark here that at least some of the graph theoretic applications can also be carried out with genetic algorithms.
References 1. 2. 3. 4. 5. 6.
7. 8. 9.
R.I. Ackoff and M.W. Sasieni, Fundamentals of Operations Research. Wiley, New York, 1968. D.L, Massart, A. Dijkstra and L. Kaufman, Evaluation and Optimization of Laboratory Methods and Analytical Procedures. Elsevier, Amsterdam, 1978. T. De Vries, Het klinisch-chemisch laboratorium in economisch perspectief. Stenfert-Kroese, Leiden, 1974. H. De Clercq, M. Despontin, L. Kaufman and D.L. Massart, The selection of representative substances by operations research techniques. J. Chromatogr., 122 (1976) 535-551. S. de Jong and Th.J. R. de Jonge, Computer assisted fat blend recognition using regression analysis and mathematical programming. Fat Sci. Technol., 93 (1991) 532-536. I.K. Vaananen, S. Kivirikko, J. Koskennieni, J. Koskimies, A. Relander, Laboratory Simulator: a study of laboratory activities by a simulation method. Meth. Inform. Med., 13 (1974) 158-168. B. Schmidt, Rechnergesteuerte Strategien zur Probenverteilung in einem klinisch-chemischen Labo. Z. Anal. Chem., 287 (1977) 157-160. T. A.H.M. Janse, G. Kateman, Enhancement of the performance of analytical laboratories by a digital simulation approach. Anal. Chim. Acta, 159 (1984), 181-189. B.G.M. Vandeginste, Strategies in molecular spectroscopic analysis with application of queueing theory and digital simulation. Anal. Chim. Acta, 112 (1979) 253-275.
626 10. 11. 12. 13.
14.
15.
16. 17.
18. 19. 20. 21. 22.
23. 24.
B.G.M. Vandeginste, Digital simulation of the effect of dispatching rules on the performance of a routine laboratory for structural analysis. Anal. Chim. Acta, 122 (1980) 435-454. L. Kleinrock, Queueing Systems, Vol. I, Theory. Wiley, New York, 1976. J.G. Vollenbroek and B.G.M. Vandeginste, Some considerations on batch arrival and batch analysis in analytical laboratories. Anal. Chim. Acta, 133 (1981) 85-97. B. Van de Wijdeven, J. Lakeman, J. Klaessens, B. Vandeginste and G. Kateman, Digital simulation as an aid to sample scheduling in a routine laboratory for liquid chromatography. Anal. Chim. Acta, 184 (1986) 151-164. J. Klaessens, T. Saris, B. Vandeginste and G. Kateman, Expert system for knowledge-based modelling of analytical laboratories as a tool for laboratory management. J. Chemometrics, 2 (1988)49-65. J. Klaessens, J. Sanders, B. Vandeginste and G. Kateman, LABGEN, expert system for knowledge based modelling of analytical laboratories. Part 2, Application to a laboratory for quality control. Anal. Chim. Acta, 222 (1989) 19-34. J. Klaessens, B. Vandeginste and G. Kateman, Towards Computer-supported management of analytical laboratories. Anal. Chim. Acta, 223 (1989) 205-221. J. Klaessens, L. Van Beysterveldt, T. Saris, B. Vandeginste and G. Kateman, Labgen, expert system for knowledge-based modelling of analytical laboratories. Part 1. Laboratory organisation. Anal. Chim. Acta, 222 (1989) 1-17. G.M. Birtwisle, O.J. Dahl, B. Myhrhaug and K. Nygard, SIMULA Begin, Petrocelli/Charter, New York, 1973. C.A. Copenhaver, Knowledge Engineering Environment (KEE) from Intellicorp, Packag. Software Rep., 15 (1989) 5-20. M.J. Randic, On the recognition of identical graphs representing molecular topology. J. Chem. Physics, 60 (1974) 3920-3928. A.T. Balaban, ed.. Chemical Applications of Graph Theory, Academic Press, London, 1976. V. Kvasnicka and J. Pospichal, An improved version of the constructive enumeration of molecular graphs with described sequence of valence states. Chemom. Intell. Lab. Systems, 18 (1993) 171-181. L.B. Kiers and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Research Studies Press, Letchwork, U.K., 1986. D.L. Massart, C. Janssens, L. Kaufman and R. Smits, Application of the theory of graphs to the optimalisation of chromatographic separation schemes for multicomponent samples. Anal. Chem., 44 (1972) 2390-2399.
Recommended additional reading L. Kleinrock, Queueing Systems, Vol. II: Computer Applications. Wiley, New York, 1976. H.M. Wagner, Principles of Operations Research, With Applications to Managerial Decisions. Prentice/Hall, Englewood Cliffs, NJ, 1975. S.P. Sanoff, D. Poilevey, Integrated information processing for production scheduling and control. Comput. Integr. Manuf., 4 (1991) 164-175.
627
Chapter 43
Artificial Intelligence: Expert and Knowledge Based Systems 43.1 Artificial intelligence and expert systems To perform tasks in a chemical laboratory various skills and expertise are required. A considerable amount of experience and intelligence is necessary to apply correctly the theoretical knowledge. A good example concerns the use of chemometrical techniques that are described in previous chapters of this book for the analysis and validation of chemical data. Even when we think (or hope) that we have described these techniques and their application possibilities and limitations to a great extent, it remains difficult to select a suitable technique for a specific problem at hand. This aspect requires a considerable amount of intelligence and expertise. A specialized branch of science, artificial intelligence (AI) investigates to what extent it is possible to implement this expertise in a computer. It is a relatively new science whose research object is "intelligence". Although intelligent behaviour is recognizable by most people, it is at this moment impossible to give a clear definition of intelligence. One of several definitions that have been offered is: "All active structures and processes during a reasoning process". From the beginning of computer science, a small group of people has investigated this topic in order to implement in the computer some kind of intelligent behaviour. There are, however, so many different aspects to be studied that various subdisciplines emerged. Natural language systems, robotics, vision, automatic learning, neural networks and expert systems are examples of topics studied in AI. The first useful results relevant to this book were obtained with expert systems. These are computer programs that emulate an expert's behaviour. This implies that they can help solve problems at an expert level. In the last ten years some successes have been achieved in chemistry with these software products [1-6]. They can be considered as a new generation of tools for the chemical laboratory. When integrated in existing methodology and data processing techniques as described in previous chapters, they can provide substantial benefits. The possibility to incorporate experience in expert systems (see Section 43.2) is the major reason for their success.
628
As an example, consider the automation efforts for chemical laboratories in the last decades. Chemical laboratories of today are equipped with instruments that, in principle, can run automatically for 24 hours a day. This results in a higher productivity, since more samples can be analysed with an equal technical effort. Decisions about the analysis itself, how many and which samples must be analysed with what method or technique, etc., are still the responsibility of the laboratory personnel. Since experience can be incorporated into expert systems, they can provide significant benefits as decision-supporting tools. Therefore, the main ideas of expert systems and their development are explained in this chapter. More detailed information can be found in the numerous textbooks on expert systems [7-10].
43.2 Expert systems Expert systems are computer programs that are intended to support tasks at an expert level. What makes them different from other programs that compute complicated formulas by means of ingenious algorithms? Certainly, these programs are necessary and valuable tools for the expert; they make it possible to carry out computations instantaneously and reliably. When making decisions in a specialized area there is, however, another important aspect, namely, experience. During the decision-making process, the expert — often without realizing it explicitly — makes choices or excludes possibilities to reach conclusions efficiently. His or her major guides are qualitative rules of thumb, optimized during his/her own learning period and experience with his/her job. A simple example will best illustrate the concept of heuristics. Classification of French wines according to their region of origin is a problem that can be tackled in different ways. Many studies are published based on different pattern recognition techniques on chemical and sensory characteristics of the wines. All these techniques perform well to very well, but they all require many measurements. However, this strategy is totally inefficient and unnecessary when some typical prior knowledge is available. When the sample is offered in its original bottle, the shape and colour of the bottle contain at least part of the required information. To reason with this kind of prior knowledge is difficult in classical pattern recognition techniques. The inclusion of qualitative rules, also called heuristic rules, would certainly enhance the performance of the classical techniques. Sometimes these heuristics are an extension of the theory and sometimes they are experience-based rules, with no apparent theoretical justification, but they simply work most of the time. One way of explaining human reasoning that has influenced the expert system research is that these heuristic rules of thumb are
629
implicit in the expert's mind as a kind of decision tree that he or she is able to consult and evaluate instantaneously. This is what defines an expert. An expert system is then the implementation of a part of this decision tree in a computer program. By chaining the heuristic rules, the expert system can emulate a part of the decision capacity of the expert. Expert systems are not simulations of human intelligence in general. They are simply practical programs that use heuristics to solve specific problems. Such programs can obviously be useful as decision support tools. Since they are developed to give advice at expert level, ideally they should: - be able to explain their own reasoning process, answer queries such as 'why' or 'how' certain conclusions have been reached or 'why' certain conclusions have 'not' been drawn; - be easily modifiable; since by definition expert knowledge is dynamic, expert systems must allow easy implementation of changes; - allow a certain degree of inexactness; since they are based on heuristics which work most of the time, but not in 100% of the cases, a strategy to cope with uncertainty should be available. The last feature brings us to what is considered to be one major disadvantage of expert systems. Since they are acting on an expert level, with a certain level of uncertainty, evaluating them in an exact way is impossible, as is possible with algorithmic programs. These programs should work correctly and reliably in 100% of all cases. Test methods are available to evaluate them. Expert systems, however, are based on heuristics, which fail occasionally. The problem is then how to evaluate expert systems. What is acceptable? The only reasonable way is to compare their performance with that of human experts. Even when they are performing well, eliminating the occurrence of a failure is not possible. This is especially a drawback when expert systems are to be integrated in an automatic environment. One way to alleviate this problem is to include a strong meta-knowledge component in the expert system, e.g., knowledge about its boundaries and limitations in order to recognize cases where failures are likely to occur.
43.3 Structure of expert systems The typical structure of an expert system is shown in Fig. 43.1. Three basic components are present in all expert systems: the knowledge base, the inference engine and the interaction module (user interface). The knowledge base is the heart of the expert system. It contains the necessary expert knowledge and experience to act as a decision support. Only if this is correct and complete enough the expert system can produce meaningful and useful conclusions and advice. The inference
630
KNOWLEDGE BASE
INFERENCE ENGINE
INTERACTION MODULE
Fig. 43.1. General structure of an expert system.
engine contains the strategy to use this knowledge to reach a conclusion. This separation between the knowledge in the knowledge base and the inference engine is an important characteristic of expert systems. An advantage is that changes, made in the knowledge base will not corrupt the active component, the inference engine. Another spin-off of this separated architecture is the possibility of multiple uses of the inference engine for different knowledge bases. Expert-system shells are commercially available software programs that contain all elements of Fig. 43.1, except that the knowledge base is empty and must be filled to create specific applications. The interaction module takes care of communication between the user and the expert system. Explanation facilities are an important part of the interface.
43.4 Knowledge representation The knowledge representation must allow an efficient implementation of all the knowledge that is necessary for the problem area; at the same time it must be efficiently usable by the inference strategy. These are often conflicting demands. Finding the right balance is a research subject for AI specialists. Because expert
631
systems will serve as human decision support tools they must incorporate all the knowledge of the domain in a natural way. The focus is changed from computer efficiency aspects towards the programmer's or user's comfort. It is thanks to research in this field that the focus of modem computer tools and languages is shifted more to the user than to the machine. In general, a representation method for expert systems should fulfil as well as possible the following criteria: - incorporate qualitative knowledge in a natural way, - allow an efficient use of this knowledge by the inference engine, - allow structuring of the knowledge, - allow representation of meta-knowledge: knowledge about the system itself, so that it is able to recognize its own boundaries. The two most important representation schemes are the rule-based scheme and the frame-based or object-oriented scheme.
43.4J Rule-based knowledge representation The most popular representation scheme in expert systems is the rule-based scheme. In a rule-based system the knowledge consists of a number of variables, also called attributes, to which a number of possible values are assigned. The rules are the functions that relate the different attributes with each other. A rule base consists of a number of "If... Then..." rules. The "IF" part contains the conditions that must be satisfied for the actions or conclusions in the "THEN" part to be valid. As an example, suppose we want to express in a rule that a compound is unstable in an alkaline solution if it contains an ester function. In a semi-formal way the rule can be written: IF THEN
the molecule contains an ester-function it is unstable in an alkaline solution
The syntax for implementing this in an actual expert system is, of course, quite diverse and it is beyond the scope of this book to describe this in more detail. As an example the above-mentioned rule is translated in a realistic syntax. This starts by the definition of the attributes and their possible values. In this examples two attributes must be defined: Attribute ester-function alkaline stability IF THEN
Possible values [yes, no] [stable, unstable]
ester-function = 'yes' alkaline stability = 'unstable'
632
In what follows we will use the semi-formal way of describing rules, since there are multiple equivalent ways to implement them formally. The rules may consist of one or more conditions in the IF part and one or more conclusions in the THEN part. The collection of all the rules in the knowledge base constitutes the expertise present in the system. Typically a medium-sized expert system contains a few hundreds of rules. Rules allow a flexible and modular representation of this expertise. It is important that each rule represents a single step in the reasoning process. In this case rules can be seen as independent chunks of knowledge that can be changed or modified easily without affecting the other rules. When the knowledge domain becomes complex, this condition is almost impossible to fulfil. Larger applications require the possibility of structuring the rules in small rule sets that are easier to manipulate. Rules seemingly have the same format as "IF.. THEN.." statements in any other conventional computer language. The major difference is that the latter statements are constructed to be executed sequentially and always in the same order, whereas expert system rules are meant as little independent pieces of knowledge. It is the task of the inference engine to recognize the applicable rules. This may be different in different situations. There is no preset order in which the rules must be executed. Clarity of the rule base is an essential characteristic because it must be possible to control and follow the system on reasoning errors. The structuring of rules into rule sets favours comprehensibility and allows a more efficient consultation of the system. Because of the natural resemblance to real expertise, rule-based expert systems are the most popular. Many of the earlier developed systems are pure rule-based systems.
43.4.2 Frame-based knowledge representation Frame-like structures can be used to represent the facts, objects and concepts. In this context frames must be interpreted as a software structure (a frame) in which all characteristics of an object or a concept is described. The simplest example of frame-like structures are the so-called ''object-attribute-value'' triplets. Examples of such triplets to describe a chromatographic column are: Column 1 - Tradename - Lichrosorb Column 1 - Functionality - C8
Column 2 - Tradename - |i-Bondapak Column 2 - Functionality - CI8
633
Frames can be seen as structures where all relevant information about an object or a concept is collected. As an example the relevant information about a column in a chromatographic method can be represented by a dedicated general frame, COLUMN. Separate columns can be represented by so-called instantiations of this frame. Instantiations are copies of the general frame that contain the characteristics of a specific object, in this example the separate columns. COLUMN Tradename: [Spherisorb, Hypersil, Lichrosorb,...] Functionality: [C8,C18,...] Particle size: [Real number] (|im) Column length: [Integer number] (cm) Batch number: [Integer number] Internal diameter: [Real number] (mm) COLUMN 1 Tradename: Lichrosorb Functionality: C8 Column length: 30 cm Particle size: 4.0 |Lim Batch number: 2347645 Internal diameter: 4.6 mm
COLUMN 2 Tradename: |Li-Bondapalc Functionality: C18 Column length: 30 cm Particle size: 5.0 |Lim Batch number: 3459863 Internal diameter: 4.0 mm
In these frames all specific columns that are relevant for the reasoning process of the expert system can be described in a structured and comprehensive way. The frame-based and rule-based Icnowledge representation are both required to represent expertise in a natural way. Therefore, in most expert systems a combination of rule-based and frame-based knowledge representation is used. The rule base together with the factual and descriptive knowledge by means, of e.g., frames constitute the knowledge base of the expert system. 43.5 The inference engine 43.5.1 Rule-based inferencing The inference engine is the active component of the expert system. It contains a strategy to use the knowledge, present in the knowledge base, to draw conclusions. In a rule-based expert system its major task is to recognize the applicable rules and how they must be combined in order to derive new knowledge that eventually leads to the conclusion. Suppose the following two rules are present in a knowledge base:
634
IF THEN
a molecule contains a phenyl group it is UV-active
and IF THEN
a molecule is UV-active a UV-detector can be used
These two rules can be combined into: IF THEN
a molecule contains a phenyl group a UV-detector can be used.
In this example this combination is straightforward. One has to be aware, however, that a medium-sized expert system easily contains several hundreds of rules. In addition, several rules can be valid at the same time. The inference engine should also have a strategy for deciding on priorities {conflict resolution). For the combination of rules two alternative strategies are possible: forward chaining (data driven) and backward chaining (goal-driven). A forward chaining strategy first checks all available factual knowledge for a given problem. It then compares this with the IF parts of the rules and tries to derive new knowledge that can be used to activate other rules. A backward, goal-oriented strategy starts from a possible solution of the problem and tries to establish and prove all necessary conditions for this solution. If this is not successful, another possibility is tried out, until a solution is found. An example will clarify this. Suppose we have a small expert system that can give advice on a suitable solvent and the rule base consists of the following six rules, here represented in a semi-formal way: 1. IF the polarity of the compound = not polar THEN the solvent is methanol 2. IF THEN
the stability in an alkaline solution of the compound = not stable the solvent is methanol
3. IF AND IF THEN
the polarity of the compound = polar the stability in an alkaline solution of the compound = stable the solvent is an alkaline solution
4. IF THEN
no functional groups are present the polarity of the compound = not polar
5. IF THEN
an ester-function is present the stability in an alkaline solution of the compound = not stable
635
6. IF THEN
a carboxylic functional group is present the polarity of the compound = polar
This is obviously not a complete rule base and, moreover, we have oversimplified the chemical knowledge, but for clarity we will restrict the example to these six (over-simplified) rules. The knowledge base is summarized in Fig. 43.2. We will simulate a forward and a backward inference engine to find the solvent for acetylsalicylic acid. The goal-driven or backward strategy starts with a possible solution. In our case the two possible solvents are methanol and an alkaline solution. The inference engine searches the rule base for all rules that decide on the solvent. Rules 1, 2 and 3 are selected on this basis. At this point the inference engine must decide which one to investigate first. The simplest strategy is to investigate the rules in order of appearance in the knowledge base. In our case this means that the inference engine selects Rule 1. In more sophisticated expert systems priority rules can be defined that decide on the order of the rules or rule sets that must be investigated. In our example we will first investigate Rule 1. The THEN part of this rule is valid if the compound is not polar. The IF part of this rule becomes now a new subgoal (SGI: is the polarity of the compound = not polar?) that is treated in a similar way. The inference engine now tries to prove this subgoal by searching other rules that conclude in the THEN part on the polarity of a compound. This is the case for Rules 4 and 6. Rule 4 is first selected. The IF part of Solvent =
Solvent = Methanol
alkaline solution
Rules
Rule
Polarity
Rule 6
Carboxylic functional group
Rule 4
Functional Groups
Stability in alkaline solution
Rules
Ester Function
Fig. 43.2. Representation of a small example knowledge base to select a solvent (see text for further explanation).
636
Rule 4 is now the new subgoal (SG2: are there functional groups present?). No rules are found in this rule base that conclude on the presence of functional groups. At this point the inference engine checks whether this information is available in the factual database (e.g. the frames). In our example we assume that there is no such factual database and the working memory is still empty. Only in this case does the inference engine ask the user (through the interaction module) for information about this point. When the user confirms the presence of functional groups in acetylsalicylic acid the inference engine realizes that Rule 4 is not valid. Rule 6 is now investigated as the second and last possibility to conclude on the polarity of the compound. The compound is polar when a carboxylic functional group is present. The presence of a carboxylic function becomes subgoal 3. Since no rules or databases are available to decide on this, the user is again asked for information on this topic. Because acetylsalicylic acid contains a carboxylic function, the inference engine can establish the polarity of the compound: The polarity of acetylsalicylic acid = polar. Based on this conclusion SGI must be rejected. The inference engine is now back to Rule 1 and has to conclude that the path followed was unsuccessful. The process just described is called backtracking. Rule 2 can now be investigated in the same way. SG4 is now the instability of the compound in an alkaline solution. Rule 5 is selected because it concludes in the THEN part on the stability of compounds in an alkaline solution. The presence of an ester function becomes SG5. Since no rules are available that conclude on this issue, the user is again asked for information. Through the interaction module the user now confirms the presence of an ester function in acetylsalicylic acid. This proves SG5 which in turn establishes SG4. We are now back to Rule 2, which can be executed. The second path turned out to be successful. The expert system has come to a conclusion and advises methanol as a suitable solvent. In the same way Rule 3 is investigated. The reader can verify that the inference engine can reject this third path without further information of the user. It is already clear from this small example that aspects such as conflict resolution can have an important influence on the performance of the search strategy. A forward chaining inference engine first searches for rules that make use of characteristics of the molecule. This is the case for Rules 4, 5 and 6. The inference engine checks whether the conditions of the IF parts of these rules can be proven from the knowledge already present in the descriptive part of the knowledge base, e.g. in the frames. In our example there is no such information. In this case the expert system queries the user for information on functional groups, ester functions, and carboxylic acid functions. Rule 4 is selected first (order of appearance). Because there are indeed functional groups present in acetylsalicylic acid, this rule cannot be executed and the inference engine continues with Rule 5. This rule can be executed because the user has confirmed the presence of an esterfunction in acetylsalicylic acid. The stability in an alkaline solution has now been
637
established as: 'not stable'. The inference engine now looks for rules that use in the IF part the stability in an alkaline solution of the compound = 'not stable'. This is the case for Rule 2 and this rule can be executed, leading to the conclusion, namely to advise methanol as solvent for acetylsalicylic acid. The inference engine continues with Rule 6. Based on the information of the user that a carboxylic group is present, the polarity of the compound can be established as 'polar'. The inference engine looks for rules that use the fact that the compound is polar in the IF part. This is the case for Rule 3. This rule can, however, not be executed since the second IF statement of the rule is not true. Both strategies show advantages and drawbacks. When most information of the bottom line rules (here Rules 4,5 and 6) is present in the knowledge base or can be obtained automatically, e.g. by a database search, then forward chaining is the natural way to proceed. When, however, much information must be queried from the user in this way, the use of the expert system becomes irritating and the system gives the impression to ask for many facts, that are seemingly not used. Backward chaining avoids this drawback. Information is asked only where necessary for the reasoning process. The consultation of such a system creates a closer relationship with the user since following the reasoning path of the system is easier, and questions come in a logical sequence. 43.5.2 Frame-based inferencing 43.5.2.1 Inheritance A typical feature of expert systems that support frames is inheritance. Frames can be organized in a hierarchical structure. They can inherit properties (attributes) from frames that are higher in the hierarchy. The latter are therefore called parent frame and the former child frame. There are many varieties of the inheritance principle. Frames can have only one parent frame (simple inheritance) or may have multiple parent frames (multiple inheritance). All attributes can be inherited (full inheritance) or only a few, selected by the knowledge engineer, may be inherited (partial inheritance) by the child frames. An example of a simple inheritance organization of frames is shown in Table 43.1. The frame 'Organic Compound' is the parent frame. The frames 'Ester' and 'Acids' are child frames of 'Organic Compound'. A typical example of inheritance is instantiation. The frame Acetic acid is a child of 'Acids' and, since no extra attributes are added, it is also an instantiation. The hierarchical structure of the frames defines the relation between them. This hierarchical structure is static, however, and cannot be easily modified. Inheritance is therefore an elegant way to represent taxonomical relations or well-established relations between objects. It is less suited for variables or ill-defined relations between objects.
638
TABLE 43.1 An example of inheritance Organic Compound Name Molecular formula Molecular weight
Ester
Acids
(child of 'Organic Compound')
(child of 'Organic Compound')
Name
Name
Molecular formula
Molecular formula
Molecular weight
Molecular weight
StabiHty in alkaline solution
pKa value
Acetic Acid (child of 'Acids') Name: acetic acid Molecular formula: C2H4O2 Molecular weight: 60 pKa value: 4.7
43.5.2.2 Object-oriented programming techniques Another relatively recent feature of frame-based expert systems is the use of object-oriented programming techniques. The frames are considered as independent entities that can exchange information by means of messages that they can send to each other. Which frames can send/receive messages to/from which frames must be defined beforehand by procedures that must be part of each frame. Apart from attributes, procedures are thus defined for each frame. They define what kind of messages can be sent to which other frames and what kind of messages can be received by which frames. A message can be the request to change an attribute, or to activate another procedure of the frame. Procedures can involve external programs to be executed. Similar to attributes, procedures can be inherited by child frames when inheritance is supported. Object-oriented programming enormously enlarges the flexibility of frame-based systems. A disadvantage is, however, that it results often in unclear and intractable reasoning structures. Table 43.2 shows the small rule base of Fig. 43.2, translated in an object-oriented system.
639 TABLE 43.2 An example of object-oriented programming Sample
polarity [yes/no] alkaline solution stability [yes/no] Procedure 'find polarity' Call external program 'polarity.exe' Assign value 1 to SAMPLE/'polarity' Send message to SOLVENT/procedure 'start' = valuel Procedure 'find alkaline solution stability' Call external program 'alkaline solution stability.exe' Assign value2 to SAMPLE/'alkaline solution stability' Send message to SOLVENT/procedure 'start' = value2
Solvent
Methanol [yes/no] Alkaline solution [yes/no] Procedure 'start' Send message to SAMPLE: activate procedure 'find polarity' Send message to SAMPLE: activate procedure 'find alkaline solution stability' if
valuel = 'yes'
and
value2 = 'yes' SOLVENT/'Alkaline solution' = 'yes'
elsif
SOLVENT/'Methanol' = 'yes'
endif
43.5.3 Reasoning with uncertainty Because the rules or procedures in expert systems are heuristic they are often not well-defined in a logical sense. Nevertheless, they are used to draw conclusions. A conclusion can be uncertain because the truth of the rules deriving it cannot be established with 100% certainty or because the facts or evidence on which the rule is based are uncertain. Some measure of reliability of the obtained conclusions is therefore useful. There are different approaches used in expert systems to model uncertainty. They can be divided into methods that are based on
640
Bayesian probability theory and methods that are based on fuzzy-set theory. The principles of both theories are explained in Chapter 16 and Chapter 19, respectively. Both approaches have advantages and disadvantages for the use in expert systems and it must be emphasized that none of the methods, developed up to now are satisfactory [7,11]. The application of Bayesian theory requires the knowledge of probabilities of all occurring situations. This is in practice impossible. In one of the earliest successful expert systems, MYCIN, a medical diagnosis system [12], this problem has been circumvented by assigning two numbers to each rule, not to be interpreted as probabilities but as likelihoods. The measure of belief (MB, reflecting the strength of the rule) and a measure of disbelief (MD). These two measures are combined in a certainty factor (CF). There are different ways, loosely based on probability theory, to combine these measures when rules must be chained to arrive at the conclusion. Instead of assigning a certainty factor to a rule or statement, a fuzzy measure can be assigned that is based on a function, sometimes called a credibility function in expert system terminology. All these methods pretend to represent the intuitive way an expert deals with uncertainty. Whether this is true remains an open question. No method has yet been evaluated thoroughly. Modelling uncertainty to obtain a reasonable reliability measure for the conclusions remains one of the major unsolved issues in expert system technology. Therefore, it is important that in the expert system a mechanism is provided to define its boundaries, within which it is reasonably safe to accept the conclusions of the expert system.
43.6 The interaction module The interaction module takes care of the communication with the user. It has essentially the same function as the user interface of other software programs. All queries to and from the expert system are passed through this module. Although it has nothing to do with the quality of the advice, it often determines the degree of acceptance of the system. Most expert systems provide also an interface for the programmer and for the user. The programmer's interface can be a knowledge base editor that allows the introduction of the rules in the correct format in an efficient way. Sometimes an automatic debugger for syntax errors is also provided. The possibility to retrieve or forward information from or to the user is always provided. Expert systems can use question/answer dialogues in a natural-like language, or by menu-driven interfaces or by graphical style interactions. The preferred interface is mainly application-dependent. Another essential component, different from classical software, of the interaction module is the explanation facility subsystem. The simplest provision is the
641
explanation of a question or concept. This is in most cases easily implemented by means of a text file where some difficult concepts are explained. The explanation system only shows the proper section of this file. If we consider the small example expert system of the previous section, we could add explanations for the concept: functional groups. In most expert systems two types of questions from the user are supported: WHY is certain information required and HOW are conclusions reached. In today's expert systems the answer to the first question is simply the rule that the inference engine is trying to fire. If in our example the user asks WHY the expert system asks whether there are functional groups in the compound, the answer would be a reproduction of Rule 4. The answer to the HOW question is the sequence of rules needed to reach the conclusion. In some cases this is sufficient information. Occasionally, however, the user is actually asking for more theoretical background knowledge of the system. To date, this is not possible, since expert systems are not simulations of the expert's intelligence but rather practical programs that reproduce (part of) the heuristics of the expert. 43.7 Tools The implementation of an expert system can be done using an AI language such as PROLOG or LISP (or a conventional language such as PASCAL, FORTRAN). This is, however, a tedious and time-consuming task. Another approach is to use dedicated tools for building expert systems. These tools facilitate the knowledge acquisition and the implementation. Tools for implementation are commercially available in a large price range. Tools for knowledge acquisition are still in the research phase. Implementation tools differ in the facilities they offer. A major selection criterion is the knowledge representation possibilities the tools offer. Another important issue is the access to external programs. This is an important aspect for chemical knowledge bases since many chemical problems involve a large computational aspect that is best solved by an external, algorithmic program. Expert system shells represent the lower end of the range of tools. Shells are tools that offer only one knowledge representation strategy with only one inferencing strategy (or a restricted second one). Shells typically support the rule-based knowledge representation. The user interface is highly standardized and does not allow user-tailored output. Explanation facilities are restricted to the HOW and WHY type. Access to externals is also standardized to classical packages such as LOTUS 123, Excell or DBase. Due to the absence of flexibility, shells are not suitable for large projects. They are, however, very useful for small test projects or for projects where the expert and the knowledge engineer is one and the same person. Shells run on PC-type hardware.
642
Other, more advanced tools offer possibilities for combining different knowledge representations and inferencing strategies. They typically offer rules, combined with frames, different inheritance possibilities. The possibility of object-oriented programming is not always present. The user interface is not standardized and can be user tailored. The interface to standard packages is usually available. Knowledge engineering environments offer extended possibilities to combine as well various knowledge representations as inferencing strategies. The user interface includes graphical possibilities. External programs can be integrated using the underlying language of the tool. Knowledge environments require at least workstations as hardware. 43.8 Development of an expert system. The appearance of expert systems to solve practical problems, also in chemistry, started in the eighties. During this period much experience has been acquired through the expected and unexpected problems that arose during such projects. Until now there are only a few commercially available expert systems and this is not likely to change in the near future. This implies that expert systems will be mostly in-house developments. The different steps to consider are: 1. analysis of the application area, 2. definition of the knowledge domain and intended users, 3. definition of knowledge sources, 4. selection of hard- and software, 5. knowledge acquisition, 6. implementation/prototyping, 7. testing/validating, 8. maintenance. 43,8.1 Analysis of the application area Since the development of an expert system is a substantial project, only applications with high returns are worthwhile to be considered. The benefits and the risks must be well evaluated. The pay-off of application areas relates to the required speed and or the required reliability of the task. Complicated tasks requiring high level expertise and that are time-critical in their execution are good candidates. It is more likely that the human experts make errors at these tasks. A decision support system is then a valuable tool. Examples are diagnostics in a nuclear reactor [13] or for medical diagnosis [14]. Another pay-off of the development of an expert system concerns the detection of unknown knowledge gaps. During the knowledge acquisition phase it is necessary to make all heuristic rules
643
explicit. This process often leads to the conclusion that for some situations no heuristic rules or experience are available. The awareness of these knowledge gaps can prevent unforeseen difficulties. There may be of course also other reasons that justify an expert system project. Obtaining experience in this kind of projects often is a reason or it may be useful for demonstration purposes. 43.8.2 Definition of knowledge domain, sources and tools The next few steps are very similar to those required in any software project. One of the first stages is the clear definition of the knowledge domain. It must be clear which problems the expert system must solve. It is at this stage not the intention to define how this can be done. Clarity and specificity must be the major guides here. Fuzziness at this stage will, more than in classical software projects, have to be paid for later when different interpretations cause misunderstandings. Equally important is the clear definition of the end user(s). An expert system set up as decision support tool for professionals is totally different from an expert system that can be used as a training support for less professional people. The next step is to make sure that the whole knowledge area is covered by appropriate expertise. In most cases the knowledge source is a person with extensive experience in the problem area. The expert will also be responsible for the contents of the knowledge base and the quality of advice. Usually, literature references will serve as a secondary source to find basic or background knowledge. Using literature puts fewer demands on the experts. There are situations where literature can be the main source of knowledge. The knowledge domain must then be stable for a longer period. Still there has to be an expert who is responsible for extracting the relevant information from the literature. After the knowledge domain, the end users and the experts have been defined, a choice must be made for the software environment for the development of the expert system. In addition, the hardware for the development process as well as for the delivery system must be identified. Real application expert systems always require at least a mid-sized tool (see Section 43.7). If possible, the best procedure for selecting a tool is to implement a smaller test knowledge base similar to the final knowledge base in a few different tools. This test knowledge base should be small enough to be implemented quickly in different tools. On the other hand, it must be specific enough to highlight the differences between the tools. 43.8.3 Knowledge acquisition The knowledge base of an expert system contains the relevant knowledge about the specific area of the expert system. For this all necessary heuristics must be
644
made explicit. Two persons play a pivotal role in this step: the expert and the knowledge engineer. The expert is responsible for the contents of the expert system while the knowledge engineer is responsible for the implementation of the knowledge in the selected tools. Knowledge acquisition is the bottleneck of the development stage. Here expert and knowledge engineer together try to unravel the decision path of the problem solving. All the factual knowledge, the decision rules and rules of thumb must be explicitly defined. This process is quite tedious because experts usually have much trouble in defining why and how certain conclusions are reached. The expert often uses many assumptions that are obvious to him or her. Yet these have to be stated explicitly since nothing is obvious for the computer. It is the task of the knowledge engineer to assist the expert in making all necessary knowledge explicit. Up to now numerous knowledge acquisition techniques have been applied, ranging from interviews of the expert by the knowledge engineer to closely observing the expert in his daily practice. Research is still going on to support this step and to automate it as far as possible. The final aim is to construct a formalized representation of the decision process. Decision trees and structured system analysis are possibilities. Some types of expert systems can derive their own rules from examples. These are described in Chapters 18 and 33. 43.8.4 Implementation After knowledge acquisition has been completed, the next step is implementation. Since in practice explicit knowledge emerges piecewise, it is useful to start implementation as soon as some structured knowledge is available. In doing so, the knowledge engineer has the opportunity to become acquainted with the tool. The expert on the other hand uses the prototype to check whether no misunderstandings exist. This tends to greatly enhance the confidence in the project and to reduce misunderstandings between the knowledge engineer and the expert. An important point that should always be borne in mind during implementation is the future maintainability of the system. 43.8.5 Testing, validation and evaluation The testing phase is important in expert system development. The practical applicability of the expert system will largely depend on this phase. Testing expert systems is different from normal software engineering in a number of ways. First, it is difficult to test exhaustively the full code and all possible paths the reasoning process may follow. Secondly, the nature of expert systems poses some typical problems. Due to their heuristic nature the correctness of the results cannot be easily verified. A certain degree of errors may be acceptable and, moreover, an
645
answer, different from the expert's, may still be acceptable. Because of these facts it is not to be expected that general test procedures will become available. The practical question that must be answered is whether the expert system is useful in the working environment. One procedure to test an expert system is a validation followed by an evaluation of the expert system. Validation involves verifying whether the system responds in the way the expert has in mind. Evaluation involves verifying whether the expert system meets the requirements of the intended end users. Validation can be performed with a number of carefully selected test cases. Evaluation must be performed in real situations. The users must apply the system in their daily work and check the practical usefulness. 43,8,6 Maintenance Once an expert system is installed in a laboratory, it is important to follow its performance in time. Hidden errors may be detected by longer use of the system. Keeping the system up to date is also important. New knowledge and experience which can improve the performance of the system should be incorporated. The maintenance of expert systems should be foreseen in the project management. Knowledge about the structure and knowledge engineering strategy must be saved as long as the expert system is working,
43.9 Conclusion Expert systems were investigated most intensively during the eighties, and it appeared that the problems for their practical usability were greater than expected. In particular, the time-consuming acquisition phase, the absence of a general test procedure and the lack of a satisfactory method to reason with uncertain knowledge prohibited their real breakthrough. Another typical feature of expert systems that limits their distribution compared with classical software is the fact that they contain very subjective and site-specific knowledge. The utility of this kind of knowledge on a larger scale is, of course, limited. All these problems caused a decline in the development of expert system applications. The phrase 'expert system' became almost taboo. Nevertheless, some expert systems have been developed that are useful on a large scale, particularly in the medical domain. In the chemical domain there are also a few famous expert systems such as DENDRAL [15], while other expert systems in chemistry concern organic synthesis, LHASA [1]. LHASA uses a retrosynthetic approach; starting from the molecule to be synthesized, it gives different possibilities for how that molecule can be synthesized in one step from smaller molecules. The user then selects one of the possibilities and subsequently LHASA continues with the molecules involved
646
in the selected reaction. The process continues until basic starting molecules such as, e.g., ethanol are reached. LHASA is still undergoing development to make it more efficient to use [16]. Other applications can be found in spectroscopy, chromatography and AAS; for an overview, see Refs [2-4,17-22]. Two trends can be observed in the last few years. First, small-scale expert systems are being incorporated in chemical instrumentation under the name decision-support systems. Secondly, in many companies site-specific knowledge is being incorporated in computer programs. For example, the strategy for validation of the analytical methods is a crucial issue, especially for accreditation and GLP reasons. The development of a satisfactory strategy for a specific laboratory requires much expertise. In many companies programs are being developed that contain this strategy. While these are often programmed in conventional programming languages, the knowledge acquisition phase is the same as for expert systems [23]. References 1. 2. 3. 4. 5. 6.
7. 8.
9.
10. 11. 12. 13.
E.J. Corey, A.K. Long and S.D. Rubenstein, Computer-assisted analysis in organic synthesis. Science, 228 (1985) 408-418. L.M.C. Buydens and P. Schoenmakers (eds.). Intelligent Software for Chemical Analysis. Elsevier, Amsterdam, 1993. M. Peris, An overview of recent expert system applications in analytical chemistry. Crit. Rev. Anal. Chem., 26 (4) (1996) 219-237. C.H. Bryant, A. Adam, D.R. Taylor and R.C. Rowe, A review of expert systems for chromatography. Anal. Chim. Acta, 297 (1994) 317-347. M. Cadisch and E. Pretsch, SpecTool: a knowledge-based hypermedia system for interpreting molecular spectra. Fres. J. Anal. Chem., 344 (1992) 173-177. W. Pennincks, P. Vankeerberghen, D.L. Massart and J. Smeyers-Verbeke, A knowledge-based computer system for the detection of matrix interferences. Atomic Absorption Spectrometric Methods, J. Anal. Atom. Spectrom. inch Atomic Absorption Spectrom. Updates, 10 (3) (1995) 207-214. F. Hayes-Roth, D.A. Waterman and D.B. Lenat (eds.). Building Expert Systems. AdisonWesley, London, 1983. R. Forsyth and R. Rada, Machine Learning: Applications in Expert Systems and Information Retrieval. Ellis Horwood Series in Artificial Intelligence. Ellis Horwood, Wiley, Chichester, 1986. P.G. Raeth, Expert systems; a software methodology for modern applications. IEEE Computer Society Press Reprint Collection, IEEE Computer Society Press, Los Alamitos, CA, USA, 1990. A. Hart, Knowledge Acquisition for Expert Systems. Kogan Page, London, 1986. P. Walley, Measures of uncertainty in expert systems. Artif. Intell., 83 (1) (1996) 1-58. E.H. Shortlife, Computer-based Medical Consultation: MYCIN. Elsevier, New York, 1976. H.Y. Chung, I.S. Park, S.K. Hur, S.W. Cheon, H.G. Kim, S.H. Chang, Diagnostic strategies for primary-side systems in nuclear power plants. Elect. Power Energy Syst., 14 (4) (1994) 284-297.
647 14. 15. 16. 17.
18. 19.
20. 21.
22.
23.
K.P. Adlassnig, H. Leitich and G. Kolarz, On the applicability of dignostic criteria for the diagnosis of rheumatoid arthritis in an expert system. Expert Syst. Applic, 13 (1) (1997) 73-79. B.G. Buchanan and E.A. Feigenbaum, Dendral and Meta-Dendral; their application dimension. Artif. Intel!., 11 (1978) 5-24. M.A. Ott and J.H. Noordik, Long-range strategies in the LHASA program: the Quinone Diels-Alder transform. J. Chem. Inf. Comput. Sci., 37 (1997) 98-108. P. Van Keerbergen, J. Smeyers-Verbeke and D.L. Massart, Decision support system for run suitability checking and explorative method validation in electrothermal atomic absorption spectrometry. J. Anal. Atomic Spectrom. incl. Atomic Spectrom. Updates, 11 (2) (1996) 149-158. J.E. Matek and G.F. Luger, An expert system controller for gas chromatography automation. Instrum. Sci. Technol., 25 (2) (1997) 107-120. M.E. Elyashberg, Y. Karasev, E.R. Martirosian, H. Thiele and H. Somberg, Expert systems as a tool for the molecular structure elucidation by spectral methods, strategies of solution to the problem. Anal. Chim. Acta, 348 (1997) 443-464. M. De Smet, G. Musch, A. Peeters, L. Buydens and D.L. Massart, Expert systems for the selection of HPLC methods for the analysis of drugs. J. Chromatogr., 485 (1989) 237-253. P.J. Schoenmakers, A. Peeters and R.J. Lynch, Optimization of chromatographic methods by a combination of optimization software and expert systems. J. Chromatogr., 506 (1990) 169-184. M. Mulholland, N. Walker, J.A. van Leeuwen, L. Buydens, H. Hindriks and P.J. Schoenmakers, Expert systems for method development and validation in HPLC. Microchim. Acta, 2 (1991) 493-503. J. Sandberg, B. Wielinga and G. Schreiber, Methods and techniques for knowledge management: what has knowledge engineering to offer? Expert Syst. Applic, 13(1) (1997) 73-79.
Additional Reading H.J. Luinge, A knowledge-based system for structure analysis from infrared and mass spectral data. Trends Anal. Chem., 9 (1990) 66-69. M. Otto, Fuzzy expert systems. Trends Anal. Chem., 9 (1990) 69-72. J. Klaessens, L. Van Beysterveldt, T. Saris, B. Vandeginste and G. Kateman, LABGEN, expert system for knowledge-based modelling of analytical laboratories. Part I, Laboratory organization. Anal. Chim. Acta, 222 (1989) 1-17. K. Janssens and P. van Espen, Implementation of an expert system for the qualitative interpretation of X-ray fluorescence spectra. Anal. Chim. Acta, 184 (1991) 117-132. B.A. Hohne and T.H. Pierce, (eds.). Expert-system Applications in Chemistry. ACS Symp. Ser., Vol. 306, ACS, Washington, DC, 1989.
This Page Intentionally Left Blank
649
Chapter 44
Artificial Neural Networks 44.1 Introduction In Chapter 43 the incorporation of expertise and experience in data analysis by means of expert systems is described. The knowledge acquisition bottleneck and the brittleness of domain expertise are, however, the major drawbacks in the development of expert systems. This has stimulated research on alternative techniques. Artificial neural networks (ANN) were first developed as a model of the human brain structure. The computerized version turned out to be suitable for performing tasks that are considered to be difficult to solve by classical techniques. Research on artificial neural networks has developed along two major pathways. First, there are efforts to mimic the nervous system as closely as possible by the artificial networks. The goal of this research is to better understand the functioning of the nervous system. Another school of research is more interested in artificial neural networks as an interesting computing formalism. The focus of the efforts here lies in applying and adapting the concept of artificial neural networks to solve and study practical computational problems. The aim is to exploit this new concept as much as possible, not to better understand the neurophysiological functioning. For chemistry and more specifically for chemometrics, the results of the second approach are of interest. In this book, therefore, only the second approach to neural networks is considered. In what follows whenever neuron or neural network are mentioned, the artificial variant is meant. Many different types of networks have been developed. They all consist of small units, neurons, that are interconnected. The local behaviour of these units determines the overall behaviour of the network. The most common is the multilayer-feed-forward network (MLF). Recently, other networks such as the Kohonen, radial basis function and ART networks have raised interest in the chemical application area. In this chapter we focus on the MLF networks. The principle of some of the other networks are explained and we also discuss how these networks relate with other algorithms, described elsewhere in this book.
650
44.2 Historical overview The beginning of the research into artificial neural networks is often considered to be 1943 when McCulloch and Pitts published their paper on the functioning of the nervous system [1]. They tried to explain it by small units that are based on mathematical logic and that are interconnected. These units are abstractions of the biological neurons, and their connections. In 1949 Hebb [2] tried to explain some psychological results by means of a learning law for biological synapses. With the introduction of computers it was possible to develop and test artificial neural networks. The first artificial neural networks on computer were developed by Rosenblatt (the perceptron) [3] and by Widrow [4] (the AD ALINE: ADAptive LINear Element). The main difference between them is the learning rule that they use. These simple networks were able to learn and perform some simple tasks. This success stimulated research in the field; the book by Nilsson on linear learning machines [5] summarizes most of the work of this early period. The possibilities of artificial neural networks were believed to be tremendous, and gave rise to unrealistically high expectations by the public at large. At that time (1969) Papert and Minsky showed [6] that many of these expectations could not be fulfilled by the perceptron. Their book had a very negative impact on the research in the field: funding became difficult and publication of papers declined. Apart from some enthusiastic researchers who continued their efforts in the field, research stopped for many years. Some of the investigations continued under the heading adaptive signal processing or pattern recognition. In chemistry the best known example is the linear learning machine which was a popular pattern recognition method (see also Chapter 33). In 1986 the second breakthrough was caused by the publication of a book by Rumelhart in which a learning strategy, the back-propagation, developed earlier by Werbos was proposed [7,8]. This new learning rule allowed the construction of networks which were able to overcome the problems of the perceptron-based networks. They were now able to solve more complicated (non-linear) problems. This breakthrough caused a new interest and up-to-date research is still increasing with encouraging results.
44.3 The basic unit — the neuron The structural unit of artificial neural networks is the neuron, an abstraction of the biological neuron; atypical biological neuron is shown in Fig. 44.1. Biological neurons consist of a cell body from which many branches (dendrites and axon) grow in various directions. Impulses (external or from other neurons) are received through the dendrites. In the cell body, these signals are sifted and integrated.
651
Fig. 44.1. A biological neuron. (Adapted from Ref. [80]).
When a certain threshold impulse is reached a suitable response signal is delivered. The axon transports the response signal to many other neurons through a complex branching system. The contact points between different nerve cells, through which the signals are propagated are called synapses. The efficiency of the propagation of signals from one cell to another is influenced by many factors (chemical and electrical) and is called the synaptic strength. It is important to note that this basic process is the same for all neurons. This suggests that the connecting network structure determines the overall behaviour of a nervous system. This basic process is represented by the artificial neuron as shown in Fig.44.2. The incoming signals (the input) are passed to the neuron body, where they are weighted and summed, then they are transformed, by passing through the transfer function (or output
652
Output
Fig. 44.2. An artificial neuron; xi...Xp are the incoming signals; wi... Wp are the corresponding weight factors and F is the transfer function.
function), into the outgoing signal (the output of the neuron). The propagation of the signal is determined by the connections between the neurons and by their associated weights. These weights represent the synaptic strengths in the biological neuron. Different types of networks have been developed. The connection pattern is an important differentiating factor, since this determines the information flow through the network. The weights associated with each connection are also essential since, together with the connection pattern, they determine the signal propagation through the network. They are real numbers that determine the strength of the connectivity between the two connected neurons. Each signal that is transported through the connection is multiplied by the associated weight of the connection. Weights can be positive or negative. The appropriate setting of the weights is essential for the proper functioning of the network. They determine the relationship between the input and the eventual output of the network. Therefore, they are considered as the distributed knowledge content of the network. Finding the proper weight setting is achieved in a training phase in which the weights are adapted according to a so-called learning rule. The training of the network is a major phase of the developing process of artificial neural networks. Different networks require different training procedures, the two major variants being supervised and unsupervised training (see also Chapters 30 and 33).
653
44.4 The linear learning machine and the perceptron network 44.4.1 Principle The perceptron-like linear networks were the first networks that were developed [3,4]. They are described in an intuitive way in Chapter 33. In this section we explain their working principle as an introduction to that of the more advanced MLF networks. We explain the principle of these early networks by means of the Linear Learning Machine (LLM) since it is the best known example in chemistry. The LLM is a classical supervised pattern recognizer. It tries to find boundaries between classes. In Fig. 44.3 an example of two classes A and B is shown in two dimensions {x^ and X2). Each possible object is thus defined by its values on x^ and JC2. The boundary between the two classes A and B is defined by the line, L, described by eq. (44.1):
WjX2+W2X2 - 6 = 0
Fig. 44.3. Two classes in two dimensions and a possible boundary line.
654
W, X| + W2 JC2 = 6
or
(44.1) ^2 -
e =o
where w, and W2 ^^^ called the weights, 0 is the offset. In neural networks terminology it is called the bias. The offset or bias is added for generality. If no offset (or bias) is provided the line is forced to go through the origin. The line L divides the two-dimensional space in two regions (corresponding to the two classes). All points or objects (combinations of Xj and X2) below the line yield a value for the function, which we will call NET(xi,X2) (eq. (44.2)), larger (smaller) than 0; the points above the line yield a value of NET(xi,X2) smaller (larger) than 0; the points on the line satisfy eq. (44.1) and thus yield a value, 0. NET(xj^- X2i) = Wy Xjj- + W2 X2i = x ^ w
where xj =[x,,X2i]
wT = [w,W2]
(44.2)
Object / is classified as Class A, if: NET(jCi,, JC2,) > 0 or NET(jCi„ JC2,) - 0 > 0 Class B, if: NET(jCi„ ^2,) < 0 or NET(jCi., X2) - 0 < 0 Another approach to eq. (44.2) is to add an extra dimension to the object vector x^, on which all objects have the same value. Usually 1 is taken for this extra term. The 0 term can then be included in the weight vector, w^(wi,W2,-0). This is the same procedure as for MLR where an extra column of ones is added to the X-matrix to accommodate for the intercept (Chapter 10). The objects are then characterized by the vector xJ (JCI-,JC2„1). Equation 44.2 can then be written: NET(x.) = xj yv xJ = [jci, X2i 1]
w'T = [w, W2 - 0]
(44.3)
Object / is classified as Class A if: NET(x,.) > 0 Class B if: NET(x,.) < 0 The classification of objects is based on the threshold value, 0, of NET(x), also called the bias. The procedure can be described by means of a transfer function, F (Fig. 44.4a). The weighted sum of the input values of x is transmitted through a
655
+1 (CLASS A) F(vj[5j+\^3^ - 9) I
^-1 (CLASS B)
F +1
+1
0
K^ + ^^1 -1
[Wj^+\^Xj-
6]
J_
Fig. 44.4. (a) Schematic representation of the LLM. F represents the threshold transfer function: F = sign(Net(x)). (b) On the left: the threshold function for NET(x) = wiJCi + W2X2; on the right for NET(x) = wiXi + W2X2 - 6; see text for explanation.
transfer function, a threshold step function, also called a hard delimiter function. Figure 44.4b shows the threshold function. The classification is based on the output value of the threshold function (0 or +1 for class A; - 1 for class B). When the input to the threshold function (i.e. NET(Xj)) exceeds the threshold, then the output value, y, of the function is 1, otherwise it is - 1 . Instead of such a hard delimiter function it is possible to use other transfer functions such as the threshold logic, also called the semi-linear function (Fig. 44.5a). In this function there is a region where the output value of the transfer function is linearly related to the input value with a slope, a. The first point, A, of this region is: x^w^ + X2W2 = NET(JCI,JC2) = 6. The endpoint, B, of the linear region is reached when: a{x^w^ + X2W2 - 9) = 1 or x^Wi + JC2W2 = NET(xi,X2) = 6 + lla. The width of the interval between A and B is thus lla. It is, moreover, possible to use non-linear functions. The sigmoidal transfer function (Fig. 44.5b) is the most widely used transfer function in the more advanced MLF networks. It will be discussed more in detail in Section 44.5.
656
a
A +1
e + l/a
NET
F +1
NET
Fig. 44.5. (a) The semi-linear function, F = max(0,min(l,a(NET(x) -9))) with NET(x) = Wixi + W2X2 and (b) the sigmoid function: F =
ft.Mox.
^ a^^
44.4.2 Learning strategy The weights, as described in the previous section, determine the position of the boundary that the LLM draws between the classes. The strategy to find these weights is at the heart of the LLM. This procedure is called the learning rule [7,8]. It is a supervised strategy and is based on the learning rule that Hebb suggested for biological neurons [2]. Initially the weights are set randomly. A training set with a number of objects with known classification is presented to the classifier. At each presentation the weights are updated with an amount Aw, determined by the learning rule. The most important variant, used in networks, is the delta rule
657
developed by Widrow and Hoff (eq.(44.4)) [4]. In this learning strategy the actual output of the neuron is adapted with a term, based on 6, the error, i.e. the difference between the actual output and the desired or target output of the neuron for a specific object (eq. (44.4)). Aw. ~ X. 6, 5-4-a,
(44.4)
where d^ is the desired output and a^ is the actual output. In the linear learning machine this rule is applied as follows: 1. present an object / with input vector x^ and apply eq. (44.2); 2. if the classification is correct, the weights are left unchanged; 3. if the classification is wrong, the weights are updated according to eq. (44.4); 6 is taken such that the updated weights yield the current output but with an opposite sign and thus yielding the correct classification for object i (see eq. (44.5)); 4. goto step 1. In the LLM the (scalar) desired output value (xj w^^^) is defined as the negative of the actual (wrong) output value: ^J w„ew =-^J
Wold
(44.5)
The trivial solution (w„g^ = -w^y) is not interesting since it defines the same boundary line. A non-trivial solution is found by the following procedure: Aw ~ 6, X. or
xJ Aw 5,=T1-V—
(0
x ; x^. Aw = w„,^-Woid
/
I
- T
x ; X,
and by using eq. (44.5)
658
TABLE 44.1 Example of LLM procedure T
Initial weight vector, w/ = [1,-1,0]; r| = 1 Object
xT
Class
OK?
New w'''
-1
Yes
-
Result T X W
Cycle 1 1
(1,2,1)
A
2
(3,2,1)
A
+1
No
wj
3
(-1,0,1)
B
-0.71
No
W2 =(-0.14,-1.26,0.57)
4
(-1,2,1)
B
-1.86
No
W3 =(-0.76,-0.05,1.19)
1
(1,2,1)
A
+0.33
No
w j =(-0.87,-0.27,1.08)
2
(3,2,1)
A
-2.07
Yes
3
(-1,0,1)
B
+ 1.95
Yes
4
(-1,2,1)
B
+ 1.41
Yes
-
=(0.57,-1.28,-0.14)
Cycle 2
Cycle 3 I
(1,2,1)
A
-0.33
Yes
2
(3,2,1)
A
-2.07
Yes
3
(-1,0,1)
B
+ 1.95
Yes
4
(-1,2,1)
B
+ 1.41
Yes
-
All objects are correctly classified after three cycles; final w'^ = (-0.87,-0.27,1.08)
xj X,. W„ew=W,,-'^"J^°'^X, xj X.
(44.6)
When Ti is set to 1 the new weights yield the opposite NET-value for the object, /. This mirroring effect can sometimes yield undesirably large weight changes. Therefore r| is usually set to a value < 1. In Table 44.1 a calculation on a two-class example is given. In Fig. 44.6 the boundaries found by LLM are shown. As an
659
1
'
1
'
1
1
1
1
l
\NO^
U(B)
\
X 3 (B)
1X2 (A)
X 1 (A)
isf2-^
^^^^,:^
-1
-2
w4 \
1
1
-0.5
1
1
1
0.5
1 x1
1.5
\
w3
1
_
1
J
2.5
Fig. 44.6. The successive boundaries (wO to w4) as found by the LLM, Airing the draining procedure described in Table 44.1. The crosses indicate the positions of the objects (1 to 4); their class membership (A or B) is given between parentheses.
example the weight adaptation in the first cycle, caused by the wrong classification of the second object, can be calculated using eq. (44.6):
ll 2[3 2 1] - 1 [3 2 1]
oj [3" 2
[l
" 0.57' — -1.28 -0.14
44.4,3 Limitations In Fig. 44.7 an example of a classification problem is shown that cannot be solved by the simple perceptron-like networks. It is known as the "exclusive or" (XOR) problem. No single boundary can be found that yields a correct class-
660
A X2
4-A
+B
\«-
-+Ar
xl
Fig. 44.7. The XOR classification problem (see also Table 44.2).
TABLE 44.2 A two-layered network of Fig. 44.8a to solve the XOR problem Class
x\
x2
hu\ sign(jc 1+^:2-0.5)
hul sign(A l+jc2--1.5)
Final output sign(/iMl-/iM2-0.5)
B
0
0
-1
-1
-1
A
0
1
+1
-1
+1
A
1
0
+1
-1
+1
B
1
1
+1
+1
-1
ification in 2 classes A and B for all objects. The fact that perceptron-like networks were only able to solve simple linearly separable problems was criticized first by Minsky and Papert [6] and marked the end of the first period of interest in neural network research. To solve the XOR problem it must be possible to find two boundary lines. For this an additional layer of neurons is necessary (see Fig. 44.8a). The two additional neurons in the hidden layer of neurons each determine one boundary line. By combining these lines it is possible to obtain the correct classification. In Fig. 44.8a an example of a combination of weights and thresholds is given that yields the correct classification. The first hidden unit determines 'line 1' and the other one determines 'line T in Fig. 44.8b. In Table 44.2 the result of applying the different hard delimiter transfer functions, of the two hidden units and the output unit, is summarized. The output values of the transfer functions of the two hidden units, hu\ and hul are given in the third and fourth column. The two objects of class A yield the same results for hul and hul. The objects of class B still yield different results. In Fig. 44.8c the output values of the hidden units for the
661 Hidden unit 1 (linel)
Output unit(line3)
Hidden unit 2 (line2)
Line 2 xl+x2-1.5=0 Linel xl+x2-0.5=0
hu2 B
/
Line 3
hul-hu2-0.5=0
hul
Fig. 44.8. (a) The structure of the neural network, for solving the XOR classification problem of Fig. 44.7. (b) The two boundary lines as defined by the hidden units in the input space (xl,x2). (c) Representation of the objects in the space defined by the output values of the two hidden units {hul,hu2) and the boundary line defined in this space by the output unit. The two objects of class A are at the same location.
662
different objects is plotted. It is the output unit that determines a third line, 'line 3' in Fig. 44.8c, that yields the correct classification (see Table 44.2).
44.5 Multilayer feed forward (MLF) networks 44.5.1 Introduction Unfortunately no learning rule was available that could train two-layered networks. The major problem was that, while it is straightforward to define the target output value for the output neuron, it is difficult to define the desired output value of the hidden neuron layers. When Rumelhart [7,8] published the backpropagation learning rule in 1986 to train these networks, MLF networks became the most popular and most widely used artificial neural networks in many application fields, including chemistry [9-11]. Sometimes they are called backpropagation networks after the learning rule. Besides the additional layer, an important difference with the earlier described perceptron-like networks is the use of the sigmoidal transfer function. The introduction of this transfer function is essential for the back-propagation learning rule and it is also responsible for the non-linear properties of the MLF networks (see Sections 44.5.4 and 44.5.5). As an extension of perceptron-like networks MLF networks can be used for non-linear classification tasks. They can however also be used to model complex non-linear relationships between two related series of data, descriptor or independent variables (X matrix) and their associated predictor or dependent variables (Y matrix). Used as such they are an alternative for other numerical non-linear methods. Each row of the X-data table corresponds to an input or descriptor pattern. The corresponding row in the Y matrix is the associated desired output or solution pattern. A detailed description can be found in Refs. [9,10,12-18]. 44.5.2 Structure The general structure is shown in Fig. 44.9. The units are ordered in layers. There are three types of layers: the input layer, the output layer and the hidden layer(s). All units from one layer are connected to all units of the following layer. The network receives the input signals through the input layer. Information is then passed to the hidden layer(s) and finally to the output layer that produces the response of the network. There may be zero, one or more hidden layers. Networks with one hidden layer make up the vast majority of the networks. The number of units in the input layer is determined by /?, the number of variables in the {nxp) matrix X. The number of units in the output layer is determined by q, the number of variables in the {nxq) matrix Y, the solution pattern.
663
input layer
hidden layer
output layer
Fig. 44.9. General structure of an MLF Network, (a) without bias and (b) with a bias neuron.
Just as in the perceptron-like networks, an additional column of 'ones' is added to the X matrix to accommodate for the offset or bias. This is sometimes explicitly depicted in the structure (see Fig. 44.9b). Notice that an offset term is also provided between the hidden layer and the output layer. The choice of h, the number of units in the hidden layers, or hidden units, is discussed in Section 44.5.8.
664
44.5.3 Signal propagation The signal propagation in the MLF networks is similar to that of the perceptron-like networks, described in Section 44.4.1. For each object, each unit in the input layer is fed with one variable of the X matrix and each unit in the output layer is intended to provide one variable of the Y table. The values of the input units are passed unchanged to each unit of the hidden layer. The propagation of the signal from there on can be summarized in three steps. 1. Each hidden unit,y, receives the signals from the/? units of the previous layer, the input layer. From these signals the net input is calculated: NET,.(x,) = xj ^VJ xj wj
=[x,,,„x,pl] =[Wj,...Wjp-Q]
Wj is the weight vector associated with the hidden unity, including 0, the bias term. X. is the input vector, including an extra 'one' to accommodate for the bias, that characterizes object /. Since the input units are pass-through units, x. is also the input vector for each of the hidden units. 2. The net input, NETy, is then passed to the transfer function that transforms it into the output signal of the unit. Different transfer functions may be used, the most common non-linear one being the sigmoidal function (Fig. 44.5b). output,. = oj = sfiNETj) =
_l 1+ e
(44.8) ^
3. The output units receive the weighted output signals of the h hidden units. The weighted sums are calculated and passed through the transfer function to yield the final output of the network. NET/0,) = o7w^.
wJ =[wj,.,.Wjf^-Q] oJ is the input vector to the output units. It contains the output signals of the h hidden units plus an additional 'one' to take the bias term for the output units into account. The whole signal propagation is shown in Fig. 44.10. As can be seen the signals are only forwarded from one layer to the next layer. There is no signal propagation within a layer or to a previous layer. This explains the term feedforward networks.
665
0.583
Output
Fig. 44.10. An example of signal propagation in an MLF network with transfer function: 1 -(NET(x))-
1+e
44.5.4 The transfer function 44.5.4.1 Role of the transfer function The sigmoidal transfer function is the most common one in MLF networks. This function has been developed as an adaption of the hard delimiter transfer function of the perceptron-like networks. The back-propagation learning rule (see Section 44.5.5) requires a transfer function for which a derivative exists over the whole domain. The transfer function has two important roles in the signal propagation process. It 'squashes' the output value of the units between 0 and 1, preventing that it becomes infinitely large and it is also responsible for the non-linear modelling properties, which make MLF networks so useful.
666
44.5.4.2 Transfer function of the output units There are basically two alternatives for the transfer function of the output units, the sigmoidal function and the linear function. The sigmoidal function is preferred in (supervised) classification tasks, because of its squashing property. The desired output of the network in a classification task is typically 1 or 0 (belonging or not to a certain class). The shape of the sigmoidal takes care that the output values are squashed into the range [ 0 - 1 ] . When the MLF network is used to model non-linear relationships this squashing of the output values is not necessary and is even harmful for the performance of the network. For modelling tasks a simple linear transfer function for the output units performs best. 44.5.4.3 Transfer function in the hidden units The transfer function of the hidden units in MLF networks is always a sigmoid or related function. As can be seen in Fig. 44.5b, 0, represents the offset, and has the same function as in the simple perceptron-like networks. (3 determines the slope of the transfer function. It is often omitted in the transfer function since it can implicitly be adjusted by the weights. The main function of the transfer function is modelling the non-linearities in the data. In Fig. 44.11 it can be seen that there are five different response regions in the sigmoidal function:
NET 2
3
4
Fig. 44.11. Different response regions in a sigmoid function. For explanation, see text.
667
« = -2.5
(a) 6=2.5
Fig. 44.12. (a) and (b): The output value (y) of the MLF network as a function of the input value (x) with two different weight settings, (c) and (d): The contour plots of the output value of an MLF network in the specified input space with two different weight settings.
1. NETI < A: response approximates 0 2. A < NET, < B: the response is non-linear, convex 3. B < NETy < C: the response varies almost linearly with NET, 4. C < NETy < D: the response is non-linear, concave 5. NET, > D: the response is maximal (approximates 1) The region from A to D is called the dynamic range. The regions 2 and 4 constitute the most important difference with the hard delimiter transfer function in perceptron networks. These regions rather than the near-linear region 3 are most important since they assure the non-linear response properties of the network. It may
668
8
10
e = o^
(b) e=-o.5 Fig. 44.12 continued (b).
seem surprising at first sight that such a simple non-linear unit suffices to model successfully complex non-linear relationships. One must not forget however that it is the proper combination (weight setting) of several of those sigmoid functions that assures the overall modelling power. In Fig. 44.12a an example of a combination of two sigmoidal functions of two hidden units is given so that a Gaussian-like relationship between input and output is modelled. Changing the weights yields Fig. 44.12b. It is easy to imagine that in such a way complex relationships can be modelled with more sigmoids. In this view an MLF network can be seen as a kind of automatic spline fitter (Chapter 11). It is automatic because the combination of basic functions and knot setting is done automatically during the weight and bias setting in the training phase by means of the learning rule (Section 44.5.5).
669
-4
-3
-2
9=0 xl
x2
e=o Fig. 44.12 continued (c).
When the MLF is used for classification its non-linear properties are also important. In Fig. 44.12c the contour map of the output of a neural network with two hidden units is shown. It shows clearly that non-linear boundaries are obtained. Totally different boundaries are obtained by varying the weights, as shown in Fig. 44.12d. For modelling as well as for classification tasks, the appropriate number of transfer functions (i.e. the number of hidden units) thus depends essentially on the complexity of the relationship to be modelled and must be determined empirically for each problem. Other functions, such as the tangens hyperbolicus function (Fig. 44.13a) are also sometimes used. In Ref. [19] the authors came to the conclusion that in most cases a sigmoidal function describes non-linearities sufficiently well. Only in the presence of periodicities in the data
670 1
1
1
1
1 1
'
1
'
'/
'
1
... 1
'
4 3 ?
cvj
^
^^
1 0
-1 -2 -3 -4 1
1
-4
-3
1
-2
/
1
-1
1
1
1
1..
1
0 x1 9=0
xl
x2
9=0
Fig. 44.12 continued (d).
this type of function is not adequate. In that case a sinusoidal transfer function (Fig. 44.13b) is necessary. In a special kind of MLF networks the so-called radial basis function is used as transfer function. We describe them in Section 44.6. 44.5.5 Learning rule In MLF networks the proper setting of the weights is optimized by supervised training. The weights are optimized by means of a number of example input patterns together with their associated desired output pattern. During the training session the weights are adapted according to the learning rule.
671
A
-1
b
A
Fig, 44.13. The tangens hyperbolicus and the sine function.
The most common learning algorithm is the back-propagation learning rule [7,8]. The weight updates are, as in the delta rule, based on the difference between the actual and the desired output of the network. The weight updating can be done after each training example that is offered to the network or it can be done after all training examples have been seen once. The two procedures are strictly equivalent. Since the former one is applied most, this procedure is explained here. It can be summarized as follows: 1. Initialize the weights with small random values in a range around 0 (e.g. -0.3 to +0.3). 2. For each output unit, y, calculate the output value with the current weight setting and the error, Ej, based on the difference between this value and the target or desired output value.
672
Ej=^(dj-Ojf
(44.9)
Ej is determined by the weights (through Oj which is a function of NET^, see eq. (44.8)). Note that this error is in fact the same as the error term used in a usual least squares procedure. 3. Carry out the weight adaptation of the output neurons: Aw,. = y]bjO,
(44.10)
w^j is the weight between the hth hidden unit and the 7th output unit; r| is the learning rate, a positive constant between 0 and l;Of^is the output of the hth hidden unit and bj is a term based on the error. In the back-propagation strategy the weight adaption is made in the direction that minimizes the error. The back-propagation algorithm is thus essentially a gradient based optimization method. Therefore, the gradient of the error as a function of the weights must be calculated. In Fig. 44.14 this is shown for one of the weights. For the output units this gradient minimization yields eq. (44.11). This explains why the transfer function must have a derivative over the whole response domain. d(sf(NETj)) '
'
'
d(NETj)
(44.11)
= (dj- Oj) sf (NETJ) [1 - sf (NETJ)] sfiNETj) represents the value of the sigmoidal function for NETy. I
Error
Wij Fig. 44.14. An example of an error surface of an MLF as function of one weight value. The gradient determines the change of the weight in the next iteration.
673
4. Calculate the adaptation of the weights to the hidden layer. The desired output and thus the error is not known directly. In the back-propagation strategy it is calculated from the accumulated errors of the output neurons. It can be shown (e.g. Refs. [14] and [15]) that this yields eq. (44.12). In this equation, k represents the units in the output layer. All weight adaptations can be calculated using eq. (44.10) and eq. (44.12). The error is said to be back-propagated through the network.
•'•
diNETj)
h
k
^jk
(44.12)
= sf (NETJ) [1 - sfiNETj)] ^ 5 , Wj, k=\
5. Repeat this process for all input patterns. One iteration or epoch is defined as one weight correction for all examples of the training set.
44.5.6. Learning rate and momentum term The learning rate, r|, in eq. (44.10) is important for the performance of the training procedure. If it is small, the convergence of the weights to an optimum may be slow and there is a danger of getting stuck at a local optimum. If the learning rate is high, the system may oscillate. An appropriate value of the learning rate depends on the transfer function of the output units. For a sigmoidal transfer function, r| values between 0.5 and 1 are used. For a linear transfer function much smaller values (0.001 < r| < 0.1) are applied. To limit the danger of oscillation, eq. (44.10) can be adapted as follows: Aw,-(n) = Ti5^ o, + a Aw,.(n - 1)
(44.13)
In eq. (44.13) the term ^Wy^fn) is the current weight change, l^w^fn - 1) is the weight change from the previous learning cycle, a is called the momentum. It represents the proportion of the weight change of the previous learning cycle that is taken into account to determine the step size of the weight in the present learning cycle. Its value is usually set between 0.6 and 0.8. The influence of the momentum term is less important than that of the learning rate. The optimal values of a and y\ depend on the problem under study and can be found by means of an optimization procedure, although this is usually not done and only some values of r| and a are tried.
674
44.5J Training and testing an MLF network During a supervised training session the weights of the network are optimized, using a training set. Since the weight adaptations are done by means of the training set, it should contain a sufficient number of relevant examples for the relationship to be modelled. The more hidden units, the more examples are necessary to train the network. The number of weights to be optimized increases indeed drastically with the number of units in the network: number of weights = ph + qh = h(p + q) where p is the number of input units, h the number of hidden units and q the number of output units. As described in Section 44.5.5, the weights are adapted along the gradient that minimizes the error in the training set, using the back-propagation strategy. One iteration is not sufficient to reach the minimum in the error surface. Care must be taken that the sequence of input patterns is randomized at each iteration, otherwise bias can be introduced. Several (50 to 5000) iterations are typically required to reach the minimum. 44.5.7.1 Network performance During the training session the performance of the network must be monitored. Different performance criteria are possible, but usually the normalized standard error, NSE, is used.
NSE = —XX(^/;-^.7)'
(44.14)
q is the number of output units, n is the number of examples in the training set, d is the desired output value and a is the actual output value. The NSE is calculated after each iteration of the training session. The NSE is not always the best performance criterion. One badly predicted pattern (e.g. an outlier) may influence strongly this performance criterion. It also gives no indication on which or how many patterns are well or badly predicted. Small errors, close to 0 for almost all training examples and a few badly predicted examples will yield the same NSE as for average errors for all examples. For classification purposes the NSE is not the ideal criterion since it is more important that the output values are above or below a certain threshold (e.g. 0.9 or 0.1) than that they are exactly 0 or 1. The NSE does give, however, a good general idea of the performance of the network and is therefore commonly used during the training session. In Fig. 44.15a a performance curve is shown. The NSE decreases monotonically with the number
675
a
Error A
ITERATIONS
Error
Fig. 44.15. (a) Performance curve of a MLF network for a training set. (b) Performance curve of a MLF network with too high a learning rate.
of iterations. Training with too high a learning rate, Ti, yields the performance behaviour of Fig 44.15b. When the network must be used to predict future unknown examples it is not good practice however to judge the performance of the network based on the training set only. Together with the NSE of the training set the NSE of an independent set, the monitoring set, must always be calculated. The NSE value for the monitoring set is usually somewhat larger than the NSE for the training set. In Fig. 44.16a an ideal performance behaviour of a network is shown. A typical performance behaviour is shown in Fig. 44.16b. The increase of the NSE for the monitoring set is a phenomenon that is called overtraining. This phenomenon can be compared to fitting a curve with a polynomial of a too high order or with a PCR or PLS model with too many latent variables. It is caused by the fact that after a certain number of iterations, the noise present in the training set is modelled by the network. The network acts then as a memory, able to recall
676
Error
Error
ITERATIONS
ITERATIONS
Fig. 44.16. (a) Ideal performance curve of a MLF network. The solid line represents the training set; the dashed line represents the monitoring set. (b) overtraining of the network begins at point A. (c) Paralysed network, (d) Performance curve of an MLF network for which the training and control samples represent different models.
exactly the training examples but loosing its ability to predict. This is harmful for the predictive performance of the network. When the NSE of the monitoring set is acceptable it suffices to stop the training at point A. Overtraining occurs sooner when the number of hidden units is too large for the problem complexity or because the number of training examples is too small to train adequately all the weights. The remedy then is to use a network with a smaller number of units. In Fig. 44.16c the performance behaviour of SL paralysed network is shown. The normal decrease of the NSE stops too soon and the NSE remains too high to be acceptable. The paralysation of the network occurs when the weighted input value,
677
NETy, is situated in the response regions 1 or 5 of the transfer function for too many units (Fig. 44.11). As long as Net^ is smaller than A or larger than D, the response does not change when changing Net^ (i.e changing the weights). A remedy is to increase the learning rate, r|, or to explicitly decrease the slope, (3, of the transfer function, so that the input values fall again inside the dynamic region. To retrain the network with a different weight initialization is another option. This performance behaviour is also observed when too few hidden units are used. The network is then not able to properly model the relationship. Increasing the number of hidden units is the remedy in this case. The performance behaviour shown in Fig. 44.16d is caused by the fact that the monitoring set and the training set represent different relationships or when some outliers are present in the monitoring set that are not present in the training set. When not enough examples are available to make an independent monitoring set, the cross-validation procedure can be applied (see Chapter 10). The data set is split into C different parts and each part is used once as monitoring set. The network is trained and tested C times. The results of the C test sessions give an indication on the performance of the network. It is strongly advised to validate the network that has been trained by the above procedure with a second independent test set (see Section 44.5.10). 44.5.7.2 Local minima The back-propagation strategy is a steepest gradient method, a local optimization technique. Therefore, it also suffers from the major drawback of these methods, namely that it can become locked in a local optimum. Many variants have been developed to overcome this drawback [20-24]. None of these does however really solve the problem. A way to obtain an idea on the robustness of the obtained solution is to retrain the network with a different weight initialization. The results of the different training sessions can be used to define a range around the performance curve as shown in Fig. 44.17. This procedure can also be used to compare different networks [20]. 44.5.8 Determining the number of hidden units From the previous sections it is clear that it is important to use a network with a suitable number of hidden units. When too few hidden units are used, the relationship cannot be modelled properly and the network shows poor performance. Too large a number of hidden units causes severe overtraining. The suitable number of hidden units depends on the problem complexity and on the number of training examples that are available. It must be determined empirically. There are basically three approaches for this:
678
Error
10
12
14
16
18
Hidden units Fig. 44.17. Determining the number of hidden units for an MLF network. The whiskers represent the range of the error with different random weight initializations or by cross-validation.
Train and test a network with a certain number of hidden units, based on an educated guess. When the NSE of the network is acceptable and no severe overtraining occurs, the network is suitable. A second approach is to train different networks with a different number of hidden units. Preferably, each network is trained several times with a different weight initialization. From a plot as in Fig. 44.17 it is then straightforward to select a suitable number of hidden units. This approach is certainly the best but it involves the training and testing of many networks and is thus a time consuming procedure. Another approach that has been used is the so-called pruning procedure [26,27]. One starts with a network with a large number of hidden units. During the training phase the weight changes of all units are monitored. Those units or connections whose weights remain low are removed and the training is continued. This procedure has not been very successful for MLF networks and is not often applied.
679
44.5.9 Data Preprocessing 44.5.9.1 Scaling So far, we have not considered the nature of the data in the X or Y matrix. However, as with all other data handling techniques, preprocessing of the data may be desirable in some cases. The reasons for scaling are essentially the same as described in earlier chapters. Methods that are often used for continuous values are autoscaling and range scaling. The data can be scaled by columns or by rows. Scaling is often applied if the variables have different ranges of values. Variables with large values can dominate the model. Another reason for scaling in neural networks is the fact that large input values for a certain unit, especially together with large weights, may cause the NET input of that unit to fall in the tail of the sigmoid transfer function. In this region the derivative is small and, as explained in Section 44.5.5, weight corrections according to the back-propagation rule are proportional to the derivative of the transfer function (see eqs. (44.10 and 44.11)). This may then lead to paralysation of the network. Scaling of the input brings the net input within the appropriate range of the sigmoid transfer function. 44.5.9.2 Variable selection and reduction The number of variables in the X and Y matrix determines the number of input and output units respectively. The total number of units determines the total number of weights that have to be optimized during the training session. When there is a large amount of input variables, variable selection and or reduction may be necessary. This is for example the case when the input consists of a spectrum of e.g. 1000 wavelengths. Preprocessing the data with PCA seems attractive; it is however not always the appropriate approach. Neural networks, unlike some other techniques, do not suffer from correlations in the data. In this way the network can extract relevant information from these correlations which is not possible with other techniques. Other methods have been proposed to perform the variable selection (e.g. genetic algorithms and simulated annealing, see Chapter 27). 44.5.10 Validation ofMLF networks Validation of neural networks is usually based on an independent test set. Note that this test set should be different from the monitoring set as described in Section 44.5.7. If sufficient samples are not available cross-validation is applied [28]. The presence of local minima causes an additional problem. Retraining the network with different weight initializations is an important diagnostic tool for this purpose. It must be noted however that retraining the network with different initial weights yields different final weight settings, thus different networks. The next problem is to select the best network among those. One may decide to use the
680
network that yields the lowest NSE for the test set. As mentioned before, however, the NSE does not cover all performance aspects. It is good practice to consider also other performance criteria at this stage of the development. The distribution of the errors is certainly important. Another important performance criterion is the robustness to noise on the input values. Derks et al. [29] proposed an empirical procedure to test this robustness. It is based on the gradually increasing addition of noise to the input data. Networks whose performance is less sensitive to this noise injection are to be preferred over sensitive networks. Gemperline studied the robustness of MLF networks in comparison with PLS [30]. The more performance criteria used to validate the model the higher the probability that a good impression is obtained of the network's predictive value for samples that lie within the range of the training set. Since no theoretical knowledge is available about the model, it is dangerous to extrapolate with MLF networks. Before using the network for prediction it should be checked whether the input falls within the training set range. 44.5.11 Aspects of use MLF networks are very powerful but should be applied with caution. They are powerful since practical experience has shown that they are able to model in an easy way difficult non-linear relationships, much better and faster than alternative techniques. There are however many pitfalls in the use of MLF networks. - Since they are used for complex relationships, about which only little prior knowledge is available, it is very hard to be sure that the training, monitoring and test examples are representative for the data. - Usually many local minima are present in the error surface. It is impossible to guarantee that the overall optimum is reached. - The validation of the obtained result is another problem. It is yet not possible to obtain a confidence interval around the obtained output. Research on this topic is going on [21,28]. - It is not possible to extract in a direct way theoretical information on the model. It is to be considered as an empirical model builder such as a spline or polynomial fitting procedure. 44.5.12 Chemical applications Especially the last few years, the number of applications of neural networks has grown exponentially. One reason for this is undoubtedly the fact that neural networks outperform in many applications the traditional (linear) techniques. The large number of samples that are needed to train neural networks remains certainly a serious bottleneck. The validation of the results is a further issue for concern.
681 TABLE 44.3 Examples of chemical applications - Pattern recognition [31-38] - Spectrum interpretation [39-45] - Quality control/ process control [46-49] - Structure elucidation [50-51] - Non-linear calibration and modelling [52-58] - Quantitative structure activity relationships [11,59] - Signal processing [60,61]
Even more than in traditional techniques, care must be taken not to extrapolate. When the training data are not evenly distributed, even intrapolation can cause problems [20]. The applications concern classification as well as quantification problems. In Table 44.3 some examples are given in both problem domains. 44.6 Radial basis function netvrorks 44.6.1 Structure Radial basis function networks (RBF) are a variant of three-layer feed forward networks (see Fig 44.18). They contain a pass-through input layer, a hidden layer and an output layer. A different approach for modelling the data is used. The transfer function in the hidden layer of RBF networks is called the kernel or basis function. For a detailed description the reader is referred to references [62,63]. Each node in the hidden unit contains thus such a kernel function. The main difference between the transfer function in MLF and the kernel function in RBF is that the latter (usually a Gaussian function) defines an ellipsoid in the input space. Whereas basically the MLF network divides the input space into regions via hyperplanes (see e.g. Figs. 44.12c and d), RBF networks divide the input space into hyperspheres by means of the kernel function with specified widths and centres. This can be compared with the density or potential methods in pattern recognition (see Section 33.2.5). The output of a hidden unit, in case of a Gaussian kernel function is defined as: output^. = Ofx) = e'^"^-'^'^^^'
(44.13)
Ix - c-l is the euclidean distance (other distance measures are also possible) between the input vector, x, and c^, the centroid of the Gaussian kemel function. The
682
yi
y2
Fig. 44.18. An example of an RBF network.
parameter b represents the width of the Gaussian function. The centroids, Cy and the widths, bj of all the hidden units together define the so-called activation space, where the Gaussian function has a value larger than a given threshold value. Nodes containing a kernel with a centroid close (in comparison with the width, b) to the input pattern will be predominantly activated. This is in contrast to the distant objects which cause low output values of the hidden units. This activation space can thus be regularized by the width factor and the centroid position, providing a means to adapt the local behaviour of the network. The output of these hidden nodes, o^, is then forwarded to all output nodes through weighted connections. The output yj of these nodes consists of a linear combination of the kernel functions: yj = I wji Oi(x)
where Wj^ represents the weights of the connections between the hidden layer / and output layer, j .
44.6.2 Training The centroid positions, c^, of the kernel functions and their widths, bj, determine the output of the hidden units. The centroid positions are fixed at the initialization of the network. The widths and the weights are obtained by supervised training. Initializing and training these positions and widths is a critical issue in constructing RBF networks. The supervised network training is generally performed by error back-propagation by means of the generalized delta learning rule (Section 44.5.5). The Gaussian kernels are continuously differentiable functions and are therefore suited for this algorithm.
683
Approaches for the initialization and training of the position of the centres, the widths and weights are summarized below: 1. For the distribution of the position of the centres the following methods or combinations of methods are used [29]: - random distribution within the range of interest; - random selection of input patterns; - maximal coverage of the range of interest, e.g. with the Kennard and Stone algorithm (see Chapter 24); - selection of representative input patterns, defined by means of the output of a Kohonen network, applied to the input pattems (see Section 44.7); ~ selection of representative input pattems, defined by means of clustering methods (e.g. ^-means [65] (see Section 30.3.2), genetic algorithms or simulated annealing [64] (see Chapter 27). 2. The widths, b, of the kernel functions are obtained by: - the supervised training procedure (error back-propagation); - fixing them at the initialization phase based on expertise and prior knowledge on the data structure. 3. The weights, w^p are trained with the usual error back-propagation strategy. It is generally advised to use an abundant number of kernels on random positions which are then successively pruned during training. 44,6.3 An example In this section the local modeling capability of RBF is demonstrated on the logical operators AND and OR. The examples used are demonstrative for the structure and physical meaning of the weights of the RBF network. Two datasets of four training objects each, have been created according to the logical operators AND and OR. In Fig. 44.19 the datasets and the network topology are graphically depicted. The AND-operator yields a positive output on inputs of identical sign, whereas the OR-operator responds positively on any positive input. The centroids represent the positions of Gauss kernels and are in this example positioned in the same place as the objects in the input space. The width factors do not change during training. The weights of each kernel function are obtained by training. In Fig. 44.20a a grid of data within the range [-1...1] has been propagated through the OR network model to depict the activation space. It can be seen that the OR-function has been trained properly, i.e. the output is high or low at the correct positions. Where no input data were available the network yields zero as output value. In Fig. 44.20b the same grid of data has been propagated through an OR network model obtained with a broader width parameter. It can be seen that the network still yields high and low values at the positions of the position of the input
684
.•.
0., •• 4
# .•
(-i,.i)(i-i,i)Ci4i)(i.i)
i
-1
- 1 1
>i
I
%. •
#
, 1
i
(-i;-i)(i-i,i).(i.^i)<W)
i
"* X
AND
. 1 •, 1
• -1
-I
I
-1
1
1
OR
Fig. 44.19. The logical operators AND and OR and their RBF implementation.
patterns but it interpolates in a smoother way. It follows from the introduction of this section that the concept of RBF is based on local approximation of data by means of kernel functions. MLF networks try to model data by means of constructing non-linear hyperplanes in the input space. In Fig. 44.21 the predictions of an AND-model obtained by MLF and RBF are shown. It can clearly be seen that the output of both networks is the same in the neighbourhood of the input objects. However, the output differs for the positions in between the input objects. The RBF network yields zero output for the interpolated grid positions, while the output of the MLF network is different from zero. The results can be influenced by the width parameters of the kernels. Whatever is better should be evaluated by means of independent test sets. 44.6.4 Applications In chemical practice, problems are far more complex than the example problems described in the previous section. To be able to apply RBF networks properly, it is useful to have a considerable amount of prior knowledge about the
685
Fig. 44.20. (a) The trained logical OR. The contour lines in the xl ,x2-plane around the centres denote the 20, 40, 60 and 80 percent confidence limits of the Gaussian kernels, (b) The trained logical OR with increased width factors.
686
Fig. 44.21. Grid (40x40) prediction of the logical AND for (a) the MLF model and (b) for the RBF model.
687
data especially for the distribution of the centroid positions and widths [29]. Although this might be considered as a disadvantage, at least the weights have physical or chemical meaning, allowing a better understanding of the model. They have been applied in process control [66,67]. A combination of RBF-PLS as a flexible non-linear regression technique has also been proposed [68].
44.7 Kohonen networks 44.7.1 Structure Kohonen networks belong to the class of self-organizing maps. In contrast to MLF and RBF networks, they are designed for unsupervised pattern recognition tasks. The Kohonen network consists of one layer of neurons, ordered in a low-dimensional map such as a one-dimensional array or a two-dimensional matrix (see Fig. 44.22). Each neuron or unit contains a weight vector of the same dimension as the input patterns. To train the network, the unsupervised Kohonen learning rule is applied (see Section 44.7.2). After training the individual weight vectors are oriented in such a way that the structure of the input space (the topology) is represented as well as possible in the resulting map (see further). Each object or input pattern is assigned to (or mapped on) the neuron with the most similar weight vector. The goal of the Kohonen network is to map similar objects
a o—o—o—o—o—o
c
o—o—o—o /
\
/
\
/
\
/
\
o—o—o—o—o /
b o—o—o—o—o—o
\
/
\
/
\
/
\
/
\
o—o—o—o—o—o
i4444_|, / \ / \ / \ / \ / \ / \ I
I
I
I
I V
o—o—o—o—o—o o—o—o—o—o—o o—o—o—o—o—o
I I I I I I o—o—o—o—o—o
O
O
O
O
O
O
O
o—o—o—o—o—o \
/
\
/
\
/
\
/
\
/
o—o—o—o—o
\ / \ / \ / \. o—o—o—o
Fig. 44.22. Three commonly used Kohonen network structures, (a) One-dimensional array; (b) two-dimensional rectangular network (each unit, apart from the borderline units has 8 neighbours) and (c) two-dimensional hexagonal network (each unit, apart from the borderline units, has 6 neighbours). (Reprinted with permission from Ref. [70]).
688
on the same or neighbouring neurons. The interpretation of the Kohonen map is described in Section 44.7.3. The Kohonen mapping is primarily used for classification purposes. 44J.2 Training The training process of a Kohonen network consists of a competitive learning procedure and can be summarized as follows: - Initialize the weight vectors of all units with random values. - Calculate a specified similarity measure, D, between each weight vector and a randomly chosen input pattern, x. - The unit in the Kohonen map that is most similar to the input vector is declared as the winning unit and is activated (i.e. its output is set to 1). The output of a Kohonen unit is typically 0 (not activated) or 1 (activated). - The weights of the winning unit are adapted according to: w(/+l) = w(0 + r| (x - yfj{t)) y\ is the learning rate and w(0 is the weight vector. - the weights of the units in close vicinity of the winning unit are also adapted according to: w(r+l) = w(r) + r| N(t,r)(x - Wj(t)) N(t,r) is a predefined neighbourhood function in which r represents the distance in the map between the considered unit and the winning unit. Different neighbourhood functions can be used. The principle is always that units closer to the winning unit are adapted most. Some common functions are shown in Fig. 44.23. Due to this aspect of the learning procedure, the connectivity (i.e. the number of neighbours for each unit) of the network has an influence on its performance. Networks with a different number of neighbours for the units are shown in Fig. 44.22. The above steps are repeated for all input patterns to complete one iteration or cycle. A training procedure consists of several such cycles. During the training process the neighbourhood definition is usually narrowed down, allowing convergence of the network (see Fig. 44.24). As a result of the training procedure the weight vector of the winning unit and its neighbours move gradually towards the applied input vector, as shown in Fig. 44.25. In this way, similar input vectors will map in the same region of the Kohonen map. Due to the unsupervised nature of the Kohonen algorithm, it cannot be checked which units are associated with specific clusters in the input patterns. However, inspection of all weight vectors provides information concerning topological aspects of the input pattern space, i.e. the position of the input patterns relative to each other.
689
n(r)
a
Fig. 44.23. Some common neighbourhood functions, used in Kohonen networks, (a) a block function, (b) a triangular function, (c) a Gaussian-bell function and (d) a Mexican-hat shaped function. In each of the diagrams is the winning unit situated at the centre of the abscissa. The horizontal axis represents the distance, r, to the winning unit. The vertical axis represents the value of the neighbourhood function. (Reprinted with permission from [70]).
matching input vector
Fig. 44.24. Example of gradually narrowing of the neighbourhood function during the training process. Light grey: definition of neighbourhood at stage 1; middle grey: at stage 2 and dark grey: at stage 3. O is a unit and • is the winning unit. (Adapted from Ref. [70]).
690
X^
r\(X-W(t))
Fig. 44.25. Example of the weight update of the winning unit in a Kohonen network. (Reprinted with permission from Ref. [70]).
44.7.3 Interpretation of the Kohonen map After the training procedure, the weight vectors of the units are fixed and the map is ready to be interpreted. There are different possibilities to interpret the weight combination, depending on the purpose of the network use. In this section some possibilities are described. The output-activity map. A trained Kohonen network yields for a given input object, Xj, one winning unit, whose weight vector, w^, is closest (as defined by the criterion used in the learning procedure) to x-. However, x^ may be close to the weight vectors, w^, of other units as well. The output )^ of the units of the map can also be defined as: yj = D(Wj,
X.)
D is the similarity measure as used in the training procedure. This results in a map as in Fig.44.26a. This map allows the inspection of regions (neighbouring neurons) that have a similar weight vector as a given input x^. Note that each input x^ yields a different output activity map. The counting map (Fig. 44.26b). This map is obtained by counting for each unit of the Kohonen network, the number of training objects for which the unit is the winning one. This map provides insight in the number of clusters that are present in the dataset. When e.g. all input patterns are assigned to e.g two distinct regions (sets of neighbouring neurons) in the map, it can be concluded that there are two clusters in the dataset.
691
a
+ 4- 4- + 4- + + 4- 4- 4- 44- + 4X -
- 4- 4- 4 4+ « 4 ++ 4- 4- 4- 44- 4-
k
X X X -
X X -1 -
A A A A A A A A A A B A A A B C X B B X C C BX X C C C B X C C C C C C C B
B B B B B B
B B B B B
Fig. 44.26. Some possibilities for analysing a two-dimensional Kohonen map (Reprinted with permission from Ref. [70]). (a) Grey-encoded output activity map for a given training example. Dark areas in the map indicate a high similarity between the weight vector of the unit and the input object, (b) A counting map: dark areas indicate a large number of training examples that map on the unit. Units on which no training examples map are indicated white, (c) Feature map indicating units on which training examples map with (+) or without (-) a certain feature. (X) indicates that different labels are assigned to the unit, (d) Feature map with class identification A and B as labels. (X) indicates multiple class labels.
The feature map. This map can be obtained when labels can be assigned to the training objects. Labels may consist of a known classification, or the presence versus absence of certain features. For each training object, x-, the winning unit is determined and to this unit the label of x^ is assigned. When this is completed for all training objects, each unit in the map is labelled in the map with zero, one or more labels (see Figs. 44.26c and 44.26d). When one unit is labelled with more labels it means that an overlap is present. In this way it can be compared with fuzzy clustering techniques (see Chapter 30). A detailed description can be found in references [69,70]. 44.7.4 Applications Due to the Kohonen learning algorithm, the individual weight vectors in the Kohonen map are arranged and oriented in such a way that the structure of the input space, i.e. the topology is preserved as well as possible in the resulting
692
low-dimensional map. Therefore, the Kohonen map is said to be a topology preserving mapping technique. It is primarily used for the examination of data for which no or little prior knowledge is available. The choice of the size and dimensionality of the network depends on the problem type. When too few units are used for classification purposes, different classes will fall together on the same unit(s). Too many units will not result in clustering of the data. All training objects will tend to map on different units. A rule of thumb is to take twice the number of expected classes. It has been applied for e.g. mapping IR spectra [71], to map surface molecular properties of organic molecules [72] and for classification [73] (e.g. SIMS images [74] or plant seeds [75]). Another interesting application is in the design of RBF networks (Section 44.6). It can best be compared with techniques such as non-linear mapping or multi-dimensional scaling (see Chapter 38). Reference [76] is a review of the use of Kohonen neural networks in analytical chemistry.
44.8 Adaptive resonance theory networks 44.8.1 Introduction The Adaptive Resonance Theory (ART) networks are also designed for unsupervised pattern recognition. There exists a variant (ARTMAP) that is meant for supervised pattern recognition, but we will consider here the more common unsupervised variant. For a detailed description see Ref. [14]. Grossberg designed the adaptive resonance theory to overcome a general drawback of classifiers (e.g. MLF networks), i.e. the fact that once the training procedure (supervised or unsupervised) is completed the weights are fixed. The network or classifier is designed for a certain well-defined classification task. The network is trained by means of a set of representative training examples. This is an acceptable procedure only when the classification problem is well defined and only as long as it remains stable. Unfortunately this is not often the case in real situations. It may happen that a new class of objects appears that was not represented in the training set. The only remedy in that case is to retrain the network including training objects of the new class. It is also possible that existing classes tend to change in time, e.g. drifting away. Here too, the proper action is to retrain the network with new representative training objects. This is what Grossberg called the stability-plasticity dilemma: how can a system be adaptive to relevant new input and at the same time be stable to irrelevant input changes? In response to this, he and others developed the adaptive resonance theory. The essence of ART is pattern matching. When a familiar pattern is presented to the network (i.e. satisfies the limiting conditions of the previous examples) the network recognizes the input and it also incorporates
693
the new information of the new input by adapting the weights. When, on the other hand, a novel input is presented, i.e. not satisfying the limiting conditions of the previous examples, the structure is adapted and the novel input is identified as the first representative of a new class. In this way ART is stable enough to preserve past learning but remains adaptable to incorporate new information when it appears. 44.8.2 Structure ART networks consist of units that contain a weight vector of the same dimension as the input patterns. Each unit is meant to represent one class or cluster in the input patterns. The structure of the ART network is such that the number of units is larger than the expected number of classes. The units in excess are dummy units that can be taken into use when a new input pattern shows up that does not belong to any of the already learned classes. There exist many different types of ART. The variant ARTl is the original Grossberg algorithm. It allows only binary input vectors. ART2 allows also continuous input. It is the basic variant of this type that we will describe. The structure of ART networks is hard to visualize. It is in fact a theory that can better be explained by means of a sequence of steps that follow the strategy. 44.8.3 Training Training of an ART network can be summarized as follows: - Initialize the weights of all units (in a (pxc) matrix W) with a fixed value. The parameter p is the length of the weight vector and c is the total number of units. Usually the fixed value 1/^ P is used for the initial weights, such that the length of the weight vector is scaled to unity. - The input patterns are scaled to unit length. - The first input vector is copied into the weight vector of the first unit, which becomes now an active unit. - For the next input vector the similarity, p,^, with the weight vector, w^, of each active unit, k is calculated:
Because x^ as well as w^ are normalized, p,^ represents the cosine or correlation coefficient between the two vectors. In a variant of ART, Fuzzy ART, a fuzzy similarity measure is used instead of the cosine similarity measure [14]. - The active unit with the highest similarity is declared as the winning unit.
694
- The similarity between x, and the winning unit is compared with a threshold value, p*, in the range from zero to one. When p^^ < p* the input pattern, x^, is not considered to fall into the existing class. It is decided that a so-called novelty is detected and the input vector is copied into one of the unused dummy units. Otherwise the input pattern, x^, is considered to fall into the existing class (to resonate with it). A large p* will result in many novelties, thus many small clusters. A small p* results in few novelties and thus in a few large clusters. - When the resonance step succeeds the weight vector of the winning unit is changed. It adapts itself a little towards the new input pattern x^, belonging to the same class, according to: w(r+l) = r| x^ + (1 - r|) w(0 w(r+l) = w(r+l)/lw(r+l)l Usually the learning rate, rj, is chosen between 0 and 1. In this step the network incorporates the new information present in the input object by moving the centroid of the class a little towards the new input pattern x^. This step is intended to make the network flexible enough when clusters are changing in time. 44.8.4 Application From the previous discussion, it is clear that ART networks are more applicable for pattern recognition than for quantitative applications. In Fig. 44.27 it is shown how ART networks functions with different values for p*. It has been applied to
Fig. 44.27. The influence of different values of p* on the classification performance of the ART network, (a) large p* value; (b) small p* values.
695
classify process control data [77], for classification of UV/VIS/NIR spectra and of airborne particles [78]. It has also been applied for QSAR [79]. There is not, however, much expertise available yet on the performance of these networks in real situations. The general experience is that the networks are very sensitive to noise. It is difficult to select a proper threshold value for the similarity measure. Moreover, the experience is that within one network in fact different threshold values are necessary for different classes. Strategies that use also an adaptive threshold (e.g. based on the different class properties) may be more successful.
References 1. 2. 3. 4. 5. 6. 7.
8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
W.S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophy., 5 (1943) 115-133. D.O. Hebb, The Organization of Behavior. Wiley, New York, 1949. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psycholog. Rev., 65 (1958) 386-408. B. Widrow and M.E. Hoff, Adaptive switching circuits. In: IRE WESCON Convention Record, New York, 1960, p. 96-104. N. Nilsson, Learning Machines, McGraw-Hill, New York, 1965. M.L. Minsky and S.A. Papert, Perceptrons. MIT Press, Cambridge, MA, 1969. D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning internal representations by error propagation. In: Parallel Distributing Processing, Explorations in the Microstructure of Cognition, Vol. 1: Foundations, D.E. Rumelhart and J.L. McClelland (eds.), MIT Press, Cambridge, MA, 1986, pp. 318-362. P. Werbos, Beyond regression: new tools for prediction and analysis the behavioral sciences. PhD Thesis, Harvard, Cambridge, MA, 1974. B. Wythoff, Backpropagation neural networks. Chemom. Intell. Lab. Syst., 18 (1993) 115-155. J. Zupan and J. Gasteiger, Neural Networks for Chemists. An Introduction. VCH Verlaggesellschaft, Weinheim, 1993. J. Devillers (ed.), Neural Networks in QSAR and Drug Design. Academic Press, London, 1996. R. Beale and T. Jackson, Neural Computing: An Introduction. Institute of Physics Publishing, Bristol, 1992. J.L. McClelland and D.E. Rumelhart, Parallel Distributed Processing, Vol. 1. MIT Press, London, 1988. J.A. Freeman and D.M. Skapura, Neural Networks, Algorithms, Applications and Programming Techniques. Addison-Wesley, Reading, MA, 1991. R. Hecht-Nielsen, Neurocomputing. Addison-Wesley, Reading, MA, 1991. P.D. Wasserman, Neural Computing: Theory and Practice. Van Nostrand Reinhold, New York, 1989. J. Smits, W.J. Meissen, L.M.C. Buydens and G. Kateman, Using artificial neural networks for solving chemical problems. Chemom. Intell. Lab. Syst., 22 (1994) 165-189. D. Svozil, Introduction to multi-layer feed-forward neural networks. Chemom. Intell. Lab. Syst., 39 (1997) 43-62.
696 19.
20. 21. 22. 23. 24. 25. 26. 27.
28.
29. 30. 31.
32.
33.
34.
35.
36.
37.
W.J. Meissen and L.M.C. Buydens, Aspects of multi-layer feed-forward neural networks influencing the quality of the fit of univariate non-linear relationships. Anal. Proc, 32 (1995) 53-56. E.P.P. A. Derks, Aspects of Artificial networks and experimental noise. PhD thesis, University of Nijmegen, the Netherlands, Chapter 2, 1997. R. Tibshirani, A comparison of some error estimates for neural network models. Neural Computation, 8 (1995) 152-163. B. Walczak, Neural networks with robust backpropagation learning algorithm. Anal. Chim. Acta, 322 (1996) 21-30. P. Deveka and L. Achenie, On the use of quasi-Newton based training of a feedforward neural network for time series forecasting. J. Intell. Fuzzy Syst., 3 (1995) 287-294. M. Norgaard, Neural network based system identification toolbox. Technical report. Institute of Automation, Technical University, Denmark, 1995. E.P.P. A. Derks, Aspects of artificial networks and experimental noise. PhD thesis, Nijmegen, the Netherlands, Chapter 3, 1997. G. Castellano, A.M. Fanelli and M. Pelillo, An iterative pruning algorithm for feedforward neural networks. IEEE Trans. Neural Networks, 8 (1997) 519-531. J. Zhang, J.-H. Jiang, P. Liu, Y.-Z. Liang and R.-Q. Yu, Multivariate nonlinear modelling of fluorescence data by neural network with hidden node pruning algorithm. Anal. Chim. Acta, 344(1997)29-40. E.P.P.A. Derks, M.L.M. Beckers, W.J. Meissen and L.M.C. Buydens, A parallel cross-validation procedure for artificial neural networks. Computers Chem., 20 (1995) 439-448. E.P.P.A. Derks, M.S. Sanchez Pastor and L.M.C. Buydens, A robustness analysis for MLF and RBF neural networks models. Chemom. Intell. Lab. Syst., 34 (1996) 299-301. P.J. Gemperline, Rugged Spectroscopic calibration. Chemom. Intell. Lab. Syst., 39 (1997) 29-42. G.J. Salter, M. Lazzari, L. Giansante, R. Goodacre, A. Jones, G. Surrichio, D.B. Kell and B. Bianchi, Determination of the geographical origin of Italian extra-virgin olive oil using pyrolysis-mass spectrometry and neural networks. J. Anal. Appl. Pyrol., 40 (1997) 159-170. J.R.M. Smits, L.W. Breedveld, M.W.J. Derksen, G. Kateman, H.W. Balfoort, J. Snoek and J.W. Hofstraat, Pattern classification with artificial neural networks: classification of Algae, based upon flow cytometry data. Anal. Chim. Acta, 258 (1992) 11-25. H. Chan, A. Butler, D.M. Falk and M.S. Freund, Artificial neural network processing of stripping analysis responses for identifying and quantifying heavy metals in the presence of intermetallic compound formation. Anal. Chem., 69 (1997) 2373-2378. C.W. McCarrick, D.T. Ohmer, L.A. Gilliland, P.A. Edwards and H.T. Mayfield, Fuel identification by neural network analysis of the responses of vapour-sensitive sensor arrays. Anal. Chem., 68 (1996) 4264^269. M. Click and G.M. Hieftje, Classification of alloys with an artificial neural network and multivariate calibration of Glow-Discharge emission spectra. Appl. Spectrosc, 45 (1991) 1706-1716. R. Goodacre, D.B. Kell and G. Bianchi, Rapid identification of species using pyrolysis mass spectrometry and artificial neural networks of propionibacterium acnes isolated from dogs. J. Appl. Bacteriol., 76 (1994) 124-134. W. Werther, H. Lohninger, F. Stand and K. Vermuza, Classification of mass spectra. A comparison of yes/no classification methods for the recognition of simple structural properdes. Chemom. Intell. Lab. Syst., 22 (1994) 63-67.
697 38. 39. 40.
41.
42. 43.
44. 45. 46. 47. 48. 49. 50. 51.
52. 53.
54.
55. 56.
57.
J.M. Andrews and S.H. Lieberman, Neural network approach to qualitative identifications of fuels from laser induced fluorescence spectra. Anal. Chim. Acta, 285 (1994) 237-246. U. Hare, J. Brian and J.H. Prestegard, Application of neural networks to automated assignment of NMR spectra of proteins. J. Biomolec. NMR, 4 (1994) 35-46. T. Visser, H.J. Luinge and J.H. van der Maas, Recognition of visual characteristics of infrared spectra by artificial neural networks and partial least squares regression. Anal. Chim. Acta, 296 (1994). H.J. Luinge, E.D. Leussink, and T. Visser, Trace-level identity confirmation from infrared spectra by library searching and artificial neural networks. Anal. Chim. Acta, 345 (1997) 173-184. C. Affolter and J.T. Clerc, Prediction of infrared spectra from chemical structures of organic compounds using neural networks. Chemom. Intell. Lab. Syst., 21 (1993) 151-157. J.R.M. Smits, P. Schoenmakers, A. Stehmann, F. Sijstermans and G. Kateman, Interpretation of Infrared spectra with modular neural-network systems. Chemom. Intell. Lab. Syst., 18 (1993) 27-39. M.E. Munk, M.S. Madison and E.W. Robb, Neural network models for infrared spectrum interpretation. Microchim. Acta, 2 (1991) 505-524. W. Wu and D.L. Massart, Artificial neural networks in classification of Nir spectral data: selection of the input. Chemom. Intell. Lab. Syst., 35 (1996) 127-135. C. Hoskins and D.M. Himmelblau, Process Control via artificial Neural networks and reinforced learning. Computers Chem. Eng., 16 (1992) 241-251. C. Hoskins and D.M. Himmelblau, Fault diagnosis in complex chemical plants using artificial neural networks. AIChE J, 37 (1991) 137-141. M. Bhat and T.J. McAvory, Use of neural nets for dynamic modelling and control of chemical process systems. Computers Chem. Eng., 15 (1990) 573-578. C. Puebla, Industrial process control of chemical reactions using spectroscopic data and neural networks. Chemom. Intell. Lab. Syst., 26 (1994) 27-35. N. Qian and T.J. Sejnowski, Predicting the secondary structure of globular proteins using neural network models. J. Molec. Biol., 202 (1988) 568-584. M. Vieth, A. Kolinsky, J. Skolnicek, and A. Sikorski, Prediction of protein secondary structure by neural networks, encoding short and long range patterns of amino acid packing. Acta Biochim. Pol., 39 (1992) 369-392. R. Wehrens and W.E. van der Linden, Cahbration of an array of voltammetric microelectrodes. Anal. Chim. Acta, 334 (1996) 93-100. J.R.M. Smits, W.J. Meissen, G.J. Daalmans and G. Kateman, Using molecular representations in combination with neural networks. A case study: prediction of the HPLC retention index. Computers Chem., 18 (1994) 157-172. A. Bos, M. Bos and W.E. Van der Linden, Artificial neural networks as a multivariate calibration tool: modeUng the ion-chromium nickel system in x-ray fluorescence spectra. Anal. Chim. Acta, 277 (1993) 289-295. M.N. Tib and R. Narayanaswamy, Multichannel calibration technique for optical-fibre chemical sensor using artificial neural network. Sensors Actuators, B39 (1997) 365-370. H.M. Wei, L.S. Wang, B.G. Zhang, C.J. Liu and J.X. Feng, An application of artificial neural networks. Simultaneous determination of the concentration of sulfur dioxide and relative humidity with a single coated piezoelectric crystal. Anal. Chem., 69 (1997) 699-702. H.J. Miao, M. H. Yu and S.X. Hu, Artificial neural networks aided deconvolving overlapped peaks in chromatograms. J. Chromatogr., A, 749 (1996) 5-11.
698 58.
59.
60.
61. 62. 63. 64. 65. 66.
67.
68. 69. 70.
71.
72.
73. 74. 75.
76.
R.M. Lopes Marques, P.J. Schoenmakers, C.B. Lucasius and L.M.C. Buydens, Modelling chromatographic behavior as a function of pH and solvent composition in RPLC. Chromatographia, 36 (1993) 83-95. A.P. de Weyer, L.M.C. Buydens, G. Kateman and H.M. Heuvel, Neural networks used as a soft modelling technique for quantitative description of the inner relation between physical properties and mechanical properties of poly ethylene terephthalate yams. Chemom. Intell. Lab. Syst., 16(1992)77-82. E.P.P.A. Derks, B.A. Pauly, J. Jonkers, E.A.H. Timmermans and L.M.C. Buydens, Adaptive noise cancellation on inductively coupled plasma spectroscopy. Chemom. Intell. Lab. Syst., (1998) in press. A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing. Wiley, New York, 1993. J. Park and I.W. Sandberg, Universal approximation using radial basis function networks. Neural Computation, 3 (1991) 246-257. J. Moody and C.J. Darken, Fast learning in networks of locally-tuned processing units. Neural Computation, 1 (1989) 281-294. B. Carse and T.C. Fogarty, Fast evolutionary learning of minimal radial basis function neural networks using a genetic algorithm. Lecture Notes in Computer Science, 1143, (1996) 1-22. L. Kieman, J.D. Mason and K. Warwick, Robust initialisation of Gaussian radial basis function networks using partitioned k-means clustering. Electron. Lett., 32 (1996) 671-672. W. Luo, M.N. Karim, A.J. Morris and E.B. Martin, Control relevant identification of a pH waste water neutralisation process using adaptive radial basis function networks. Computers Chem.Eng.,20(1996)S1017 B. Walczack and D.L. Massart, Application of radial basis functions-partial least squares to non-linear pattern recognition problems: diagnosis of process faults. Anal. Chim. Acta, 331 (1996) 187-193. B. Walczack and D.L. Massart, The radial basis functions — partial least squares approach as a flexible non-linear regression technique. Anal. Chim. Acta, 331 (1996) 177-185. T. Kohonen, Self Organization and Associated Memory. Springer-Verlag, Heidelberg, 1989. W.J. Meissen, J.R.M. Smits, L.M.C. Buydens and G. Kateman, Using artificial neural networks for solving chemical problems. II. Kohonen self-organizing feature maps and Hopfield networks. Chemom. Intell. Lab. Syst., 23 (1994) 267-291. W.J. Meissen, J.R.M. Smits, G.H. Rolf and G. Kateman,Two-dimensional mapping of IR spectra using a parallel implemented self-organising feature map. Chemom. Intell. Lab. Syst., 18(1993)195-204. S. Anzali, G. Bamickel, M. Krug, J. Sadowski, M. Wagener, J. Gasteiger and J. Polanski, The comparison of geometric and electronic properties of molecular surfaces by neural networks: Application to the analysis of corticosteroid-binding globulin activity of steroids. J. Computer-Aided Molec. Design, 10 (1996) 521-534. X.H. Song and P.K. Hopke, Kohonen neural network as a pattern-recognition method, based on weight interpretation. Anal. Chim. Acta, 334 (1996) 57-66. M. Walkenstein, H. Hutter, C. Mittermayr, W. Schiesser and M. Grasserbauer, Classification of SIMS images using a Kohonen network. Anal. Chem., 69 (1997) 777-782. R. Goodacre, J. Pygall and D.B. Kell, Plant seed classification using pyrolysis mass spectrometry with unsupervised learning: the application of auto-associative and Kohonen artificial neural networks. Chemom. Intell. Lab. Syst., 33 (1996) 69-83. J. Zupan, M. Novic and I. Ruisanchez, Kohonen and counterpropagation networks in analytical chemistry. Chemom. Intell. Lab. Syst., 38 (1997) 1-23.
699 77.
78.
79.
80.
D. Wienke and L.M.C. Buydens, Adaptive resonance theory neural networks — the ART of real-time pattern recognition in chemical process monitoring? Trends Anal. Chem., 99 (1995) 1-8. Y. Xie, P.K. Hopke and D. Wienke, Airborne particle classification with a combination of chemical composition and shape index utilizing an adaptive resonance artificial neural network. Environ. Sci. Technol., 28 (1994) 1399-1407. D. Domine, D. Wienke, J. Devillers and L.M.C. Buydens, A new nonlinear neural mapping technique for visual exploration of QSAR data. In: Neural Networks in QSAR and Drug Design, J. Devillers (ed.). Academic Press, London, 1996, p. 223-253. S.H. Barondes, Geestesziekten en Moleculen. De Wetenschappelijke Bibliotheek van Natuur & Techniek, 1993.
This Page Intentionally Left Blank
701
Subject Index a priori probability, 221 absorption, 449 abstract chromatogram, 247, 266 abstract factor, 243, 245 abstract spectrum, 247 ACE, 235 activation space, 682 ADALINE, 650 adaptive Kalman filter, 576, 598, 599 adaptive resonance theory network, 692 additivity, 393, 397 administration, way of, 449 aerosol particles, 60 agglomerative hierarchical classification, 75, 79 aliasing, 524 algebraic set, 9 ALLOC, 227 alpha-phase, 469, 481 alternating least squares, 278, 296 alternating regression, 278, 296 analog filter, 537 analytical laboratory management, 617 analyzing wavelet, 566 angular distance, 11, 47 ANOVA, 129,384,395 antibiotics, 452 anti-symmetric function, 511 apodization, 527 approximation coefficient, 568 area under the curve, 457,493 area under the moment curve, 495 area under the second moment, 497 ART network, 649 artefact, 156 artificial intelligence, 627 artificial neural network, 649 artificial neuron, 233 autoscaling, 64, 122
average linkage, 69, 70, 72 axis of inertia, 106 axis of symmetry, 104
back-propagation learning rule, 662, 671 back-propagation network, 662 backrotation, 56 backtracking, 636 backward chaining, 634 backward Fourier transform, 516 bacteriostatic activity, 394 basis function, 681 basis vector, 9, 14, 91 Bayes equation, 221 Bayesian approach, 221 Bayesian probability theory, 640 beta-phase, 463,481,494 between-class variance, 216 bias, 654 bidiagonal matrix, 136 bilinear model, 390 binary programming, 609 binary variables, 65 binding affinity, 402 binding assay, 411 binding capacity, 453 bioavailability, 457, 466,484 bio-equivalence, 467 biological activity, 149, 383 biotransformation, 449 biplot, 112,153,183,187, 190, 203,405,408, 413 bipolar axis, 113, 188 bivariate distribution, 212 bivariate normal distribution, 211 bootstrap, 371 busy time, 616
702 calibration, 115 canonical analysis, 320 canonical axis, 409 canonical correlation, 319 canonical correlation analysis, 409 canonical variable, 319 canonical variate, 213, 319 canonical variate model, 408 canonical weight, 319 Cartesian coordinate space, 9 Cartesian diagram, 112 catenary model, 452, 487 centered matrix, 45 centering, 43 centre of mass, 42 centroid, 77, 116, 120 centrotypes, 78 certainty factor, 640 chalcone, 116 characteristic equation, 31, 490 characteristic root, 31 characteristic vector, 33 child frame, 637 Chinese teas, 60 chi-square distance, 147 chi-square statistic, 166, 182 chromatography, 118, 622 circulating protein, 456 city-block distance, 66, 149 city-block metric, 152 class envelope, 212 class membership, 408 classical least squares, 352, 353 classical metric scaling, 149 classification, 57, 669 — agglomerative hierarchical, 75, 79 — divisive hierarchical, 75 — fuzzy clustering, 80-82 — hierarchical, 57, 69 — /:-nearest neighbour method, 208, 223 — non-hierarchical, 58 — non-hierarchical, Forgy's method, 77 — and regression trees, 227 classification rule, 207, 208
classification tree, 227 CLASSY, 232 clearance, 452,459,481, 484 clinical chemistry, 208, 213 clinical laboratory, 609 closure, 87, 167 cluster analysis, 57, 156, 384, 397,416 clustering, 57, 207 — algorithm, 69 — average linkage, 69,70, 72 — complete linkage, 69, 70, 72 — density method, 209, 211, 213, 225 — non-hierarchical, 59, 83 — seed points, 77 — single linkage, 398 — tendency, 82 — Ward's method, 71,72 column space, 246 column-centered biplot, 120 column-centered matrix, 43 column-closure, 168 column-eigenvector, 92 column-mean, 42 column-orthogonal, 21 column-orthonormal, 21 column-pattern, 16, 104 column-principal component, 97 column-profile, 168 column-singular vector, 91 column-standard deviation, 46 column-standardization, 122 column-standardized biplot, 123 column-standardized matrix, 47 column-sum, 165 column-variable, 87 combinatorial chemistry, 59, 65 commutative, 20 comparative molecular field analysis, 385, 410 compartment, 451 compartmental analysis, 451 competitive inhibition, 503 complete linkage, 69, 70, 72 compositional data, 130
703 compositional table, 87 concentration-time curve, 457 confirmatory data analysis, 46 conflict resolution, 634, 636 conformation, 387 conjugated biplot, 409 connected graphs, 73 connection pattern, 652 connectivity indices, 392 consensus, 313 contingency table, 3, 7, 161 continuous signal, 513 continuum power-PLS, 345 continuum regression, 367 contrast, 113, 115, 127, 180, 181, 203, 204, 405 controlled calibration, 351 convolution, 489, 530, 533 convolution function, 530 convolution theorem, 533 Cook's distance, 374 Coomans plot, 231 coordinate space, 9 core matrix, 154 correlation, 63, 401 correlation coefficient, 14, 62, 65, 111 correlation matrix, 49 correlation-PCR, 364 correspondence, 182 correspondence factor analysis, 130, 174, 182, 405 cosine rule, 12, 111 cost-benefit analysis, 237 counting map, 690 Craig plot, 397 credibility function, 640 cross-product matrice, 48 cross-tabulation, 7, 88, 147 cross-validation, 144, 229, 236, 369, 416 — internal, 238 — k-Md, 238 curve peeling, 465, 491, 501 curve resolution, 243, 260, 267 curvilinear trajectories, 152 cut-off frequency, 548
dance step diagram, 135 DASCO, 232 data analysis, 14, 30, 40 data compression, 550 data-scope, 281 Daubechies wavelet, 566 de novo approach, 391, 393 decision-making process, 628 decision tree, 416, 629 deconvolution, 490, 510, 553, 556 deflated cross-product matrix, 138 deflated matrix, 35 deflation, 138 degeneracy, 40 delta rule, 656 dendrogram, 57, 69,70-72, 83 density method, 209, 211, 213, 225 derivative of a signal, 550 descriptive linear discriminant analysis, 220 detail coefficient, 568 determinant, 31, 34, 39, 490 determinant, minor of, 491 deterministic chaos, 451 deviations — of column-profiles, 169 — of double-closed data, 169 — of row-closed profile, 168 diagonal matrix, 19 diagonalization, 34, 93, 139, 140 differential equation, 462 differential equations, system of, 477 digital filter, 537 dilation, 566 dimension, 8 dimension reduction, 107 direct standardization, 377 directed graphs, 623 discrete event simulation, 618 discrete Fourier transform, 519 discrete wavelet transform, 567 distributivity, 529 discriminant function, 216, 217 discriminating power, 237 discrimination, 208
704 disjoint class modelling, 208 dispersion matrix, 31 dissimilarity plot, 295 dissociation constant, 387 distance, 43, 45, 47, 60, 108, 176-178 — between two points, 11 — of chi-square, 133, 175 — city block, 66, 149 — Cook, 374 — Euclidean,60, 62,63, 67, 108, 146,230, 231 — from the origin, 11 — function, 152 — generalized, 61 — Hamming, 66, 147 — Mahalanobis, 61, 62, 65, 147, 220, 221, 228, 274 — Manhattan, 67, 147 — Minkowski, 67 — matrix, 68 — metrics, 150 — Pythagorean, 146 — standardized Euclidean, 61, 62, 64 — taxi, 147 distribution, 449 — volume of, 456 distributional equivalence, 193 D-optimality, 40 double-closure, 130, 169 drift, 593, 598 drug-receptor complex, 385 drug-receptor interaction, 402 drug-test specificity, 397 dual representation, 28 dual space, 174 duality, 16 duo-trio test, 422 dynamic programming, 605 earning machine, 416 edges, 621 eigenfunction, 486 eigenstructure tracking, 280 eigenvalue, 31, 186, 486, 490, 492
eigenvalue decomposition, 33, 92, 148, 183 eigenvector, 33, 228, 230 eigenvector projection, 55 electron acceptor, 127 electron donor, 127 electrostatic interaction, 386 elimination pool, 455 ellipsoid, 39 enthalpy, 386 entropy, 386 environmental applications, 232 enzyme, 383, 411 epoch, 673 equilibrium constant, 386 equiprobability envelope, 104, 107 error eigenvector, 143 Euclidean distance, 60, 62, 63, 67, 108, 146, 230,231 euroleptic effect, 400 evolution program, 375 evolutionary method, 251 evolutionary rank analysis, 274 evolving factor analysis, 274 evolving principal components analysis, 274 exclusive or (XOR) problem, 659 excretion, 449, 452 exhaustion, 136 expected value, 147 experimental design, 40 expert system, 4, 627 — if...then... rules, 631 — inductive, 227 — inference engine, 629, 633 — inheritance, 637 — shell, 630, 641 explanation facility, 640 exploratory analysis, 46, 167 exponential function, 203, 462 exponential smoothing, 544 extended matrix, 155 extrathermodynamic method, 384 extravascular administration, 461,469 factor, 244, 319
705 factor analysis, 243, 244, 302, 397 — evolving, 274 — fixed-size window evolving, 278 — full-rank method, 251 — iterative target transformation, 268 — key-set, 297 — local rank methods, 274 — residual bilinearization, 300 — target transformation, 256 factor loading, 244 factor rotation, 252 factor scaling coefficient, 95, 150, 188 factor space, 95, 99 factorization, 95 fast Fourier transform, 530 feature map, 691 feature reduction, 215, 236 feature selection, 207, 236, 375 field and resonance parameters, 392 filter function, 540, 547 filter transfer function, 548 filtering, 529, 540, 547, 549 — cut-off frequency, 548 — low-pass, 547, 549 — high-pass, 547 — inverse, 553, 555 finite progression, 474 fixed-size window evolving factor analysis, 278 folding, 524 food authentication, 232 Forgy's method, 78 Forgy's non-hierarchical classification method, 77 forward chaining, 634 forward chaining inference engine, 636 forward Fourier transform, 516 Fourier coefficient, 236, 515 — imaginery, 516 — real, 516 Fourier filtering, 373 Fourier transform, 236, 478, 507, 510, 513, 550 — discrete, 519 — inverse, 516 — pair, 517
— time-frequency, 564 fractal geometry, 451 frame, 14, 633 free choice profiling, 436 Free-Wilson analysis, 384, 393 frequency domain, 509, 510, 547 fuel spill, 232 full-rank method, 251 fuzzy clustering, 80, 81, 82 fuzzy-set theory, 640 gain vector, 578, 579 game theory, 605 gamma-method, 490,491 generalized column-eigenvectors, 186 generalized distance, 61 generalized eigenvalue decomposition, 185 generalized inverse, 38 generalized least squares, 356 generalized loading, 188 generalized Procrustes analysis, 317, 434 generalized rank annihilation factor analysis, 298 generalized row-eigenvectors, 185 generalized score, 188 generalized singular value decomposition, 183 generalized standard addition method, 367, 368 genetic algorithm, 78, 625 geometrical structure, 46 Gibbs' free energy, 385 global interaction, 133 global sum, 165 global sum of squares, 94 global weighted mean, 178 Golub-Reinsch, 134 goodness of fit, 182 Cower's similarity index, 148 GRAFA, 298 graph, 73, 621 graph theory, 79, 605 graphical determination, 480 graphical solution, 463 GSAM, 367, 368
706 Guttman effect, 199 Haar wavelet, 566 Hadamard transform, 562, 564 half-life time, 456 Hammett electronic parameter, 387 Hammett equation, 387 Hamming distance, 66, 147 hard delimiter, 655 Hansch analysis, 384 Hansch equation, 388 Hansch model, 410 HELP, 280 heterogeneity, 124 heterogeneous data, 133, 205 heterogeneous table, 182 heteroscedasticity, 64, 503 heuristic evolving latent projection, 280 heuristics, 628 hierarchical classification, 57 hierarchical methods, 69 high-pass filter, 547 H-matrix, 568 HOMALS program, 150 homogeneity, 150 homogeneous data, 133, 205 homogeneous linear equation, 34 homogeneous table, 87, 182 Hopkin's statistic, 82 Hotelling T^-distribution, 228 Householder-QR algorithm, 134 Householder transformation, 136 hybrid constants, 486 hybrid transport constant, 479 hydrogen bonding, 386, 395 hydrophobic interaction, 386, 388 hydrophobicity, 388 ICso value, 412 identity matrix, 19 ill-conditioned problem, 469 image, 52 image analysis, 153 imaginary Fourier coefficient, 516
incomplete absorption, 469 indicator table, 161 indicator variable, 392 indirect QSAR, 412 inertia, 107, 113 inference engine, 629, 633 inheritance, 637 inhibition constant, 403 integer programming, 605, 609 integration by parts, 498 interaction, 128, 175, 179, 182, 196 interaction module, 629, 640 interferogram, 509 internal cross-validation method, 238 intravenous administration, 455,476,491 — repeated, 473 inverse calibration, 352 inverse filtering, 553, 555 inverse Fourier transform, 516 inverse Laplace transform, 478, 489 iterative target transformation factor analysis, 268 ITTFA, 268 Jaccard similarity coefficient, 65 jackknife method, 238 Jacobi algorithm, 134 Kalman filter, 575, 576, 594, 596 — innovation, 578, 599 K-center, 78 K-centroid methods, 78 K-clustering, 77, 83, 84 Kennard-Stone algorithm, 372, 378 kernel, 80 kernel function, 681 kernel method, 225 key-set factor analysis, 297 kinetics, 592, 596 kinetics model, 576 k-nearest neighbour method, 208, 223 knowledge acquisition, 643 knowledge base, 629 knowledge engineer, 644
707 knowledge environment, 642 knowledge representation, 630 Kohonen learning rule, 687 Kohonen mapping, 82 Kohonen network, 649, 687 Kronecker's delta, 152 Kruskal's algorithm, 74 LI-norm, 66, 152 L2-norm, 67 laboratory management, 610 laboratory simulator, 621 Lagrange multipliers, 93 Laplace domain, 488 Laplace transform, 477,491 — inverse, 478, 489 latent value, 95 latent variable, 50, 212, 215, 228, 319 latent vector, 95, 182, 183 lead compounds, 384 learning object, 207 learning rate, 673 learning rule, 652, 656, 670 least-squares criterion, 53 least-squares linear regression, 503 leave-one-out procedure, 238 leverage, 374 limiting solution, 463 line of closest fit, 104, 106 linear algebra, 134 linear combination, 91, 96 linear discriminant analysis, 84, 208, 209, 212, 213,236,408 linear free energy relationship, 384 linear independence, 27, 32 linear learning machine, 234, 653 linear programming, 605 linear regression, 387, 388, 390, 460, 468 linear transfer function, 666 linearity, 500 linearization, 502 linearly independent, 8 Lineweaver-Burk form, 502 lipophilicity, 388
loading, 55 loading matrix, 155 loading plot, 99 locally weighted regression, 378 location, 174 location model, 78 lock and key paradigm, 384 log column-centered biplot, 124 log column-centering, 123 log double-centering, 125 log ratio, 404 logarithmic analysis, 129 logarithmic function, 124 logarithmic transformation, 64 logarithms, 133 log-bilinear model, 129, 201 log-linear model, 201 lumping, 451 MacQueen's K-means method, 78 Mahalanobis distance, 61, 62, 65, 147, 220, 221,228,374 majority rule, 223 Malinowski's F-ratio, 144 mammillary model, 452 Manhattan distance, 67,147 manifest variable, 50 marginal sum, 131, 165, 166, 405 MARS, 235 MASLOC, 78, 83 mass, 174 mass balance differential equation, 451 mass balance equation, 470, 476 mass weight, 106 matching coefficient, 65, 66 mathematical programming, 609 matrix, 7, 15 — inner product, 10, 20 — outer product, 25, 43 matrix addition, 43 matrix multiplication, 20 matrix-by-vector product, 23 maximal complete subgraph, 79 maximum entropy method, 555, 558
708 maximum likelihood method, 555, 557 mean absorption time, 496 measure of belief, 640 measure of disbelief, 640 measure of (dis)similarity, 60, 65 measurement equation, 577 measurement model, 576 measurement table, 7, 87 mechanical analogy, 128 membrane, 456 metabolism, 452 metabolite, 450 meta-knowledge, 629, 631 meteorites, 57 metric matrix, 171 metric MDS, 428 Michaelis constant, 453 Michaelis-Menten equation, 502 Michaelis-Menten kinetics, 453 minimal path, 622 minimal spanning tree, 73, 74 Minkowski distance, 67 mint species, 60 mixed variables, 67 mixture design, 444 mode analysis, 79 modelling power, 237 molar refractivity, 392 molecular field, 410 momentum, 673 monitoring set, 675 Moore-Penrose inverse, 38 Morlet wavelet, 566 moving average, 538 multicompartment model, 487 multicomponent analysis, 575 multicriteria decision making, 426, 605 multidimensional scaling, 149, 427 multidimensional space, 16 multinormal, 104 multiple linear regression, 53 multiple inheritance, 637 multiplicative scatter correction, 373 multivariate bioassay, 397
multivariate cahbration, 60, 239, 349 multivariate normal distribution, 40, 212, 221, 228 multivariate quality control, 232 multivariate regression, 323 multivariate statistics, 4 multi-way contingency table, 165 multi-way data table, 2 MYCIN, 640 natural calibration, 352 near infrared spectra, 232 nearest neighbour method, 209, 223 nearest neighbours, 213 needle search, 271 network, 621 neural network, 82, 149, 209, 213, 225, 233, 378,385,416,649 — application of, 680 — ART, 649 — back-propagation learning rule, 662, 671 — backtracking, 636 — backward chaining, 634 — hidden layer, 660, 662 — hidden unit, 660, 677 — input layer, 662 — multi-layer-feed-forward network, 649 — output layer, 662 — performance behaviour of, 674, 677 — performance criterion, 680 — radial basis function, 649, 670, 681 — validation of, 679 neuron, 650 NIPALS, 134, 139,332 NIPALS algorithm, 40, 107 NIR spectra, 213 node, 621 noise, 143, 535 nominal scale, 161 non-compartmental analysis, 493, 500 non-hierarchical classification, 58 non-hierarchical clustering, 59, 83 non-hierarchical methods, 76 non-linear Hansch model, 389
709 non-linear mapping, 430 non-linear model, 502 non-linear PCA, 150 non-linear PLS, 378 non-linear programming, 609 non-linear regression, 390, 486, 491 non-linear transformation, 149 non-metric MDS, 429 non-parametric regression, 505 non-singular, 27 norm, 11, 111 normal probability distribution, 213 normalization, 12, 138, 167 normalized Hamming distance, 66 normalized standard error, 674 null matrix, 19 null vector, 9 numerical integration, 494 numerical taxonomy, 58 Nyquist frequency, 521, 524 object-attribute-value triplets, 619, 632 object-oriented programming technique, 638 one-compartment open model, 455,473, 474, 495 operations research, 4, 605 optimization, 622 ordinal variable, 66 orthogonal decomposition, 138 orthogonal projection, 55, 295 orthogonal projection operator, 54 orthogonal rotation, 55, 108, 252 orthogonal rotation matrix, 255 orthogonal vector, 12 orthonormal vector, 14 outliertest, 210, 232 outlier, 239, 374 output-activity map, 690 overtraining, 675 paired comparison test, 425 parabolic model, 389 PARAFAC model, 156, 301 paralysed network, 676
parent frame, 637 partial least squares, 232, 331 partial least squares analysis, 408 partial least squares model, 409 partial least squares regression, 366 partition coefficient, 388 partitioning-optimization techniques, 78 pattern recognition, 397 — supervised, 207, 652 — unsupervised, 207, 687 peak concentration, 467 peak purity, 301 perceptron, 650 periodicity, 527 pharmaceutical identification, 232 pharmaceutical preparations, 223 pharmacodynamics, 450 pharmacokinetic effect, 411 pharmacokinetics, 449 pharmacological activity, 412 pharmacophore, 384, 396 phase, 528 phase spectrum, 529 phylogenetic application, 157 physical dimension, 169 physicochemical parameter, 393 physicochemical property, 387 piecewise direct standardization, 377 plane of closest fit, 106 PLS — continuum power, 345 — two-block, 411 PLS 1,233 PLS2, 233 point-spread function, 531, 532, 554 poles of specificity, 404 polynomial smoothing, 542 pooled variance-covariance matrix, 217, 219, 221 positive definite, 30 positive semi-definite, 30 potential density function, 80 potential function, 225, 226 potential method, 80, 81, 225
710 powering algorithm, 134, 138 power spectrum, 517, 536 precisions, 196 prediction ability, 238, 239 preprocessing, 48, 112, 115, 201 PRESS, 368 principal axis, 104 principal component, 215, 228 principal components analysis, 57, 88, 192, 201, 384, 398 principal components regression, 329, 358 principal coordinates analysis, 146,428 principal covariates regression, 367 principal diagonal, 19 prior probabilities, 221 probability, a priori, 221 probability density function, 495 Procrustes analysis, 310, 411 projection, 51, 52, 151 projection to latent structure, 332 pruning, 678 pseudo-deconvolution, 555 pseudodiagonals, 145 pure variable, 286 purest column, 251 purest row, 251 purity spectrum, 292 pyramidal algorithm, 571 Pythagorean distance, 146 QDA, 228 Q-mode, 88 QR transformation, 136, 140 quadratic discriminant analysis, 209, 210, 220, 228 quadratic programming, 609 quantitation limit, 460 quantitative descriptive analysis, 431 quantitative structure-activity relationships, 383 queue length, 614 queueing — idle time, 616 — M/M/1 system, 611 — utilization factor, 611
queuing theory, 605, 609, 610 radial basis function network, 649, 670, 681 RAFA, 298 random calibration, 352 random noise, 485 random variation, 28 range scaling, 64, 67 rank, 27, 170, 196,202 rank annihilation factor analysis, 298 rate-limited exchange, 453 rational drug design, 385 RBL, 300 real Fourier coefficient, 516 receptor, 383, 396, 402, 411 reciprocal averaging, 182 recognition ability, 238, 239 reconstructed data matrix, 102 reconstruction, 100 recursive multicomponent analysis, 585 recursive regression, 575, 577 reduced rank regression, 324, 367 redundancy analysis, 324 reflected discriminant analysis, 236 regression, 64, 220, 238 regression line, 106 regularised discriminant analysis, 223 relative contribution, 103, 192 repeated intravenous administration, 473 representative objects, 78 representative samples, 60 repulsion, 128 resampling, 238 reservoir, 472 residual data matrix, 102, 136 residual error, 155 residual error sum of squares, 145 residual matrix, 35 residual variance, 142 resolution, 520, 521,526 response surface, 397 response surface methodology, 444 retention time, 116 ridge regression, 367
711 right-singular vector, 91 R-mode, 88 RMSPE, 368 robust clusters, 83 robust regression, 504 rotation, 55 rotation angle, 251, 253 rotation matrix, 254, 286 rotation, Varimax, 254 row simplicity, 254 row space, 246 row-centered matrix, 43 row-closure, 168 row-eigenvector, 92 row-mean, 43 row-orthogonal, 21 row-orthonormal, 21 row-pattern, 16, 104 row-principal component, 96 row-profile, 168 rows-by-columns product, 20 row-singular vector, 91 row-standardized matrix, 47 row-sum, 165 row-variable, 87 rule set, 632 rule-based scheme, 631 sampling, 500, 524 saturated RC association model, 129, 201 scalar multiplication, 9 scalar product, 10, 11,51 scaling, 64, 67, 529 score, 55, 213 score plot, 97, 157 Scree-plot, 142 selecting clusters, 82 selectivity, 237 self-organizing map, 687 semi-axis, 40 semilogarithmic diagram, 481 semilogarithmic plot, 457,468, 501 sensitivity, 237, 413 sensory analysis, 421
Shannon entropy, 558 shift, 528 shortest path problem, 621 sigmoidal transfer function, 655, 662 signal domains, 507 signal, derivative of, 550 signal enhancement, 510, 535, 536, 547 signal processing, 507, 509, 535 signal restoration, 510, 535 signal-to-noise ratio, 509, 535 SIMCA, 212, 225, 228, 237 similarity, 60, 148 similarity coefficient, 60 similarity index, Gower's, 148 similarity matrix, 68 simple inheritance, 637 simple neural network, 228 Simplex method, 608 Simplex optimization, 384, 397 Simplisma, 292 SIMPLS, 340, 367 SIMULA, 621 simulated annealing, 78 simulation, 605, 610 simulation model, 619 single linkage, 69, 70, 71, 72 singular value, 186 singular value decomposition, 40, 89, 183 singular vector decomposition, 202 sinusoidal transfer function, 670 size component, 119 slit function, 530 smoothing, 538, 549 — polynomial, 542 — Savitzky-Golay, 373 smoothing parameter, 226, 227 soft independent modelling of class analogy, 228 solubility, 449 solvent, 60, 74, 75 specificity, 413 spectral decomposition, 34, 92 spectral map, 129,404 spectral map analysis, 129, 201, 402
712 speed of release, 449 spread, 174 stability-plasticity dilemma, 692 stacked histogram, 168 standard deviation of the retention time, 498 standard deviation spectrum, 292 standard normal variates, 373 standardized Euclidean distance, 61, 62, 64 state vector, 585 state-space model, 576, 595 statistical moment, 498 statistical moment theory, 493, 501 steady-state plasma concentration, 471 steady-state solution, 475 steady-state volume of distribution, 496 stepwise variable selection, 236 steric bulk, 392 steric conformation, 385 STERIMOL variable, 392 stochastic distribution, 498 STRESS, 429 stretched coordinate axis, 171 structural eigenvector, 143 structural features, 48 structural information, 140 structure correlation, 321 subspace, 9, 192, 136 substituent parameter, 398 substitution constant, 393 sum matrix, 19 sum vector, 42 supervised learning, 207, 652 supervised pattern recognition, 207 Swain and Lupton parameter, 391 symmetric function, 511 symmetry, 527 system equation, 589, 592, 593 system model, 576 system noise, 591 system transition matrix, 591 systematic information, 140 systems equation, 577 Tanimoto coefficient, 65, 66
Tanimoto similarity, 65 target, 256 target transformation factor analysis, 256 taxi-distance, 147 taxonomy, 157 test set, 238, 239 therapeutic activity, 383 thermometer plot, 198 three-dimensional conformation, 416 three-distance clustering method, 76 three-factor system, 267 three-way analysis, 153 three-way data, 153 three-way data table, 2 three-way matrix, 301 threshold logic, 655 threshold step function, 655 time averaging, 538 time of appearance of the maximum, 467 time domain, 507, 509, 510, 536 time-frequency Fourier transform, 564 time-intensity curves, 441 time-invariant system, 585, 595 tolerance, 135, 139 top-down PCR, 364 topology preserving mapping technique, 692 total least squares, 367 toxicity testing, 59 trace, 22, 49 trace elements, 89 training object, 233 training objects, 207 trajectory, 153 transfer constant, 451 transfer function, 234, 651, 655 transfer function, role of, 665 transfer of calibration models, 376 transform kernel, 517 transformation, 43 transition formula, 100, 185 translation, 43, 45, 566 transport, 450, 451 triangle relationship, 12 triangle test, 421
713 tridiagonal matrix, 140 trilinear model, 156 true factors, 243 TTFA, 256 Tucker3 model, 154 two-block PLS, 411 two-compartment catenary model, 461, 469 two-compartment mammillary model, 476,491 two-component analysis, 586 two-factor system, 260 two-layer network, 234 two-way table, 1,7,88, 153 uncertainty, 639 UNEQ, 210, 212, 228 unfolding, 153 unipolar axis, 112, 150, 188 unit vector, 9 unsupervised learning, 397, 652 unsupervised pattern recognition, 207, 687 utilization factor, 611 validation, 141, 157,207,238 — of neural networks, 679 van der Waals radius, 386 VARDIA, 286 variable space, 246 variance diagram, 286, 289 variance inflation factor, 374 variance of the residence time, 497 variance-covariance matrix, 49, 578 Varimax rotation, 254 varivector, 255 vector, 8 vector addition, 9 vector product, 25 vector space, 8 vector-by-matrix product, 23 volume of distribution, 456 Ward's method, 71,72
wavelength domain, 507 wavelength selection, 589 wavelet — Daubechies, 566 — Haar, 566 — Morlet, 566 wavelet basis, 566 wavelet coefficient, 236 wavelet filter coefficient, 567 wavelet packet transform, 571 wavelet transform, 566 — coefficient, 566 — discrete, 567 Weber-Fechner's psychophysical law, 386 weight, 652 weight coefficient, 131, 173, 201 weighted cross-products matrix, 186 weighted distance, 116, 171 weighted Euclidean distance, 61, 62 weighted Euclidean metric, 170 weighted linear regression, 503 weighted mean, 173 weighted norm, 171 weighted scalar product, 170 weighted sum, 131 weighted sum of squares, 131, 173 weighted variance, 173 weighting function, 495 white noise, 535 Wilks' lambda, 237 wind direction, 89 within-class variance, 216 within-variance, 150 xenobiotics, 450 Young-Householder factorization, 38 z-transform, 64 zero filling, 526, 527
This Page Intentionally Left Blank