Applied Multivariate Statistical Analysis
FIFTH E DITION
Applied Multivariate Statistical Analysis
RICHARD A. JOH N SON
University of Wisconsin-Madison
D EAN W. WIC H E RN Texas A&M University
Prentice Hall
------
PRENTICE HALL, Upper Saddle River, New Jersey
07458
Library of Congress Cataloging-in-Publication Data
Johnson, Richard Arnold. Applied multivariate statistical analysis/Richard A. Johnson.--5th ed. p. cm. Includes bibliographical references and index. ISBN 0-13-092553-5 1. Multivariate analysis. I. Wichern, Dean W. II. Title. QA278 .J 63 2002 519.5'35--dc21
2001036199
Quincy McDonald Editor-in-Chief: Sally Yagan
Acquisitions Editor:
David W. Riccardi Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Assistant Managing Editor: Bayani DeLeon Production Editor: Steven S. Pawlowski Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Editorial Assistant/Supplements Editor: Joanne Wendelken Managing Editor, Audio/Video Assets: Grace Hazeldine Art Director: Jayne Conte Cover Designer: Bruce Kenselaar Dlustrator: Marita Froimson Vice President/Director Production and Manufacturing: Executive Managing Editor:
•
© 2002, 1998, 1992, 1988, 1982 by Prentice-Hall, Inc. Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 ISBN 0-13-092553-5
Pearson Education LTD., London Pearson Education Australia PTY, Limited, Sydney Pearson Education Singapore, Pte. Ltd Pearson Education North Asia Ltd, Hong Kong Pearson Education Canada, Ltd., Toronto Pearson Education de Mexico, S.A. de C.V. Pearson Education-Japan, Tokyo Pearson Education Malaysia, Pte. Ltd
To the memory of my mother and my father. R. A. J. To Dorothy, Michael, and Andrew.
D. W W
Contents
1
PREFACE
ASPECTS OF MULTIVARIATE ANALYSIS
1.1 1 .2 1.3
1 .4
1.5 1 .6
2
XV
1
Introduction 1 Applications of Multivariate Techniques The Organization of Data 5
3
Arrays, 5 Descriptive Statistics, 6 Graphical Techniques, 11
Data Displays and Pictorial Representations
Linking Multiple Two-Dimensional Scatter Plots, 20 Graphs of Growth Curves, 24 Stars, 25 Chernoff Faces, 28
Distance 30 Final Comments Exercises 38 References 48
19
38
MATRIX ALGEBRA AND RANDOM VECTORS
2.1 2.2
2.3 2.4 2.5 2.6
2.7
50
Introduction 50 Some Basics of Matrix and Vector Algebra 50
Vectors, 50 Matrices, 55
Positive Definite Matrices 61 A Square-Root Matrix 66 Random Vectors and Matrices 67 Mean Vectors and Covariance Matrices
68
Partitioning the Covariance Matrix, 74 The Mean Vecto r and Covariance Matrix for Linear Combinations of Random Variables, 76 Partitioning the Sample Mean Vector and Covariance Matrix, 78
Matrix Inequalities and Maximization 79
vii
viii
Contents Supplement 2A: Vectors and Matrices: Basic Concepts
Vectors, 84 Matrices, 89
84
Exercises 104 References 1 1 1
3
3.1 3.2 3.3
3.4
3.5 3 .6
4
112
SAMPLE GEOMETRY AND RANDOM SAMPLING
Introduction 112 The Geometry of the Sample 112 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix 120 Generalized Variance 124
Situations in which the Generalized Sample Variance Is Zero, 130 Generalized Variance Determined by I R I and Its Geometrical Interpretation, 136 Another Generalization of Variance, 138
Sample Mean, Covariance, and Correlation As Matrix Operations 139 Sample Values of Linear Combinations of Variables Exercises 145 References 148
141
THE MULTIVARIATE NORMAL DISTRIBUTION
4.1 4.2 4.3
4.4 4.5 4.6
Introduction 149 The Multivariate Normal Density and Its Properties
149
149
Additional Properties of the Multivariate Normal Distribution, 156
Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 168
The Multivariate Normal Likelihood, 168 Maximum Likelihood Estimation of JL and I, 170 Sufficient Statistics, 173
The Sampling Distribution of X and S
173
Properties of the Wishart D istribution, 174
Large-Sample Behavior of X and S 175 Assessing the Assumption of Normality 177
Evaluating the Normality of the Univariate Marginal D istributions, 178 Evaluating Bivariate Normality, 183
4.7
Detecting Outliers and Cleaning Data
4.8
Transformations To Near Normality
Steps for Detecting Outliers, 190
189
194
Transforming Multivariate Observations, 198
Exercises 202 References 209
Contents 5
INFERENCES ABOUT A MEAN VECTOR
5.1 5.2 5.3
5 .4
5.5 5.6
5.7 5.8
6
21 0
Introduction 210 The Plausibility of Ito as a Value for a Normal Population Mean 210 Hotelling's T2 and Likelihood Ratio Tests 216 General Likelihood Ratio Method, 219
Confidence Regions and Simultaneous Comparisons of Component Means 220
Simultaneous Confidence Statements, 223 A Comparison of Simultaneous Confidence Intervals with One-at-a- Time Intervals, 229 The Bonferroni Method of Multiple Comparisons, 232
Large Sample Inferences about a Population Mean Vector 234 Multivariate Quality Control Charts 239
Charts for Monitoring a Sample of Individual Multivariate Observations for Stability, 241 Contro l Regions for Future Individual Observations, 247 Control Ellipse for Future Observations, 248 T 2 -Chart for Future Observations, 248 Control Charts Based on Subsample Means, 249 Control Regions for Future Subsample Observations, 251
Inferences about Mean Vectors when Some Observations Are Missing 252 Difficulties Due to Time Dependence in Multivariate Observations 256 Supplement SA: Simultaneous Confidence Intervals and Ellipses as Shadows of the p-Dimensional Ellipsoids 258 Exercises 260 References 270
COMPARISONS OF SEVERAL MULTIVARIATE MEANS
6.1 6.2 6.3
6.4
ix
272
Introduction 272 Paired Comparisons and a Repeated Measures Design 272
Paired Comparisons, 272 A Repeated Measures Design for Comparing Treatments, 278
Comparing Mean Vectors from Two Populations
Assumptions Concerning the Structure of the Data, 283 Further Assumptions when n1 and n2 Are Small, 284 Simultaneous Confidence Intervals, 287 The Two-Sample Situation when �1 i= �2, 290
283
Comparing Several Multivariate Population Means ( One-Way Manova) 293
Assumptions about the Structure of the Data for One-way MAN OVA, 293 A Summary of Univariate AN OVA, 293 Multivariate Analysis of Variance (MAN OVA), 298
x
Contents 6.5
6.6 6.7
6.8
6.9
7
Simultaneous Confidence Intervals for Treatment Effects Two-Way Multivariate Analysis of Variance 307
305
Univariate Two-Way Fixed-Effects Model with Interaction, 307 Multivariate Two-Way Fixed-Effects Model with Interaction, 309
Profile Analysis 318 Repeated Measures Designs and Growth Curves Perspectives and a Strategy for Analyzing Multivariate Models 327 Exercises 332 References 352
323
MULTIVARIATE LINEAR REGRESSION MODELS
7.1 7.2 7.3
Introduction 354 The Classical Linear Regression Model Least Squares Estimation 358
7.4
Inferences About the Regression Model
7.5 7.6
7.7
7.8 7.9 7.10
354
354
Sum-of-Squares Decomposition, 360 Geometry of Least Squares, 361 Sampling Properties of Classical Least Squares Estimators, 363
365
Inferences Concerning the Regression Parameters, 365 Likelihood Ratio Tests for the Regression Parameters, 370
Inferences from the Estimated Regression Function
Estimating the Regression Function at z0, 374 Forecasting a New Observation at z0, 375
Model Checking and Other Aspects of Regression
Does the Model Fit?, 377 Leverage and Influence, 380 Additional Problems in Linear Regression, 380
Multivariate Multiple Regression
374 377
383
Likelihood Ratio Tests for Regression Parameters, 392 Other Multivariate Test Statistics, 395 Predictions from Multivariate Multiple Regressions, 395
The Concept of Linear Regression Prediction of Several Variables, 403 Partial Correlation Coefficient, 406
398
Comparing the Two Formulations of the Regression Model
Mean Corrected Form of the Regression Model, 407 Relating the Formulations, 409
407
Multiple Regression Models with Time Dependent Errors 410 Supplement 7 A: The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model Exercises 417 References 424
415
Conte nts 8
PRINCIPAL COMPONENTS
426
8.2
Introduction 426 Population Principal Components
8.3
Summarizing Sample Variation by Principal Components
8.1
8.4 8.5
8.6
xi
426
Principal Components Obtained from Standardized Variables, 432 Principal Components for Covariance Matrices with Special Structures, 435
The Number of Principal Components, 440 Interpretation of the Sample Principal Components, 444 Standardizing the Sample Principal Components, 445
Graphing the Principal Components Large Sample Inferences 452
437
450
Large Sample Properties of Ai and ej, 452 Testing for the Equal Correlation Structure, 453 A
Monitoring Quality with Principal Components
Checking a Given Set of Measurements for Stability, 455 Controlling Future Values, 459
455
Supplement 8A: The Geometry of the Sample Principal Component Approximation 462 The p-D imensional Geometrical Interpretation, 464 The n-D imensional Geometrical Interpretation, 465
Exercises 466 References 475 9
FACTOR ANALYSIS AND INFERENCE FOR STRUCTURED COVARIANCE MATRICES
9.1 9.2 9.3
9.4
Introduction 477 The Orthogonal Factor Model Methods of Estimation 484
477
478
The Principal Component (and Principal Factor) Method, 484 A Modified Approach-the Principal Factor So lution, 490 The Maximum Likelihood Method, 492 A Large Sample Test for the Number of Common Factors, 498
Factor Rotation
501
Oblique Rotations, 509
9.5
Factor Scores
9.6 9.7
Perspectives and a Strategy for Factor Analysis Structural Equation Models 524
510
The Weighted Least Squares Method, 511 The Regression Method, 513
The LISREL Model, 525 Construction of a Path Diagram, 525 Covariance Structure, 526 Estimation, 527 Model-Fitting Strategy, 529
517
xii
Contents
Supplement 9A: Some Computational Details for Maximum Likelihood Estimation
Recommended Computational Scheme, 531 Maximum Likelihood Estimators of p LzL'z + \flz, 532
530
=
Exercises 533 References 541
10
CANONICAL CORRELATION ANALYSIS
10.1 10.2 10.3
10.4 10.5 10.6
11
543
Introduction 543 Canonical Variates and Canonical Correlations Interpreting the Population Canonical Variables
543 551
Identifying the Canonical Variables, 551 Canonical Correlations as Generalizations of Other Correlation Coefficients, 553 The First r Canonical Variables as a Summary of Variability, 554 A Geometrical Interpretation of the Population Canonical Correlation Analysis 555
The Sample Canonical Variates and Sample Canonical Correlations 556 Additional Sample Descriptive Measures 564 Matrices of Errors ofApproximations, 564 Proportions of Explained Sample Variance, 567
Large Sample Inferences Exercises 573 References 580
569
DISCRIMINATION AND CLASSIFICATION
11.1 1 1 .2 1 1 .3
11.4 1 1 .5 11.6 11.7 11 . 8
581
Introduction 581 Separation and Classification for Two Populations 582 Classification with Two Multivariate Normal Populations
Classification of Normal Populations When I1 Scaling, 595 Classification of Normal Populations When I1
=
#:-
I2
=
I, 590
590
I2, 596
Evaluating Classification Functions 598 Fisher's Discriminant Function-Separation of Populations Classification with Several Populations 612 The Minimum Expected Cost of Misclassification Method, 613 Classification with Normal Populations, 616
Fisher's Method for Discriminating among Several Populations 628
Using Fisher's Discriminants to Classify Objects, 635
Final Comments
641
Including Qualitative Variables, 641 Classification Trees, 641 Neural Networks, 644
609
Contents
xiii
Selection of Variables, 645 Testing for Group Differences, 645 Graphics, 646 Practical Considerations Regarding Multivariate Normality, 646
Exercises 647 References 666 12
CLUSTERING, DISTANCE METHODS, AND ORDINATION
12.1 12 . 2
12.3
12.4 12 . 5 12.6
12.7 12 . 8
Introduction 668 Similarity Measures
668
670
Distances and Similarity Coefficients for Pairs of Items, 670 Similarities and Association Measures for Pairs of Variables, 676 Concluding Comments on Similarity, 677
Hierarchical Clustering Methods
679
Single Linkage, 681 Complete Linkage, 685 Average Linkage, 689 Ward's Hierarchical Clustering Method, 690 Final Comments-Hierarchical Procedures, 693
Nonhierarchical Clustering Methods
694
K-means Method, 694 Final Comments-Nonhierarchical Procedures, 698
Multidimensional Scaling
700
Correspondence Analysis
709
The Basic Algorithm, 700
Algebraic Development of Correspondence Analysis, 711 Inertia, 718 Interpretation in Two Dimensions, 719 Final Comments, 719
Biplots for Viewing San1pling Units and Variables Constructing Biplots, 720
Procrustes Analysis: A Method for Comparing Configurations
723
Supplement 12A: Data Mining
731
719
Constructing the Procrustes Measure ofAgreement, 724
Introduction, 731 The Data Mining Process, 732 Model Assessment, 733
Exercises 738 References 7 45
APPENDIX
DATA INDEX
SUBJECT INDEX
748 758 761
Preface
I NTE N D E D AU D I E NCE This book originally grew out of our lecture notes for an "Applied Multivariate Analy sis" course offered j ointly by the Statistics Department and the School of Business at the University of Wisconsin-Madison. Applied Multivariate Statistical Analysis, Fifth Edition, is concerned with statistical methods for describing and analyzing multi variate data. Data analysis, while interesting with one variable, becomes truly fasci nating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect measurements on several variables. Modern computer packages readily provide the numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpretations, selecting appropriate tech niques, and understanding their strengths and weaknesses. We hope our discussions will meet the needs of experimental scientists, in a wide variety of subject matter areas, as a readable introduction to the statistical analysis of multivariate observations. LEVEL Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics cours es. We emphasize the applications of multivariate methods and, consequently, have attempted to make the mathematics as palatable as possible. We avoid the use of cal culus. On the other hand, the concepts of a matrix and of matrix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of multivariate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chapter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. This supplementary material helps make the book self-contained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book accessible to a wide audience. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice a consistency XV
xvi
Preface
of level. Some sections are harder than others. In particular, we have summarized a voluminous amount of material on regression in Chapter 7. The resulting presenta tion is rather succinct and difficult the first time through. We hope instructors will be able to compensate for the unevenness in level by judiciously choosing those sec tions, and subsections, appropriate for their students and by toning them down if necessary. ORGAN IZATI ON AN D APPROACH The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book, but they cannot be as similated without much of the material in the introductory Chapters 1 through 4. Even those readers with a good knowledge of matrix algebra or those willing to ac cept the mathematical results on faith should, at the very least, peruse Chapter 3, "Sample Geometry," and Chapter 4, "Multivariate Normal Distribution." Our approach in the methodological chapters is to keep the discussion direct and uncluttered. Typically, we start with a formulation of the population models, delineate the corresponding sample results, and liberally illustrate everything with examples. The examples are of two types: those that are simple and whose calculations can be eas ily done by hand, and those that rely on real-world data and computer software. These will provide an opportunity to (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using methods other than the ones we have used or suggested. The division of the methodological chapters (5 through 12) into three units al lows instructors some flexibility in tailoring a course to their needs. Possible sequences for a one-semester (two quarter) course are indicated schematically. Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices. Getting Started
�
Chapters 1-4
�
Inference About Means
Classification and Grouping
Chapters 5-7
Chapters 11 and 12
Analysis of Covariance
Analysis of Covariance
Structure
Structure
Chapters 8-10
Chapters 8-10
I
I
For most students, we would suggest a quick pass through the first four chap ters (concentrating primarily on the material in Chapter 1 ; Sections 2.1, 2 . 2, 2.3, 2 .5, 2.6, and 3.6; and the "assessing normality" material in Chapter 4) followed by a se lection of methodological topics. For example, one might discuss the comparison of mean vectors, principal components, factor analysis, discriminant analysis and clus tering. The discussions could feature the many "worked out" examples included in
Preface
xvii
these sections of the text. Instructors may rely on diagrams and verbal descriptions to teach the corresponding theoretical developments. If the students have uniform ly strong mathematical backgrounds, much of the book can successfully be covered in one term. We have found individual data-analysis proj ects useful for integrating materi al from several of the methods chapters. Here, our rather complete treatments of multivariate analysis of variance (MANOVA), regression analysis, factor analysis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures. CHAN G E S TO TH E FI FTH EDITI ON New material . Users of the previous editions will notice that we have added several exercises and data sets, some new graphics, and have expanded the discus sion of the dimensionality of multivariate data, growth curves and classification and regression trees (CART). In addition, the algebraic development of correspondence analysis has been redone and a new section on data mining has been added to Chap ter 12. We put the data mining material in Chapter 12 since much of data mining, as it is now applied in business, has a classification and/or grouping obj ective. As always, we have tried to improve the exposition in several places. Data CD. Recognizing the importance of modern statistical packages in the analysis of multivariate data, we have added numerous real-data sets. The full data sets used in the book are saved as ASCII files on the CD-ROM that is packaged with each copy of the book. This format will allow easy interface with existing statistical software packages and provide more convenient hands-on data analysis opportunities. Instructors Sol utions Manual. An Instructors Solutions Manual (ISBN 0-13092555-1) containing complete solutions to most of the exercises in the book is avail able free upon adoption from Prentice Hall. For information on additional for sale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall Web site at www.prenhall.com. ACKNOWLE D G M E NTS We thank our many colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. A number of indi viduals helped guide this revision, and we are grateful for their suggestions: Steve Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George Mason University; Shyamal Peddada, University of Virginia; K. Sivakumar, Univer sity of Illinois at Chicago; Eric Smith, Virginia Tech; and Stanley Wasserman, Uni versity of Illinois at Urbana-Champaign. We also acknowledge the feedback of the students we have taught these past 30 years in our applied multivariate analysis cours es. Their comments and suggestions are largely responsible for the present iteration
xviii
Preface
of this work. We would also like to give special thanks to Wai Kwong Cheang for his help with the calculations for many of the examples. We must thank Dianne Hall for her valuable work on the CD-ROM and Solu tions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his assistance with the multidimensional scaling examples discussed in Chapter 12. Jacquelyn Forer did most of the typing of the original draft manuscript, and we ap preciate her expertise and willingness to endure the caj oling of authors faced with pub lication deadlines. Finally, we would like to thank Quincy McDonald, Joanne Wendelken, Steven Scott Pawlowski, Pat Daly, Linda Behrens, Alan Fischer, and the rest of the Prentice Hall staff for their help with this project.
R. A. Johnson rich@stat. wisc. edu
D. W. Wichern
d-wichern@tamu.edu
Applied Multivariate Statistical Analysis
CHAPT E R
1
Aspects of Multivariate Analysis
1.1
I NTRODUCTI O N Scientific inquiry i s an iterative learning process. Objectives pertaining t o the ex planation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In turn, an analysis of the data gathered by experi mentation or observation will usually suggest a modified explanation of the phe nomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an in vestigator to collect observations on many different variables. This book is concerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body of methodology is called multivariate analysis. The need to understand the relationships between many variables makes mul tivariate analysis an inherently difficult subj ect. Often, the human mind is over whelmed by the sheer bulk of the data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a univari ate setting. We have chosen to provide explanations based upon algebraic concepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathe matics. Nonetheless, some mathematical sophistication and a desire to think quan titatively will be required. Most of our emphasis will be on the analysis of measurements obtained with out actively controlling or manipulating any of the variables on which the mea surements are made. Only in Chapters 6 and 7 shall we treat a few experimental plans (designs) for generating data that prescribe the active manipulation of im portant variables. Although the experimental design is ordinarily the most impor tant part of a scientific investigation, it is frequently impossible to control the generation of appropriate data in certain disciplines. (This is true, for example, in business, economics, ecology, geology, and sociology.) You should consult [7] and 1
2
Chapter
1
Aspects of M u ltiva riate Ana lysis [8] for detailed accounts of design principles that, fortunately, also apply to multi variate situations. It will become increasingly clear that many multivariate methods are based upon an underlying probability model known as the multivariate normal distribu tion. Other methods are ad hoc in nature and are justified by logical or commonsense arguments. Regardless of their origin, multivariate techniques must, invariably, be im plemented on a computer. Recent advances in computer technology have been ac companied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classification scheme for multivariate techniques that both is widely accepted and indicates the appropriateness of the techniques. One classification distinguishes techniques de signed to study interdependent relationships from those designed to study depen dent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. In Section 1.2, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the obj ectives of the study. These problems, plus the examples in the text, should provide you with an appreciation for the applicability of multivariate techniques across different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following:
1. Data reduction or structural simplification . The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier. 2. Sorting and grouping. Groups of " similar" obj ects or variables are created, based upon measured characteristics. Alternatively, rules for classifying obj ects into well-defined groups may be required. 3. Investigation of the dependence among variables. The nature of the relation ships among variables is of interest. Are all the variables mutually indepen dent or are one or more variables dependent on the others? If so, how?
4.
Prediction. Relationships between variables must be determined for the pur pose of predicting the values of one or more variables on the basis of observa tions on the other variables.
5. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions. We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should
Section 1 .2
Appl i cations of M u ltivariate Tec h n i ques
3
keep it in mind whenever you attempt or read about a data analysis. It allows one to maintain a proper perspective and not be overwhelmed by the elegance of some of the theory: If the results disagree with informed opinion, do not admit a simple logical interpreta tion, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines au tomatically transforming bodies of numbers into packets of scientific fact.
1 .2
APPLI CATI O N S O F M U LTIVARIATE TECH N I Q U E S The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of real-world applications of these methods with brief discussions, as we did in earlier editions of this book. However, in order to give some indication of the usefulness of multivariate techniques, we offer the following short descriptions of the results of studies from several disciplines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are multifaceted and could be placed in more than one category.
Data reduction or simplification •
•
•
•
•
Using data on several variables related to cancer patient responses to radio therapy, a simple measure of patient response to radiotherapy was constructed. (See Exercise 1.15.) Track records from many nations were used to develop an index of performance for both male and female athletes. (See [10] and [22].) Multispectral image data collected by a high-altitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimen sions. (See [23].) Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. (See [ 1 4] . ) A matrix of tactic similarities was developed from aggregate data derived from professional mediators. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was de termined. (See [21].)
Sorting and grouping •
•
Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of ex isting (or planned) computer utilization. (See [2].) Measurements of several physiological variables were used to develop a screen ing procedure that discriminates alcoholics from nonalcoholics. (See [26].)
4
Chapter
1
Aspects of M u ltivariate Ana lysis •
•
Data related to responses to visual stimuli were used to develop a rule for sep arating people suffering from a multiple-sclerosis-caused visual pathology from those not suffering from the disease. (See Exercise 1.14.) The U. S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [31].)
Investigation of the dependence among variables Data on several variables were used to identify factors that were responsible for client success in hiring external consultants. (See [13] .) Measurements of variables related to innovation, on the one hand, and vari ables related to the business environment and business organization, on the other hand, were used to discover why some firms are product innovators and some firms are not. (See [5] .) ' Data on variables representing the outcomes of the 10 decathlon events in the Olympics were used to determine the physical factors responsible for success in the decathlon. (See [17] .) The associations between measures of risk-taking propensity and measures of socioeconomic characteristics for top-level business executives were used to as sess the relation between risk-taking behavior and performance. (See [18].)
•
•
•
Prediction •
•
•
•
The associations between test scores and several high school performance vari ables and several college performance variables were used to develop predic tors of success in college. (See [11].) Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [9] and [20] .) Measurements on several accounting and financial variables were used to de velop a method for identifying potentially insolvent property-liability insurers. (See [28] .) Data on several variables for chickweed plants were used to develop a method for predicting the species of a new plant. (See [4] .)
Hypotheses testing •
•
•
Several pollution-related variables were measured to determine whether levels for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and weekends. (See Exercise 1 .6.) Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks, as quantified by test scores. (See [27] . ) Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing so ciological theories. (See [16] and [25] .)
Section •
1 .3
The Orga nization of Data
5
Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innova tion. (See [15].)
The preceding descriptions offer glimpses into the use of multivariate methods in widely diverse fields. 1 .3
TH E ORGANIZATION OF DATA Throughout this text, we are going to be concerned with analyzing measurements made on several variables or characteristics. These measurements (commonly called data) must frequently be arranged and displayed in various ways. For example, graphs and tabular arrangements are important aids in data analysis. Summary numbers, which quantitatively portray certain features of the data, are also necessary to any description. We now introduce the preliminary concepts underlying these first steps of data organization. Arrays Multivariate data arise whenever an investigator, seeking to understand a social or 1 of variables or characters to record. physical phenomenon, selects a number p The values of these variables are all recorded for each distinct item, individual, or >
experimental unit.
We will use the notation xj k to indicate the particular value of the kth variable that is observed on the jth item, or trial. That is,
xj k = measurement of the kth variable on the jth item Consequently, n measurements on p variables can be displayed as follows: Variable
1 Variable 2
Variable k
Variable p
Item 1: Item 2 :
X11 X2 1
X12 X2 2
xl k X2 k
Xl p X2 p
Item j:
Xj l
Xj 2
Xj k
Xj p
Item n:
Xn l
Xn 2
Xnk
Xn p
Or we can display these data as a rectangular array, called X, of n rows and p columns:
X=
X1 1 X2 1
X1 2 X 22
X1 k X2 k
Xl p X2 p
xj l
xj 2
Xj k
Xj p
Xn l
Xn 2
Xnk
Xn p
6
Chapter 1
Aspects of M u ltivariate Ana lysis
The array X, then, contains the data consisting of all of the observations on all of the variables. Example 1 . 1
(A data a rray)
A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can regard the corresponding numbers on the receipts as four measurements on two variables. Suppose the data, in tabular form, are Variable 1 ( dollar sales) : Variable 2 ( number of books) :
42 52 48 58 4 5 4 3
Using the notation just introduced, we have
x1 1 = 42 x2 1 = 52 x3 1 = 48 x4 1 = 58 x 1 2 = 4 x2 2 = 5 x3 2 = 4 x4 2 = 3 and the data array X is 42 52 X= 48 58 with four rows and two columns.
4 5 4 3 •
Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and efficient manner : The efficiency is twofold, as gains are attained in both (1) describing nu merical calculations as operations on arrays and (2) the implementation of the cal culations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for dis playing data. Descri ptive Statistics A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics. For example, the arithmetic average, or sample mean, is a descriptive sta tistic that provides a measure of location-that is, a "central value" for a set of num bers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, vari ation, and linear association. The formal definitions of these quantities follow.
Section 1 .3
The Organ ization of Data
7
Let x 1 1 , x2 1 , . . . , x n 1 be n measurements on the first variable. Then the arith metic average of these measurements is
If the n measurements represent a subset of the full set of measurements that might have been observed, then x1 is also called the sample mean for the first vari able. We adopt this terminology because the bulk of this book is devoted to proce dures designed for analyzing samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means: k A measure of spread is provided by the surements on the first variable as
s 2l -where
=
1 , 2, . . . ' p
(1-1)
sample variance, defined for n mea
1 � - 2 x ( jl - xl ) n
- £.J j=l
x1 is the sample mean of the xj 1 ' s. In general, for p variables, we have k
=
1, 2, . . . ' p
(1-2)
Two comments are in order. First, many authors define the sample variance with a divisor of n - 1 rather than n. Later we shall see that there are theoretical reasons for doing this, and it is particularly appropriate if the number of measurements, n, is small. The two versions of the sample variance will always be differentiated by dis playing the appropriate expression. Second, although the s 2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample vari ances lie along the main diagonal. In this situation, it is convenient to use double subscripts on the variances in order to indicate their positions in the array. Therefore, we introduce the notation skk to denote the same variance computed from measure ments on the ith variable, and we have the notational identities k
=
1, 2, .. . ' p
(1-3)
The square root of the sample variance, � ' is known as the sample standard deviation. This measure of variation is in the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2:
8
Chapter 1
Aspects of M u ltivariate Ana lysis
That is, xj 1 and xj2 are observed on the jth experimental item (j = 1, 2, . . . , n ) . A measure of linear association between the measurements of variables 1 and 2 is pro vided by the sample covariance
s1 2
1 n :L (xj l - xl ) (xj2 - x2 ) n j=l
=-
or the average product of the deviations from their respective means. If large values for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, s 1 2 will be positive. If large values from one variable occur with small values for the other variable, s1 2 will be negative. If there is no partic ular association between the values for the two variables, s 1 2 will be approximately zero. The sample covariance
s;k
=
�
1 n (xi; - X;) ( xik - Xk) n
i
=
1 , 2 , . . . , p, k
=
1 , 2 , . . . , p (1-4)
measures the association between the ith and kth variables. We note that the covari ance reduces to the sample variance when i = k. Moreover, sik = ski for all i and k. The final descriptive statistic considered here is the sample correlation coefficient (or Pearson's product-moment correlation coefficient; see [3] ). This measure of the lin ear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as
n
for i
=
1 , 2, . . . , p and k
=
��
:Ll j=
( xj i - xJ (xj k - xk)
( xi; - X; )
1 , 2, . , p. Note rik . .
=
2
��
( xik - Xk)
2
(1-5)
rki for all i and k.
The sample correlation coefficient is a standardized version of the sample co variance, where the product of the square roots of the sample variances provides the standardization. Notice that rik has the same value whether n or n - 1 is chosen as the common divisor for si i ' sk k ' and sik · The sample correlation coefficient rik can also be viewed as a sample covariance. Suppose the original values xj i and xj k are replaced by standardized values ( xj i - xi)/� and (xj k - xk)j�. The standardized values are commensurable because both sets are centered at zero and expressed in standard deviation units. The sample correlation co efficient is just the sample covariance of the standardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bound ed. To summarize, the sample correlation r has the following properties:
1. The value of r must be between - 1 and + 1 . 2. Here r measures the strength of the linear association. If r = 0 , this implies a lack of linear association between the components. Otherwise, the sign of r in dicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its aver age; and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together.
Section 1 .3
The Organ ization of Data
9
3. The value of rik remains unchanged if the measurements of the ith variable are changed to yjl = a xj i + b, j = 1, 2, . . . , n, and the values of the kth variable are changed to Yj k = cxj k + d, j = 1, 2, . . . , n, provided that the constants a and c have the same sign. The quantities sik and rik do not, in general, convey all there is to know about the association between two variables. Nonlinear associations can exist that are not revealed by these descriptive statistics. Covariance and correlation provide mea sures of linear association, or association along a line. Their values are less informa tive for other kinds of association. On the other hand, these quantities can be very sensitive to "wild" observations ( "outliers" ) and may indicate association when, in fact, little exists. In spite of these shortcomings, covariance and correlation coefficients are routinely calculated and analyzed. They provide cogent numerical summaries of as sociation when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes. The values of sik and rik should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of cross product deviations are often of interest themselves. These quantities are n
wkk and
wik
=
n
=
2 (xj k - xk) 2: j= l
2: (xj i - xi) (xj k - xk )
j=l
k
i
=
=
1, 2, . . . ' p
1, 2, . . . , p, k
(1-6) =
1 , 2, . . . , p
(1 -7)
The descriptive statistics computed from n measurements on p variables can also be organized into arrays.
10
Chapter
1
Aspects of M u ltivariate Ana lysis
The sample mean array is denoted by x, the sample variance and covariance array by the capital letter S n , and the sample correlation array by R. The subscript n on the array S n is a mnemonic device used to remind you that n is employed as a divisor for the elements sik . The size of all of the arrays is determined by the num ber of variables, p. The arrays S n and R consist of prows and p columns. The array x is a single col umn with p rows. The first subscript on an entry in arrays S n and R indicates the row; the second subscript indicates the column. Since sik = ski and rik = rki for all i and k, the entries in symmetric positions about the main northwest-southeast diagonals in arrays S n and R are the same, and the arrays are said to be symmetric. Example 1 . 2
(The arrays
x,
Sn, and R fo r bivariate data)
Consider the data introduced in Example 1 . 1 . Each receipt yields a pair of measurements, total dollar sales, and number of books sold. Find the ar rays x, S n , and R. Since there are four receipts, we have a total of four measurements ( ob servations ) on each variable. The sample means are
4
x1
=
x2
=
� � xj l j=l 4 � � xj2 j= l
=
� ( 42 + 52 + 48 + 58 )
=
� (4 + 5 + 4 + 3 )
=
=
50
4
The sample variances and covariances are
sl l
4
= =
s22
= =
s1 2
= =
� � (xj l - x1 ) 2 j= l � ( (42 - 50 ) 2 + (52 - 50 ) 2 + (48 - 50 ) 2 + (58 - 50) 2 ) = 34 4 � � (xj2 - x2 ) 2 j= l � ( (4 - 4) 2 + (5 - 4) 2 + (4 - 4) 2 + (3 - 4) 2 ) = .5 4 � � (xj l - xl ) (xj2 - x2 ) j= l � ( (42 - 50) (4 - 4 ) + ( 52 - 50) (5 - 4) + ( 48 - 50) ( 4 - 4) + ( 58 - 50) (3 - 4 ) ) = - 1 .5
and
sn
=
[
34 - 1 .5 .5 - 1 .5
]
Section 1 .3
The Organ ization of Data
11
The sample correlation is
r1 2 =
�Vs;
=
r2 1 = r1 2 so
R = G raphical Techniques
- 1 .5
v'34v3
= - . 36
[ - � �J - .3
.3
•
Plots are important, but frequently neglected, aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations, plots of individual variables and plots of pairs of variables can still be very informative. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one, two, or three dimensions with relative ease. On the other hand, many valuable insights can be obtained from the data by constructing plots with paper and pencil. Simple, yet elegant and effective, methods for displaying data are available in [29] . It is good statistical practice to plot pairs of variables and visually inspect the pattern of association. Consider, then, the following seven pairs of measurements on two variables:
( x1 ) : Variable 2 (x2 ) : Variable
1
3
5
4 5.5
2
6
4
7
8 10
2
5
5 7.5
These data are plotted as seven points in two dimensions ( each axis represent ing a variable ) in Figure 1.1. The coordinates of the points are determined by the paired measurements: ( 3 , 5 ) , ( 4, 5.5 ) , . . . , (5, 7.5 ) . The resulting two-dimensional plot is known as a scatter diagram or scatter plot.
•
s C\S
OJ) C\S ;....
:.a
Q
0
• •
• •• •
x2
x2
10
•
10
8
8
6
6
4
4
2
2
0
• •
•
•
2
•
' 2
•
4 •
' 4
•
•
6
8
10
'
'
I
6
Dot diagram
8
xl ..,.
10
xl
Figure 1 . 1
A scatter plot a n d m a rg i na l dot d i a g rams.
12
Aspects of M u ltivariate Ana lysis
Chapter 1 • • •
•
• •
•
x2
x2
10
10
8
8
6
6
4
4
2
2 0
•
•
• •
4
2
•
! 2
•
•
!
•
4
•
•
6
8
!
!
6
xl
10 I
8
10
�
Figure 1 . 2
xl
Scatter plot a n d dot d i a g rams for rea rranged data .
Also shown in Figure 1 . 1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (marginal) dot diagrams. They can be obtained from the original observations or by projecting the points in the scatter diagram onto each coordinate axis. The information contained in the single-variable dot diagrams can be used to calculate the sample means .X1 and .X2 and the sample variances s1 1 and s22 . (See Ex ercise 1 . 1 . ) The scatter diagram indicates the orientation of the points, and their co ordinates can be used to calculate the sample covariance s 1 2 . In the scatter diagram of Figure 1 . 1 , large values of x 1 occur with large values of x2 and small values of x 1 with small values of x2 • Hence, s1 2 will be positive. Dot diagrams and scatter plots contain different kinds of information. The in formation in the marginal dot diagrams is not sufficient for constructing the scatter plot. As an illustration, suppose the data preceding Figure 1 . 1 had been paired dif ferently, so that the measurements on the variables x 1 and x2 were as follows: Variable 1 Variable 2
( x1 ) : ( x2 ) :
5
4
6
2
2
8
3
5
5.5
4
7
10
5
7.5
(We have simply rearranged the values of variable 1.) The scatter and dot diagrams for the "new" data are shown in Figure 1 .2. Comparing Figures 1 . 1 and 1 .2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decidedly different. In Figure 1 .2, large values of x 1 are paired with small values of x2 and small values of x 1 with large values of x2 • Consequently, the descriptive statistics for the in dividual variables .X1 , x2 , s1 1 , and s22 remain unchanged, but the sample covariance s1 2 , which measures the association between pairs of variables, will now be negative. The different orientations of the data in Figures 1 . 1 and 1 .2 are not discernible from the marginal dot diagrams alone. At the same time, the fact that the marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are not competitors. The next two examples further illustrate the information that can be conveyed by a graphic display.
Section 1 .3
• • •
• •
•
•
•
•
_.___�x1
._____.__�----L__ -��-�__.__
0
10
50
60
Employees (thousands)
Example 1 .3
13
Dun & Bradstreet
Time Warner -10
The Orga n i zation of Data
70
80
Figure 1 .3
Profits per employee a n d n u mber o f employees for 1 6 publishing firms.
(The effect of unusual o bservations on sample correlatio ns)
Some financial data representing j obs and productivity for the 16 largest pub lishing firms appeared in an article in Forbes magazine on April 30, 1990. The data for the pair of variables x 1 = employees (j obs) and x2 = profits per employee (productivity) are graphed in Figure 1 . 3 . We have labeled two "unusual" observations. Dun & Bradstreet is the largest firm in terms of num ber of employees, but is "typical" in terms of profits per employee. Time Warner has a "typical" number of employees, but comparatively small (negative) profits per employee. The sample correlation coefficient computed from the values of x 1 and x2 is - .39 for all 16 firms - .56 for all firms but Dun & Bradstreet - .39 for all firms but Time Warner - .50 for all firms but Dun & Bradstreet and Time Warner It is clear that atypical observations can have a considerable effect on the sam ple correlation coefficient. • Example 1 .4
(A scatter plot for baseba l l data)
In a July 17, 1978, article on money in sports, Sports Illustrated magazine pro vided data on x 1 = player payroll for National League East baseball teams. We have added data on x2 = won-lost percentage for 1 977. The results are given in Table 1 . 1 . The scatter plot in Figure 1 .4 supports the claim that a championship team can be bought. Of course, this cause-effect relationship cannot be substantiat ed, because the experiment did not include a random assignment of payrolls. Thus, statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries? •
14
Chapter 1
Aspects of M u ltivariate Ana lysis
TABLE 1 . 1
1 977 SALARY AN D FI NAL RECORD
FOR TH E NATIONAL LEAG U E EAST Team
x 1 = player payroll
x2 = won-lost percentage
Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets
3 ,497,900 2,485,475 1 ,782,875 1 ,725,450 1 ,645,575 1 ,469,800
.623 .593 .512 .500 .463 .395
� 0£) C\S
� �
u ;.... � �
. 800
CZl 0
�
b 0
�
•
.400
•• •
•
•
�------�--�� xl 0 1.0 2.0 3.0 4.0
Figure 1 .4
Player payroll in millions of dollars
Sala ries a n d won - lost pe rce ntage from Ta ble 1 . 1 .
To construct the scatter plot in Figure 1 .4, we have regarded the six paired ob servations in Table 1 . 1 as the coordinates of six points in two-dimensional space. The figure allows us to examine visually the grouping of teams with respect to the vari ables total payroll and won-lost percentage. Example 1 .5
{M u lti ple scatter plots for paper strength measu rements)
Paper is manufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when measured in the direction produced by the machine than when measured across, or at right angles to, the machine direction. Table 1 .2 shows the measured values of
x 1 = density ( grams/ cubic centimeter ) x2 = strength ( pounds ) in the machine direction x3 = strength ( pounds ) in the cross direction A novel graphic presentation of these data appears in Figure 1 . 5 , page 1 6 . The scatter plots are arranged as the off-diagonal elements of a co variance array and box plots as the diagonal elements. The latter are on a different scale with this software, so we use only the overall shape to provide
Section 1 .3
TABLE 1 .2
The Orga n i zation of Data
PAPE R-Q UALITY M EASU R E M E NTS Strength
Specimen
Density
Machine direction
Cross direction
1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
.801 .824 .841 .816 .840 .842 .820 .802 .828 .819 .826 .802 .810 .802 .832 .796 .759 .770 .759 .772 .806 .803 .845 .822
121.41 127.70 129.20 131.80 135.10 131 .50 126.70 115.10 130.80 124.60 1 18.31 1 14.20 120.30 115.70 117.51 109.81 109.10 115.10 11 8.31 1 12.60 1 1 6.20 1 1 8.00 131 .00 125.70
70.42 72.47 78.20 74.89 71 .21 78.39 69.02 73.10 79.28 76.48 70.25 72.88 68.23 68.12 71.62 53.10 50.85 51 .68 50.60 53.51 56.53 70.70 74.35 68.29
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
.816 .836 .815 .822 .822 .843 .824 .788 .782 .795 .805 .836 .788 .772 .776 .758
125.80 125.50 127.80 130.50 127.90 123.90 124.10 120.80 107.40 120.70 121 .91 122.31 110.60 103.51 110.71 11 3.80
70.64 76.33 76.75 80.33 75.68 78.54 71.91 68.22 54.42 70.41 73.68 74. 93 53.52 48.93 53.67 52.42
6
Source: Data courtesy of SONOCO Products Company.
15
Aspects of M u ltivariate Ana lys is
Chapter 1
16
.
Max
Min
. . . . .. . . ...... . . . . .. . . . .. . .. :· . .. . . .
0.8 1 0.76
. .
.
.. .
.. · ·' . ... .
.
. ·.·.
.
.. .. .. . ..
-,.....-
Max
.
.
.
0.97
~
Med
Strength (CD)
Strength (MD)
Density
. ... . ..
... . ... .
.
.
Min
. . .. .. . . ·. . . ::.·.'. .
..
�
.
.
..
. . :· . . .
_ ,__
. . .. .. . . .. . .
. .. . . . . . . .
Figure
1.5
1 03. 5
.. . . . .. . . . . .. . . . . .
,
.
135.1
1 2 1 .4
Med
,
. .. ·: : ... •.··. . . .. ·: ..
.
.. . . ...
Max Med
Min
.. ... . . . . .. .. . . .. .. . . . ' · .. .
.
.
T
j_
80.33 70.70
48.93
Scatter pl ots a n d boxpl ots of paper-q u a l ity data from Ta ble 1 . 2.
information on symmetry and possible outliers for each individual characteris tic. The scatter plots can be inspected for patterns and unusual observations. In Figure 1 .5 , there is one unusual observation: the density of specimen 25 . Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. These scatter plot arrays are further pursued in our discussion of new soft• ware graphics in the next section. In the general multiresponse situation, p variables are simultaneously record ed on n items. Scatter plots should be made for pairs of important variables and, if the task is not too great to warrant the effort, for all pairs. Limited as we are to a three-dimensional world, we cannot always picture an en tire set of data. However, two further geometric representations of the data provide an important conceptual framework for viewing multivariable statistical methods. In cases where it is possible to capture the essence of the data in three dimensions, these representations can actually be graphed.
Section 1 .3
17
The Organ ization of Data
n
Points in p Dimensions (p-Dimensional Scatter Plot). Consider the natur al extension of the scatter plot to p dimensions, where the p measurements on the jth item represent the coordinates of a point in p-dimensional space. The co ordinate axes are taken to correspond to the variables, so that the jth point is xj 1 units along the first axis, xj2 units along the second, . . . , xj P units along the pth axis. The resulting plot with n points not only will exhibit the overall pattern of variabili ty, but also will show similarities (and differences) among the n items. Groupings of items will manifest themselves in this representation. The next example illustrates a three-dimensional scatter plot. Example 1 .6
(Loo king for l ower-d imensional structu re)
A zoologist obtained measurements on n = 25 lizards known scientifically as Cophosaurus texanus. The weight, or mass, is given in grams while the snout vent length (SVL) and hind limb span (HLS) are given in millimeters. The data are displayed in Table 1.3. TABLE 1 .3
LIZARD SIZE DATA
Lizard
Mass
SVL
HLS
Lizard
Mass
SVL
HLS
1 2 3 4 5 6 7 8 9 10 11 12 13
5.526 10.401 9.213 8. 953 7.063 6.610 1 1 .273 2.447 15.493 9.004 8.199 6.601 7.622
59.0 75.0 69.0 67 .5 62.0 62.0 74.0 47.0 86.5 69.0 70.5 64.5 67 .5
113.5 142.0 124.0 125.0 129.5 123.0 140.0 97.0 162.0 126.5 136.0 116.0 135.0
14 15 16 17 18 19 20 21 22 23 24 25
10.067 10.091 10.888 7.610 7.733 12.015 10.049 5 . 149 9.158 12.132 6.978 6.890
73.0 73.0 77.0 61.5 66.5 79.5 74.0 59.5 68.0 75.0 66.5 63.0
136.5 135.5 139.0 1 1 8.0 133.5 150.0 137.0 116.0 123 .0 141.0 1 17.0 1 17.0
Source: Data courtesy of Kevin E. Bonine.
Although there are three size measurements, we can ask whether or not most of the variation is primarily restricted to two dimensions or even to one dimension. To help answer questions regarding reduced dimensionality, we construct the three-dimensional scatter plot in Figure 1 .6. Clearly most of the variation is scatter about a one-dimensional straight line. Knowing the position on a line along the maj or axes of the cloud of points would be almost as good as know ing the three measurements Mass, SVL, and HLS. However, this kind of analysis can be missleading if one variable has a much larger variance than the others. Consequently, we first calculate the stan dardized values, Zj k = ( xj k - xk )/ � , so the variables contribute equally to
18
Chapter 1
Aspects of M u ltiva riate Ana lysis
15
5
HLS
90
SVL
3
Figure 1 .6
3 D scatter plot of l i zard data from Ta ble 1 . 3 .
•
2
-1
25 1 .5 .
-2 ZsvL
2
.....__
0.5 -0.5 ._.._ _ 2 . 5- 1 . 5 ZHLS
Figure 1 . 7
3 D scatter plot of sta n d a r d i zed l i zard d ata .
the variation in the scatter plot. Figure 1 .7 gives the three-dimensional scatter plot for the standardized variables. Most of the variation can be explained by • a single variable determined by a line through the cloud of points. A three-dimensional scatter plot can often reveal group structure. Example 1 .7
(Looking for g roup structu re i n th ree d i mensions)
Referring to Example 1 .6, it is interesting to see if male and female lizards oc cupy different parts of the three dimensional space containing the size data. The gender, by row, for the lizard data in Table 1.3 are f m f f m f m f m f m f m m m m f m m m f f m f f Figure 1.8 repeats the scatter plot for the original variables but with males marked by solid circles and females by open circles. Clearly, males are typically • larger than females.
Section 1 .4
Data Displays and Pictoria l Representations
19
jOfl � 15
•
0
•
CS o
5
SVL
Figure 1 .8
� ·
-
90
1 15
HLS
3 D scatter p l ot of m a l e a n d fe m a l e lizards.
Points in n Dimensions. The n observations of the p variables can also be . regarded as p points in n-dimensional space. Each column of X determines one of the points. The ith column, p
consisting of all n measurements on the ith variable, determines the ith point. In Chapter 3, we show how the closeness of points in n dimensions can be re lated to measures of association between the corresponding variables.
1 .4 DATA D I SPLAYS AN D PI CTORIAL REPRESENTATI O N S The rapid development o f powerful personal computers and workstations has led to a proliferation of sophisticated statistical software for data analysis and graphics. It is often possible, for example, to sit at one ' s desk and examine the nature of multidi mensional data with clever computer-generated pictures. These pictures are valu able aids in understanding data and often prevent many false starts and subsequent inferential problems. As we shall see in Chapters 8 and 12, there are several techniques that seek to represent p-dimensional observations in few dimensions such that the original dis tances ( or similarities ) between pairs of observations are ( nearly ) preserved. In gen eral, if multidimensional observations can be represented in two dimensions, then outliers, relationships, and distinguishable groupings can often be discerned by eye. We shall discuss and illustrate several methods for displaying multivariate data in two dimensions. One good source for more discussion of graphical methods is [12] .
20
Chapter 1
Aspects of M u ltivariate Ana lysis
Li n ki ng M u lti ple Two-D imensional Scatter Plots One of the more exciting new graphical procedures involves electronically connect ing many two-dimensional scatter plots. Example 1 .8
(Li nked scatter plots and brushi ng)
To illustrate linked two dimensional scatter plots, we refer to the paper-quality data in Table 1 .2. These data represent measurements on the variables x1 = density, x2 = strength in the machine direction, and x3 = strength in the cross direction. Figure 1.9 shows two-dimensional scatter plots for pairs of these variables organized as a 3 X 3 array. For example, the picture in the upper left hand corner of the figure is a scatter plot of the pairs of observations ( x1 , x3 ) . That is, the x1 values are plotted along the horizontal axis, and the x3 values are plotted along the vertical axis. The lower right-hand corner of the figure con tains a scatter plot of the observations ( x3 , x1 ) . That is, the axes are reversed. Corresponding interpretations hold for the other scatter plots in the figure. No tice that the variables and their three-digit ranges are indicated in the boxes along the SW-NE diagonal. The operation of marking (selecting), the obvious outlier in the ( x1 , x3 ) scatter plot of Figure 1 . 9 creates Figure 1 .10(a) , where the outlier is labeled as specimen 25 and the same data point is highlighted in all the scatter plots. Specimen 25 also appears to be an outlier in the ( x1 , x2 ) scatter plot but not in the ( x2 , x3 ) scatter plot. The operation of deleting this specimen leads to the modified scatter plots of Figure 1 .10(b ) From Figure 1 . 1 0 , we notice that some points in, for example, the ( x2 , x3 ) scatter plot seem to be disconnected from the others. Selecting these .
. . ... ., . . . . .,. . :- · . . ...
.., . :.
..
.
. •
.
.
.
J
.
I
.
-- �
-
-
'
.. . ,
.
. . .. ,
80.3 .
Cross
( x3 )
. • •
.
.
. .-=· ., . . . . · ·'· . ., . . . r . . ..
, _
.
48.9
.
1 35 .
.
Machine
(x2)
.
.
1 04
.
.
.
. .
. I . . .
.
.
.97 1
.
. .. .
I I .
.
..
,
..
..
'
�· .
'
.
Density
(xi )
. 758
.
.
.,
••1 . . . . . .
..
.
. .. . . ' ..,.. ' · ;··
.
.
.
. . .
. .. .
.
. � ��· . .,
·. . . ., . .
..
Figure 1 .9
Scatter plots for the paper-q ual ity data of Ta ble 1 .2 .
Section 1 .4 . •I
... . ..
.. ..:- · "'
. . ., .., .
:.
.
.
.
.
. .. � -:. .. . .. .
.
,
-
( x3 )
Cross
.
( x2 )
Machine
..
.
.
.,
.
.
.
.
.
.
'·
.. .
,
·: 25 • • ..
'
25
. . . . I' ··�· : . . ·· ' . · . : . . . '
.
.
25
Density
.
I I ... .
. . ..
.
. 97 1
. 758
.
48.9
1 04
21
80.3
135
25
(xi )
,
• J • •. . .
.
.
. . ..
I , _ • : · 25 - \
.
.
.
. r . . .. . .
.
..
25
··'. · '.
.
.
Data D isp lays a n d Pictoria l Representations
.
.
. ..
. . . .:
. ·.. ·. .,
' •.'- ..\ . .
.
"'
.
(a)
"'
. •I
..· : ..
. . . ... . . . ., . .
:-' . .
.
..
.
.
.
.
.
. r . . .. . .
.
I
- \
... .
.. -
,
. . ..
,
80.3 .
.. �
( x3 )
Cross
• J • •. . .
48.9
.
135
.-=· . .. . .
··'. · '.
.
.
,
.
( x2 )
Machine
.
. .
1 04
. . ..
.
.,
.
.· . .:
..
,
• I I • ... .
'
�· .
'
.97 1
(xi )
Density
. 758
.
'
. •• 1 . . .. . . . . .
. . . . ' . •••,.. I .
;··
(b)
. . .. ..
�·
.
. ·.. ·. .,
' •.'- ..\ . . . ..
Figure 1 . 1 0 Mod ified scatter pl ots for the paper-q u a l ity d ata with outl ier (25) (a) sel ected and (b) deleted.
22
Chapter 1
Aspects of M u ltiva riate Ana lysis
points, using the (dashed) rectangle (see page 23 ) , highlights the selected points in all of the other scatter plots and leads to the display in Figure 1 . 1 1 (a) . Further checking revealed that specimens 1 6-2 1 , specimen 34, and specimens 38-41 were actually specimens from an older roll of paper that was included in order to have enough plies in the cardboard being manufactured. Deleting the outlier and the cases corresponding to the older paper and ad j usting the ranges of the remaining observations leads to the scatter plots in Figure 1 . 1 1 (b ) . The operation of highlighting points corresponding to a selected range of one of the variables is called brushing. Brushing could begin with a rectangle, as in Figure 1 . 1 1 (a) , but then the brush could be moved to provide a sequence of highlighted points. The process can be stopped at any time to provide a snap • shot of the current situation. Scatter plots like those in Example 1 . 8 are extremely useful aids in data analy sis. Another important new graphical technique uses software that allows the data analyst to view high-dimensional data as slices of various three-dimensional per spectives. This can be done dynamically and continuously until informative views are obtained. A comprehensive discussion of dynamic graphical methods is avail able in [1] . A strategy for on-line multivariate exploratory graphical analysis, moti vated by the need for a routine procedure for searching for structure in multivariate data, is given in [32]. Example 1 .9
{Rotated plots in th ree di mensions)
Four different measurements of lumber stiffness are given in Table 4.3, page 187. In Example 4.14, specimen (board) 16 and possibly specimen (board) 9 are identified as unusual observations. Figures 1 . 12(a), (b) , and (c) contain per spectives of the stiffness data in the x 1 , x2 , x3 space. These views were obtained by continually rotating and turning the three-dimensional coordinate axes. Spin ning the coordinate axes allows one to get a better understanding of the three dimensional aspects of the data. Figure 1 .12( d) gives one picture of the stiffness data in x2 , x3 , x4 space. Notice that Figures 1 . 12(a) and (d) visually confirm specimens 9 and 1 6 as outliers. Specimen 9 is very large in all three coordi nates. A counterclockwiselike rotation of the axes in Figure 1 . 12(a) produces Figure 1. 12(b ), and the two unusual observations are masked in this view. A fur ther spinning of the x2 , x3 axes gives Figure 1 . 12(c) ; one of the outliers (16) is now hidden. Additional insights can sometimes be gleaned from visual inspection of the slowly spinning data. It is this dynamic aspect that statisticians are just begin • ning to understand and exploit. Plots like those in Figure 1 . 12 allow one to identify readily observations that do not conform to the rest of the data and that may heavily influence inferences based on standard data-generating models.
Section 1 .4
.. .. . ·: . . .. ,.. .. .... ,
,
. :. . ..,
.
I
I
. .
�
.. � -:.
.. . ' . .
. ·· '. · .
I
I
.
.
. ... . .
r
..
.
, _ -
.
•
•
.
•.
I
I
48.9 135
.
.
.. . . . ., . .
( x2 )
Machine
'
1 04
.
. 97 1
(x i )
Density
.
. 758
.
'
.
.. .·· ' . ;· . .. . .
.
....,... ' · '
.
( x3 )
Cross
I
.I
23
80.3
\
I
- - - - - - J . .
.- .... , ... . .,
Data Displays and Pictorial Representations
. . . ....�·
.
' . . -· · . . . . ,
. .. ' . .. I I . '
. . ·. . -. . . ., ·. , . a. . . . .. \
(a)
.
.
.
. . .. . .. .. . . . . . . . . .. . . Cross .. .. . . ( x3 ) . . .. . . .. . .. . . .. . .. .. . . . .. 68. 1 . . . 1 35 . . . .. .. . . .. . .. . . . . . . .. . . . . Machine . . . . . . . ( x2) . . . .. . . . .. . . I . . 1 14 .. . . .. . . .. . .845 . . . . . . .. . . . .. .. . . .... . . .. . Density . . (xi ) . .. . .. . .. - . . . . . .
.
.
. 788
(b)
80.3
.. . . .
...
.
. Figure 1 . 1 1 Mod ified scatter p l ots with (a) g ro u p of poi nts sel ected a n d (b) poi nts, i n c l u d i n g s peci m e n 2 5, deleted a n d the scatter plots resca led.
24
Chapter 1
Aspects of M u ltiva riate Ana lysis .16
. . .
. .
.
.
. . . .
. .
.
.
.
.
•
:... . . ..
. .. . .
.
. .
.
.
. . .
9 x
(a)
16 .
.
• •
.
.
• . .
.
.
.
(b) Outliers masked.
Outliers clear.
�
l
.
.
.
•
.
.
.
.
•
• • .
.
.
. •
•
.
•
.
�3
•
.
•
1§
(c)
Figure
1.12
Specimen 9 large.
(d )
9
Good view of space.
x2 , x3 , x4
Th ree-d imensional perspectives fo r the l u mber stiffness data.
Graphs of G rowth Cu rves When the height of a young child is measured at each birthday, the points can be plotted and then connected by lines to produce a graph. This is an example of a growth curve. In general, repeated measurements of the same characteristic on the same unit or subj ect can give rise to a growth curve if an increasing, or decreasing, or even an increasing followed by a decreasing pattern is expected. Example 1 . 1 0
{Arrays of g rowth cu rves)
The Alaska Fish and Game Department monitors grizzly bears with the goal of maintaining a healthy population. Bears are shot with a dart to induce sleep and weighed on a scale hanging from a tripod. Measurements of length are taken with a steel tape. Table 1 .4 gives the weights (wt) in kilograms and lengths (lngth) in centimeters of seven female bears at 2, 3, 4, and 5 years of age. First for each bear, we plot the weights versus the ages and then connect the weights at successive years by straight lines. This gives an approximation to growth curve for weight. Figure 1.13 shows the growth curves for all seven bears. The noticeable exception to a common pattern is the curve for bear 5. Is this an outlier or just natural variation in the population? In the field, bears are weighed on a scale that reads pounds. Further inspection revealed that, in this case, an assistant later failed to convert the field readings to kilograms when creating the electronic database. The correct weights are ( 45, 66, 84, 1 12) kilograms. Because it can be difficult to inspect visually the individual growth curves in a combined plot, the individual curves should be replotted in an array where
Section 1 .4
TABLE 1 .4
Data Displays and Pictoria l Representations
25
FE MALE B EAR DATA
Bear
Wt2
Wt3
Wt4
1 2 3 4 5 6 7
48 59 61 54 100 68 68
59 68 77 43 145 82 95
95 102 93 104 1 85 95 109
Wt 5
Lngth 2
Lngth 3
Lngth 4
Lngth 5
141 140 145 146 150 142 139
157 168 162 159 158 140 171
168 174 172 176 168 178 176
1 83 170 177 171 175 189 175
82 102 107 104 247 118 111
Source: Data courtesy of H. Roberts.
2.0
2.5
3.0
3.5 Year
4.0
4.5
5.0
Figure 1 . 1 3 Com b i n ed g rowth cu rves fo r we ight fo r seven fem a l e g rizzly bears.
similarities and differences are easily observed. Figure 1 . 14 gives the array of seven curves for weight. Some growth curves look linear and others quadratic. Figure 1.15 gives a growth curve array for length. One bear seemed to get shorter from 2 to 3 years old, but the researcher knows that the steel tape mea • surement of length can be thrown off by the bear ' s posture when sedated. We now turn to two popular pictorial representations of multivariate data in two dimensions: stars and Chernoff faces. Sta rs Suppose each data unit consists of nonnegative observations on p 2 variables. In two dimensions, we can construct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays repre sent the values of the variables. The ends of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subj ective) similarities. >
26
Aspects of M u ltivariate Ana lysis
Chapter 1
B ear
B ear 1 ...c=
150 -
...c=
_/
. � 1 00 -
�
50 0-
I
I
2
1
...c=
I
I
3
4
�
I
50 0-
5
I
I
1
2
1
I
2
4
I
50 0-
I
1
5
2
3
4
5
...c=
I
I
3
4
�
I
/
50 0-
I
1
5
I
2
I
3
I
4
�
I
50 0-
I
I
1
5
1.14
I
I
I
I
I
1
2
3
4
5
Year
�
+-l
. � 1 00 -
2
I
I
3
4
I
5
Year
Year
Year
Figure
...c=
150 -
�
50 0-
I
B ear 7
. � 1 00 -
�
I
B ear 6 1 50 -
�
I
B ear 5
150 -
. � 1 00 -
I
Year
/
I
3
I
...c=
/
+-l
Year
�
0-
I
150 -
. � 1 00 -
�
Bear 4
3
Year
�
50 -
...c=
__r
. � 1 00 -
150 -
. � 1 00 -
�
1 50 -
�
�
B ear
2
I n divid u a l g rowth cu rves for weight for fe male g rizzly bears.
Bear 4
Bear 3
Bear 2
Bear 1
-
...c= �
gf 11)
�
1 80 -
1 80
-
{j
� 1 60 OJ)
1 60
� 11)
1 40 -
140 1
2
3
4
I
1
5
Year
r I
2
Bear 5
I
3
I
4
1 80 -
{j
� 1 60 OJ)
� 11)
I
�
�
I
2
I
3
Year
Year
Bear 6
Bear 7 1 80 -
1 80
1 60 140 -
I
1
5
/ I
4
I
1 80 -
{j
� 1 60 OJ)
� 11)
1 40 -
5
I
1
/ I
2
I
3
I
4
{j
gf
� 11)
1 60
� 11)
140 -
140 1
2
3 Year
1.15
-
� 1 60 -
OJ)
I
5
Year
Figure
{j
4
5
I
1
r I
2
I
3
I
1
I
/
2
I
3 Year
-
1 80 OJ)
-
1 40 -
-
{j
-
I
I
4
5
Year
I n d ivi d u a l g rowth cu rves for l e n gth for fem a l e g rizzly bears.
I
4
I
5
Section
1 .4
Data Displays a n d Pictorial Representations
27
It is often helpful, when constructing the stars, to standardize the observations. In this case some of the observations will be negative. The observations can then be reexpressed so that the center of the circle represents the smallest standardized ob servation within the entire data set. Example 1 . 1 1
(Uti l ity data as stars)
Stars representing the first 5 of the 22 public utility firms in Table 12.5, page 683, are shown in Figure 1 .16. There are eight variables; consequently, the stars are distorted octagons. Boston Edison Co. ( 2)
Arizona Public Service ( 1 ) 2
8
5
5
Central Louisiana Electric Co. ( 3 ) Commonwealth Edison Co. (4)
5
5 Consolidated Edison Co. (NY) ( 5) 1
5
Figure 1 . 1 6
p u b l ic util ities.
Sta rs for the fi rst five
28
Chapter 1
Aspects of M u ltivariate Ana lysis
The observations on all variables were standardized. Among the first five utilities, the smallest standardized observation for any variable was - 1 .6. Treat ing this value as zero, the variables are plotted on identical scales along eight equiangular rays originating from the center of the circle. The variables are or dered in a clockwise direction, beginning in the 12 o 'clock position. At first glance, none of these utilities appears to be similar to any other. However, because of the way the stars are constructed, each variable gets equal weight in the visual impression. If we concentrate on the variables 6 (sales in kilowatt-hour [kWh] use per year) and 8 (total fuel costs in cents per kWh), then Boston Edison and Consolidated Edison are similar (small variable 6, large variable 8), and Arizona Public Service, Central Louisiana Electric, and Com monwealth Edison are similar (moderate variable 6, moderate variable 8). • Chernoff Faces People react to faces. Chernoff [6] suggested representing p-dimensional observations as a two-dimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measurements on the p variables. As originally designed, Chernoff faces can handle up to 18 variables. The as signment of variables to facial features is done by the experimenter, and different choices produce different results. Some iteration is usually necessary before satis factory representations are achieved. Chernoff faces appear to be most useful for verifying (1) an initial grouping suggested by subj ect-matter knowledge and intuition or (2) final groupings produced by .clustering algorithms. Example 1 . 1 2
(Uti l ity data as Chernoff faces)
From the data in Table 12.5, the 22 public utility companies were represented as Chernoff faces. We have the following correspondences: Variable
Xl : X2: X3 : X4 : Xs : X6 : X7 : X8:
Facial characteristic
Fixed-charge coverage Rate of return on capital Cost per kW capacity in place Annual load factor
�
Peak kWh demand growth from 1974
�
Sales (kWh use per year) Percent nuclear Total fuel costs (cents per kWh)
�
� � �
� �
Half-height of face Face width Position of center of mouth Slant of eyes height Eccentricity of eyes width Half-length of eye Curvature of mouth Length of nose
( )
The Chernoff faces are shown in Figure 1 . 17. We have subj ectively grouped "similar" faces into seven clusters. If a smaller number of clusters is de sired, we might combine clusters 5, 6, and 7 and, perhaps, clusters 2 and 3 to
Section 1 .4 Cluster 1
Cluster 2
Cluster 3
0 0 0 W CD G 0 CD 8 0 0 CD 0 0 0 4
1
6
10
3
22
13
9
Cluster 4
20
14
8
18
11
19
16
Figure 1 . 1 7
29
Data D isplays and Pictori a l Representations Cluster 5
Cluster 7
ffi CD \D CD 5
7
21
15
Cluster 6
CD CD CD 2
12
17
Chernoff faces for 22 p u b l ic util ities.
obtain four or five clusters. For our assignment of variables to facial features, the firms group largely according to geographical location. • Constructing Chernoff faces is a task that must be done with the aid of a com puter. The data are ordinarily standardized within the computer program as part of the process for determining the locations, sizes, and orientations of the facial char acteristics. With some training, we can use Chernoff faces to communicate similari ties or dissimilarities, as the next example indicates. Example 1 . 1 3
{Using Chernoff faces to show changes over ti me)
Figure 1 .1 8 illustrates an additional use of Chernoff faces. ( See [24] .) In the figure, the faces are used to track the financial well-being of a company over time. As indicated, each facial feature represents a single financial indicator, and the longitudinal changes in these indicators are thus evident at a glance. •
30
Chapter 1
Aspects of M u ltiva riate Ana lysis Liquidity �
Profitability
1 975
1 976
1 978
1 977
1 979
------� Time Figure 1 . 1 8 Chernoff faces over time.
Chernoff faces have also been used to display differences in multivariate ob servations in two dimensions. For example, the two-dimensional coordinate axes might represent latitude and longitude ( geographical location ) , and the faces might represent multivariate measurements on several U.S. cities. Additional examples of this kind are discussed in [30] . There are several ingenious ways to picture multivariate data in two dimen sions. We have described some of them. Further advances are possible and will al most certainly take advantage of improved computer graphics.
1 .5
DI STANCE Although they may at first appear formidable, most multivariate techniques are based upon the simple concept of distance. Straight-line, or Euclidean, distance should be familiar. If we consider the point P = ( x 1 , x2 ) in the plane, the straight-line distance, d( 0, P) , from P to the origin 0 = (0, 0) is, according to the Pythagorean theorem,
Vx i + x �
(1-9)
Vx t + X � + . . · + X �
(1 -10)
d(O, P)
=
The situation is illustrated in Figure 1.19. In general, if the point P has p coordinates so that P = ( x 1 , x2 , , x p ) , the straight-line distance from P to the origin 0 = (0, 0, . . . , 0 ) is • • •
d( 0, P)
d(O, P ) =
jx T + X �
=
�
p
1 �11(- Xl�l
0 F------'
Figure 1 . 1 9 Dista n ce g iven by the Pythagorea n theorem.
Section 1 . 5
31
Dista nce
(See Chapter 2.) All points (x 1 , x2 , , xp ) that lie a constant squared distance, such as c 2 , from the origin satisfy the equation • • •
(1-11)
Because this is the equation of a hypersphere (a circle if p = 2) , points equidistant from the origin lie on a hypersphere. The straight-line distance between two arbitrary points P and Q with coordi nates P = (x 1 , x2 , . . . , xp ) and Q = (y1 , y2 , . . . , Yp ) is given by d(P, Q ) = V (x l - Y1 ) 2 + (x2 - Y2 ) 2 + · + (x p - Yp ) 2 (1-12)
··
Straight-line, or Euclidean, distance is unsatisfactory for most statistical pur poses. This is because each coordinate contributes equally to the calculation of Eu clidean distance. When the coordinates represent measurements that are subject to random fluctuations of differing magnitudes, it is often desirable to weight coordinates subject to a great deal of variability less heavily than those that are not highly vari able. This suggests a different measure of distance. Our purpose now is to develop a "statistical" distance that accounts for differ ences in variation and, in due course, the presence of correlation. Because our choice will depend upon the sample variances and covariances, at this point we use the term statistical distance to distinguish it from ordinary Euclidean distance. It is statistical distance that is fundamental to multivariate analysis. To begin, we take as fixed the set of observations graphed as the p-dimensional scatter plot. From these, we shall construct a measure of distance from the origin to a point P = (x 1 , x2 , , xp ) · In our arguments, the coordinates (x 1 , x2 , . . . , xp ) of P can vary to produce different locations for the point. The data that determine distance will, however, remain fixed. To illustrate, suppose we have n pairs of measurements on two variables each having mean zero. Call the variables x 1 and x2 , and assume that the x 1 measure ments vary independently of the x2 measurements. 1 In addition, assume that the vari ability in the x 1 measurements is larger than the variability in the x2 measurements. A scatter plot of the data would look something like the one pictured in Figure 1 .20. . • •
Figure 1 .20
A scatter plot with
g reater va r i a b i l ity in the x1 d i rection than i n the x2 d i rection. 1
At this point, "independently" means that the x2 measurements cannot be predicted with any ac curacy from the x 1 measurements, and vice versa.
32
Chapter 1
Aspects of M u ltiva r i ate Ana lysis
Glancing at Figure 1.20, we see that values which are a given deviation from the origin in the x 1 direction are not as "surprising" or "unusual" as are values equidis tant from the origin in the x2 direction. This is because the inherent variability in the x 1 direction is greater than the variability in the x2 direction. Consequently, large x 1 coordinates (in absolute value) are not as unexpected as large x2 coordinates. It seems reasonable, then, to weight an x2 coordinate more heavily than an x 1 coordi nate of the same value when computing the "distance" to the origin. One way to proceed is to divide each coordinate by the sample standard devi ation. Therefore, upon division by the standard deviations, we have the "standardized" coordinates x i = x 1j� and xi = x2j'\l's; . The standardized coordinates are now on an equal footing with one another. After taking the differences in variability into account, we determine distance using the standard Euclidean formula. Thus, a statistical distance of the point P = ( x 1 , x2 ) from the origin 0 = (0, 0) can be computed from its standardized coordinates xi = x 1j� and x i = x2j'\l's; as
+ (xi) 2 2 = + )( �) ( � Y
d(O, P) = \l(xi) 2
(1-13)
Comparing (1-13) with (1-9), we see that the difference between the two expressions is due to the weights k 1 = 1/s1 1 and k2 = 1/s22 attached to x i and x� in (1-13). Note that if the sample variances are the same, k 1 = k2 , and xi and x� will receive the same weight. In cases where the weights are the same, it is convenient to ignore the com mon divisor and use the usual Euclidean distance formula. In other words, if the variability in the x 1 direction is the same as the variability in the x2 direction, and the x 1 values vary independently of the x2 values, Euclidean distance is appropriate. Using (1-13), we see that all points which have coordinates (x 1 , x2 ) and are a constant squared distance c 2 from the origin must satisfy
2 � S1 1
+
2 � = c2 S22
(1-14)
Equation (1 -14) is the equation of an ellipse centered at the origin whose major and minor axes coincide with the coordinate axes. That is, the statistical distance in (1-13) has an ellipse as the locus of all points a constant distance from the origin. This gen eral case is shown in Figure 1 .21 .
c
0
p
�------------�------------�------� X
Js::
----
-c
fi;;
-c
fi;;
I
Figure 1 .2 1
The e l l i pse of co n sta nt statistical d i stance d 2 ( 0, P) xf/s1 1 + x�/s22 = c 2 • =
Section 1 . 5
Example 1 . 1 4
Distance
33
{Calculating a statistica l d istance)
A set of paired measurements ( x 1 , x2 ) on two variables yields x1 = x2 = 0, s1 1 = 4, and s22 = 1. Suppose the x1 measurements are unrelated to the x2 mea surements; that is, measurements within a pair vary independently of one an other. Since the sample variances are unequal, we measure the square of the distance of an arbitrary point P = (x 1 , x2 ) to the origin 0 = (0, 0) by
x-21 _2 x2 2 d (0 ' P) = 4 + 1 All points ( x 1 , x2 ) that are a constant distance 1 from the origin satisfy the equation x 21 _2 x2 1 _ + 4 1 =
The coordinates of some points a unit distance from the origin are presented in the following table:
2
x1 . D 1stance: 4
Coordinates: ( x 1 , x2 )
x22 = 1 +T
0-2 1 2 4 + -1 = 02 ( -1 ) 2 4 + 1 = 22 + 02 = 4 1 2 1 + ( v'3/2 ? 4 1 =
(0, 1 ) (0, - 1) (2, 0) (1, v'3/2)
1 1 1 1
A plot of the equation xi/4 + x�/1 = 1 is an ellipse centered at (0, 0) whose major axis lies along the x 1 coordinate axis and whose minor axis lies along the x 2 coordinate axis. The half-lengths of these major and minor axes are V4 = 2 and VI = 1, respectively. The ellipse of unit distance is plotted in Fig ure 1.22. All points on the ellipse are regarded as being the same statistical dis • tance from the origin-in this case, a distance of 1.
--�----�--r---�--+---� Xl -2
2
-1
-1
Figure 1 . 22 2 2
x ___2_ 4
+
x
� = 1.
1
E l l i pse of u n it dista n ce,
34
Chapter 1
Aspects of M u ltivariate Ana lysis
The expression in (1-13) can be generalized to accommodate the calculation of statistical distance from an arbitrary point P == ( x 1 , x2 ) to any fixed point Q == (y1 , y2 ) . If we assume that the coordinate variables vary independently of one another, the distance from P to Q is given by
d( P , Q )
=
I (x i - YI ) 2 'I S1 1
+
(x2 - Y2 ) 2 S22
(1-15)
The extension of this statistical distance to more than two dimensions is straight forward. Let the points P and Q have p coordinates such that P == (x1 , x2 , , x p ) and Q == (y1 , y2 , , yp )· Suppose Q is a fixed point [it may be the origin 0 == (0, 0, . . . , 0 ) ] and the coordinate variables vary independently of one another. Let s1 1 , s22 , . . . , sP P be sample variances constructed from n measurements on x 1 , x2 , . . . , xP , respectively. Then the statistical distance from P to Q is • • •
• • •
d ( P , Q)
2 I (x l - YI ) 'I S1 1
=
+
(x2 - Y2 ) 2 s22
+ ... +
(xp - Yp ) 2 sPP
(1-16)
All points P that are a constant squared distance from Q lie on a hyperellipsoid centered at Q whose major and minor axes are parallel to the coordinate axes. We note the following: 1. The distance of P to the origin 0 is obtained by setting y1 == y2 == · · · == Yp == 0 in (1-16). 2. If s 1 1 == s22 == · · · == sPP ' the Euclidean distance formula in (1 -12) is appropriate. The distance in (1-16) still does not include most of the important cases we shall encounter, because of the assumption of independent coordinates. The scatter plot in Figure 1 .23 depicts a two-dimensional situation in which the x 1 measurements do not vary independently of the x2 measurements. In fact, the coordinates of the pairs ( x 1 , x2 ) exhibit a tendency to be large or small together, and the sample correlation coefficient is positive. Moreover, the variability in the x2 direction is larger than the variability in the x 1 direction .
-
•
I
•
· I .. �� ----�--� x l . -. -
• -
------------
•
• • •
I
I
I
• •
• I •
• •
•
-
Figure 1 . 2 3
A scatte r plot fo r positively correl ated meas u reme nts a n d a rotated coord i n ate system .
Section 1 .5
35
Distance
What is a meaningful measure of distance when the variability in the x 1 direc tion is different from the variability in the x2 direction and the variables x 1 and x2 are correlated? Actually, we can use what we have already introduced, provided that we look at things in the right way. From Figure 1 .23 , we see that if we rotate the origi nal coordinate system through the angle (} while keeping the scatter fixed and label the rotated axes x 1 and x2 , the scatter in terms of the new axes looks very much like that in Figure 1 .20. (You may wish to turn the book to place the x 1 and x2 axes in their customary positions. ) This suggests that we calculate the sample variances using the x1 and x2 coordinates and measure distance as in Equation (1-13). That is, with ref erence to the x l and x2 axes, we define the distance from the point p = (x l ' x2 ) to the origin 0 = (0, 0 ) as
d(O, P)
(1-17)
=
where sl l and s22 denote the sample variances computed with the xl and x2 measurements. The relation between the original coordinates ( x 1 , x2 ) and the rotated coordi nates ( xl ' x2 ) is provided by x1 = x 1 cos ( (}) + x2 sin ( (} ) (1-18) X2 = - X 1 sin ( (}) + X2 COS ( (} ) Given the relations in (1-18), we can formally substitute for x1 and x2 in (1-17) and express the distance in terms of the original coordinates. After some straightforward algebraic manipulations, the distance from P = (x1 , x2 ) to the origin 0 = (0, 0 ) can be written in terms of the original coordi nates x 1 and x 2 of P as (1-19)
where the a ' s are numbers such that the distance is nonnegative for all possible val ues of x 1 and x2 . Here a 1 1 , a 12 , and a 22 are determined by the angle 0, and s1 1 , s12 , and s22 calculated from the original data. 2 The particular forms for a 1 1 , a 1 2 , and a 22 are not important at this point. What is important is the appearance of the cross product term 2a 1 2 x 1 x2 necessitated by the nonzero correlation r1 2 . Equation (1-19) can be compared with (1-13). The expression in (1-13) can be regarded as a special case of (1-19) with a 1 1 = 1/ s 1 1 , a 22 = 1/ s22 , and a 12 = 0. 2 Specifically, 2 2 cos ( 8) sin ( 8) -a1 2 1 cos2 ( 8)s 1 + 2 sin( 8) cos ( 8)s1 2 + sin2( 8)s22 + ---cos2 ( 8)s2 2 2 sin( 8) cos ( 8)s1 2 + sin ( 8)s1 1 1 sin2 (8) cos2 (8) -a2 2 2- cos2 ( 8)s1 1 + 2 sin( 8) cos ( 8)s1 2 + sin2 ( 8)s22 + ---cos ( 8)s22 - 2 sin( 8) cos ( 8)s1 + sin2( 8)s1 1 2 -
and a
12 -
cos ( 8) sin( 8) 2 cos (8)s1 1 + 2 sin(8) cos (8)s1 2 + sin2(8)s22
-
2
sin( 8) cos ( 8)
2 cos (8)s22 - 2 sin(8) cos (8)s1 2 + sin (8)s1 1
36
Chapter 1
Aspects of M u ltiva r i ate Ana lysis
'
/
/
Q
'
/
/
p
'/
'
/
/
/
/
/
/
/
'
Figure 1 .24 E l l i pse of points a co nsta nt dista nce from the point Q.
In general, the statistical distance of the point P = (x 1 , x2 ) from the fixed point (y1 , y2 ) for situations in which the variables are correlated has the general form d( P , Q ) = v'a 1 1 (x 1 - Y1 ) 2 + 2a 1 2 ( X1 - Y1 ) ( x2 - Y2 ) + a 22 (x2 - Y2 ) 2 (1-20)
Q =
and can always be computed once a 1 1 , a 1 2 , and a 22 are known. In addition, the coordi nates of all points P = ( x1 , x2 ) that are a constant squared distance c 2 from Q satisfy a 1 1 (x 1 - Yl ) 2 + 2a 1 2 ( X1 - Yl ) (x2 - Y2 ) + a 22 (x2 - Y2 ) 2 = C2 (1-21) By definition, this is the equation of an ellipse centered at Q. The graph of such an equation is displayed in Figure 1 .24. The major (long) and minor (short) axes are in dicated. They are parallel to the x1 and x2 axes. For the choice of a 1 1 , a 1 2 , and a 22 in footnote 2, the xl and x2 axes are at an angle (} with respect to the x l and x2 axes. The generalization of the distance formulas of (1-19) and (1-20 ) to p dimen sions is straightforward. Let P = (x1 , x2 , . . . , xp ) be a point whose coordinates rep resent variables that are correlated and subj ect to inherent variability. Let 0 = ( 0, 0, . . . , 0) denote the origin, and let Q = (y1 , y2 , . . . , yp) be a specified fixed point. Then the distances from P to 0 and from P to Q have the general forms
d(O, P) = v'a 1 1 x t + a 22 x� + . . . + a PP x� + 2a 1 2 x 1 x2 + 2a 1 3 x 1 x3 + . . . + 2a p - l ,p xp _1 xp and
d (P, Q) =
·
( 1-22 )
·
[a 1 1(x 1 - Yl ) 2 + a 22 (x2 - Y2 ) 2 + . . + a pp (xp - Yp ) 2 + 2a 1 2 ( X1 - Yl ) ( x2 - Y2 ) + 2a 1 3 ( x 1 - Y1 ) ( X3 - Y3) + . . + 2a p - 1 , p ( xp - 1 - Yp - 1 ) ( xp - Yp ) ]
where the a's are numbers such that the distances are always nonnegative?
( 1 - 23 )
3 The algebraic expressions for the squares of the distances in (1-22) and (1-23) are known as qua dratic forms and, in particular, positive definite quadratic forms. It is possible to display these quadratic forms in a simpler manner using matrix algebra; we shall do so in Section 2.3 of Chapter 2.
Section 1 . 5
37
Distance
We note that the distances in (1-22) and (1-23) are completely determined by the coefficients (weights) a i k ' i = 1 , 2, . . . , p, k = 1, 2, . . , p. These coefficients can be set out in the rectangular array .
(1 -24)
where the ai k ' s with i i= k are displayed twice, since they are multiplied by 2 in the distance formulas. Consequently, the entries in this array specify the distance func tions. The a i k ' s cannot be arbitrary numbers; they must be such that the computed distance is nonnegative for every pair of points. (See Exercise 1 .10.) Contours of constant distances computed from (1 -22) and (1-23) are hyperellipsoids. A hyperellipsoid resembles a football when p = 3; it is impossible to visualize in more than three dimensions. The need to consider statistical rather than Euclidean distance is illustrated heuristically in Figure 1 .25. Figure 1 .25 depicts a cluster of points whose center of gravity (sample mean) is indicated by the point Q. Consider the Euclidean distances from the point Q to the point P and the origin 0. The Euclidean distance from Q to P is larger than the Euclidean distance from Q to 0. However, P appears to be more like the points in the cluster than does the origin. If we take into account the vari ability of the points in the cluster and measure distance by the statistical distance in (1-20) , then Q will be closer to P than to 0. This result seems reasonable given the nature of the scatter.
•
P'-'� •
•
•
.
•
• • • • • •• ••Q • : fi" •
. . • • • • • • . •
. .•� •• • •• • • • • . .
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • .
0
X1
Figure 1 .2 5
A cl uster of poi nts rel ative to a point P a n d the orig i n .
Other measures of distance can be advanced. (See Exercise 1 . 12.) At times, it is useful to consider distances that are not related to circles or ellipses. Any distance measure d ( P, Q ) between two points P and Q is valid provided that it satisfies the fol lowing properties, where R is any other intermediate point: d ( P, Q) d ( p' Q ) d ( p ' Q) d ( P, Q )
=
> =
<
d ( Q, P ) 0 if p i= Q 0 if p = Q d ( P, R ) + d ( R, Q )
(1-25) (triangle inequality)
38 1 .6
Chapter 1
Aspects of M u ltivariate Ana lysis
FI NAL CO M M E NTS We have attempted to motivate the study of multivariate analysis and to provide you with some rudimentary, but important, methods for organizing, summarizing, and dis playing data. In addition, a general concept of distance has been introduced that will be used repeatedly in later chapters.
EXE RCI SES
1.1. Consider the seven pairs of measurements ( x 1 , x2 ) plotted in Figure 1 . 1 :
x2
3
4
2
6
8
2
5
5
5.5
4
7
10
5
7.5
Calculate the sample means .X 1 and .X2 , the sample variances s 1 1 and s22 , and the sample covariance s1 2 . 1.2. A morning newspaper lists the following used-car prices for a foreign compact with age x 1 measured in years and selling price x2 measured in thousands of dollars:
x2
3
5
5
7
7
2.30
1 .90
1 .00
.70
.30
8
9
10
11
1 .00 1 .05
.45
.70
.30
7
(a) Construct a scatter plot of the data and marginal dot diagrams. (b) Infer the sign of the sample covariance s 1 2 from the scatter plot. (c) Compute the sample means .X1 and .X2 and the sample variances s 1 1 and s22 . Compute the sample covariance s1 2 and the sample correlation coefficient r1 2 . Interpret these quantities. (d) Display the sample mean array x, the sample variance-covariance array Sn , and the sample correlation array R using (1-8). 1.3. The following are five measurements on the variables x 1 , x2 , and x3 :
x1 x2 x3
9
2
6
5
8
12
8
6
4
10
3
4
0
2
1
Find the arrays x, Sn , and R. 1.4. The 10 largest U.S. industrial corporations yield the following data:
Chapter 1
Company General Motors Ford Exxon IBM General Electric Mobil Philip Morris Chrysler Du Pont Texaco
x2 = profits x 1 = sales (millions of dollars) (millions of dollars)
39
x3 = assets (millions of dollars)
4,224 3,835 3,510 3,758 3,939 1 ,809 2,946 359 2,480 2,413
126,974 96,933 86,656 63,438 55,264 50,976 39,069 36,156 35,209 32,41 6
Exercises
173 ,297 160,893 83 ,21 9 77,734 128,344 39,080 38,528 51 ,038 34,715 25,636
Source: "Fortune 500," Fortune, 121 (April 23, 1990), 346--367. © 1990 Tune Inc. All rights reserved.
(a) Plot the scatter diagram and marginal dot diagrams for variables x 1 and x2 . Comment on the appearance of the diagrams. (b) Compute x1 , x2 , s1 1 , s22 , s1 2 , and r1 2 . Interpret r1 2 . 1.5. Use the data in Exercise 1 .4. (a) Plot the scatter diagrams and dot diagrams for (x2 , x3 ) and ( x1 , x3 ) . Com ment on the patterns. (b) Compute the x, Sn , and R arrays for (x1 , x2 , x3 ) . 1.6. The data in Table 1 .5 are 42 measurements on air-pollution variables recorded at 12:00 noon in the Los Angeles area on different days. (See also the air pollution data on the CD-ROM.) (a) Plot the marginal dot diagrams for all the variables. (b) Construct the x, Sn , and R arrays, and interpret the entries in R. 1.7. You are given the following n = 3 observations on p = 2 variables: Variable 1 :
xl l = 2
x2 1 = 3
x3 1 = 4
Variable 2:
xl 2 = 1
x22 = 2
x3 2 = 4
( a) Plot the pairs of observations in the two-dimensional "variable space." That
is, construct a two-dimensional scatter plot of the data. (b) Plot the data as two points in the three-dimensional "item space." 1.8. Evaluate the distance of the point P = ( - 1 , - 1 ) to the point Q = ( 1 , 0) using the Euclidean distance formula in (1-12) with p = 2 and using the statistical distance in (1 -20) with a1 1 = 1/3, a 22 = 4/27, and a 1 2 = 1/9. Sketch the locus of points that are a constant squared statistical distance 1 from the point Q. 1.9. Consider the following eight pairs of measurements on two variables x1 and x2 : -3
-2
1
2
5
6
8
-3
1
-1
2
1
5
3
( a) Plot the data as a scatter diagram, and compute s1 1 , s22 , and s1 2 .
40
Chapter
1
Aspects of M u ltivariate Ana lys is
TABLE 1 . 5 Wind
(x1)
8 7 7 10 6 8 9 5 7 8 6 6 7 10 10 9 8 8 9 9 10 9 8 5 6 8 6 8 6 10 8 7 5 6 10 8 5 5 7 7 6 8
AI R-POLLUTI ON DATA Solar radiation ( x2 ) 98 107 103 88 91 90 84 72 82 64 71 91 72 70 72 77 76 71 67 69 62 88 80 30 83 84 78 79 62 37 71 52 48 75 35 85 86 86 79 79 68 40
CO (x3 ) N O (x4) N0 2 ( x5) 0 3 (x6) HC (x7) 7 4 4 5 4 5 7 6 5 5 5 4 7 4 4 4 4 5 4 3 5 4 4 3 5 3 4 2 4 3 4 4 6 4 4 4 3 7 7 5 6 4
Source: Data courtesy of Professor G. C. Tiao.
2 3 3 2 2 2 4 4 1 2 4 2 4 2 1 1 1 3 2 3 3 2 2 3 1 2 2 1 3 1 1 1 5 1 1 1 1 2 4 2 2 3
12 9 5 8 8 12 12 21 11 13 10 12 18 11 8 9 7 16 13 9 14 7 13 5 10 7 11 7 9 7 10 12 8 10 6 9 6 13 9 8 11 6
8 5 6 15 10 12 15 14 11 9 3 7 10 7 10 10 7 4 2 5 4 6 11 2 23 6 11 10 8 2 7 8 4 24 9 10 12 18 25 6 14 5
2 3 3 4 3 4 5 4 3 4 3 3 3 3 3 3 3 4 3 3 4 3 4 3 4 3 3 3 3 3 3 4 3 3 2 2 2 2 3 2 3 2
Chapter 1
Exercises
41
(b) Using (1-18), calculate the corresponding measurements on variables x1 and x2 , assuming that the original coordinate axes are rotated through an angle of (} = 26° [given cos (26 ° ) = .899 and sin (26° ) = .438] . (c) Using the x1 and x2 measurements from (b) , compute the sample variances sl l and s22 · (d) Consider the new pair of measurements ( x 1 , x 2 ) = ( 4, - 2 ) . Transform these to measurements on x1 and x2 using (1-18), and calculate the distance d ( 0, P ) of the new point P = (x 1 , x2 ) from the origin 0 = ( O, O ) using (1-17) . Note: You will need s1 1 and s22 from (c) . (e) Calculate the distance from P = (4, -2) to the origin 0 = (0, 0) using (1-19) and the expressions for a 1 1 , a 22 , and a 1 2 in footnote 2. Note: You will need s1 1 , s2 2 , and s1 2 from (a) . Compare the distance calculated here with the distance calculated using the x1 and x2 values in (d). (Within rounding error, the numbers should be the same.) 1.10. Are the following distance functions valid for distance from the origin? Explain. (a) x r + 4x � + xl x2 = ( distance ) 2 (b) x r - 2x � = ( distance ) 2 1.11. Verify that distance defined by (1-20) with a 1 1 = 4, a 22 = 1 , and a 1 2 = - 1 sat isfies the first three conditions in (1-25). (The triangle inequality is more diffi cult to verify.) 1.12. Define the distance from the point P = ( x 1 , x2 ) to the origin 0 = ( 0, 0) as d( O , P ) = max( j x 1 j , l x2 1 ) (a) Compute the distance from P = ( -3 , 4 ) to the origin. (b) Plot the locus of points whose squared distance from the origin is 1 . (c) Generalize the foregoing distance expression t o points in p dimensions. 1.13. A large city has maj or roads laid out in a grid pattern, as indicated in the fol lowing diagram. Streets 1 through 5 run north-south (NS) , and streets A through E run east-west (EW) . Suppose there are retail stores located at in tersections ( A, 2 ) , (E, 3 ) , and ( C, 5 ) . Assume the distance along a street be tween two intersections in either the NS or EW direction is 1 unit. Define the distance between any two intersections (points) on the grid to be the " city block " distance. [For example, the distance between intersections (D, 1 ) and ( C, 2 ) , which we might call d ( ( D, 1 ) , ( C, 2) ) , is given by d( (D, 1 ) , ( C, 2) ) = d ( (D, 1 ) , (D, 2 ) ) + d ( (D, 2 ) , ( C, 2) ) = 1 + 1 = 2. Also, d ( (D, 1 ) , ( C, 2) ) = d( (D, 1 ) , ( C, 1 ) ) + d( ( C, 1 ) , ( C, 2) ) = 1 + 1 = 2.] 2 A B c D E
3
4
5
42
Chapter 1
Aspects of M u ltiva riate Ana lysis
Locate a supply facility (warehouse) at an intersection such that the sum of the distances from the warehouse to the three retail stores is minimized.
The following exercises contain fairly extensive data sets. A computer may be neces sary for the required calculations. 1.14. Table 1.6 contains some of the raw data discussed in Section 1 .2. (See also the multiple-sclerosis data on the CD-ROM.) Two different visual stimuli (S1 and S2) produced responses in both the left eye ( L ) and the right eye ( R) of sub jects in the study groups. The values recorded in the table include x 1 (subj ect ' s age); x2 (total response of both eyes to stimulus S 1 , that is, S1L + S1R); x3 (difference between responses of eyes to stimulus S1 , I S1L - S1R I ) ; and so forth. TABLE 1 .6
M U LTI PLE-SCLEROSIS DATA
Non-Multiple-Sclerosis Group Data Subject number
xi
x2
x3
x4
Xs
(Age)
( S1L + S1R)
IS1L - S1RI
(S2L + S2R)
IS2L - S2R I
1 2 3 4 5
18 19 20 20 20
152.0 138.0 144.0 143.6 148.8
1.6 .4 .0 3.2 .0
198.4 180.8 186.4 1 94.8 217.6
.0 1 .6 .8 .0 .0
65 66 67 68 69
67 69 73 74 79
154.4 171.2 157.2 175.2 155.0
2.4 1.6 .4 5.6 1 .4
205.2 210.4 204.8 235 .6 204.4
6.0 .8 .0 .4 .0
x3
x4
Xs
Multiple-Sclerosis Group Data Subj ect number
xl
x2
1 2 3 4 5
23 25 25 28 29
148.0 195.2 158.0 134.4 190.2
.8 3.2 8.0 .0 14.2
205.4 262.8 209.8 198.4 243 .8
.6 .4 12.2 3 .2 10.6
25 26 27 28 29
57 58 58 58 59
165.6 238.4 164.0 169.8 199.8
16.8 8.0 .8 .0 4.6
229.2 304.4 216.8 219.2 250.2
15.6 6.0 .8 1.6 1 .0
Source: Data courtesy of Dr. G. G. Celesia.
Chapter 1
Exercises
43
(a) Plot the two-dimensional scatter diagram for the variables x2 and x4 for the multiple-sclerosis group. Comment on the appearance of the diagram. (b) Compute the x, Sn , and R arrays for the non-multiple-sclerosis and multiple sclerosis groups separately. 1.15. Some of the 98 measurements described in Section 1 .2 are listed in Table 1 .7 ( See also the radiotherapy data on the CD-ROM. ) The data consist of average ratings over the course of treatment for patients undergoing radiotherapy. Vari ables measured include x1 (number of symptoms, such as sore throat or nausea ) ; x2 ( amount of activity, on a 1-5 scale ) ; x3 ( amount of sleep, on a 1-5 scale ) ; x4 ( amount of food consumed, on a 1-3 scale ) ; x5 ( appetite, on a 1-5 scale ) ; and x6 ( skin reaction, on a 0-3 scale ) . TABLE 1 .7
xl
RADIOTH E RAPY DATA
x2
x3
x4
Xs
x6
Symptoms
Activity
Sleep
Eat
Appetite
.889 2.813 1.454 .294 2.727
1 .389 1 .437 1 .091 .941 2.545
1.555 .999 2.364 1.059 2.819
2.222 2.312 2.455 2.000 2.727
1.945 2.312 2.909 1 .000 4.091
1.000 2.000 3.000 1 .000 .000
4.100 .125 6.231 3.000 .889
1.900 1.062 2.769 1.455 1.000
2.800 1.437 1.462 2.090 1.000
2.000 1.875 2.385 2.273 2.000
2.600 1 .563 4.000 3.272 1 .000
2.000 .000 2.000 2.000 2.000
Skin reaction
Source: D ata courtesy of Mrs. Annette Tealey, R.N. Values of x2 and x3 less than 1.0 are due to errors in the data collection process. Rows containing values of x2 and x3 less than 1.0 may be omitted.
(a) Construct the two-dimensional scatter plot for variables x2 and x3 and the marginal dot diagrams ( or histograms ) . Do there appear to be any errors in the x 3 data? (b) Compute the x, Sn , and R arrays. Interpret the pairwise correlations. 1.16. At the start of a study to determine whether exercise or dietary supplements would slow bone loss in older women, an investigator measured the mineral content of bones by photon absorptiometry. Measurements were recorded for three bones on the dominant and nondominant sides and are shown in Table 1 .8. ( See also the mineral-content data on the CD-ROM. ) Compute the x, Sn , and R arrays. Interpret the pairwise correlations.
1.17. Some of the data described in Section 1.2 are listed in Table 1.9. ( See also the national-track-records data on the CD-ROM. ) The national track records for women in 55 countries can be examined for the relationships among the run ning events. Compute the x, Sn , and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter ( 100-meter ) to the longer (marathon ) running distances. Interpret these pairwise correlations.
44
Chapter 1
Aspects of M u ltivariate Ana lysis
TABLE 1 .8
M I N E RAL CONTENT I N B O N ES
Subj ect number
Dominant radius
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 . 103 .842 .925 .857 .795 .787 .933 .799 .945 .921 .792 .815 .755 .880 .900 .764 .733 .932 .856 .890 .688 .940 .493 .835 .915
Radius
Dominant humerus
1.052 .859 .873 .744 .809 .779 .880 .851 .876 .906 .825 .751 .724 .866 .838 .757 .748 .898 .786 . 950 .532 .850 .616 .752 .936
2.139 1.873 1.887 1 .739 1 .734 1.509 1 .695 1 .740 1 .811 1.954 1 . 624 2.204 1 .508 1 .786 1 . 902 1 .743 1 .863 2.028 1 .390 2.187 1.650 2.334 1 .037 1 .509 1 .971
Humerus
Dominant ulna
Ulna
2.238 1 .741 1.809 1 .547 1 .715 1 .474 1 .656 1 .777 1 .759 2.009 1.657 1.846 1.458 1.811 1.606 1 .794 1 . 869 2.032 1 .324 2.087 1 .378 2.225 1 .268 1 .422 1 .869
.873 .590 .767 .706 .549 .782 .737 .618 .853 .823 .686 .678 .662 .810 .723 .586 .672 .836 .578 .758 .533 .757 .546 .618 .869
.872 .744 .713 .674 .654 .571 .803 .682 .777 .765 .668 .546 .595 .819 .677 .541 .752 .805 .610 .718 .482 .731 .615 .664 .868
Source: Data courtesy of Everett Smith.
1.18. Convert the national track records for women in Table 1 . 9 to speeds measured in meters per second. For example, the record speed for the 100-m dash for Argentinian women is 100 m/1 1 .61 sec = 8.613 mjsec. Notice that the records for the 800-m, 1500-m, 3000-m and marathon runs are measured in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Compute the i, Sn , and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100 m) to the longer (marathon) running distances. Interpret these pairwise correlations. Compare your results with the results you obtained in Ex ercise 1.17. 1.19. Create the scatter plot and boxplot displays of Figure 1 .5 for (a) the mineral content data in Table 1 . 8 and (b) the national-track-records data in Table 1.9. 1.20. Refer to the bankruptcy data in Table 1 1 .4, page 655, and on the data CD-ROM. Using appropriate computer software, (a) View the entire data set in x 1 , x2 , x3 space. Rotate the coordinate axes in various directions. Check for unusual observations. (b) Highlight the set of points corresponding to the bankrupt firms. Examine various three-dimensional perspectives. Are there some orientations of
Chapter 1
TABLE 1 .9
Exercises
45
NATI ONAL TRACK RECORDS FOR WO M E N
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands
lOO m (s)
200 m (s)
400 m ( s)
800 m ( min )
1500 m (min )
3000 m ( min )
Marathon (min )
10.81
21.71
48. 16
1 .93
3.96
8.75
157.68
1 1 .01
22.39
49.75
1 . 95
4.03
8.59
148.53
1 1 .00 1 1 .79 1 1 .84 1 1 .45 1 1 .95 1 1 .85 1 1 .43 1 1 .45 1 1 .29 1 1 .73 1 1 .73 1 1 .96
22.13 24.08 24.54 23.06 24.28 24.24 23.51 23.57 23.00 24.00 23 .88 24.49
50.46 54.93 56.09 51 .50 53.60 55.34 53.24 54.90 52.01 53.73 52.70 55.70
1 .98 2.07 2.28 2.01 2.10 2.22 2.05 2.10 1 .96 2.09 2.00 2.15
4.03 4.35 4.86 4.14 4.32 4.61 4.11 4.25 3.98 4.35 4.15 4.42
8.62 9.87 10.54 8.98 9.98 10.02 8.89 9.37 8.63 9.20 9.20 9.62
149.72 1 82.20 215.08 156.37 1 88.03 201 .28 149.38 160.48 151.82 150.50 181 .05 164.65
12.25 12.03 12.23 1 1 .76 1 1 .89 1 1 .25
25.78 24.96 24.21 25.08 23.62 22.81
51 .20 56.10 55.09 58.10 53.76 52.38
1 .97 2.07 2.19 2.27 2.04 1 .99
4.25 4.38 4.69 4.79 4.25 4.06
9.35 9.64 10.46 10.90 9.59 9.01
179.17 174.68 1 82.17 261 .13 158.53 152.48
1 1 .61 1 1 .20 1 1 .43 1 1 .41 1 1 .46 1 1 .31 12.14 1 1 .00 12.00 1 1 .95 1 1 .60 12.90 1 1 .96 1 1 .09 1 1 .42 1 1 .79 11.13 11.15
22.94 22.35 23 .09 23.04 23.05 23.17 24.47 22.25 24.52 24.41 24.00 27.10 24.60 21 .97 23.52 24.05 22.39 22.59
54.50 51 .08 50.62 52.00 53.30 52.80 55.00 50.06 54.90 54.97 53.26 60.40 58.25 47.99 53.60 56.05 50.14 51 .73
2.15 1 .98 1 .99 2.00 2.16 2.10 2.18 2.00 2.05 2.08 2. 1 1 2.30 2.21 1 .89 2.03 2.24 2.03 2.00
4.43 4.13 4.22 4.14 4.58 4.49 4.45 4.06 4.23 4.33 4.35 4.84 4.68 4.14 4.18 4.74 4.10 4.14
9.79 9.08 9.34 8.88 9.81 9.77 9.51 8.81 9.37 9.31 9.46 11.10 10.43 8.92 8.71 9.89 8.92 8.98
178.52 152.37 159.37 157.85 169.98 168.75 191.02 149.45 171.38 168.48 1 65.42 233 .22 171 .80 158.85 151 .75 203.88 154.23 155.27
(continues)
46
Chapter 1
TABLE 1 . 9
Aspects of M u ltivariate Ana lysis
NATI O NAL TRACK RECORDS FOR WOM E N
Country
lOO m (s)
200 m (s)
400 m (s)
800 m (min)
1500 m (min)
3000 m (min)
Marathon (min)
New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taiwan Thailand Turkey U. S.A. U.S.S.R. Western Samoa
11 .55 1 1 .58 12.25 1 1 .76 1 1 .13 1 1 .81 1 1 .44 12.30 1 1 .80 11.16 1 1 .45 1 1 .22 1 1 .75 11 .98 10.79 1 1 .06 12.74
23.13 23.31 25 .07 23.54 22.21 24.22 23.46 25.00 23 .98 22.82 23.31 22.62 24.46 24.44 21 .83 22. 19 25.85
51.60 53.12 56.96 54.60 49.29 54.30 51 .20 55.08 53.59 51 .79 53.11 52.50 55.80 56.45 50.62 49.19 58.73
2.02 2.03 2.24 2.19 1. 95 2.09 1 .92 2.12 2.05 2.02 2.02 2.10 2.20 2.15 1 .96 1 .89 2.33
4.18 4.01 4.84 4.60 3.99 4.16 3.96 4.52 4.14 4.12 4.07 4.38 4.72 4.37 3.95 3.87 5.81
8.76 8.53 10.69 10.16 8.97 8.84 8.53 9.94 9.02 8.84 8.77 9.63 10.28 9.38 8.50 8.45 13.04
145.48 145.48 233.00 200.37 160.82 151 .20 165.45 182.77 162.60 154.48 153 .42 177.87 168.45 201 .08 142.72 151 .22 306.00
Source: IAAF/ATFS Track and Field Statistics Handbook for the 1984 Los Angeles Olympics.
three-dimensional space for which the bankrupt firms can be distinguished from the nonbankrupt firms? Are there observations in each of the two groups that are likely to have a significant impact on any rule developed to classify firms based on the sample means, variances, and covariances calcu lated from these data? (See Exercise 11 .24. ) 1.21. Refer to the milk transportation-cost data in Table 6.8, page 334, and on the CD-ROM. Using appropriate computer software, (a) View the entire data set in three dimensions. Rotate the coordinate axes in various directions. Check for unusual observations. (b) Highlight the set of points corresponding to gasoline trucks. Do any of the gasoline-truck points appear to be multivariate outliers? (See Exercise 6.17. ) Are there some orientations of x 1 , x2 , x3 space for which the set of points representing gasoline trucks can be readily distinguished from the set of points representing diesel trucks? 1.22. Refer to the oxygen-consumption data in Table 6.10, page 336, and on the CD ROM. Using appropriate computer software, (a) View the entire data set in three dimensions employing various combina tions of three variables to represent the coordinate axes. Begin with the x 1 , x2 , x3 space. (b) Check this data set for outliers. 1.23. Using the data in Table 1 1 .9, page 664, and on the CD-ROM, represent the cereals in each of the following ways. ( a) Stars.
Chapter 1
47
Exercises
(b) Chernoff faces. (Experiment with the assignment of variables to facial char acteristics.) 1.24. Using the utility data in Table 12.5, page 683, and on the CD-ROM, represent the public utility companies as Chernoff faces with assignments of variables to facial characteristics different from those considered in Example 1.12. Compare your faces with the faces in Figure 1.17. Are different groupings indicated? 1.25. Using the data in Table 12.5 and on the CD-ROM, represent the 22 public util ity companies as stars. Visually group the companies into four or five clusters. 1.26. The data in Table 1.10 (see the bull data on the CD-ROM) are the measured characteristics of 7 6 young (less than two years old) bulls sold at auction. Also included in the table are the selling prices (SalePr) of these bulls. The column headings (variables) are defined as follows: 1 Angus YrHgt = Yearling height at Breed = 5 Hereford shoulder (inches ) 8 Simental FtFrBody = Fat free body PrctFFB = Percent fat-free body (pounds) Frame = Scale from 1 ( small ) BkFat = Back fat to 8 ( large ) (inches) SaleHt = Sale height at SaleWt = Sale weight ( pounds ) shoulder ( inches )
{
(a) Compute x, Sn , and R arrays. Interpret the pairwise correlations. Do some of these variables appear to distinguish one breed from another? (b) View the data in three dimensions using the variables Breed, Frame, and BkFat. Rotate the coordinate axes in various directions. Check for out liers. Are the breeds well separated in this coordinate system? (c) Repeat part b using Breed, FtFrBody, and SaleHt. Which three-dimensional display appears to result in the best separation of the three breeds of bulls? TABLE 1 . 1 0
DATA O N B U LLS
Breed
SalePr
YrHgt
FtFrBody
PrctFFB
Frame
BkFat
SaleHt
SaleWt
1 1 1 1 1
2200 2250 1625 4600 2150
51 .0 51.9 49.9 53 . 1 51.2
1 128 1 108 1011 993 996
70.9 72.1 71.6 68.9 68.6
7 7 6 8 7
.25 .25 .15 .35 .25
54.8 55.3 53.1 56.4 55.0
1720 1575 1410 1595 1488
8 8 8 8 8
1450 1200 1425 1250 1500
51.4 49.8 50.0 50.1 51.7
997 991 92 8 990 992
73.4 70.8 70.8 71.0 70.6
7 6 6 6 7
.10 .15 .10 .10 .15
55.2 54.6 53.9 54.9 55.1
1454 1475 1375 1564 1458
Source: Data courtesy of Mark Ellersieck.
48
Chapter 1
Aspects of M u ltivariate Ana lysis
REFERENCES 1 . Becker, R . A . , W. S . Cleveland, and A . R . Wilks. "Dynamic Graphics for Data Analysis." Statistical Science, 2, no. 4 (1987), 355-395.
2. Benjamin, Y. , and M. lgbaria. "Clustering Categories for Better Prediction of Comput er Resources Utilization." Applied Statistics, 40, no. 2 (1991 ) , 295-307. 3. Bhattacharyya, G. K., and R. A. Johnson. Statistical Concepts and Methods. New York: John Wiley, 1977. 4. Bliss, C. I. "Statistics in Biology," Statistical Methods for Research in the Natural Sciences, vol. 2. New York: McGraw-Hill, 1967. 5. Capon, N. , J. Farley, D. Lehman, and J. Hulbert. "Profiles of Product Innovators among Large U. S. Manufacturers." Management Science, 38, no. 2 (1992) , 157-169. 6. Chernoff, H. "Using Faces to Represent Points in K-Dimensional Space Graphically." Journal of the American Statistical Association, 68 , no. 342 (1973), 361-368.
7. Cochran, W. G. Sampling Techniques (3d ed.). New York: John Wiley, 1977.
8. Cochran, W. G. , and G. M. Cox. Experimental Designs (2d ed. ) . New York: John Wiley, 1957. 9. Davis, J. C. "Information Contained in Sediment Size Analysis." Mathematical Geology, 2, no. 2 (1970) , 105-1 12.
10. Dawkins, B. "Multivariate Analysis of National Track Records." The American Statisti cian, 43, no. 2 (1 989) , 1 1 0-1 15. 11. Dunham, R. B., and D. J. Kravetz. "Canonical Correlation Analysis in a Predictive Sys tem." Journal of Experimental Education, 43, no. 4 (1975 ) , 35-42.
12. Everitt, B. Graphical Techniques for Multivariate Data. New York: North-Holland, 1978.
13. Gable, G. G. "A Multidimensional Model of Client Success when Engaging External Consultants." Management Science, 42 , no. 8 (1 996) 1175-1 198.
14. Halinar, J. C. "Principal Component Analysis in Plant Breeding." Unpublished report based on data collected by Dr. F. A. Bliss, University of Wisconsin, 1979. 15. Kim, L., and Y. Kim. "Innovation in a Newly Industrializing Country: A Multiple Dis criminant Analysis." Management Science, 31, no. 3 (1985) 312-322. 16. Klatzky, S. R., and R. W. Hodge. "A Canonical Correlation Analysis of Occupational Mo bility." Journal of the American Statistical Association, 66, no. 333 (1971), 16-22.
17. Linden, M. "Factor Analytic Study of Olympic Decathlon Data." Research Quarterly, 48, no. 3 (Oct. 1977) , 562-568.
18. MacCrimmon, K., and D. Wehrung. "Characteristics of Risk Taking Executives." Man agement Science, 36, no. 4 (1990) , 422-435.
19. Marriott, F. H. C. The Interpretation ofMultiple Observations. London: Academic Press, 1974. 20. Mather, P. M. "Study of Factors Influencing Variation in Size Characteristics in Flu vioglacial Sediments." Mathematical Geology, 4, no. 3 (1972) , 21 9-234.
21. McLaughlin, M., et al. "Professional Mediators ' Judgments of Mediation Tactics: Multi dimensional Scaling and Cluster Analysis." Journal of Applied Psychology, 76 , no. 3 (1991 ), 465-473.
22. Naik, D. N., and R. Khattree. "Revisiting Olympic Track Records: Some Practical Con siderations in the Principal Component Analysis." The American Statistician, 50, no. 2 (1996) , 140-144.
Chapter 1
References
49
23. Nason, G. "Three-dimensional Proj ection Pursuit." Applied Statistics, 44, no. 4 (1995 ) , 41 1-430. 24. Smith, M., and R. Taffler. "Improving the Communication Function of Published Ac counting Statements." Accounting and Business Research, 14, no. 54 (1984) , 139-146. 25. Spenner, K. I. "From Generation to Generation: The Transmission of Occupation." Ph.D. dissertation, University of Wisconsin, 1977.
26. Tabakoff, B., et al. "Differences in Platelet Enzyme Activity between Alcoholics and Nonalcoholics." New England Journal of Medicine, 318, no. 3 (1988), 134-139.
27. Timm, N. H. Multivariate Analysis with Applications in Education and Psychology. Mon terey, CA: Brooks/Cole, 1975.
28. Trieschmann, J. S., and G. E. Pinches. "A Multivariate Model for Predicting Financially Distressed P-L Insurers." Journal of Risk and Insurance, 40, no. 3 (1973) , 327-338. 29. Tukey, J. W. Explo ratory Data Analysis. Reading, MA: Addison-Wesley, 1977. 30. Wainer, H., and D. Thissen. "Graphical Data Analysis." Annual Review of Psycho logy, 32, (1981), 191-241.
3 1 . Wartzman, R. "Don't Wave a Red Flag at the IRS." The Wall Street Journal (February 24, 1 993 ), C1, C15.
32. Weihs, C., and H. Schmidli. "OMEGA (On Line Multivariate Exploratory Graphical Analysis): Routine Searching for Structure." Statistical Science, 5, no. 2 (1990) , 175-226.
C HAPTER
2
Matrix Algebra and Random Vectors
2.1
I NTRODUCTI ON We saw in Chapter 1 that multivariate data can be conveniently displayed as an array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns is called a matrix of dimension n X p. The study of multivariate methods is greatly facilitated by the use of matrix algebra. The matrix algebra results presented in this chapter will enable us to concisely state statistical models. Moreover, the formal relations expressed in matrix terms are easily programmed on computers to allow the routine calculation of important sta tistical quantities. We begin by introducing some very basic concepts that are essential to both our geometrical interpretations and algebraic explanations of subsequent statistical techniques. If you have not been previously exposed to the rudiments of matrix al gebra, you may prefer to follow the brief refresher in the next section by the more de tailed review provided in Supplement 2A.
2.2
SOME BASICS OF MATRIX AND VECTOR ALG E BRA Vectors An array x of n real numbers x 1 ,
x2 , . , xn is called a vector, and it is written as . .
x ==
where the prime denotes the operation of transposing a column to a row. so
Section 2 . 2
I I I I
/
/
51
7.
/ 1 I / I / I I I I I I I I I I I
/
2
/
- - - - - - - - - - - - - - - - -
L _ _
:
:
1
Some Basics of Matrix and Vector Al gebra
3: : o�-----��- x2 I I / I I / I - - - - - - - - - - - - - - - - - �/
Figure 2 . 1
The vector x ' = [ 1 , 3, 2 ] .
A vector x can be represented geometrically as a directed line in n dimensions with component x 1 along the first axis, x2 along the second axis, . . . , and xn along the nth axis. This is illustrated in Figure 2.1 for n = 3. A vector can be expanded or contracted by multiplying it by a constant c. In par ticular, we define the vector ex as
ex = That is, 2.2 ( a ) .]
ex is the vector obtained by multiplying each element of x by c. [See Figure
2 2
/
/
/
/
/
/
-1(
/
2x
X
(a) Figure 2 . 2
(b) Sca l a r m u lt i p l ication a n d vector a d d ition .
52
Chapter 2
Matrix Algebra and Ra ndom Vectors
Two vectors may be added.
Addition of x and y is defined as
x + y =
+
Yn so that x + y is the vector with ith element xi + Yi .
The sum of two vectors emanating from the origin is the diagonal of the paral lelogram formed with the two original vectors as adj acent sides. This geometrical in terpretation is illustrated in Figure 2.2(b ). A vector has both direction and length. In n = 2 dimensions, we consider the vector X =
[ :J
The length of x, written Lx , is defined to be
Lx = Yx i + X �
Geometrically, the length of a vector in two dimensions can be viewed as the hy potenuse of a right triangle. This is demonstrated schematically in Figure 2.3. The length of a vector x' = [ x 1 , x2 , , xn ] , with n components, is defined by • • •
(2-1)
·· ·· · ·
Multiplication of a vector x by a scalar c changes the length. From Equation (2-1 ),
c 2 x� + + c2 x� = I c I Yx i + x � + + x � = I c I Lx
Lex = Yc 2 X I +
Multiplication by c does not change the direction of the vector x if c > 0. How ever, a negative value of c creates a vector with a direction opposite that of x. From (2-2) it is clear that x is expanded if I c I > 1 and contracted if 0 < I c I < 1. [Recall Figure 2.2(a).] Choosing c = L� 1 , we obtain the unit vector L� 1 x, which has length 1 and lies in the direction of x. 2
Figure 2 . 3
Le ngth of x =
Vxt
+ x� .
Section 2.2
53
Some Basics of Matrix and Vector Algebra
2
y X
Figure 2.4 The a n g l e e between x ' = [x1 , X2 J a n d y' = [y1 , Y2 ] ·
A second geometrical concept is angle. Consider two vectors in a plane and the angle 0 between them, as in Figure 2.4. From the figure, 0 can be represented as the difference between the angles 0 1 and 02 formed by the two vectors and the first coordinate axis. Since, by definition,
and
cos ( O )
=
cos ( 02 - 0 1 ) = cos ( 02 ) cos ( 0 1 ) + sin ( 02 ) sin ( 0 1 )
the angle 0 between the two vectors x ' = [ x 1 , x2 ] and cos ( 0 )
=
cos ( 02 - 0 1 ) =
y' = [ y1 , y2 ] is specified by
Y ) ( ) ) ( ( (�) Ly Lx Ly Lx �
We find it convenient to introduce the dimensions, the inner product of x and y is
2
+
x2
=
XlYl + XzYz
LxLy
(2- 3 )
inner product of two vectors. For n = 2
x'y = X 1 Y1 + X2 Y2
With this definition and Equation (2- 3 ),
L =� X
cos ( O )
=
x' y x' y ----LxLy � v?Y =
Since cos ( 90° ) = cos (270° ) = 0 and cos ( 0 ) = 0 only if x'y = 0, x and y are per pendicular when x 'y = 0. For an arbitrary number of dimensions n, we define the inner product of x and y as
(2-4)
The inner product is denoted by either x ' y or y' x. Using the inner product, we have the natural extension of length and angle to vectors of n components:
Lx =
length of x
=�
(2- 5 )
54
Chapter 2
M atrix Algebra and Ra ndom Vectors
Since, again, cos ( 0)
x' y = 0.
Example 2.1
x' y
x' y "v;---:x' x- "v;--:y' y
(2 - 6 ) L x Ly = 0 only if x' y = 0, we say that x and y are perpendicular when cos ( 0)
=
=
(Ca lcu lati ng lengths of vecto rs and the angle between them)
Given the vectors x' = [ 1 , 3, 2 ] and y' = [ - 2, 1 , - 1 ] , find 3x and x + y. Next, determine the length of x, the length of y, and the angle between x and y. Also, check that the length of 3x is three times the length of x. First,
x'x = 12 + 3 2 + 2 2 = 14, y' y = ( -2) 2 + 12 + ( -1 ) 2 = 6, and x'y = 1 ( -2) + 3 ( 1 ) + 2 ( - 1 ) = -1. Therefore,
Next,
and
Lx = \1£ = v14 = 3 .742 cos ( O )
so 0
x' y
Ly = VfY = v'6 = 2.449
=
= LxLy 3 . 742
+
62 = VI26
-1 X
2.449 = - .l09
= 96.3°. Finally,
L3x = V3 2 showing L3x
+
92
= 3Lx .
and
3Lx = 3v14 = VI26 •
A pair of vectors x and y of the same dimension is said to be linearly dependent if there exist constants c 1 and c2 , both not zero, such that
A set of vectors x 1 , x , , x k is said to be 2 c1 , c2 , . . . , ck , not all zero, such that . . •
linearly dependent if there exist constants
(2- 7 ) Linear dependence implies that at least one vector in the set can be written as a linear combination of the other vectors. Vectors of the same dimension that are not linearly dependent are said to be linearly independent.
Section 2.2
Example 2.2
Some Basics of Matrix and Vector Al gebra
55
(Identifying l i nearly i ndependent vecto rs)
Consider the set of vectors
Setting implies that
c1 + c2 + c3 = 0 2c1 - 2c3 = 0 c 1 - c2 + c3 = 0 with the unique solution c 1 = c2 = c3 = 0. As we cannot find three constants c 1 , c2 , and c3 , not all zero, such that c 1 x 1 + c2 x2 + c3x 3 = 0, the vectors x 1 , x2 , and x 3 are linearly independent. •
The projection (or shadow) of a vector x on a vector y is
--
(x' y) 1 (x' y) . . -y ProJ ection of x on y = ,-y = yy Ly Ly where the vector L; 1 y has unit length. The length of the proj ection is Length of protection =
x' y l x' y l = L x -L Lx L
y y where 0 is the angle between x and y. (See Figure 2.5 .)
= L x l cos ( O )
(2-8)
I
(2-9)
Matrices A matrix is any rectangular array of real numbers. We denote an arbitrary array of n rows and p columns by
a1 1 a1 2 a a A = 2 1 22 (n x p) an l an 2
al p a2 p
a np
Many of the vector concepts just introduced have direct generalizations to matrices.
�1-.('-----
LX
( )
cos (8)
x' y y y' y
--)lt.....-JI
Figure 2 . 5
The p rojection of x o n y .
56
Chapter 2
Matrix Algebra and Ra ndom Vectors
The transpose operation A' of a matrix changes the columns into rows, so that the first column of A becomes the first row of A' , the second column becomes the second row, and so forth. Example 2.3
(The transpose of a matrix)
If
[3
A 1 ( 2X 3 ) then
A' = ( 3 X2 )
-1 2 5 4
[3 ]
]
1 -1 5
2 4
•
A matrix may also be multiplied by a constant c. The product cA is the matrix that results from multiplying each element of A by c. Thus
ca 1 P ca 2 P
ca l l ca l 2 ca 2 1 ca 22 cA = ( n X p) ca n l ca n 2
ca np
Two matrices A and B of the same dimensions can be added. The sum A + ( i, j)th entry ai j + bi j . Example 2.4
B has
(The sum of two matrices and m u lti pl ication of a matrix by a co nsta nt)
If
A ( 2X 3 )
_
[01 -13
1 1
]
and
[ 5 ] 1
1 -2 - 3 B3 = 2 ( 2X )
then
[0 B3 = [ 0 1 ( 2X ) � ( 2X 3 ) 4A
A + ( 2X 3 )
4
�24
+ 1 + 2
4 4
] 31
and
-2 1 -
- +5 1 +
31 ] = [ 31
1 -2
4
2
]
•
It is also possible to define the multiplication of two matrices if the dimensions of the matrices conform in the following manner: When A is ( n X k ) and B is ( k X p ) , so that the number of elements in a row of A is the same as the number of elements in a column of B , we can form the matrix product AB. An element of the new matrix AB is formed by taking the inner product of each row of A with each column of B.
Section 2.2
57
Some Basics of Matrix a n d Vector Algebra
The matrix product AB is
A B = the ( n X p) matrix whose entry in the ith row (n x k) (k xp) and jth column is the inner product of the ith row of A and the jth column of B
or
···
k
(2 - 10 ) + a i kbk j = � a i e be j €= 1 When k = 4, we have four products to add for each entry in the matrix AB. Thus, ( i, j) entry of AB
= a i1 b1 j + ai 2 b2j +
a 1 1 a 1 2 a 13 a 1 4 A B = (a i l a i 2 a i3 a i 4) n( X 4 ) ( 4 X p ) a n l a n2 a n 3 a n 4
= Row i
Example 2.5
[· ( ·
blj b2j b3 j b4 j
bl l b2 1 b3 1 b4 1
bl p b2p b3p b4 p
Column
�
1
an b 1 i + a ; 2 b2 i
(Matrix mu lti p l ication)
a ; 3 b3 i + a ; 4 b4 i )
··]
If
then
[
3 -1 2 A3 3B = 5 4 1 ( 2 x ) ( x l)
] [ �] [ -
=
9
3 ( -2) + ( - 1 ) (7) + 2( 9 ) + 4( 9) 1 ( -2) + 5 (7)
]
( 2Xl)
and
( 2�2 ) ( 2�3 )
=
[� -n [� [ �
-� !J
2( 3 ) + 0( 1 ) 2( - 1 ) + 0( 5 ) 2(2) + 0(4) 1 (3 ) - 1 ( 1 ) 1 ( - 1 ) - 1 (5 ) 1 ( 2) - 1 ( 4)
- [�
=
( 2X 3 )
-�J
] •
58
Chapter 2
Matrix Algebra and Ra ndom Vectors
When a matrix B consists of single column, it is customary to use the lowercase b vector notation. Example 2.6
{Some typical p roducts and their d i mensions)
Let
A =
[
1 -2 3 2 4 -1
]
b =
[ -!] [ _;] c =
d
=
[� ]
Then Ab, be ' , b ' c , and d ' Ab are typical products.
[ _;]
The product Ab is a vector with dimension equal to the number of rows of A.
b ' c = [7
-3 6]
[]
[
= [ - 13]
]
The product b ' c is a 1 X 1 vector or a single number, here -13.
be ' =
7 -3 [5 8 -4] = 6
35 56 -28 - 15 -24 12 30 48 -24
The product be ' is a matrix whose row dimension equals the dimension of b and whose column dimension equals that of c. This product is unlike b ' c, which is a single number.
The product d' Ab is a 1 X 1 vector or a single number, here 26.
•
Square matrices will be of special importance in our development of statistical methods. A square matrix is said to be symmetric if A = A ' or a if = a1 i for all i and j. Example 2.7
{A symmetric matrix)
The matrix
Section 2 . 2
59
Some Basics of Matrix and Vector Algebra
is symmetric; the matrix
is not symmetric.
•
When two square matrices A and B are of the same dimension, both products AB and BA are defined, although they need not be equal. (See Supplement 2A.) If we let I denote the square matrix with ones on the diagonal and zeros elsewhere, it follows from the definition of matrix multiplication that the ( i, j)th entry of AI is a l· 1 X 0 + + a l·, J· - 1 X 0 + a lj· · X 1 + a l· , J· + 1 X 0 + + a l· k X 0 = a l· f· ' so AI = A Similarly, lA = A, so
···
···
•
I A = A I = A for any kAk (2-11) (X ) (k X k ) ( k X k) ( kX k ) (k X k) ( k X k) The matrix I acts like 1 in ordinary multiplication ( 1 a = a 1 = a ) , so it is called the identity rna trix. The fundamental scalar relation about the existence of an inverse number a - 1 such that a -1 a = aa - 1 = 1 if a # 0 has the following matrix algebra extension: If there exists a matrix B such that ·
•
B A = A B = I (k X k ) (k X k ) ( k X k ) (k X k) ( k X k ) then B is called the inverse of A and is denoted by A- 1 . The technical condition that an inverse exists is that the k columns a 1 , a2 , . . . , a k of A are linearly independent. That is, the existence of A- 1 is equivalent to (2-12) (See Result 2A.9 in Supplement 2A.) Example 2.8
(The existence of a matrix i nverse)
For
A= you may verify that
[
so
.4 - .2 .8 - .6
][ ] [ 3 2 = 4 1
[! � ]
( - .2)3 + ( .4)4 ( - .2)2 + ( .4 ) 1 ( .8)3 + ( 6 ) 4 ( 8)2 + ( -.6)1
[
-
- .2 .4 .8 - .6
]
.
.
J
60
Chapter 2
Matrix Algebra and Ra ndom Vectors
is A- 1 . We note that
implies that c1 = c2 = 0, so the columns of A are linearly independent. This confirms the condition stated in (2-12) . • A method for computing an inverse, when one exists, is given in Supplement 2A. The routine, but lengthy, calculations are usually relegated to a computer, especially when the dimension is greater than three. Even so, you must be forewarned that if the column sum in (2-12) is nearly 0 for some constants c1 , . . . , ck , then the computer may produce incorrect inverses due to extreme errors in rounding. It is always good to check the products AA- 1 and A-1 A for equality with I when A- 1 is produced by a computer package . ( See Exercise 2.10.) Diagonal matrices have inverses that are easy to compute. For example, 1 0 0 0 0 a1 1 1 a1 1 0 0 0 0 0 0 0 0 a 22 0 a 22 0 0 0 1 0 0 0 a 33 0 0 0 0 0 has inverse 0 0 0 a 44 0 1 0 0 0 0 as s 0 0 0 0 0
0
0
0
1
if all the aii # 0. Another special class of square matrices with which we shall become familiar are the orthogonal matrices, characterized by
QQ' = Q ' Q = I or Q' = Q - 1
(2-13)
The name derives from the property that if Q has ith row qi , then QQ' = I implies that q�qi = 1 and qiqj = 0 for i * j, so the rows have unit length and are mutually perpendicular ( orthogonal) . According to the condition Q' Q = I, the columns have the same property. We conclude our brief introduction to the elements of matrix algebra by intro ducing a concept fundamental to multivariate statistical analysis. A square matrix A is said to have an eigenvalue A, with corresponding eigenvector x # 0, if
Ax = Ax
(2-14)
Ordinarily, we normalize x so that it has length unity; that is, 1 = x'x. It is conve nient to denote normalized eigenvectors by e, and we do so in what follows. Sparing you the details of the derivation ( see [1 ]), we state the following basic result:
Section 2.3
Example 2.9
Positive Defin ite Matrices
61
(Verifyi ng eigenval ues and eigenvectors)
Let
A=
[
1 5
-
-
5
1
]
Then, since
[ A1
=
1 -5 1 -5
]
1 v'2 1 =6 v'2
-
1 v'2 1 v'2
-
- -
-
6 is an eigenvalue, and
1 v'2 1 v'2 is its corresponding normalized eigenvector. You may wish to show that a sec ond eigenvalue-eigenvector pair is A2 = -4, e2 = [1/ v'2 , 1/v'2J. • A method for calculating the A ' s and e ' s is described in Supplement 2A. It is in structive to do a few sample calculations to understand the technique. We usually rely on a computer when the dimension of the square matrix is greater than two or three. 2.3
POSITIVE D E F I N ITE MATRICES
The study of the variation and interrelationships in multivariate data is often based upon distances and the assumption that the data are multivariate normally distributed. Squared distances (see Chapter 1) and the multivariate normal density can be ex pressed in terms of matrix products called quadratic forms (see Chapter 4). Conse quently, it should not be surprising that quadratic forms play a central role in multivariate analysis. In this section, we consider quadratic forms that are always nonnegative and the associated positive definite matrices. Results involving quadratic forms and symmetric matrices are, in many cases, a direct consequence of an expansion for symmetric matrices known as the spectral
62
Chapter 2
Matrix Algebra and Ra ndom Vectors
decomposition. The spectral decomposition of a k X k symmetric matrix A is given by 1
where A 1 , A2 , . . . , Ak are the eigenvalues of A and e 1 , e 2 , . . . , e k are the associated nor malized eigenvectors. (See also Result 2A.14 in Supplement 2A). Thus, eie i = 1 for i = 1 , 2, . . , k, and eiej = 0 for i # j. .
Example 2. 1 0
(The spectra l decomposition of a matrix)
Consider the symmetric matrix
[ � �: -� ] 3
A=
4
2 -2
10
The eigenvalues obtained from the characteristic equation I A - AI I = 0 are A 1 = 9, A2 = 9, and A3 = 18 (Definition 2A.30) . The corresponding eigen vectors e 1 , e 2 , and e 3 are the (normalized) solutions of the equations Aei = Aiei for i = 1, 2, 3. Thus, Ae 1 = Ae 1 gives
[
2] [ ] [ ]
13 -4 -4 13 -2 2 -2 10
or
e1 1 e1 1 e2 1 = 9 e2 1 e3 1 e3 1
13e 1 1 - 4e2 1 + 2e3 1 = 9e 1 1 -4e 1 1 + 13e2 1 - 2e3 1 = 9e2 1 2e 1 1 2e2 1 + 10e3 1 = 9e3 1 -
Moving the terms on the right of the equals sign to the left yields three homogeneous equations in three unknowns, but two of the equations are redundant. Selecting one of the equations and arbitrarily setting e1 1 = 1 and e2 1 = 1 , we find that e3 1 = 0. Consequently, the normalized eigenvector is e1 = [ 1;V12 + 12 + o2 , 1 ;V12 + 12 + 02 , o;V12 + 12 + o2 ] = [ 1/ v'z , 1/ v'z , 0 ], since the sum of the squares of its elements is unity. You may verify that e2 = [1/V'I8, - 1/VIS , -4/ V'I8] is also an eigenvector for 9 = A2 , and e3 = [2/3, -2/3, 1/3 ] is the normalized eigenvector correspond ing to the eigenvalue A3 = 18. Moreover, eiej = 0 for i # j. 1 A proof of Equation (2-16) is beyond the scope of this book. The interested reader will find a proof in [6] , Chapter 8 .
Section 2.3
63
Pos itive Defi n ite Matrices
The spectral decomposition of A is then
[
or 13 -4 -4 13 2 -2
-� ]
1 V2 =9 1 10 V2 0
-
[�
-
1 V2
-
o]
1
V18
--
+9
-1
V18
--
-4
V18
--
1 2 =9 1 2 0
2 3 2 3 1 3
-
[Jrs
1 0 2 1 +9 0 2 0 0
-1
-4
V18 V18
--
J
+ 18
-
1 18 1 18 4 18
1 18 1 18 4 18
[� �] -
4 18 4 18 16 18 _j 4 2 4 -9 9 9 2 4 4 - 9 9 9 1 2 2 - 9 9 9
-
+ 18
-
-
-
as you may readily verify.
2 3
-
-
•
The spectral decomposition is an important analytical tool. With it, we are very easily able to demonstrate certain statistical results. The first of these is a matrix ex planation of distance, which we now develop. Because x ' Ax has only squared terms x f and product terms xi xk , it is called a quadratic form. When a k X k symmetric matrix A is such that (2-17) 0 < x ' Ax for all x' = [ x1 , x2 , . . . , xk ] , both the matrix A and the quadratic form are said to be nonnegative definite. If equality holds in (2-17) only for the vector x ' = [0, 0, . . . , OJ, then A or the quadratic form is said to b e positive definite. In other words, A is pos itive definite if (2-18) 0 < x' Ax for all vectors x # 0.
64
Chapter 2
Matrix Al gebra a n d Ra ndom Vectors
Example 2.1 1
(A positive defi nite matrix and q uad ratic fo rm) Show that the matrix for the following quadratic form is positive definite: 3x i + 2x� - 2V2 x 1 x2 To illustrate the general approach, we first write the quadratic form in matrix notation as
By Definition 2A.30, the eigenvalues of A are the solutions of the equa tion I A - AI I == 0, or ( 3 - A) (2 - A) - 2 == 0. The solutions are A 1 == 4 and A2 == 1 . Using the spectral decomposition in (2-16), we can write A == A 1 e 1 e i + A2 e 2 e 2 (2X 2 ) ( 2 X 1 ) ( 1 X 2 ) ( 2 X 1 ) ( 1 X 2 ) == 4e 1 e i + e2 e2 (2X 1 ) ( 1 X 2 ) ( 2 X 1 ) ( 1 X 2 ) where e 1 and e 2 are the normalized and orthogonal eigenvectors associated with the eigenvalues A 1 == 4 and A2 == 1 , respectively. Because 4 and 1 are scalars, premultiplication and postmultiplication of A by x' and x, respectively, where x' == [ x 1 , x2 ] is any nonzero vector, give x' A x == 4x e 1 e i x + x' e2 e2 x ( 1 X2) ( 2 X 2 ) ( 2Xl) ( 1 X 2 ) ( 2 Xl) ( 1 X 2 )( 2 Xl) ( 1 X 2 ) ( 2 Xl) ( 1 X 2 ) ( 2 Xl) == 4yy + y� 0 with y1 == x' e 1 == e1 x and y2 == x' e2 == e2 x We now show that y1 and y2 are not both zero and, consequently, that x' Ax == 4yy + y� > 0, or A is positive definite. From the definitions of y1 and y2 , we have '
>
or
y == E X ( 2 Xl) ( 2 X2)(2Xl) Now E is an orthogonal matrix and hence has inverse E ' . Thus, x == E' y. But • x is a nonzero vector, and 0 # x == E' y implies that y # 0. Using the spectral decomposition, we can easily show that a k X k symmetric matrix A is a positive definite matrix if and only if every eigenvalue of A is positive. ( See Exercise 2.17.) A is a nonnegative definite matrix if and only if all of its eigen values are greater than or equal to zero. Assume for the moment that the p elements x 1 , x2 , . . . , xP of a vector x are re alizations of p random variables X1 , X2 , , XP . As we pointed out in Chapter 1 , we • • •
Section 2.3
Positive Defin ite Matrices
65
can regard these elements as the coordinates of a point in p-dimensional space, and the "distance" of the point [ x 1 , x2 , , x p ] to the origin can, and in this case should, be interpreted in terms of standard deviation units. In this way, we can account for the inherent uncertainty (variability) in the observations. Points with the same as sociated "uncertainty" are regarded as being at the same distance from the origin. If we use the distance formula introduced in Chapter 1 [see Equation (1-22)], the distance from the origin satisfies the general formula (distance ) 2 = a 1 1 x i + a 22 x � + + a P P x � • • •
···
+ 2( a 12 x 1 x2 + a 13 x 1 x3 +
···
+ a p - l , p xp - l xp )
provided that ( distance ) 2 > 0 for all [ x 1 , x2 , . . . , xp ] # [0, 0, . . . , OJ . Setting ai j = a j i , i # j, i = 1, 2, . . . , p, j = 1, 2, . . . , p, we have
a 1 1 a 12 0 < ( dist ance ) 2 = [ x 1 , x2, . . . , x P ] a 2 1 a 22
ap l ap 2
a lp a 2p
xl x2
a PP
Xp
or for x # 0 0 < (distance ) 2 = x' Ax (2-19) From (2-19), we see that the p X p symmetric matrix A is positive definite. In sum, distance is determined from a positive definite quadratic form x' Ax. Conversely, a positive definite quadratic form can be interpreted as a squared distance.
Comment. Let the square of the distance from the point x' = [ x 1 , x2 , . . . , xp ] to the origin be given by x' Ax, where A is a p X p symmetric positive definite ma trix. Then the square of the distance from x to an arbitrary fixed point JL ' = [ JL 1 , JL2 , , JL p ] is given by the general expression (x - JL ) ' A(x - JL ) . • • •
Expressing distance as the square root of a positive definite quadratic form al lows us to give a geometrical interpretation based on the eigenvalues and eigenvec tors of the matrix A. For example, suppose p = 2. Then the points x' = [ x 1 , x2 ] of constant distance c from the origin satisfy
By the spectral decomposition, as in Example 2.11, A = A 1 e 1 e� + A2 e2 e2 so x'Ax = A 1 (x' e 1 ) 2 + A2 (x' e2 ) 2 Now, c2 = A 1 yi + A2 y� is an ellipse in y1 = x' e 1 and y2 = x' e2 because A 1 , A2 > 0 when A is positive definite. (See Exercise 2.17.) We easily verify that x = cA11/2 e 1 satisfies x' Ax = A 1 ( cA11/2 e� e 1 ) 2 = c2 . Similarly, x = cA2 1/2 e 2 gives the appropriate distance in the e 2 direction. Thus, the points at distance c lie on an ellipse whose axes are given by the eigenvectors of A with lengths proportional to the reciprocals of the square roots of the eigenvalues. The constant of proportionality is c. The situation is illustrated in Figure 2.6.
66
Chapter 2
Matrix Algebra and Ra ndom Vectors
Figure 2.6 Poi nts a consta nt d ista nce c from the origin (p = 2, 1 A1 < A2 ) .
:::;
···
If p > 2, the points x' = [ x 1 , x2 , , xp ] a constant distance c = Vx7Ax from the origin lie on hyperellipsoids c2 = A 1 (x' e 1 ) 2 + + Ap (x' e p ) 2 , whose axes are given by the eigenvectors of A. The half-length in the direction e i is equal to cj'\IA; , i = 1, 2, . . , p, where A 1 , A2 , , AP are the eigenvalues of A. • • •
.
. • .
2.4 A SQ UARE-ROOT MATRIX
The spectral decomposition allows us to express the inverse of a square matrix in terms of its eigenvalues and eigenvectors, and this leads to a useful square-root matrix . Let A be a k X k positive definite matrix with the spectral decomposition k A = � A i ei ei . Let the normalized eigenvectors be the columns of another matrix i=l P = [ e 1 , e 2 , . . . , e k ] . Then
where PP ' = P ' P = I and A is the diagonal matrix
Al 0 0 A2 A = (kXk) 0 0
0 0 Ak
with Ai
>
0
Section 2 . 5
Random Vectors and Matrices
67
Thus,
since (PA- 1 P ' ) PAP' = PAP' (PA- 1 P ' ) = PP ' = I . Next, let A 1/2 denote the diagonal matrix with \!A; as the ith diagonal element. k The matrix � \IA; eiei = PA 112 P' is called the square root of A and is denoted by A112 . i =l
2.5 RAN DOM VECTORS AN D MATRICES
A random vector is a vector whose elements are random variables. Similarly, a random matrix is a matrix whose elements are random variables. The expected value of a ran dom matrix (or vector) is the matrix (vector) consisting of the expected values of each of its elements. Specifically, let X = { Xi j } be an n X p random matrix. Then the ex pected value of X, denoted by E(X) , is the n X p matrix of numbers (if they exist)
E(X)
(2-23)
where, for each element of the matrix, 2 2 If you are unfamiliar with calculus, you should concentrate on the interpretation of the expected value and, eventually, variance. Our development is based primarily on the properties of expectation rather than its particular evaluation for continuous or discrete random variables.
68
Chapter 2
Matrix Algebra and Random Vectors
lao oo
if Xi j is a continuous random variable with Xl· j..f. } l j·(x l. j. ) dxl· j. probability density function h j ( xi j ) if Xi j is a discrete random variable with probability function Pi j (x i j )
Example 2. 1 2
(Computi ng expected va l u es fo r di screte random variables)
Suppose p = 2 and n = 1, and consider the random vector X' = [ X1 , X2 ] . Let the discrete random variable X1 have the following probability function: -1 .3 Then E(X1 )
=
2: x 1 p1 ( x 1 )
all x 1
=
0 .3
1 .4
( - 1 ) ( .3) + (0) (.3) + ( 1 ) ( .4)
=
.1.
Similarly, let the discrete random variable X2 have the probability function 0 1 .8 .2 Then E(X2 )
=
Thus,
2: x2 p2 (x2 )
all x2
E( X )
=
=
(0) (.8) + ( 1 ) ( .2)
E(X1 ) ] [ · 1 ] [E(X .2 2) =
=
.2.
•
Two results involving the expectation of sums and products of matrices follow directly from the definition of the expected value of a random matrix and the univariate properties of expectation, E (X1 + YjJ = E(X1 ) + E( Yi ) and E(cX1 ) = cE(X1 ) . Let X and Y be random matrices of the same dimension, and let A and B be conformable matrices of constants. Then (see Exercise 2.40)
2.6
M EAN VECTORS AN D COVARIAN CE MATRICES
Suppose X' = [X1 , X2 , , Xp ] is a p X 1 random vector. Then each element of X is a random variable with its own marginal probability distribution. (See Example 2.12.) The . . .
Section 2.6
69
Mean Vectors and Cova ria nce Matrices
marginal means JLi and variances a-r are defined as JLi = E(Xi ) and a-r = E(Xi - JLi) 2 , i = 1, 2, . . . , p, respectively. Specifically,
00 100 x;li(x; ) dx;
JLi =
:L xi pi (xi) all x1
a-l? =
if X; is a continuous random variable with probability density function fi( xi ) if Xi is a discrete random variable with probability function Pi ( xi )
( xi - JLi ) 2J-Fi ( xi ) dxi -00 ioo
(xi - JLi) 2 Pi ( xi ) :L all x1
if Xl· is a continuous random variable with probability density function /i( xJ
(2-25)
if Xi is a discrete random variable with probability function Pi ( xi)
It will be convenient in later sections to denote the marginal variances by a-ii rather than the more traditional a-7 , and consequently, we shall adopt this notation. The behavior of any pair of random variables, such as Xi and Xk , is described by their joint probability function, and a measure of the linear association between them is provided by the covariance
=
if xi ' xk are continuous random variables with the joint density function h k( xi , xk) if xi ' xk are discrete random variable with joint probability function Pik(xi , x k ) (2-26)
:L :L (xi - JLJ (xk - JLk ) Pi k (xi , xk)
all X1 all xk
and JLi and JLk , i, k = 1, 2, . . . , p, are the marginal means. When i = k, the covari ance becomes the marginal variance. More generally, the collective behavior of the p random variables X1 , X2 , , XP or, equivalently, the random vector X' = [ X1 , X2 , , XP ] , is described by a joint prob ability density function f(x 1 , x2 , , x p ) = f(x) . As we have already noted in this book, f(x) will often be the multivariate normal density function. (See Chapter 4.) If the joint probability P[Xi < xi and Xk < xk ] can be written as the product of the corresponding marginal probabilities, so that • • •
• • •
• • •
(2-27)
70
Matrix Algebra and Ra ndom Vectors
Chapter 2
for all pairs of values xi , xk , then Xi and Xk are said to be statistically independent. When Xi and Xk are continuous random variables with joint density fik (xi , x k ) and marginal densities /i ( xz ) and fk ( xk ) , the independence condition becomes fik( xi , xk ) = /i ( xi)fk ( xk ) for all pairs ( xi , xk ) . The p continuous random variables X1 , X2 , , XP are mutually statistically independent if their joint density can be factored as (2-28) !I2 · · ·p (x l , x2 , . . . , xp ) = f1 (x i )f2 (x2 ) · . /p (xp ) for all p-tuples (x 1 , x2 , , xp ) · Statistical independence has an important implication for covariance. The fac torization in (2-28) implies that Cov (Xi , Xk ) = 0. Thus, • • •
.
• • •
The converse of (2-29) is not true in general; there are situations where Cov ( Xi , Xk ) = 0, but Xi and Xk are not independent. (See [2] .) The means and covariances of the p X 1 random vector X can be set out as matrices. The expected value of each element is contained in the vector of means IL = E(X) , and the p variances (J'ii and the p (p - 1 )/2 distinct covariances (J'ik ( i < k) are contained in the symmetric variance-covariance matrix I = E(X - 1L ) (X - 1L ) ' . Specifically,
E(X1 ) E(X2 ) E(X) = E(Xp )
and
i-Ll (2-30)
= JL
= JLp
I = E(X - 1L ) (X - 1L ) '
=E xp - JLp (XI - JLI ) 2 (X2 - JL2 ) (XI - JLI ) =E
(XI - JL I ) (X2 - JL2 ) (X2 - JL2 ) 2
(XI - JLI ) (Xp - JLp ) (X2 - JL2 ) (Xp - JLp ) (Xp - JLp ) 2
(Xp - JLp ) (XI - JLI ) (Xp - JLp ) (X2 - JL2 ) E(Xl - JL I ) 2 E(X1 - JL 1 ) (X2 - JL2 ) E(X2 - JL2 ) 2 E(X2 - JL2 ) (XI - JL 1 )
E(X1 - JL 1 ) (Xp - JLp ) E(X2 - JL2 ) (Xp - JLp )
E(Xp - JLp ) (XI - JL I ) E(Xp - JLp ) (X2 - JL2 )
E(Xp - JLp ) 2
Section 2.6
Mean Vectors and Cova riance Matrices
71
or I = Cov (X) =
Example 2. 1 3
(2-31)
(Computi ng the cova riance matrix)
Find the covariance matrix for the two random variables X1 and X2 introduced in Example 2.12 when their joint probability function p1 2 ( x 1 , x2 ) is represented by the entries in the body of the following table:
x2 xl
0
1
P1 ( x 1 )
-1 0 1 P2 (x2 )
.24 .16 .40 .8
.06 .14 .00 .2
.3 .3 .4 1
We have already shown that JL 1 = E(X1 ) = .1 and JL2 = E(X2 ) = .2. (See Example 2.12.) In addition,
o-1 1 = E(X1 - JL 1 ) 2 = � ( x i - . 1 ) 2 PI (x l ) all x 1
= ( - 1 - .1 ) 2 ( .3 ) + (0 - .1 ) 2 ( .3 ) + ( 1 - .1 ) 2 ( .4) = .69
o-22 = E(X2 - JL2 ) 2 = � ( x2 - .2) 2 P2 (x2 ) all x2
= (0 - .2) 2 ( .8) + ( 1 - .2) 2 ( .2) = .16
o-1 2 = E(X1 - JL 1 ) (X2 - JL2 ) =
�
all pairs ( x 1 , x 2 )
(xi - .1 ) (x2 - .2) PI 2 (x l , x2 )
= ( - 1 - .1 ) ( 0 - .2) ( .24) + ( - 1 - .1 ) ( 1 - .2) ( .06 )
+ . . . + ( 1 - . 1 ) ( 1 - .2) ( .00) = - .08
72
Chapter 2
Matrix Algebra and Ra ndom Vectors
Consequently, with X' = [X1 , X2 ] , IL
= E(X) =
E(X1 ) ] [ JL 1 ] [ ·1 ] [ E(X = = .2 JL2 2)
and I = E(X - JL ) (X - JL ) ' =E
[ (X(X21 -- JL2JL1 )) 2(XI - JL1 )
E(Xl - JL I ) ( X2 - JL2 ) ] E(X1 - JL1 ) 2 [ = E(X2 - JL2 ) (X1 - JL1 ) E(X2 - JL2 ) 2 ] - :�� J = [ :�� :: � = [ - : � :
•
We note that the computation of means, variances, and covariances for discrete random variables involves summation (as in Examples 2.12 and 2.13), while analogous computations for continuous random variables involve integration. Because a-ik = E(Xi - JLJ (Xk - JLk ) = a-k i ' it is convenient to write the ma trix appearing in (2-31) as
I = E(X - JL ) (X - JL ) ' =
(2-32)
We shall refer to IL and I as the population mean ( vector ) and population variance-covariance (matrix) , respectively. The multivariate normal distribution is completely specified once the mean vector IL and variance-covariance matrix I are given ( see Chapter 4) , so it is not surprising that these quantities play an important role in many multivariate procedures. It is frequently informative to separate the information contained in vari ances a-ii from that contained in measures of association and, in particular, the measure of association known as the population correlation coefficient Pik · The correlation coefficient Pik is defined in terms of the covariance a-ik and variances a-ii and a-kk as (2-33) The correlation coefficient measures the amount of linear association between the ran dom variables Xi and Xk . ( See, for example, [2] .)
Section 2.6
Mean Vectors and Cova ria nce Matrices
Let the population correlation matrix be the p X
p
lT l p
lT 1 2
lT 1 2
lT22
� ver;;
lT l p
lT2 p
lTpp
� yo=;; yo=;; yo=;;
� ver;; yo=;; ver;;
and let the p X
symmetric matrix
lT l l
� � � yo=;; p=
p
1 P1 2 P1 2 1
P1 p P2 p
P1 p P2 p
1
73
lT2 p
yo=;; ver;;
ver;; ver;; (2-34)
standard deviation matrix be
y l /2 =
�
0
0
yo=;;
0
0
0 0
(2-35)
Then it is easily verified ( see Exercise 2.23) that
(2-36) and
(2-37) That is, I can be obtained from V 112 and p, whereas p can be obtained from I. Moreover, the expression of these relationships in terms of matrix operations allows the calculations to be conveniently implemented on a computer. Exa m p l e 2. 1 4
(Co mputing the correlation matrix fro m the covariance matrix)
Suppose
Obtain V 112 and p.
74
Chapter 2
Matrix Algebra and Ra ndom Vectors
Here
0 � - � � 050 ] 0 ] [0 0
va=;
and
�
�[ 00 OJ0 [ 225 ] [�0! 00 �] = [ l -!]
Consequently, from (2-37), the correlation matrix p is given by ( v l/2 r l :.t ( v l/2 r l =
4 1 1 9 -3 2 -3
3
!
1 6
!5
�
1
1
-5
•
Partitioning the Cova ria nce Matrix
Often, the characteristics measured on individual trials will fall naturally into two or more groups. As examples, consider measurements of variables representing consumption and income or variables representing personality traits and physical characteristics. One approach to handling these situations is to let the character istics defining the distinct groups be subsets of the total collection of characteris tics. If the total collection is represented by a X 1 ) -dimensional random vector X, the subsets can be regarded as components of X and can be sorted by parti tioning X. In general, we can partition the characteristics contained in the X 1 random vector X into, for instance, two groups of size q and q, respectively. For exam ple, we can write x l X=
Xq Xq +l xp
}q } p-
p
= q
[i-��;-j
(p
p-
and 1L = E(X) =
p
i-L l
JLq JLq + l
=
[:;��]
JLp (2-38)
Section 2.6
75
Mean Vectors and Cova ria nce Matrices
From the definitions of the transpose and matrix multiplication,
(XI - JL I ) ( Xq+l - JLq +I ) ( X1 - JL 1 ) ( xq+2 - JLq +2 ) ( X2 - JL2 ) ( Xq +l - JLq +I ) ( X2 - JL2 ) ( xq+2 - JLq +2 )
(X1 - JL I ) (Xp - JLp ) ( X2 - JL2 ) (Xp - JLp )
1
Upon taking the expectation of the matrix (X ( l ) - JL ( ) ) (X ( 2 ) - JL ( 2 ) ) ' , we get
(Tl p
O"l,q+l O"l,q+2
(2-39) O"q , q +l O"q , q +2
O"qp
which gives all the covariances, O"ij ' i = 1, 2, . . . , q , j = q + 1, q + 2, . . . , p, between a component of X ( l ) and a component of X ( 2 ) . Note that the matrix I 1 2 is not necessarily symmetric or even square. Making use of the partitioning in Equation (2-38), we can easily demonstrate that (X - JL ) (X - JL ) ' (X ( l )
- IL
( qx l )
(X ( 2 ) -
(1 ))
(X ( l )
- IL
( lxq)
(1)
)
2 l l IL ( ) ) (X ( ) - IL ( ) ) p Xl lXq) q (( - ) ) (
'
'
and consequently,
- IL
0"11
O"q l
O"l q
(Tp l
- IL
[��-1 - -i- -��-?-J
q - q I 2 1 : I 22 (p x p) I
O"l, q +l
(Tl p
O"q, q+l
O"qp
I I I
O"q q
-----------------------------------
(]"q+ l, l
I
P
(X ( 2 )
p-q
q
I = E(X - 1L ) (X - 1L ) ' = p ( x p)
( 1 ))
( 2 ) )' ( lX ( p -q )) (qX l) (X ( 2 ) - IL ( 2 ) ) (X ( 2 ) - IL ( 2 ) ) ' (( p - q ) Xl ) ( lX ( p - q )) (X ( l )
1
1:
--------------------------------------
(]"q +l, q (]"q +l ,q + l O"pq
O"p,q+l
(]"q+l ,p (Tpp
(2-40)
76
Chapter 2
Matrix Algebra and Random Vectors
Note that I 1 2 = I21 . The covariance matrix of X (l ) is I 1 1 , that of X ( 2 ) is I22 , and that of elements from x (l ) and X (2 ) is I 1 2 ( or I 2 1 ). The Mean Vector a n d Cova riance Matrix for Linear Combi nations of Ra ndom Variables
Recall that if a single random variable, such as X1 , is multiplied by a constant c, then E(cX1 ) = cE(X1 ) = CJL 1 and If X2 is a second random variable and a and b are constants, then, using additional properties of expectation, we get Cov ( aX1 , bX2 ) = E(aX1 - aJL1 ) (bX2 - bJL2 ) = abE (X1 - JL1 ) (X2 - JL2 ) = abCov ( X1 , X2 ) = aba- 1 2 Finally, for the linear combination aX1 + bX2 , we have E( aX1 + bX2 ) = aE(X1 ) + bE(X2 ) = aJL1 + bJL2 Var (aX1 + bX2 ) = E[ (aX1 + bX2 ) - (aJL1 + bJL2 ) ] 2 I
I
I
I
= E[a( X1 - JL1 ) + b(X2 - JL2 ) ] 2 = E[a2 (X1 - JL1 ) 2 + b2 (X2 - JL2 ) 2 + 2ab(X1 - JL 1 ) ( X2 - JL2 ) J = a 2 Var (X1 ) + b2 Var ( X2 ) + 2abCov (X1 , X2 ) = a 2 a- 1 1 + b2 a-22 + 2aba- 1 2 (2-41) With c' = [a, b], aX1 + bX2 can be written as [a b]
[ �: ]
=
c' X
Similarly, E( aX1 + bX2 ) = aJL1 + b JL2 can be expressed as
[a b] If we let
[:: ] = c'
p
be the variance-covariance matrix of X, Equation (2-41) becomes Var (aX1 + bX2 ) = Var ( c' X ) = c' I c since
(2-42)
Section 2.6
77
Mean Vectors and Covariance Matrices
The preceding results can be extended to a linear combination of p random variables:
In
general, consider the q linear combinations of the p random variables X1 , . . . , XP : zl = cl l xl + c1 2 x2 + . . . + clp xp z2 = Cz l Xl + Czz X2 + . . . + Cz p Xp .. .. . .
or Z=
zl Zz Zq (qx l )
C1 1 C 1 2 C C = 2 1 zz Cq l Cq2 (q x p)
cl p c2p
xl Xz
C qp
xp (p X l )
= ex
(2-44)
where 11-x and Ix are the mean vector and variance-covariance matrix of X, respec tively. ( See Exercise 2.28 for the computation of the off-diagonal terms in CixC ' . ) We shall rely heavily on the result in (2-45) in our discussions of principal com ponents and factor analysis in Chapters 8 and 9. Example 2.1 5
(Means and cova ria nces of l i near com b i nations)
Let X' = [ X1 , X2 ] be a random vector with mean vector 11-'x = [ JL 1 , JLz ] and variance-covariance matrix
Find the mean vector and covariance matrix for the linear combinations Z1 = X1 - X2 z2 = xl + Xz
78
Chapter 2
Matrix Algebra and Random Vectors
or
in terms of ILx and Ix . Here
D - 11 ] [JLI-L2l ] - [I-Li-Lll - JLJL22]
p, z = E(Z) = C p, x =
+
and Iz = Cov (Z) = CixC ' =
o-11
[ 11 - 11 ] [ lT12o-11 o-lT2212 ] [ - 11 11 ]
X1 X2
Note that if = o- 22-that is, if and have equal variances-the off-diagonal terms in Iz vanish. This demonstrates the well-known result that the sum and dif ference of two random variables with identical variances are uncorrelated. • Partitio n i ng the Sample Mean Vecto r and Covariance Matrix
Many of the matrix results in this section have been expressed in terms of population means and variances ( covariances ). The results in (2-36), (2-37), (2-38), and (2-40) also hold if the population quantities are replaced by their appropriately defined sample counterparts. Let x' = [ p ] be the vector of sample averages constructed from and let n observations on p variables
.X1 , x2 , , x X1, X2 , X , P • • •
• • •
,
be the corresponding sample variance-covariance matrix.
Section 2 . 7
Matrix Ineq u a l ities and Maxi m ization
79
The sample mean vector and the covariance matrix can be partitioned in order to distinguish quantities corresponding to groups of variables. Thus,
xl X
--�!{
( pXl )
[-�:x(�2!)-]
Xq +l
__
(2-46)
Xp and
sl q : sl ,q +l
S1 1
si p
I
-s�::·-�- - - - - - s�:��- 1-s�:-��:-�- - - - - - s�:�; I
I I
sn = (p X p)
Sqq
Sq l
q =p -
q
[
q S1 1 S2 1
--------
p-q 1 S1 2 r : S 22
I I
I
Sq,q +l
Sqp
J
(2-47)
-------
where x ( l ) and x ( 2 ) are the sample mean vectors constructed from observations . 1y, s 1 1 1s th e samp 1 e covari-. x ( l ) - [ x 1 , . . . , xq J ' and x ( 2 ) - [ Xq + b · · · , xp J ' , respect1ve ance matrix computed from observations x ( l ) ; S 22 is the sample covariance matrix computed from observations x (2 ) ; and S 1 2 = S2 1 is the sample covariance matrix for elements of x ( l ) and elements of x ( 2 ) . .
2.7
.
MATRIX I N EQUALITI ES AND MAXI M I ZATION
Maximization principles play an important role in several multivariate techniques. Linear discriminant analysis, for example, is concerned with allocating observations to predetermined groups. The allocation rule is often a linear function of measure ments that maximizes the separation between groups relative to their within-group variability. As another example, principal components are linear combinations of measurements with maximum variability. The matrix inequalities presented in this section will easily allow us to derive cer tain maximization results, which will be referenced in later chapters.
80
Chapter 2
Matrix Algebra and Random Vectors
Cauchy-Schwarz Inequality. Let b and d be any two p X 1 vectors. Then (b' d) 2 < (b'b) (d' d)
(2-48)
(
with equality if and only if b = cd or d = cb) for some constant c. Proof. The inequality is obvious if either b = 0 or d = 0. Excluding this pos
sibility, consider the vector b - xd, where x is an arbitrary scalar. Since the length of b - xd is positive for b - xd =I= 0, in this case 0 < (b - xd) ' (b - xd) = b'b - xd'b - b ' ( xd) + x2 d' d = b'b - 2x(b' d) + x 2 (d'd) The last expression is quadratic in x. If we complete the square by adding and sub tracting the scalar (b' d) 2/d' d, we get (b' d) 2 (b' d) 2 0 < b'b - 2x(b' d) + x 2 (d' d) + d'd d' d (b' d) 2 b' d 2 - b'b + (d ' d) X - d' d d' d
(
)
The term in brackets is zero if we choose x = b' d/d' d, so we conclude that (b' d) 2 0 < b'b d' d
-
or (b' d) 2 < (b' b) (d'd) if b =I= xd for some x. Note that if b = cd, 0 = (b - cd) ' (b - cd) , and the same argument produces 2 • (b'd) = (b'b) ( d ' d ) . A simple, but important, extension of the Cauchy-Schwarz inequality fol lows directly. Extended Cauchy-Schwarz Inequality. Let b and d be any two vectors, ( px l ) and let B be a positive definite matrix. Then ( pxl )
( pxp)
(b' d) 2 < (b' Bb) (d'B-1 d)
(2-49)
with equality if and only if b = cB- 1 d or d = eBb) for some constant c.
(
Proof. The inequality is obvious when b = 0 or d = 0. For cases other than
these, consider the square-root matrix B 112 defined in terms of its eigenvalues Ai and p 2 1 the normalized eigenvectors ei as B 1 = :L � e i e . If we set s ee also (2-22)]
i =l p 1 " - e .e� B- 1/2 = � i =l Y J\i ,.. /\
l
l
i
[
Section 2 . 7
Matrix I n eq u a l ities a n d Maxim ization
81
it follows that and the proof is completed by applying the Cauchy-Schwarz inequality to the vec tors ( B 112 b) and ( B- 112 d) . • The extended Cauchy-Schwarz inequality gives rise to the following maxi mization result. Maximization Lemma. Let B be positive definite and d be a given vec(p x p) ( p xl ) tor. Then, for an arbitrary nonzero vector x , ( pXl ) (x' d) 2 = d' B- 1 d max (2-50) x :;t: O x' Bx with the maximum attained when x = cB - 1 d for any constant c =1= 0. ( pXl ) ( pXp ) ( pXl ) 2 Proof. By the extended Cauchy-Schwarz inequality, (x' d) < (x'Bx) ( d'B - 1 d ) . Because x =I= 0 and B is positive definite, x ' B x > 0. Dividing both sides of the in equality by the positive scalar x' Bx yields the upper bound (X' d)2 < d' B- 1 d x ' Bx Taking the maximum over x gives Equation (2-50) because the bound is attained for x = cB- 1 d. • A final maximization result will provide us with an interpretation of eigenvalues.
···
Maximiz ation of Q uadratic Forms for Points on the Unit Sphere. Let B be ( pXp ) a positive definite matrix with eigenvalues A 1 A2 AP 0 and associated normalized eigenvectors e 1 , e 2 , . . . , eP . Then >
>
>
x' Bx max -,- = A 1 X X x :;t: O
( attained when x = e 1 )
x' Bx min -,- = AP x :;t: O X X
( attained when x = e p )
>
(2-51)
Moreover, x ' Bx max -,- = A k + l
x .1
e 1 , . . . , ek
X X
where the symbol
j_
( attained when x = e k +l , k
is read "is perpendicular to. "
==
1, 2, . . . , p
-
1 ) (2-52)
82
Chapter 2
Matrix Algebra and Ra ndom Vectors
Proof. Let P be the orthogonal matrix whose columns are the eigenvectors
( p x p) e 1 , e2 , . . . , e P and A be the diagonal matrix with eigenvalues A 1 , A2 , . . . , AP along the P' x . main diagonal. Let B 112 P A 1 12 P' s ee (2-22)] and y ( pXl ) ( p X p ) ( pXl ) Consequently, x # 0 implies y # 0. Thus, x' Bx x' B lf2 B lf2 x x ' P A 112 P' P A 112 P ' x y' Ay
[
=
x' x
=
=
--
y'y
y'y
x' PP' x I pXp ( ) '-.r--1
(2-53)
Setting x
=
e 1 gives y
=
P' e 1
1 0
=
0 since k 1 k i= 1 =
For this choice of x, we have y' Ay/y' y
=
A 1/1
=
A 1 , or (2-54)
A similar argument produces the second part of (2-51). Now, x Py y1 e 1 + y2 e2 + . . . + yP e P , so x l_ er , . . . , e k implies 0 el�x y1 e l� e 1 + y2 e l� e2 + . . + yp el� e p y · i < k =
=
=
=
·
=
n
-
Therefore, for x perpendicular to the first k eigenvectors ei , the left-hand side of the inequality in (2-53) becomes p i YT x' Bx i =2: k +l "p x' x 2: Y T i = k +l Taking yk +l 1, Yk + 2 . . • Yp 0 gives the asserted maximum. =
=
·
=
=
Section 2 . 7
Matrix Inequal ities a n d Maxi m i zation
83
For a fixed x0 =I= 0, x0Bx0/x0x0 has the same value as x' Bx, where x' = x0/ � is of unit length. Consequently, Equation (2-51) says that the largest eigenvalue, A 1 , is the maximum value of the quadratic form x' Bx for all points x whose distance from the origin is unity. Similarly, AP is the smallest value of the qua dratic form for all points x one unit from the origin. The largest and smallest eigen values thus represent extreme values of x' Bx for points on the unit sphere. The "intermediate" eigenvalues of the p X p positive definite matrix B also have an in terpretation as extreme values when x is further restricted to be perpendicular to the earlier choices.
S U PP LE M E NT 2A
Vectors an d Matrices: Basic Concepts
Vectors
Many concepts, such as a person ' s health, intellectual abilities, or personality, cannot be adequately quantified as a single number. Rather, several different measurements x 1 , x2 , , Xm are required. • • •
Definition 2A.l. An m-tuple of real numbers ( x 1 , x2 , , xi , . . . , xm ) arranged in a column is called a vector and is denoted by a boldfaced, lowercase letter. Examples of vectors are • • •
a = [�J
x=
1 -1 b= 1 ' -1
Vectors are said to be equal if their corresponding entries are the same. Definition 2A.2 (Scalar Multiplication). Let c be an arbitrary scalar. Then the product ex is a vector with ith entry cxi . To illustrate scalar multiplication, take c1 = 5 and c2 = - 1 .2. Then
c1 y = 5
[ �] [ �] =
-2
1 - 10
and c2 y
=
( - 1.2)
[ �] [=�:� ] -2
=
2.4
Definition 2A.3 (Vector Addition). The sum of two vectors x and y, each hav ing the same number of entries, is that vector z
84
= x + y with ith entry zi = xi + Yi
Supplement 2A
Vectors and Matrices: Basic Concepts
85
Thus,
X
z
y
+
Taking the zero vector, 0, to be the m-tuple (0, 0, . . . , 0) and the vector -x to be the 1n-tuple ( -x 1 , -x2 , , -x m ) , the two operations of scalar multiplication and vector addition can be combined in a useful manner. • • •
Definition 2A.4. The space of all real m-tuples, with scalar multiplication and
vector addition as just defined, is called a vector space.
···
+ a k x k is a linear combi Definition 2A.5. The vector y = a 1 x 1 + a 2 x2 + nation of the vectors x 1 , x 2 , . . . , x k . The set of all linear combinations of x 1 , x2 , . . . , x k , is called their linear span.
Definition 2A.6. A set of vectors x 1 , x 2 , . . . , x k is said to be linearly dependent if there exist k numbers ( a 1 , a 2 , , a k ) , not all zero, such that • • •
a1 x1
+
a 2 x2
+
+
· · ·
ak xk = 0
Otherwise the set of vectors is said to be linearly independent. If one of the vectors, for example, xi , is 0, the set is linearly dependent. ( Let ai be the only nonzero coefficient in Definition 2A.6.) The familiar vectors with a one as an entry and zeros elsewhere are linearly in dependent. For m = 4, 1 0 xl = 0 0
'
0 1 x2 = 0 0
0 0 x3 = 1 0
'
0 0 x4 = 0 1
'
so
a1 0 a1 0 •
•
+
+
a2 1 a2 0 •
•
+
+
implies that a 1 = a 2 = a 3 = a 4 = 0. As another example, let k = 3 and m = 3, and let
a3 0 a3 1 •
•
+
+
a4 0 a4 0 •
•
86
Chapter 2
Matrix Algebra and Ra ndom Vectors
Then 2x 1 - x 2 + 3x3 = 0 Thus, x 1 , x 2 , x3 are a linearly dependent set of vectors, since any one can be written as a linear combination of the others (for example, x2 = 2x 1 + 3x3 ). Definition 2A.7. Any set of m linearly independent vectors is called a basis for the vector space of all m-tuples of real numbers. Result 2A.l. Every vector can be expressed as a unique linear combination of • a fixed basis. With m = 4, the usual choice of a basis is 1 0 0 0 1 0 0 ' 0 ' 1 ' 0 0 0
0 0 0 1
These four vectors were shown to be linearly independent. Any vector x can be uniquely expressed as 1 0 0 0 0 0 0 1 xl + x2 + x3 + x4 0 0 0 1 0 1 0 0
xl x2 =x x3 x4 A vector consisting of m elements may be regarded geometrically as a point in m-dimensional space. For example, with m = 2, the vector x may be regarded as representing the point in the plane with coordinates x 1 and x2 • Vectors have the geometrical properties of length and direction. 2 X2
- - - - - - - -
I I I I I I
X =[��]
Definition 2A.8. The length of a vector of m elements emanating from the origin is given by the Pythagorean formula: length of X = Lx = y'xr + X � + . . . + X � Definition 2A.9. The angle (} between two vectors x and y, both having m en tries, is defined from ( X1Y1 + X2Y2 + + Xm Ym ) cos ( O)
· · · LxL y
Supplement 2A
Vectors and Matrices: Basic Concepts
where Lx = length of x and Ly = length of y, x 1 , x2 , Y1 , Y2 , , Ym are the elements of y.
• • •
87
, xm are the elements of x, and
. . •
Let 4 -3 and y = 0 1
-1 5 x= 2 -2
Then the length of x, the length of y, and the cosine of the angle between the two vectors are length of x = V ( -1 ) 2 + 52 + 22 + ( -2) 2 = \134 = 5.83 length of y = V42 + ( -3 ) 2 + 02 + 1 2 = \126 = 5 . 10 and 1 1 + 3 3 + X4 Y4 ] + cos (B) = Lx L y [ X l Yl X2 Y2 X Y 1 1 = \134 \126 [ ( - 1 )4 + 5 ( -3) + 2(0) + ( -2 ) 1 ] 1 -21 J = - ·706 5.83 X 5.10 [
Consequently, B = 135°.
Definition 2A.l0. The inner ( or dot) product of two vectors x and y with the same number of entries is defined as the sum of component products: We use the notation x'y or y'x to denote this inner product. With the x'y notation, we may express the length of a vector and the cosine of the angle between two vectors as Lx = length of x = Vxi + x� + + x� = � x'y cos ( B ) = �:- �: v x' x v y'y · · ·
"
"
Definition 2A.ll. When the angle between two vectors x, y is B = goo or 270°, we say that x and y are perpendicular. Since cos (B) = 0 only if B = goo or 270°, the condition becomes x and y are perpendicular if x'y = 0 We write x
..l
y.
88
Chapter 2
Matrix Algebra and Random Vectors
The basis vectors 1 0 0 ' 0
0 1 0 ' 0
0 0 1 0
0 0 0 1
'
are mutually perpendicular. Also, each has length unity. The same construction holds for any number of entries m. Result 2A.2. (a) z is perpendicular to every vector if and only if z = 0. (b) If z is perpendicular to each vector x 1 , x 2 , . . . , x k , then z is perpendicular to their linear span. • (c) Mutually perpendicular vectors are linearly independent. Definition 2A.12. The projection (or shadow) of a vector x on a vector y is . . (x' y) proJectzon ofx on y = --Y L2
If y has unit length so that Ly = 1, projection ofx on y
y
=
( x' y)y
Result 2A.3 (Gram-Schmidt Process). Given linearly independent vectors x 1 , x2 , . . . , x k , there exist mutually perpendicular vectors u 1 , u2 , . . . , uk with the same linear span. These may be constructed sequentially by setting
We can also convert the u ' s to unit length by setting z j = uj j� . In this con k- 1 struction, (xkz j ) z j is the projection of x k on z j and � (xkz j )z j is the projection ofx k j= 1 on the linear span ofx 1 , x2 , . . . , x k - 1 · • For example, to construct perpendicular vectors from 3 4 1 0 and 0 0 2 -1
Supp lement 2A
Vectors and Matrices: Basic Concepts
89
we take 4 0 0 2 so
u� u 1 = 42 + 02 + 02 + 22 = 20 and
x; u 1 = 3(4) + 1 ( 0) + 0(0) - 1 (2) = 10 Thus, 3 1 u2 = 0 -1
4 10 0 20 0 2
1 1 0 -2
4 1 0 and z 1 - -V20 0 2
'
1 1 1 z2 = v'6 0 -2
Matrices
Definition 2A.l3. An m X k matrix, generally denoted by a boldface upper case letter such as A, R, I, and so forth, is a rectangular array of elements having m
rows and k columns.
[ A = � !] ' ] [ .�
Examples of matrices are -7 2
I=
.7 2 -.3 1
B = [�
-
.
3 1 ' 8
3 -2 1/
E = [e 1 ]
�l
I=
[ � �] 0 1 0
In our work, the matrix elements will be real numbers or functions taking on values in the real numbers. Definition 2A.14. The dimension ( abbreviated dim) of an m X k matrix is the
ordered pair (m, k); m is the row dimension and k is the column dimension. The di mension of a matrix is frequently indicated in parentheses below the letter repre senting the matrix. Thus, the m X k matrix A is denoted by Ak . In the preceding (mx ) examples, the dimension of the matrix I is 3 X 3, and this information can be conveyed by writing 3I3 . ( X )
90
Matrix Algebra a n d Ra ndom Vectors
Chapter 2
An m X k matrix, say, A, of arbitrary constants can be written A
( mX k )
a 12 a 22
a1 1 a 21
==
a ml a m2
or more compactly as Ak == { ai j }, where the index i refers to the row and the index (mX ) j refers to the column. An m X 1 matrix is referred to as a column vector. A 1 X k matrix is referred to as a row vector. Since matrices can be considered as vectors side by side, it is nat ural to define multiplication by a scalar and the addition of two matrices with the same dimensions. Definition 2A.15. Two matrices Ak {bi j } are said to be { ai j } and B k ( mX ) (m x ) eq ual, written A B, if ai j == bi j , i 1, 2, . . . , m , j == 1, 2, . . . , k. That is, two matri ces are equal if (a) Their dimensionality is the same. (b) Every corresponding element is the same. ==
==
==
==
Definition 2A.16 (Matrix Addition). Let the matrices A and B both be of di mension m X k with arbitrary elements ai j and bi j ' i 1, 2, . . . , m, j 1, 2, . . . , k, respectively. The sum of the matrices A and B is an m X k matrix C, written C == A + B, such that the arbitrary element of C is given by ==
==
C·l 1.
==
a l- 1. + b·l 1.
i
==
1, 2, . . . , m , j
==
1, 2, . . . , k
Note that the addition of matrices is defined only for matrices of the same dimension. For example,
� �J
[! � n [� - � � J [ �
1
+
A
+
c
B
Definition 2A.17 (Scalar Multiplication). Let c be an arbitrary scalar and Ak == { a l· 1·} Then cAk == A ck == B k == {b·l 1·} ' where b·l 1· ca l· 1 a l· 1·C ' ( mX ) ( mX ) ( m X ) ( mX ) i 1, 2, . . . , m, j = 1, 2, . . . , k. ==
•
·
==
==
Multiplication of a matrix by a scalar produces a new matrix whose elements are the elements of the original matrix, each multiplied by the scalar. For example, if c 2,
[ 3 -4 ] [3 ] [ -8] ==
2 2 6 0 5 cA
-4 2 6 2 0 5 Ac
6 4 0
12 10
B
Supplement 2A
Vectors and Matrices: Basic Concepts
91
Definition 2A.l8 (Matrix Subtraction). Let Ak = {ai j } and B = {bi j } ( mX ) (m x k) be two matrices of equal dimension. Then the difference between A and B, written A - B, is an m X k matrix C = { ci j } given by C = A - B = A + ( -1)B
That is, ci j = aij + ( - 1 )bij = ai j - bij , i = 1, 2, . . . , m, j = 1, 2, . . . , k. Definition 2A.l9. Consider the m X k matrix A with arbitrary elements ai j ' i = 1, 2, . . . , m, j = 1, 2, . . . , k. The transpose of the matrix A, denoted by A' , is the k X m matrix with elements a j i , j = 1, 2, . . . , k, i = 1, 2, . . . , m. That is, the transpose of the matrix A is obtained from A by interchanging the rows and columns. As an example, if
[ ]
2 7 1 3 A , then 3A' = 1 -4 ( X2) ( 2X 3 ) 7 -4 6 3 6
[
2
]
Result 2A.4. For all matrices A, B, and C (of equal dimension) and scalars c and d, the following hold: (a) (A + B) + C = A + (B + C ) (b) A + B = B + A (c) c(A + B ) = cA + cB (d) (c + d)A = cA + dA (That is, the transpose of the sum is equal to the (e) (A + B) ' = A' + B' sum of the transposes.) (f) (cd)A = c(dA) (g) (cA) ' = cA' • Definition 2A.20. If an arbitrary matrix A has the same number of rows and columns, then A is called a square matrix. The matrices I, I, and E given after Definition 2A.13 are square matrices. Definition 2A.21. Let A be a k X k (square) matrix. Then A is said to be symmetric if A = A' . That is, A is symmetric ai j = aj i , i = 1, 2, . . . , k, j = 1, 2, . . . , k.
OJ [
Examples of symmetric matrices are 1 0 I = 0 1 0 , (3 X 3) Q Q 1
a c e f c b g d B = e g c a (4 X 4) f d a d
92
Chapter 2
Matrix Algebra and Random Vectors
Definition 2A.22. The k X k identity matrix, denoted by k I k , is the (x ) square matrix with ones on the main (NW-SE) diagonal and zeros elsewhere. The 3 X 3 identity matrix is shown before this definition. Definition 2A.23 (Matrix Multiplication). The product AB of an m X n matrix A = {a i j } and an n X k matrix B = { bi j } is the m X k matrix C whose elements are lj
e· · =
n
" �
€= 1
l t,
q
a -obo·
i = 1, 2, . . . , m j = 1, 2, . . . , k
Note that for the product AB to be defined, the column dimension of A must equal the row dimension of B. If that is so, then the row dimension of AB equals the row dimension of A, and the column dimension of AB equals the column dimension of B. For example, let
[
3 -1 2 4 0 5 (2X 3 ) A
Then
[ 43 where
-1 2 0 5 (2 X 3 ) e1 1 e1 2 c2 1 e22
= = = =
]
]
[ � -� ] 4
( 3 X 2)
(3) (3) ( 3) ( 4) (4) (3) ( 4) ( 4)
+ + + +
3
and
_
[� -� ]
B =
(3 X 2)
4
3
[ 3211 2031 ] [ ee21 11 ee221 2 ] _
(2 X 2 )
( -1 ) (6) + (2) (4) = 11 ( -1) ( - 2) + ( 2) ( 3) = 20 (0) (6) + (5 ) (4) = 32 ( 0) ( -2) + ( 5) ( 3) = 31
As an additional example, consider the product of two vectors. Let
1 0 x= -2 3
2 -3 and y = -1 -8
Then x' = [1 0 -2 3 ] and
2 1 0 = y'x -3 = [- 20 J = [ 2 -3 -1 - 8 x'y = [1 0 - 2 3] J -1 -2 -8 3 Note that the product xy is undefined, since x is a 4 X 1 matrix and y is a 4 X 1 ma trix, so the column dim of x, 1, is unequal to the row dim of y, 4. If x and y are vec tors of the same dimension, such as n X 1, both of the products x'y and xy' are
Supplement 2A
defined. In particular, y' x = x 'y matrix with i, jth element xi yj .
=
···
Vectors and Matrices: Basic Concepts
x 1 y1
+
x2 y2 +
+
93
XnYn , and xy' is an n X n
Result 2A.5. For all matrices A, B. and C (of dimensions such that the indicated products are defined) and a scalar c , (a) c ( AB ) = ( cA ) B (b) A ( BC) = ( AB ) C (c) A ( B + C ) = AB + AC (d) ( B + C ) A = BA + CA (e) ( AB ) ' = B ' A '
More generally, for any xj such that Axj is defined, n n (f) 2: Axj = A 2: xj j=l
j=l
•
There are several important differences between the algebra of matrices and the algebra of real numbers. Two of these differences are as follows: 1. Matrix multiplication is, in general, not commutative. That is, in general, AB # BA. Several examples will illustrate the failure of the commutative law (for matrices).
but
is not defined.
but
Also,
[4 - 11 ] [ -23 41 ] [ -311 4O J 0
=
94
Chapter 2
Matrix Algebra and Ra ndom Vectors
[ -32 41 ] [4 -11 ] - [ -12 -17 ]
but
8
0
2. Let 0 denote the zero matrix, that is, the matrix with zero for every element. In
the algebra of real numbers, if the product of two numbers, ab, is zero, then a = 0 or b = 0 . In matrix algebra, however, the product of two nonz ero ma trices may be the zero matrix. Hence, 0 AB ( mXn )( n X k ) ( mX k ) does not imply that A = 0 or B = 0. For example,
It is true, however, that if either A = 0 or Bk = 0 k , then ( nX ) ( n X ) ( mXn ) ( mXn ) A nBX k mX0 k . mXn ( )( ) ( ) =
Definition 2A.24. The determinant of the square k X k matrix A = { ai j } , de noted by I A I , is the scalar if k = 1 I A I = a1 1 k I A I = � a 1 j i A 1 j l ( - 1) 1 + j if k > 1 j=l where A 1 j is the ( k - 1) X ( k - 1 ) matrix obtained by deleting the first row and jth k column of A. Also, I A I = � ai j I A i j I ( - 1 ) i + j , with the ith row in place of the first row. j=l Examples of determinants (evaluated using Definition 2A.24) are 1 3 = 1 4 ( -1 ) 2 + 3 6 1 1 1 1 ( -1 ) 3 = 1 (4) + 3(6) ( -1) = -14 6 4 In general,
3 1 6 7 4 5 = 3 4 51 ( -1) 2 + 1 72 51 ( - 1 ) 3 + 6 27 4 ( -1) 4 -7 -7 2 -7 1 = 3(39) - 1 ( -3) + 6( - 5 7) = -222 1 0 0 1 0 ( -1 ) 2 0 0 3 + 0 0 1 ( -1 ) 4 = 1( 1 ) = 1 0 1 0 =1 + 0 ( -1) 0 1 0 1 0 0 0 0 1
Supplement 2A
Vectors and Matrices: Basic Concepts
95
If I is the k X k identity matrix, I I I = 1.
a1 1 a1 2 a1 3 a 2 1 a 22 a 2 3 a3 1 a 3 2 a33
= a 1 1 a 22 a 33 + a 1 2 a 2 3a3 1 + a 2 1 a 3 2 a 1 3 - a 3 1 a 22 a 1 3 - a 2 1 a 1 2 a 33 - a 3 2 a 2 3 a 1 1 The determinant of any 3 X 3 matrix can be computed by summing the products of elements along the solid lines and subtracting the products along the dashed lines in the following diagram. This procedure is not valid for matrices of higher dimension, but in general, Definition 2A.24 can be employed to evaluate these determinants. ..... ...... ....
'
....
' .... "),. '< ...... �
, ......
'
'
\
\
We next want to state a result that describes some properties of the determinant. However, we must first introduce some notions related to matrix inverses. Definition 2A.25. The row rank of a matrix is the maximum number of linearly independent rows, considered as vectors ( that is, row vectors ) . The column rank of
a matrix is the rank of its set of columns, considered as vectors. For example, let the matrix A =
[� ! � ] -
0 1 -1 The rows of A , written as vectors, were shown to be linearly dependent after Defin ition 2A.6. Note that the column rank of A is also 2, since
but columns 1 and 2 are linearly independent. This is no coincidence, as the follow ing result indicates.
96
Chapter 2
Matrix Algebra and Ra ndom Vectors
Result 2A.6. The row rank and the column rank of a matrix are equal.
•
Thus, the rank of a matrix is either the row rank or the column rank. A square matrix kAk is nonsingular if kAk kx 0 (x ) ( x )( x 1 ) ( k X1 ) implies that kx 0 . If a matrix fails to be nonsingular, it is called singular . ( X 1 ) ( k X1 ) Equivalently, a square matrix is nonsingular if its rank is equal to the number of rows (or columns) it has. Note that Ax = x 1 a 1 + x2a2 + + xk ak , where ai is the ith column of A, so that the condition of nonsingularity is just the statement that the columns of A are linearly independent. Definition 2A.26.
···
Result 2A.7. Let A be a nonsingular square matrix of dimension k X k. Then there is a unique k X k matrix B such that AB = BA = I
•
where I is the k X k identity matrix.
Definition 2A.27. The B such that AB = BA = I is called the inverse of A and is denoted by A- 1 . In fact, if BA = I or AB = I, then B = A-\ and both products
must equal I.
For example, A =
s1nce
[ 21 53 ]
has A- 1 =
[ _ -;?_ ] - -7i
[21 53 ] [ t - �] = [ t - � ] [21 -7
Result 2A.8.
2 2
7
(a) The inverse of any X matrix
is given by
-7
7
i
2_
7
]
3 = 5
[01 01 ]
Supp l ement 2A
Vectors and Matrices: Basic Concepts
97
(b) The inverse of any 3 X 3 matrix
is given by
A- 1
- 1 - TAf
In both (a) and (b) , it is clear that I A I # 0 if the inverse is to exist. (c) In general, A- 1 has j, ith entry [ I Ai j Il l A I ] ( - 1 ) i + j , where A i j is the matrix obtained from A by deleting the ith row and jth column. • Result 2A.9. For a square matrix A of dimension k X k, the following are
equivalent:
(a) kAk kx 0 implies x 0 (A is nonsingular). ( X )( X l ) ( k X l ) ( k X 1 ) ( k Xl ) (b) I A I # 0. (c) There exists a matrix A- 1 such that AA- 1 A- 1 A I . (k X k) =
=
=
=
•
Result 2A.l0. Let A and B be square matrices of the same dimension, and let the indicated inverses exist. Then the following hold:
(a) ( A- 1 ) ' = ( A ' ) -1 (b) ( AB ) - 1 = B- 1 A- 1 The determinant has the following properties.
•
Result 2A.ll. Let A and B be k X k square matrices.
(a) I A I l A ' I (b) If each element of a row (column) of A is zero, then I A I = 0 ( c) If any two rows (columns) of A are identical, then I A I = 0 (d) If A is nonsingular, then I A I = 1/l A- 1 1 ; that is, I A I I A- 1 1 = 1. =
(e) I AB I = I A I I B I (f) I c A I = c k I A I , where c is a scalar.
[6]
You are referred to for proofs of parts of Results 2A.9 and 2A.11. Some of these proofs are rather complex and beyond the scope of this book. •
98
Chapter 2
Matrix Algebra and Random Vectors
Definition 2A.28. Let A = { a i j } be a k X k square matrix. The trace of the
k
matrix A, written tr (A), is the sum of the diagonal elements; that is, tr (A) = � a i i . i= 1 Result 2A.l2. Let A and B be k X k matrices and c be a scalar. (a) tr ( c A) = c tr (A ) (b) tr (A ± B ) = tr (A) ± tr (B) (c) tr (AB ) = tr (BA) (d) tr (B- 1 AB ) = tr (A) k k • (e) tr (AA' ) = � � a rj i=1 j=1 Definition 2A.29. A square matrix A is said to be orthogonal if its rows, con
sidered as vectors, are mutually perpendicular and have unit lengths; that is, AA' = I. Result 2A.l3. A matrix A is orthogonal if and only if A- 1 = A'. For an or thogonal matrix, AA' = A' A = I, so the columns are also mutually perpendicular • and have unit lengths. An example of an orthogonal matrix is 1 1 21 2 1 2 A= 1 - 21 21 2 1 2 2
1 21 2 1 -2 1 2
1 2 1 21 21 2
Clearly, A = A', so AA' = A' A = AA. We verify that AA = I = AA' = A' A, or 1 1 1 1 1 1 1 1 1 0 0 0 2 2 2 2 2 2 -2 -2 1 1 1 1 1 1 1 1 0 1 0 0 2 2 2 2 2 2 -2 2 1 1 1 1 1 1 1 1 0 0 1 0 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 1 2 2 2 2 2 2 2 -2 A I A so A' = A-\ and A must be an orthogonal matrix. Square matrices are best understood in terms of quantities called eigenvalues and eigenvectors. Definition 2A.30. Let A be a k X k square matrix and I be the k X k identi ty matrix. Then the scalars A 1 , A2 , , Ak satisfying the polynomial equation I A AI I = 0 are called the eigenvalues (or characteristic roots) of a matrix A. The equation I A AI I = 0 (as a function of A) is called the characteristic equation. • . .
-
-
For example, let A=
[� � ]
Supplement 2A
Then
I A - AI I =
Vectors and Matrices: Basic Concepts
99
c �J [� n -A
0 1-A = (1 - A) (3 - A) = 0 1 3 A implies that there are two roots, A 1 = 1 and A2 = 3. The eigenvalues of A are 3 and 1 . Let 3 4 A= 2 -2 10 Then the equation 2 -4 13 - A -4 13 - A -2 = -A3 + 36A2 - 405A + 1458 = 0 I A - AI I = 2 -2 10 - A =
_
[ � �� -�]
has three roots: A 1 = 9, A2 = 9, and A3 = 18; that is, 9, 9, and 18 are the eigenvalues of A. Definition 2A.31. Let A be a square matrix of dimension k X k and let A be an eigenvalue of A. If kx is a nonzero vector ( kx # k 0 ) such that ( X l) ( X l ) ( Xl ) Ax = Ax then x is said to be an eigenvector (characteristic vector) of the matrix A associated with the eigenvalue A.
An equivalent condition for A to be a solution of the eigenvalue-eigenvector equation is I A - AI I = 0. This follows because the statement that Ax = Ax for some A and x # 0 implies that 0 = (A - AI)x = x 1 col 1 (A - AI) + + x k col k (A - AI) That is, the columns of A AI are linearly dependent so, by Result 2A.9(b ), I A - AI I = 0, as asserted. Following Definition 2A.30, we have shown that the eigenvalues of
···
-
A=
c �]
are A 1 = 1 and A2 = 3. The eigenvectors associated with these eigenvalues can be determined by solving the following equations:
1 00
Chapter 2
Matrix Algebra and Ra ndom Vectors
From the first expression,
xl = xl x 1 + 3x2 = x2 or x 1 = -2x2 There are many solutions for x 1 and x2 • Setting x2 = 1 (arbitrarily) gives x 1 = - 2, and hence,
is an eigenvector corresponding to the eigenvalue 1. From the second expression, x 1 = 3x 1 x 1 + 3x2 = 3x2 implies that x 1 = 0 and x2 = 1 (arbitrarily), and hence,
[�]
X=
is an eigenvector corresponding to the eigenvalue 3. It is usual practice to deter mine an eigenvector so that it has length unity. That is, if Ax = Ax, we take e = x/ Vx'x as the eigenvector corresponding to A. For example, the eigenvector for A 1 = 1 is e1 = [ -2/VS, 1/VS] . Definition 2A.32. A quadratic form Q(x) in the k variables x 1 , x2 , • • • , x k is Q(x) = x' Ax, where x' = [x 1 , x2 , • • • , xk ] and A is a k X k symmetric matrix.
k
k
Note that a quadratic form can be written as Q(x) = � � a ijxixj . For example, i =l j =l Q(x) = [ x 1 x2 ]
[� � J
Q(x) = [x 1 x2 x3 ]
[�
Any symmetric square matrix can be reconstructured from its eigenvalues and eigenvectors. The particular expression reveals the relative importance of each pair according to the relative size of the eigenvalue and the direction of the eigenvector. Result 2A.l4. The Spectral Decomposition. Let A be a k X k symmetric ma trix. Then A can be expressed in terms of its k eigenvalue-eigenvector pairs ( Ai , ei) as
k
A= " � Al- e l- e l�
i =l
•
Supplem ent 2A
Vectors and Matrices: Basic Concepts
1 01
For example, let A =
Then
[ 2.2.4 2.8.4 ]
I A - AI I = A2 - SA + 6.16 - .16 = (A - 3 ) (A - 2)
so A has eigenvalues A 1 = 3 and A2 = 2. The corresponding eigenvectors are e1 = [ 1/ v'5 , 2/ v'5 ] and e2 = [ 2/v'S , - 1/v'S ] , respectively. Consequently, 1 2.2 .4 v'5 = 3 A = .4 2.8 2 v'5
[
2 1 v'5 _ +2 -1 v'S v'S v'5
[ �]
]
- [ :� � :! ] [ � : � - :! ] 1
[
2 -1 v'S v'S
J
+
The ideas that lead to the spectral decomposition can be extended to provide a decomposition for a rectangular, rather than a square, matrix. If A is a rectangu lar matrix, then the vectors in the expansion of A are the eigenvectors of the square matrices AA ' and A ' A. Result 2A.15. Singular-Value De composition. Let A be an m X k matrix of
real numbers. Then there exist an m X m orthogonal matrix U and a k X k orthog onal matrix V such that A = UAV'
where the m X k matrix A has (i, i) entry Ai 0 for i = 1, 2, . . . , min(m, k) and the other entries are zero. The positive constants Ai are called the singular values of A. • >
The singular-value decomposition can also be expressed as a matrix expansion that depends on the rank r of A. Specifically, there exist r positive constants A 1 , A2 , . . . , Ar , r orthogonal m X 1 unit vectors u 1 , u2 , . . . , u, and r orthogonal k X 1 unit vectors v1 , v2 , . . . , vr , such that r A = � Ai ui vi = Ur A r V � i= l
where ur = [ ul ' u2 , . . . ' Ur ] , vr = [ vl ' v2 , . . . ' Vr ] , and A r is an r X r diagonal matrix with diagonal entries Ai . Here AA ' has eigenvalue-eigenvector pairs ( Ar , ui ) , so AA ' u l· = A?u l l.
with Ai , A�, . . . , A; > 0 = A; + 1 , A; + 2 , . . , A� (for m > k). Then vi = Aj 1 A ' ui . Alter natively, the vi are the eigenvectors of A' A with the same nonzero eigenvalues Ar . .
1 02
Chapter 2
M atrix Algebra and Ra ndom Vectors
The matrix expansion for the singular-value decomposition written in terms of the full dimensional matrices U, V, A is A = U A V' ( m x k ) ( mXm )( m x k )( k X k ) where U has m orthogonal eigenvectors of AA' as its columns, V has k orthogonal eigenvectors of A' A as its columns, and A is specified in Result 2A.15. For example, let A
=
[ - 31
Then AA , =
[
3 1 1 -1 3 1
1 1 3 1
]
[J � ] �
-
1
1
=
[ 111 111 J
You may verify that the eigenvalues 'Y = A2 of AA' satisfy the equation y2 - 22y + 120 = (y - 12) (y - 10) , and consequently, the eigenvalues are y 1 = AI = 12 and y2 = A� = 10. The corresponding eigenvectors are ul =
[ � �J
and u2 =
Also,
[ � � J respectively.
so j A' A - yl l = -y3 - 22y2 - 120y = -y(y - 12) ( y - 10), and the eigenvalues are y 1 = AI = 12, y2 = A� = 10, and y3 = A� = 0. The nonzero eigenvalues are the same as those of AA' . A computer calculation gives the eigenvectors
[� [�
Eigenvectors v1 and v2 can be verified by checking: 10
A ' Av1 =
10
A ' Av2 =
0 10 4
�] � [�] � [�] � ] [-�] [ � ]
0 1 10 vs 4 2
= 12
o
1 = 10 - vs
= Aiv1
o
= Ah2
Supp lement 2A
Taking A 1 = VI2 and A2
A=[3
1 1 -1 1
3
1 = VI2 v'2 1 v'2
-
=
J
1 03
Vectors and M atrices: Basic Concepts
v'IO we find that the singular-value decomposition of A is ,
[�
�]
2 v'6
-
1 v'2 + v'IO -1 v'2
[�
-
-
-1 v'5
-
o]
The equality may be checked by carrying out the operations on the right-hand side. The singular-value decomposition is closely connected to a result concerning the approximation of a rectangular matrix by a lower-dimensional matrix, due to Eckart and Young If a m X k matrix A is approximated by B, having the same di mension but lower rank, the sum of squared differences
( [3] ) .
m k :L :L i = l j=l
( a i j - bi j ) 2
=
tr [ (A - B ) (A - B ) ' ]
Result 2A.16. Let A be an m X k matrix of real numbers with m singular value decomposition UA V' . Let s < k = rank (A). Then
B
=
>
k and
s " � Al·U·V� l l i= l
is the rank-s least squares approximation to A. It minimizes tr [ (A - B) (A - B) ' J over all m X k matrices B having rank no greater than s. The minimum value, or
k
error of approximation, is :L Af . +
•
i =s l
To establish this result, we use UU' = I m and VV ' squares as
=
I k to write the sum of
tr [ ( A - B ) (A - B ) ' ] = tr [UU' (A - B ) VV' (A - B ) ' ]
= tr [U' (A - B ) VV' (A - B ) 'UJ
m k
= tr [ ( A - C ) ( A - C) ' ] = :L :L ( Ai j - ci j ) 2 i = l j=l
=
m :L i= l
( Ai - cii ) 2 +
:Li =l=j:L crj
where C = U' BV. Clearly, the minimum occurs when ci j = 0 for i # j and cii the s largest singular values. The other ci i
=
0. That is, UBV'
=
s
=
Ai for
As or B = :L Ai uivi . i=l
1 04
Chapter 2
Matrix Algebra and Random Vectors
EXE RCI S E S
= [5, 1, 3] and y' = [ -1, 3, 1 ] . (a) Graph the two vectors. (b) Find (i) the length of x, (ii) the angle between x and y, and (iii) the projection of y on x. (c) Since x = 3 and y = 1, graph [5 - 3, 1 - 3, 3 - 3] = [ 2 -2, OJ and [ -1 - 1, 3 - 1, 1 - 1 ] = [ -2, 2, O J . 2.2. Given the matrices 2.1. Let x'
,
perform the indicated multiplications. (a) SA (b) BA (c) A' B ' (d) C' B (e) Is AB defined? 2.3. Verify the following properties of the transpose when
A=
[� � J
( a) (A' ) ' = A (b) ( c' ) - 1 = ( c -1 ) ' ( c) ( AB ) ' = B ' A'
B=
D � �J
and C =
D �J
(d) For general A and B , ( AB ) ' = B' A'. (m x k) (k x€) 2.4. When A- 1 and B - 1 exist, prove each of the following. (a) (A' ) - 1 = ( A-1 ) ' (b) ( AB ) - 1 = B- 1 A- 1 Hint: Part a can be proved by noting that AA- 1 = I, I = 1', and ( AA- 1 ) ' = ( A- 1 ) ' A'. Part b follows from ( B -1 A- 1 ) AB = B- 1 ( A- 1 A ) B = B -1 B = I. 2.5. Check that 5 12 �� 1: Q= 13 13 is an orthogonal matrix. 2.6. Let
]
[
A= ( a) Is A symmetric?
(b) Show that A is positive definite.
[
9
-2
6]
-2
Chapter 2
2.7.
LetDetA beermasingievtenheineigExerenvalciusees2.and6. eigenvectors of A. ectral decomposition of A. FiFiWrnniddtethteheeisgpenval u es and ei g envect o r s of Given the matrix A = [21 -22] findDetthe eeirgmenvaline tuheessAp1ectandralAdecompos 2 and the asitsioocina(t2e-d16)norofmA.alized eigenvectors and LetFiAnbed as in Exercise 2.8. Comput e t h e ei g enval u es and ei g envect o r s of WrfromiteExerthe csipsecte 2r.8a.l decomposition of A-\ and compare it with that of A Consider the matrices A = [:.001 ::���] and B = [:.001 ::���001 ] ThesMor1 eeovermat,rtihcese col1areumnsidentoficalA except f o r a s m al l di f e r e nce i n t h e pos i t i o n. ( 2, 2) ( a nd B) ar e near l y l i n ear l y dependent . Show t h at l y , s m al l changes per h aps caus e d by r o undi n g Acan- give( -3)subsBtant. iConsal y diequent f e r e nt i n ver s e s . wi t h Show= 0,that the idets giervmeninantby tofhethpre oduct ofdiatgonal mat r i x A = t h us , h e di a gonal el e ment s ; I A I =By Definition 2A.24, I A I = 0 0. Repeat for the sub matShowrixthat tobthe detainedermbyinantdeleoftinagstqhuare fiersstyrmmet ow andricfirst columatmnriofx AA.can be ex pres eFrd asom(th2e-pr16)and( oduct of2-2i0)ts ,eiAgenval= PAPues'with P' P = I. tFrhatomRes is, I A uI l=t 2A.11(e), II AP' II =PII·PApplAP' yI =ExerI PcI isAeP'I2.1 1.= I P I A I P' I = I A I I I , since I I I = I P' P I = mat r i x Show2 tI hQatQI'QI =I = 1Alorso-1, froifmRes . Q is ault 2A.11,1ortQhogonal 2 Thus , = = ' ' • I I Q I I Q I I Q Q IShowQ 1 =thatI I QNow' AuseQExerandciseA2.1have1. the same eigenvalues if Q is orthogonal. Let be an ei g enval u e of A. Then 0 By Exer c i s e 2. 1 3 and AI A = I I · ResQ ' Qul=I.t 2A.11(e), we can write 0 = I Q ' I A - AI I Q I = I Q ' AQ - AI I, since Adefquadr a t i c f o r m x' Ax i s s a i d t o be pos i t i v e def i n i t e i f t h e mat r i x A i s pos i t i v e inite. Is the quadratic form 3xi 3x� - 2x1x2 positive definite? (a) (b)
A- 1 .
(c)
(d)
2.8.
2.9.
A- 1 .
e1
e2 •
A- 1 .
(a) (b)
A- 1 .
(c)
2.10.
1 05
Exercises
·
2.11.
p
a ij
Hint:
2.12.
i # j, a 1 1 a 22 · · · a pp ·
X
a1 1A1 1 +
A1 1
2.14.
Hint:
Hint:
2.15.
+
· · ·
+
X A 1 , A2 , . . . , A P ; p
Hint:
2.13.
{ a ij }
p
+
j.
p
1 1 1.
( p X p ) ( p X p )( p X p ) A
(p Xp)
+
X
p
p
Ti f=1 Ai .
1 06
Chapter 2
M atrix Algebra and Random Vectors
2.16.
Cons i d er an ar b i t r a r y mat r i x Then i s a s y mmet r i c mat r i x . Show tSethat is neces s a r i l y nonnegat i v e def i n i t e . s o t h at ProveConsthat everider tyheie gdefenvalinitiuoenofof aan eigenvalposuie,tivwhere defeinite matAe.rixMulitsipposly onitivte.he lConseft byide'er sthoethseattse'of pointsAe' e. whose "distances" from the origin are given by foforconsta1ntanddistfaoncesr and4. tDetheirerasmsioncieattheedmajlengtorhands. Sketmincorh taxeshe elofliptshees elofliconpses on their positions. What wil happen as isLetntacrnteasdiesst?ances and ecomment wher e ( T heA/sand t h e ; i e o r s of t h e mat r i x e;'s arShowe thePreiogperenvaltieuses(1)and-(4)asofsotciheatseqduarnore-mroalotizmated eirigxenvect i n ( 2 2 2). Detdeteerrmmiinnee the sqanduares-hroowot matthatrix using the matrix in Exercise 2.3. Also, (See Result 2A.15) Using the matrix A' A y = Ax
Hint:
2.17.
2.18.
nX
Hint:
A.
p
' y y
Ae
=
A' A
p
x' A' Ax. kXk
Ae
X
p
A
=
=
( x 1 , x2 )
c2
c2
=
=
c2
2.19.
m
A 1 12
(mXm)
=
� \IT; i =1
=
P A 1 12 P ' ,
PP'
=
P' P
=
I.
A.)
2.20. 2.21.
A- 112 ,
A112 , A112 A- 112
=
I.
A' A AA'
(c)
A.
(a) (b)
AA' A' A
(c)
2.23.
A
CalCalccululaattee andandobtobtaaiinniittsseieiggenvalenvaluuesesandandeieiggenvect envectoorrss. . Check that the nonzer o ei g enval u es ar e t h e s a me as t h os e i n par t a. Obt a i n t h e s i n gul a r v al u e decompos i t i o n of (See Result 2A.15) Using the matrix Calccululaattee andand obtobtaaiinn iittsseieiggenval envaluuesesandandeieiggenvect envectoorrss. . Check that the Calnonzer o ei g enval u es ar e t h e same as t h os e i n par t a. Obt a i n t h e s i n gul a r v al u e decompos i t i o n of Verthe ify the rpopul elatioanstiohnipcovar s iance matrixand[Equation (2-32)] , pis the wherepopis deviulatiaotniocorn matrelartiixon[Ematquatrixio[nE(quat2-35)io]n. (2-34)], and is the population standard (a) (b)
2.22.
=
A- 112 A112
p
X
V 112 pV 112
p
=
I
A. p = ( V 1 12 ) - 1 I ( V 112 ) - r ,
V 1 12
p
X
I
p
Chapter 2
2.24.
X
Find The ei g enval u es and ei g envect o r s of (Letc) Thehaveeigenvalcovaruiesanceandmateigrenvect o r s of ix p
( a)
I
p1 3 .
x3 . (a) X1 - 2X2 (b) - x1 + 3X2 X1 + X2 + X3 (e) X1 + 2X2 - X3 (f) 3X1 - 4X2 X1
2.28.
V 1 12 .
DetMuletirpmlyinyoure andmatrices to check the relation UseFindas given in Exercise Fi n d t h e cor r e l a t i o n bet w een and Dernatioivnse exprin terems isoofnsthfoermeans the meanandandcovarvariaincesancesofofthtehreafndom ol owivarng ilainbleares combi and (c) and i f ar e i n dependent r a ndom var i a bl e s . ShowCov that c c wher e and c c Thi s ver i f i e s t h e of f 1 2 diagonalBy elements in or diagonal elements if c c and So Cov The product (b)
2.27.
I. I- 1 .
X
(a) (b)
2.26.
1 07
Let have covariance matrix (a) I- 1 (b)
2.25.
Exercises
V 112 pV 112 = I.
2.25.
X1
� X2 + � X3 •
X1 , X2 ,
X2
( c1 1 x1 + c 1 2 x2 + . . . + C1 p xp , c2 1 x1 + c22 x2 + . . . + C2 p xp ) = 1 Ix 2 = [c1 1 , c 1 2 , . . . , c 1 P ] = [ c2 1 , c22 , . . . , c2 P ]. Cix C ' (2-45) 1 = 2. Hint: (2-43) , Z1 - E(Z1 ) = c1 1 ( X1 - JL 1 ) + · · · + c1 p (Xp - JLp ) Z2 - E(Z2 ) = c2 1 ( X1 - JL 1 ) + · · + c2 p (Xp - JLp ) · (Z1 , Z2 ) = E[ ( Z1 - E(Z1 ) ) (Z2 - E(Z2 ) ) ] = E [ ( c1 1 (X1 - JL 1 ) + . . . + C1 p ( Xp - JLp ) ) ( C2 1 ( X1 - JL 1 ) + C22 ( X2 - JL2 ) + . . . + C2 p ( Xp - JLp ) ) ] . ·
(c 1 1 (X1 - JL 1 ) + C1 2 (X2 - JL2 ) + · · · + c1 p (Xp - JLp ) ) (c2 1 (X1 - JL 1 ) + C22 (X2 - JL2 ) + · · · + c2 p (Xp - JLp ) ) = Cl e ( Xe - JLe) C2 m (Xm - JLm ) 1 p p = L L Cu; C2 m (Xe - JLe ) (Xm - JLm ) €=1 m=1
(�
) (�
)
1 08
Chapter 2
M atrix Algebra and Random Vectors
has expected value = Verholdiffyotrhale llaelstement step bys. the definition of matrix multiplication. The same steps wi t h mean Cons i d er t h e ar b i t r ar y r a ndom vect o r X' vector = Partition(lX into [-XX-(2- )) J X = where X(1l = [�J and X(2l = [�; ] Letthe covar(lbe itahncee covarmatiraicncees ofXmat(r(l2i)xandof XXwi(2)tandh generthe acovarl elementiance matriParx oftiantionelementinto ofXYou ar) eandgivanenelthemente randomofX vect). or X' = wi t h mean vect o r = 4, 3, 1 and variance-covar3 ia0nce matrix = 0 011 91 04 Partition X as l X= --2-- = [i-� ;J Let A = [1 and B = [� -1 ] andE(consX(id) er1 the linear combinations AX(l) and BX(2). Find E(CovA(XX( ( ))(1 Cov( A X ) 2 ( E(E(CovBXX(X)(2()2)(2 CovCov ((XBX( ) l)X(2) 2 Cov (AX( ), BX( ) p p L L 1e a-em €= 1 m=l C C2m
2.29.
[cl b · · · , C1 p ] I[ c2r , · · · , C2 p ] '. =
[JL 1 , JL2 , JL3 , JL4 , JLs J .
IL '
[X1 , X2 , X3 , X4 , X5]
I
2.30.
ILx
a-i
[X1 , X2 , X3 , X4]
2, J
[
2
Ix
2 2
-2
2
-2
x x x3 x4
--
2]
(a) (b) (c)
1
1
(d) (e)
(f) (g)
(h)
(i) (j )
1,
-2
k.
I
Chapter 2
2.31. Repeat Exercise
2.30,
but with
[20 ] -1 1
B=
and
2.32. You are given the random vector X' = [X1 , X2 , IL'x = - 1 , OJ and variance-covariance matrix
• • •
1 2
-1
-1
1 Ix = 21 1 -1 -2
with mean vector
21
1 -1 1 -1 1 -1
x1 x2 X = x3 = x4 Xs
Let
A=
] � C
and
[i���-J
B=
C
and consider the linear combinations AX ( 1 ) and
(a) E (X (1) ) (b) E( AX ( 1 ) ) (c) Cov (X (1) ) (d) Cov (AX ( 1 ) ) (e) E (X ( 2 ) ) (f) E (BX ( 2 ) ) (g) Cov (X ( 2 ) ) (h) Cov (BX ( 2 ) ) (i) Cov (X ( 1 ) , X ( 2 ) ) (j) Cov (AX ( 1 ) , BX ( 2 ) )
2.33. Repeat Exercise
, X5]
4 3 00 6 4 0 0 0 0 2
Partition X as
2.32,
1 09
A and B replaced by
A = [1 -1J
[2, 4, 3,
Exercises
but with X partitioned as
X=
1 1
BX ( 2 ) .
-�J
Find
110
Chapter 2
Matrix Algebra and Ra ndom Vectors
and with A and B replaced by
A=
[� - � � J
and
B=
[� n _
2.34. Consider the vectors b' = [2, - 1 , 4, OJ and d' = [ - 1, 3, -2, 1 ] . Verify the Cauchy-Schwarz inequality (b' d) 2 < (b'b) (d ' d) . 2.35. Using the vectors b' = [ -4, 3] and d' = [1, 1 ], verify the extended Cauchy Schwarz inequality (b' d) 2 < (b' Bb ) ( d' B- 1 d) if
B=
[ -� -�]
2.36. Find the maximum and minimum values of the quadratic form 4xi + 4x� + 6x 1 x2 for all points x' = [ x 1 , x2 ] such that x' x = 1. 2.37. With A as given in Exercise 2. 6, find the maximum value o f x' Ax for x'x = 1. 2.38. Find the maximum and minimum values o f the ratio x' Ax/x' x for any nonzero vectors x' = [ x 1 , x2 , x3 ] if
A= 2.39. Show that
[�
3
�� -�
4 2 -2
10 s
]
t
A B C has ( i, j)th entry � � au; bekck j
(rXs)(sXt)(tXv)
€= 1 k =l
t
Hint: B C has ( e, j) th entry � bekck j = dej · So A ( B C) has ( i, j) th element
k =l
Y) = E(X) + E(Y) and E( AXB ) = AE(X) B . Hint: X + Y has Xi j + Yi j as its ( i, j) th element. Now, E ( Xi j + Yi j ) = E ( Xi j ) + E( Yi j ) by a univariate property of expectation, and this last quan tity is the ( i, j) th element of E(X) + E(Y). Next (see Exercise 2.39), AXB has ( i, j )th entry � � aieXek bkj , and by the additive property of expectation,
2.40. Verify (2-24): E(X +
e k
which is the
(i, j)th element of AE(X) B .
Chapter 2
References
111
RE F ERENCES 1 . Bellman, R. Introduction to Matrix Analysis ( 2d ed. ) New York: McGraw-Hill, 1970.
2. Bhattacharyya, G. K., and R. A. Johnson. Statistical Concepts and Methods. New York: John Wiley, 1977. 3. Eckart, C. and G. Young, "The Approximation of One Matrix by Another of Lower Rank." Psychometrika, 1 ( 1936 ) , 21 1-218.
4.
Graybill, F. A. Introduction to Matrices with App lications in Statistics. Belmont, CA: Wadsworth, 1969.
5. Halmos, P. R. Finite Dimensional Vector Spaces ( 2d ed. ) . Princeton, NJ: D. Van Nos trand, 1958.
6. Noble, B. , and J. W. D aniel. Applied Linear Algebra ( 3d ed. ) . Englewood Cliffs, NJ: Prentice Hall, 1988.
C H A PTER
3
Sample Geometry and Random Sam pling
3.1
INTRODUCTION With the vector concepts introduced in the previous chapter, we can now delve deep er into the geometrical interpretations of the descriptive statistics x, Sn , and R; we do so in Section 3.2. Many of our explanations use the representation of the columns of X as p vectors in n dimensions. In Section 3.3 we introduce the assumption that the observations constitute a random sample. Simply stated, random sampling implies that (1) measurements taken on different items (or trials) are unrelated to one an other and (2) the j oint distribution of all p variables remains the same for all items. Ultimately, it is this structure of the random sample that justifies a particular choice of distance and dictates the geometry for the n-dimensional representation of the data. Furthermore, when data can be treated as a random sample, statistical inferences are based on a solid foundation. Returning to geometric interpretations in Section 3.4, we introduce a single number, called generalized variance, to describe variability. This generalization of variance is an integral part of the comparison of multivariate means. In later sec tions we use matrix algebra to provide concise expressions for the matrix products and sums that allow us to calculate X and sn directly from the data matrix X. The con nection between x, Sn , and the means and covariances for linear combinations of variables is also clearly delineated, using the notion of matrix products.
3.2 TH E G E O M ETRY OF TH E SAM PLE A single multivariate observation is the collection of measurements on p different variables taken on the same item or trial. As in Chapter 1, if n observations have been obtained, the entire data set can be placed in an n X p array (matrix) :
112
Section 3.2
X=
The Geometry of the Sample
X1 1 X12 X21 X22
Xl p X2 p
Xn l Xn2
Xn p
X
( n xp)
X1 1 X 1 2 X2 1 X22
Xl p X2 p
x1 x2
Xn l Xn2
Xn p
x'n
113
Each row of represents a multivariate observation. Since the entire set of mea surements is often one particular realization of what might have been observed, we say that the data are a sample of size n from a p-variate "population." The sample then consists of n measurements, each of which has p components. As we have seen, the data can be plotted in two different ways. For the p-dimensional scatter plot, the rows of represent n points in p-dimensional space. We can write
X=
( n Xp)
X
� 1st (multivariate ) observation (3-1) � nth (multivariate ) observation
The row vector xj , representing the jth observation, contains the coordinates of a point. The scatter plot of n points in p-dimensional space provides information on the locations and variability of the points. If the points are regarded as solid spheres, the sample mean vector x, given by (1-8), is the center of balance. Variability occurs in more than one direction, and it is quantified by the sample variance-covariance ma trix S n . A single numerical measure of variability is provided by the determinant of the sample variance-covariance matrix. When p is greater than 3, this scatter plot rep resentation cannot actually be graphed. Yet the consideration of the data as n points in p dimensions provides insights that are not readily available from algebraic ex pressions. Moreover, the concepts illustrated for p 2 or p 3 remain valid for the other cases. Example 3 . 1
(Com puti ng the mean vector)
= =
Compute the mean vector x from the data matrix.
=
=
X = [-! � ] = =
Plot the n 3 data points in p 2 space, and locate x on the resulting diagram. The first point, x 1 , has coordinates x1 = [ 4, 1 ]. Similarly, the remaining two points are x2 [ - 1 , 3 ] and x 3 [ 3 , 5 ] . Finally,
x=
4-1 +3 3 1
+3+5 3
114
Chapter 3
Sa m p l e Geometry and Ra ndom Sa m p l i ng 2
Figure 3 . 1 A plot of the data matrix X as n = 3 poi nts i n p = 2 space.
Figure 3.1 shows that x is the balance point ( center of gravity ) of the scat• ter plot. The alternative geometrical representation is constructed by considering the data as p vectors in n-dimensional space. Here we take the elements of the columns of the data matrix to be the coordinates of the vectors. Let
X11 X12 X21 X22 X = ( n x p) Xn l Xn2
Xl p X2 p = [y Y (3-2) l i 2 . i yp ] Xn p Then the coordinates of the first point y1 = [ x1 1 , x2 1 , . . . , x n 1 ] are the n measure ments on the first variable. In general, the ith point yi = [ x 1 i , x2 i, . . . , xn i] is deter :
I
. .
mined by the n-tuple of all measurements on the ith variable. In this geometrical representation, we depict yr , . . . , yP as vectors rather than points, as in the p dimensional scatter plot. We shall be manipulating these quantities shortly using the algebra of vectors discussed in Chapter 2. Example 3.2
(Data as p vectors in
n
d i mensions)
Plot the following data as p = 2 vectors in n = 3 space:
Here y1 = [ 4, - 1 , 3 J and y2 = [ 1 , 3, 5] . These vectors are shown in Figure 3.2. • Many of the algebraic expressions we shall encounter in multivariate analysis can be related to the geometrical notions of length, angle, and volume. This is im portant because geometrical representations ordinarily facilitate understanding and lead to further insights.
Section 3 . 2
T h e Geometry of t h e Sa mple
115
3 6
Y! 2 I
1 ,..
1
6
5
.....
4
3
2 Figure 3 . 2 A plot of the data matrix X as p = 2 vectors i n n = 3 space.
Unfortunately, we are limited to visualizing objects in three dimensions, and consequently, the n-dimensional representation of the data matrix X may not seem like a particularly useful device for n > 3. It turns out, however, that geometrical relationships and the associated statistical concepts depicted for any three vectors remain valid regardless of their dimension. This follows because three vectors, even if n dimensional, can span no more than a three-dimensional space, just as two vec tors with any number of components must lie in a plane. By selecting an appropri ate three-dimensional perspective-that is, a portion of the n-dimensional space containing the three vectors of interest-a view is obtained that preserves both lengths and angles. Thus, it is possible, with the right choice of axes, to illustrate certain al gebraic statistical concepts in terms of only two or three vectors of any dimension n. Since the specific choice of axes is not relevant to the geometry, we shall always label the coordinate axes 1, 2, and 3. It is possible to give a geometrical interpretation of the process of finding a sample mean. We start by defining the n X 1 vector 1� [ 1 , 1, . . . , 1]. (To simplify the notation, the subscript n will be dropped when the dimension of the vector 1n is clear from the context.) The vector 1 forms equal angles with each of the n coordi nate axes, so the vector ( 1/Yn)1 has unit length in the equal-angle direction. Con sider the vector yi [ x l i , x2 i, . . . , x n iJ . The proj ection of Yi on the unit vector ( 1/Yn)1 is, by (2-8) , =
=
,( )
1 1 1 1 X1 i + X2 i + Yz. n Vn Vn _
···
···
+ Xn i
1 - Xz·1 _
_
(3-3)
That is, the sample mean xi (xli + x2 i + + x n J/n yi1/n corresponds to the multiple of 1 required to give the projection of Yi onto the line determined by 1. Further, for each yi , we have the decomposition =
=
Y ; -X,l 0
1
X�
116
Chapter 3
Sample Geometry and Random Sa m p l i n g
3
The decom position of Yi i nto a mean component xi 1 a n d a devi ation com ponent di Yi - Xj 1 , i = 11 2, 3. Figure 3.3
=
where xil is perpendicular to Yi - xil. The deviation, or mean corrected, vector is
x l i - xi i - xi d · = Y· - x - 1 = X2 l
l
l
(3-4)
The elements of di are the deviations of the measurements on the ith variable from their sample mean. Decomposition of the Yi vectors into mean components and de viation from the mean components is shown in Figure 3.3 for p = 3 and n = 3. Example 3.3
(Deco m posing a vector i nto its mean and deviatio n components)
Let us carry out the decomposition of Yi into xil and di = Yi - xil, i = 1, 2, for the data given in Example 3.2:
X=
[ � �] -
Here x1 = (4 - 1 + 3 )/3 = 2 and x2 = ( 1 + 3 + 5 )/3 = 3, so
Consequently,
Section 3 . 2
The Geometry of t h e Sample
117
and
We note that .X 1 1 and d 1
=
[ �]
y1 - .X1 1 are perpendicular, because
( :Xl l )'(yl - :Xl l )
=
[2 2 2]
A similar result holds for x2 1 and d 2
=
-
=
4-6+2
=
0
y2 - x2 1. The decomposition is
•
For the time being, we are interested in the deviation (or residual) vectors Yi - xil . A plot of the deviation vectors of Figure 3.3 is given in Figure 3.4. We have translated the deviation vectors to the origin without changing their lengths or orientations. Now consider the squared lengths of the deviation vectors. Using (2-5) and (3-4), we obtain
di
=
L��
=
djdi
=
(Length of deviation vector) 2
n
� (xj i - xi) 2 j=l =
(3-5)
sum of squared deviations
3
Figure 3.4 The deviation vectors di from F ig u re 3 . 3 .
118
Chapter 3
Sa m p l e Geometry and Random Sa m p l i n g
From (1-3), we see that the squared length is proportional to the variance of the mea surements on the ith variable. Equivalently, the length is proportional to the stan dard deviation. Longer vectors represent more variability than shorter vectors. For any two deviation vectors di and d k , n (3-6) d i d k == � (xji - xi) (xj k - xk ) j=l Let fJi k denote the angle formed by the vectors di and d k . From (2-6) , we get or, using (3-5) and (3-6), we obtain
�
(xi ; - X; ) ( xi k - Xk ) =
so that [see (1-5)]
��
(xi ; - X; )
2
��
2 (xik - Xk ) cos ( O ;k )
(3-7) The cosine of the angle is the sample correlation coefficient. Thus, if the two devia tion vectors have nearly the same orientation, the sample correlation will be close to 1 . If the two vectors are nearly perpendicular, the sample correlation will be ap proximately zero. If the two vectors are oriented in nearly opposite directions, the sample correlation will be close to -1. Example 3.4
(Calcu lati ng Sn and R fro m deviatio n vectors)
Given the deviation vectors in Example 3.3, let us compute the sample variance-covariance matrix Sn and sample correlation matrix R using the geo metrical concepts just introduced. From Example 3.3,
These vectors, translated to the origin, are shown in Figure 3.5 on page 119. Now,
or s1 1
==
134 • Also,
Section 3 . 2
T h e Geometry o f t h e Sample
1 19
3
dl ��--���� I
..::
_
_
5
or s22
=
8
3.
_
4
_
_
3
2
Figure 3.5
a n d d2 .
Finally,
dld2 or s1 2
and
=
2
,.,
=
[2 - 3 1 ]
; . Consequently,
- �] 8
3
'
[ -�]
-2
=
The deviation vectors d1
3s1 2
•
The concepts of length, angle, and projection have provided us with a geomet rical interpretation of the sample. We summarize as follows:
1 The square of the length and the inner product are ( n - l ) su and ( n - l )sib respectively, when the divisor n - 1 is used in the definitions of the sample variance and covariance.
1 20
3.3
Chapter 3
Sa m p l e Geometry and Random Sa m p l i n g
RAN D O M SAM PLES AN D THE EXPECTED VALU ES O F TH E SAM PLE M EAN AN D COVARIAN CE MATRIX
In order to study the sampling variability of statistics like x and Sn with the ultimate aim of making inferences, we need to make assumptions about the variables whose observed values constitute the data set X. Suppose, then, that the data have not yet been observed, but we intend to col lect n sets of measurements on p variables. Before the measurements are made, their values cannot, in general, be predicted exactly. Consequently, we treat them as ran dom variables. In this context, let the (j, k )-th entry in the data matrix be the random variable Xi k · Each set of measurements Xi on p variables is a random vector, and we have the random matrix
(
nXX p)
=
X11 X1 2 X2 1 X22
xl p x2 p
X! X2
xn l Xn2
xn p
X'n
(3-8)
A random sample can now be defined. If the row vectors X1 , X2, . . . , X� in (3-8) represent independent observations from a common joint distribution with density function f (x) f ( x1 , x2 , . . . , xp ) , then X 1 , X 2 , . . . , X n are said to form a random sample from f (x) . Mathematically, X 1 , X 2 , . . . , X n form a random sample if their joint density function is given by the product f (x 1 )f(x2 ) f(x n ), where f (xi) f(xi 1 , xi 2 , . . . , xi P ) is the density func tion for the jth row vector. Two points connected with the definition of random sample merit special attention: 1. The measurements of the p variables in a single trial, such as Xj [ Xi 1 , Xi 2 , . . . , Xi p ] , will usually be correlated. Indeed, we expect this to be the case. The measurements from different trials must, however, be independent. 2. The independence of measurements from trial to trial may not hold when the variables are likely to drift over time, as with sets of p stock prices or p economic indicators. Violations of the tentative assumption of independence can have a serious impact on the quality of statistical inferences. The following examples illustrate these remarks. =
···
=
=
Example 3.5
(Selecti ng a random sa mple)
As a preliminary step in designing a permit system for utilizing a wilderness canoe area without overcrowding, a natural-resource manager took a survey of users. The total wilderness area was divided into subregions, and respon dents were asked to give information on the regions visited, lengths of stay, and other variables. The method followed was to select persons randomly (perhaps using a random number table) from all those who entered the wilderness area during a particular week. All persons were equally likely to be in the sample, so the more popular entrances were represented by larger proportions of canoeists.
Section 3.3
Random Samples and the Expected Va l ues
1 21
Here one would expect the sample observations to conform closely to the criterion for a random sample from the population of users or potential users. On the other hand, if one of the samplers had waited at a campsite far in the in terior of the area and interviewed only canoeists who reached that spot, suc cessive measurements would not be independent. For instance, lengths of stay would tend to be large. • Example 3.6
{A nonrandom sample)
Because of concerns with future solid-waste disposal, a study was conducted of the gross weight of solid waste generated per year in the United States ("Char acteristics of Municipal Solid Wastes in the United States, 1960-2000," Franklin Associates, Ltd.). Estimated amounts attributed to x 1 = paper and paperboard waste and x2 = plastic waste, in millions of tons, are given for selected years in Table 3.1. Should these measurements on X' = [X1 , X2 ] be treated as a ran dom sample of size n = 6? No! In fact, both variables are increasing over time. A drift like this would be very rare if the year-to-year values were independent observations from the same distribution. TABLE 3.1
Year x 1 (paper) x2 (plastics)
SOLID WASTE
1960 29.8 .4
1965 37.9 1.4
1970 43.9 3.0
1975 42.6 4.4
1980 53.9 7.6
1985 61.7 9.8 •
As we have argued heuristically in Chapter 1, the notion of statistical indepen dence has important implications for measuring distance. Euclidean distance ap pears appropriate if the components of a vector are independent and have the same variances. Suppose we consider the location of the kth column Yk = [ X1k , X2 k , . . . , xnk ] of X, regarded as a point in n dimensions. The location of this point is determined by the joint probability distribution f (yk ) = f(x l k ' X2 k , . . . , Xnk )· When the measurements Xl k ' X2 k ' . . . , Xn k are a ran dom sample, f (yk ) = f (x 1 k , x2 k , . . . , x n k ) = fk (x 1k ) fk(x2 k ) · · · fk (xnk ) and, conse quently, each coordinate xj k contributes equally to the location through the identical marginal distributions fk ( xj k ). If the n components are not independent or the marginal distributions are not identical, the influence of individual measurements (coordinates) on location is asym metrical. We would then be led to consider a distance function in which the coordi nates were weighted unequally, as in the "statistical" distances or quadratic forms introduced in Chapters 1 and 2. Certain conclusions can be reached concerning the sampling distributions of X and s n without making further assumptions regarding the form of the underlying joint distribution of the variables. In particular, we can see how X and S n fare as point estimators of the corresponding population mean vector IL and covariance matrix I.
1 22
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
Let X 1 , X 2 , . . . , X n be a random sample from a joint distribution that has mean vector IL and covariance matrix I. Then X is an unbiased estimator of IL , and its covariance rna trix is Result 3.1.
That is,
E(X) -
Cov ( X)
=
1L
=
-
1 I n
(
(population mean vector) population variance-covariance matrix divided by sample size
)
(3-9)
For the covariance matrix S n ,
Thus, (3-10) so [ nj ( n - 1 ) ] S n is an unbiased estimator of I, while S n is a biased estimator with (bias) E (S n ) - I - ( 1/n)I. =
=
Proof.
Now, X (X 1 + X 2 + · · · + X n )fn. The repeated use of the proper ties of expectation in (2-24) for two vectors gives -
E (X)
=
=
=
=
( ) E ( � Xn ) E ( � X 1 ) E ( � X2 ) 1 1 1 E n X 1 + n X2 + · · · + n X n
+
+
0
+
1 1 1 + ) E(X + ) ) + E(X · · · 1 2 n n n E(X n
= IL Next,
( X - �-t ) ( X - �-t ) '
=
=
(
�
1 n (Xi - IL ) n
=
=
)(
1 1 1 1L + 1L + · · · + 1L n n n
E( X - �-t ) ( X - �-t ) '
=
)
' 1 n (X ) e &i IL n
1 n n 2 L L (X j - IL ) (X e - IL ) ' n j=l €= 1
so
Cov( X )
0 0
:2 ( � � E (Xi - �-t ) (Xe - �-t ) ' )
Section 3.3
Ra ndom Sa m p l es and the Expected Va l ues
1 23
For j # e, each entry in E(Xi - p.. ) (X e - p.. ) ' is zero because the entry is the co variance between a component of Xi and a component of X e , and these are inde pendent. [See Exercise 3.17 and (2-29).] Therefore,
� ( � E (X1 - p ) (X1 - p ) ' )
Cov ( X) = 2
Since I = E(Xi - p.. ) (Xi - p.. ) ' is the common population covariance matrix for each Xi , we have
�n ( i±=l E( X1 - p ) (X1 - p ) ' ) = �n (I + nI + . . . + I) = 2 (ni) = ( n ) I n
Cov ( X) =
terms
_!_
l
To obtain the expected value of S11 , we first note that (Xii - XJ (Xik - Xk) is the (i, k )th element of (Xi - X) (Xi - X)'. The matrix representing sums of squares and cross products can then be written as
� n
(X1 - X) (X1 - X ) ' =
� (X1 - X)Xj + ( � (X1 - X) ) ( -X)' n
n
n
1
= i.£.,; " X -X'· - nX X' =l n
1
n
since � (Xi - X) = 0 and nX' = � Xj . Therefore, its expected value is
i= l
i= l
For any random vector V with E (V) = #Lv and Cov (V) = I v , we have E (VV' ) = I v + #Lv#L'v . (See Exercise 3.16.) Consequently,
E(Xi Xj) = I +
p.. p.. '
-1 and E ( X X' ) = - I + p.. p.. ' n
Using these results, we obtain --
(
)
� E(X1Xj) - n ( XX' ) = ni - npp' - n 1 I + pp' = (n - l)I E f=t n and thus, since Sn = (l/n)
( � X1Xj - nXX' ) ,
it follows immediately that
•
1 24
Chapter 3
Sample Geometry and Random Sa m p l i n g
n Result 3.1 shows that the (i, k)th entry, (n - 1 )- 1 � ( Xji - Xi ) ( Xjk - Xk ) , of j =1 [ n/ ( n - 1 ) ]S n is an unbiased estimator of o-ik . However, the individual sample standard deviations � , calculated with either n or n - 1 as a divisor, are not unbiased estimators of the corresponding population quantities � . Moreover, the correla tion coefficients rik are not unbiased estimators of the population quantities Pik . However, the bias E ( �) - � ' or E ( rik ) - Pik ' can usually be ignored if the sample size n is moderately large. Consideration of bias motivates a slightly modified definition of the sample variance-covariance matrix. Result 3.1 provides us with an unbiased estimator S of I:
n 1 Here S, without a subscript, has (i, k)th entry ( n - 1 )- � ( Xji - Xi ) ( Xjk - Xk ) · j=l This definition of sample covariance is commonly used in many multivariate test statistics. Therefore, it will replace Sn as the sample covariance matrix in most of the ma terial throughout the rest of this book. 3.4
G E N ERALIZED VARIANCE
With a single variable, the sample variance is often used to describe the amount of variation in the measurements on that variable. When p variables are observed on each unit, the variation is described by the sample variance-covariance matrix S1 1 S12 S = S1 2 S22
slp s2p
slp s2p
sPP
{ s;k
=
n
� 1 J=l :± (xji - X; ) ( xjk - Xk )
}
The sample covariance matrix contains p variances and � p(p - 1 ) potentially dif ferent covariances. Sometimes it is desirable to assign a single numerical value for the variation expressed by S. One choice for a value is the determinant of S, which re duces to the usual sample variance of a single characteristic when p = 1. This de terminant2 is called the generalized sample variance:
2 Definition 2A.24 defines "determinant" and indicates one method for calculating the value of a determinant.
Section 3.4
Example 3.7
Genera l ized Va ria nce
1 25
(Calcu lati ng a genera l i zed varia nce)
Employees (x 1 ) and profits per employee (x2 ) for the 16 largest publishing firms in the United States are shown in Figure 1.3. The sample covariance ma trix, obtained from the data in the April 30, 1990, Forbes magazine article, is s
=
[
252.04 -68.43 -68.43 123.67
]
Evaluate the generalized variance. In this case, we compute l S I = (252.04) (123.67) - ( -68.43) ( -68.43 ) = 26,487
•
The generalized sample variance provides one way of writing the information on all variances and covariances as a single number. Of course, when p > 1 , some information about the sample is lost in the process. A geometrical interpretation of I S I will help us appreciate its strengths and weaknesses as a descriptive summary. Consider the area generated within the plane by two deviation vectors d 1 = y1 - x1 1 and d 2 = y2 - x2 1. Let Ld1 be the length of d 1 and Ld2 the length of d 2 . By elementary geometry, we have the diagram -
1
-
-
-
-
Height
-
=
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
.,
-
·
Ld 1 sin ( 8 )
and the area of the trapezoid is I Ld1 sin ( 0) 1Ld2 • Since cos2 ( 0) + sin2 ( 0 ) express this area as
From (3-5) and (3-7), L d1 = Ld2 =
and
� � (xi1 ��
-
2 X 1 ) = v' (n - l ) s1 1 2
( xi2 - X2 ) =
cos ( 0) = r1 2 Therefore,
v' (n - l ) s22
=
1, we can
1 26
Chapter 3
Sample Geometry and Random Sa m p l i n g J\ II \ II \ I I \ I [ \ \ I 1\ \ I I \ \ � I \ \ I\ I \ \ I ( \ ' II \ II � \ 2 " I \ I \ \I \ I
3
3
.,.\,
d
\'
d3
(a)
·�
dl
(b)
Figure 3.6 (a) " La rge" general ized sample va ria nce fo r p = 3 . (b) "Sma l l " general ized sa m p l e va ria nce for p = 3 .
Also,
(3-14) If we compare (3-14) with (3-13), we see that Assuming now that I S I = ( n - 1 ) - ( p - 1 ) (volume ) 2 holds for the volume gener ated in n space by the p - 1 deviation vectors d 1 , d 2 , . . . , d P _ 1 , we can establish the following general result for p deviation vectors by induction (see [1 ], p. 260): Generalized sample variance
=
ISI
=
( n - 1 ) -p (volume ) 2
(3-15)
Equation (3-15) says that the generalized sample variance, for a fixed set of data, is proportional to the square of the volume generated by the p deviation vectors3 d 1 = Y1 - x1 l, d2 = y2 - x2 1, . . . ,dP = yP - xP l. Figures 3.6(a) and (b) show trapezoidal regions, generated by p = 3 residual vectors, corresponding to "large" and "small" generalized variances. 3 If generalized variance is defined in terms of the sample covariance matrix Sn = [ ( n - 1 )/n]S� then, using Result 2A. 1 1 , I Sn l = I [ ( n - 1 )/n]I S I = I [ ( n - 1 )/n]Ip i i S I = [ ( n - 1 )/nJPI S I . Conse P quently, using (3-15), we can also write the following: Generalized sample variance = I Sn I = n -P( volum e ) 2 .
Section 3.4
Genera l ized Va ria nce
1 27
For a fixed sample size, it is clear from the geometry that volume, or I S I , will in crease when the length of any di = Yi - xil (or Vi;;) is increased. In addition, vol ume will increase if the residual vectors of fixed length are moved until they are at right angles to one another, as in Figure 3.6(a). On the other hand, the volume, or I S I, will be small if just one of the sii is small or one of the deviation vectors lies nearly in the (hyper) plane formed by the others, or both. In the second case, the trapezoid has very little height above the plane. This is the situation in Figure 3.6(b ), where d 3 lies nearly in the plane formed by d 1 and d 2 • Generalized variance also has interpretations in the p-space scatter plot represen tation of the data. The most intuitive interpretation concerns the spread of the scatter about the sample mean point x' = [x1 , x2 , • • • , xp ]· Consider the measure of distance given in the comment below (2-19), with x playing the role of the fixed point IL and s - 1 playing the role of A. With these choices, the coordinates x' = [x 1 , x2 , • • • , xp ] of the points a constant distance c from x satisfy (3 -1 6) (x - x) ' S- 1 (x - x) = c2 2 [When p = 1 , (x - x) ' S- 1 (x - x) = (x l - x1 ) /s l l is the squared distance from x l to .X 1 in standard deviation units.] Equation (3-16) defines a hyperellipsoid (an ellipse if p = 2) centered at x. It can be shown using integral calculus that the volume of this hyperellipsoid is related to I S I · In particular,
(3-17) or (Volume of ellipsoid ) 2
=
(constant) (generalized sample variance )
where the constant kP is rather formidable. 4 A large volume corresponds to a large generalized variance. Although the generalized variance has some intuitively pleasing geometrical interpretations, it suffers from a basic weakness as a descriptive summary of the sam ple covariance matrix S, as the following example shows. Example 3.8
(I nterpreti ng the general ized variance)
Figure 3.7 gives three scatter plots with very different patterns of correlation. All three data sets have x' = [2, 1 ], and the covariance matrices are S
=
[5 5 ] 4
4
,r
=
.8 S
4For those who are curious, k P ated at z .
=
=
[ OJ
3 ,r 0 3
=
0 S
=
[ 5 5] _4
-4
,r
=
- .8
21T P12jpf(pj2 ) , where f(z) denotes the gamma function evalu
1 28
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
7
7 • •
•
•
• • •
••
• •
. .. . .. • •
•
' . •
•
•
•
•
.
•
• •
•
•
•
• •
•
•
• •
• • • • • •
•
•
•
•
•
•
7
x
l
•
•
•
(a)
(b)
•
7 •
• --r-+-��-+� ��+-��-+�� • • •
•
•
• • ••
• • •
• •
•
•
• •
•
•
,. •
7
• • •• • •
•
l
•
•
•
x
•
(c) Figure 3.7
Scatter plots with th ree d iffe rent orie ntations.
Each covariance matrix S contains the information on the variability of the component variables and also the information required to calculate the corre lation coefficient. In this sense, S captures the orientation and size of the pattern of scatter. The eigenvalues and eigenvectors extracted from S further describe the pattern in the scatter plot. For 0 = (A - 5 ) 2 - 4 2 the eigenvalues satisfy = (A - 9 ) ( A - 1 ) and we determine the eigenvalue-eigenvector pairs A 1 = 9, e1 = [ 1/vl, 1/vl] and A2 = 1, e2 = [ 1/vl, -1/vlJ.
Section
3.4
Genera l ized Va ria nce
1 29
The mean-centered ellipse, with center x' = [2, 1 J for all three cases, is (x - x) ' S- 1 (x - x ) < c 2 To describe this ellipse, as in Section 2.3, with A = s - 1 , we notice that if (A, e) is an eigenvalue-eigenvector pair for S, then (A- l , e) is an eigenvalue-eigenvector pair for s - 1 . That is, if Se = Ae, then multiplying on the left by s- 1 gives s - 1 Se = AS - 1 e, or s- 1 e = A- 1 e. Therefore, using the eigenvalues from S, we know that the ellipse extends c� in the direction of e i from x. In p = 2 dimensions, the choice c2 = 5.99 will produce an ellipse that con tains approximately 95 percent of the observations. The vectors 3 V5.99 e 1 and V5.99 e 2 are drawn in Figure 3.8(a). Notice how the directions are the natural axes for the ellipse, and observe that the lengths of these scaled eigenvectors are comparable to the size of the pattern in each direction. Next, for
s=
[� �] ,
the eigenvalues satisfy
O = (A - 3 ) 2
and we arbitrarily choose the eigenvectors so that A 1 = 3, e1 = [1, O J and A2 = 3, e 2 = [0, 1 ] . The vectors V3 V5.99 e 1 and V3 V5.99 e 2 are drawn in Figure 3.8(b). Finally, for 5 -4 0 = ( A - 5 ) 2 - ( -4 ) 2 the eigenvalues satisfy s= ' = (A - 9 ) ( A - 1 ) -4 5
[
]
and we determine the eigenvalue-eigenvector pairs A 1 = 9, e1 = [ 1/ v2, - 1/ vl] and A 2 = 1, e 2 = [1/vl, 1/vl] . The scaled eigenvectors 3 V5.99 e 1 and V5.99 e2 are drawn in Figure 3.8(c). In two dimensions, we can often sketch the axes of the mean-centered el lipse by eye. However, the eigenvector approach also works for high dimensions where the data cannot be examined visually. Note: Here the generalized variance I S I gives the same value, I S I = 9, for all three patterns. But generalized variance does not contain any informa tion on the orientation of the patterns. Generalized variance is easier to inter pret when the two or more samples (patterns) being compared have nearly the same orientations. Notice that our three patterns of scatter appear to cover approximately the same area. The ellipses that summarize the variability
do have exactly the same area [see (3-17)], since all have I S I = 9.
•
As Example 3.8 demonstrates, different correlation structures are not detected by I S I · The situation for p > 2 can be even more obscure. Consequently, it is often desirable to provide more than the single number I S I as a summary of S. From Exercise 2 1 2, I S I can be expressed as the product A 1 A2 · · · AP .
1 30
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
7
•
•
•
•
• •
'
.
•
•
• •
•
•
: ·-
'
•
•
•
•
'
•
• •
7
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
• • •
•
•
•
• • •
•
•
• • •
7
•
x,
(b)
(a)
•
•
•
•
7 • •
• •
• • ••• •
• •
•
•
•
•
•
•
•
• • • •
7
x,
•
(c) Figure 3.8
Axes of the mean-centered 95 percent e l l i pses for the scatter plots in Fig u re 3.7.
of the eigenvalues of S. Moreover, the mean-centered ellipsoid based on s - 1 [see (3-16)] has axes whose lengths are proportional to the square roots of the A/s (see Sec tion 2.3). These eigenvalues then provide information on the variability in all direc tions in the p-space representation of the data. It is useful, therefore, to report their individual values, as well as their product. We shall pursue this topic later when we discuss principal components. Situations i n which the Generalized Sample Variance Is Zero
The generalized sample variance will be zero in certain situations. A generalized variance of zero is indicative of extreme degeneracy, in the sense that at least one column of the matrix of deviations,
Section 3.4
Genera l ized Va ria nce
x1 - x' x2 - x'
X 11 - X1 X12 - X2 X21 - X 1 X22 - X2
X1 p - X p X2 p - X p
x� - x'
Xnl - X1 Xn2 - X2 X - 1 x' ( n Xp) ( n Xl) ( l Xp)
Xn p - Xp
131
(3-18)
can be expressed as a linear combination of the other columns. As we have shown geometrically, this is a case where one of the deviation vectors-for instance, di = [ x l i - Xi , . . . , Xn i - Xi ] -lies in the ( hyper ) plane generated by d l , . . . ,d i - 1 '
di + l , · · · , dp .
The generalized variance is zero when, and only when, at least one deviation vector lies in the ( hyper ) plane formed by all linear combinations of the others-that is, when the columns of the matrix of deviations in (3-18) are lin early dependent. Result 3.2.
Proof.
If the columns of the deviation matrix (X - lx' ) are linearly depen dent, there is a linear combination of the columns such that 0 = a 1 col 1 ( X - lx' ) +
··· +
aP
col p ( X - lx' )
= (X - lx' ) a for some a =I= 0 But then, as you may verify, ( n - 1 )S = (X - lx' ) ' (X - lx' ) and ( n - 1 )Sa = ( X - lx' ) ' (X - lx' ) a = o so the same a corresponds to a linear dependency, a 1 col 1 ( S ) + · · · + a P col p ( S ) = Sa = 0, in the columns of S. So, by Result 2A.9, I S I = 0. In the other direction, if I S I = 0, then there is some linear combination Sa of the columns of S such that Sa = 0. That is, 0 = ( n - 1 )Sa = ( X - lx' ) ' (X - lx' ) a. Premultiplying by a' yields 0 = a' (X - lx' ) ' ( X - lx' ) a = L (x - lx' ) a and, for the length to equal zero, we must have (X - lx' ) a = 0. Thus, the columns of ( X - lx' ) are linearly dependent. • Example 3.9
(A case where the general ized va ria nce is zero)
Show that I S I = 0 for
3X3 =
(X)
[
and determine the degeneracy. Here x' = [ 3, 1, 5 ] , so
[� � �]
1 -3 2-1 X - lx' = 4 - 3 1 - 1 4-3 0-1
4 0 4
1 32
Chapter 3
Sample Geometry and Ra ndom Sa m p l i n g
3
3 4 5 6 3
Fl.Q_ure 3.9
A case where the th ree d i m ensional vo l u me is zero ( I S I = 0 ) .
4
The deviation (column) vectors are d1 = [ -2, 1, 1 ] , d2 = [ 1 , 0, - 1 ] , and d3 = [0, 1, -1 ]. Since d3 = d 1 + 2d 2 , there is column degeneracy. (Note that there is row degeneracy also.) This means that one of the deviation vectors for example, d3-lies in the plane generated by the other two residual vectors. Consequently, the th ree-dimensional volume is zero. This case is illustrated in Figure 3.9 and may be verified algebraically by showing that I S I = 0. We have
and from Definition 2A.24, lSI =
1 1 3 1 � ( - 1 )2 +
= 3
2
(1 - �) +
(-D -6 � 3
1
( -1 ) 3 + ( o )
(�) ( - � - o ) + o = � - � = o
•
When large data sets are sent and received electronically, investigators are some times unpleasantly surprised to find a case of zero generalized variance, so that S does not have an inverse. We have encountered several such cases, with their asso ciated difficulties, before the situation was unmasked. A singular covariance matrix occurs when, for instance, the data are test scores and the investigator has included variables that are sums of the others. For example, an algebra score and a geometry score could be combined to give a total math score, or class midterm and final exam scores summed to give total points. Once, the total weight of a number of chemicals was included along with that of each component. This common practice of creating new variables that are sums of the original variables and then including them in the data set has caused enough lost time that we emphasize the necessity of being alert to avoid these consequences.
Section 3.4
Example 3. 1 0
Genera l ized Va riance
1 33
(Creati ng new variables that lead to a zero genera l i zed varia nce)
Consider the data matrix
1 9 10 4 12 16 X = 2 10 12 5 8 13 3 1 1 14 where the third column is the sum of first two columns. These data could be the number of successful phone solicitations per day by a part-time and a full-time employee, respectively, so the third column is the total number of successful so licitations per day. Show that the generalized variance I S I = 0, and determine the nature of the dependency in the data. We find that the mean corrected data matrix, with entries xj k - xk , is -2 - 1 -3 1 2 3 X - lx' = -1 0 - 1 2 -2 0 1 0 1 The resulting covariance matrix is
[
]
2.5 0 2.5 s = 0 2.5 2.5 2.5 2.5 5.0 We verify that, in this case, the generalized variance I s I = 2.5 2 X 5 + 0 + 0 - 2.5 3 - 2.5 3 - 0 = 0 In general, if the three columns of the data matrix X satisfy a linear con straint a 1 xj 1 + a 2 xj2 + a 3 xj 3 = c, a constant for all j, then a 1 x1 + a 2 x2 + a 3 x3 = c, so that a l (xjl - xl ) + a 2 (xj2 - x2 ) + a 3 (xj 3 - x3 ) = 0 for all j. That is, (X - lx' ) a = o and the columns of the mean corrected data matrix are linearly dependent. Thus, the inclusion of the third variable, which is linearly related to the first two, has led to the case of a zero generalized variance. Whenever the columns of the mean corrected data matrix are linear ly dependent, ( n - 1 )Sa = (X - lx' ) ' (X - lx' ) a = (X - lx' ) O = o and Sa = 0 establishes the linear dependency of the columns of S. Hence, I S I = 0.
1 34
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
Since Sa = 0 = Oa, we see that a is a scaled eigenvector of S associated with an eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of any extra variables that are linear combinations of the others, we can find them by calculating the eigenvectors of S and identifying the one as sociated with a zero eigenvalue. That is, if we were unaware of the dependen cy in this example, a computer calculation would find an eigenvalue proportional to a' = [ 1 , 1 , - 1 ] , since Sa =
[
2.5
0
0
2.5 2.5
2.5
The coefficients reveal that
In addition, the sum of the first two variables minus the third is a constant c for all n units. Here the third variable is actually the sum of the first two variables, so the columns of the original data matrix satisfy a linear constraint with c = 0. Because we have the special case c = 0, the constraint establishes the fact that the columns of the data matrix are linearly dependent. • Let us summarize the important equivalent conditions for a generalized variance to be zero that we discussed in the preceding example. Whenever a nonzero vector a satisfies one of the following three conditions, it satisfies all of them: ( 1 ) Sa = 0
a is a scaled
eigenvector of S with eigenvalue 0.
( 2 ) a' ( xj - x ) = 0 for all j ( 3 ) a' xj = c for all j ( c = a' x)
The linear combination of the mean corrected data, using a, is zero.
The linear combination of the original data, using a, is a constant.
We showed that if condition (3) is satisfied-that is, if the values for one variable can be expressed in terms of the others-then the generalized variance is zero because S has a zero eigenvalue. In the other direction, if condition ( 1) holds, then the eigen vector a gives coefficients for the linear dependency of the mean corrected data. In any statistical analysis, I S I = 0 means that the measurements on some vari ables should be removed from the study as far as the mathematical computations are concerned. The corresponding reduced data matrix will then lead to a covariance matrix of full rank and a nonzero generalized variance. The question of which mea surements to remove in degenerate cases is not easy to answer. When there is a choice, one should retain measurements on a (presumed) causal variable instead of those on a secondary characteristic. We shall return to this subject in our discussion of principal components. At this point, we settle for delineating some simple conditions for S to be of full rank or of reduced rank.
Section 3.4
Genera l ized Va ria nce
1 35
Result 3.3. If n < p, that is, (sample size) < (number of variables), then
I S I = 0 for all samples.
Proof. We must show that the rank of S is less than or equal to p and then
apply Result 2A.9. For any fixed sample, the n row vectors in (3-18) sum to the zero vector. The existence of this linear combination means that the rank of X - lx' is less than or equal to n - 1, which, in turn, is less than or equal to p - 1 because n < p. Since ( n - 1 ) s = (X - lx ' ) ' (X - lx ' )
( pX n )
( pX p )
( n Xp)
the kth column of S, col k ( S ) , can be written as a linear combination of the rows of (X - lx' ) ' . In particular, ( n - 1 ) col k ( S ) = (X - lx' ) ' col k (X - lx' ) = ( x l k - xk ) row1 (X - lx' ) ' + . . · +
(xnk - xk ) rown (X - lx' ) '
Since the row vectors of ( X - lx' ) ' sum t o the zero vector, we can write, for ex ample, row1 (X - lx' ) ' as the negative of the sum of the remaining row vectors. After substituting for row1 (X - lx' ) ' in the peceding equation, we can express col k ( S ) as a linear combination of the at most n - 1 linearly independent row vectors row2 (X - lx' )', . . . , rown (X - lx' ) ' . The rank of S is therefore less than or equal to n - 1 , which-as noted at the beginning of the proof-is less than or equal to p - 1, and S is singular. This implies, from Result 2A.9, that I S I = 0. • Result 3.4. Let the p X 1 vectors x 1 , x 2 ,
, x n , where xj is the jth row of the data matrix X, be realizations of the independent random vectors X 1 , X 2 , , X n . Then • • •
• • •
a'Xj has positive variance for each constant vector a # 0, then, provided that p < n, S has full rank with probability 1 and I S I > 0. 2. If, with probability 1, a'Xj is a constant (for example, c) for all j, then I S I = 0. 1. If the linear combination
Proof.
(Part 2). If a'Xj = a 1 Xj 1 + a 2 Xj2
+
. . . + a p Xj p = c with probability
n 1, a'xj = c for all j, and the sample mean of this linear combination is c = :L ( a 1 xj 1 j=l . . + a 2 xj2 + · + a p xj p ) / n = a 1 x1 + a 2 x2 + . . · + a P xP = a' x. Then
[[ ]] [ ][ - ]
(X - lX' )a = a 1
xl l - x l :
_
Xn l - x l -
+
···
+ aP
x l p - xP :
_
Xn p - xp
c�c a ' x 1 � a' x = : = : =0 a ' x n - a' x c-c
indicating linear dependence; the conclusion follows from Result 3.2. The proof of Part (1) is difficult and can be found in [2] .
•
1 36
Chapter 3
Sample Geometry and Random Sam p l i ng
Genera l i zed Va riance Determi ned by I R I and Its Geo metrical I nterpretation
The generalized sample variance is unduly affected by the variability of measure ments on a single variable. For example, suppose some si i is either large or quite small. Then, geometrically, the corresponding deviation vector d i = (yi - xi l) will be very long or very short and will therefore clearly be an important factor in deter mining volume. Consequently, it is sometimes useful to scale all the deviation vec tors so that they have the same length. Scaling the residual vectors is equivalent to replacing each original observation xjk by its standardized value ( xjk - xk )j � . The sample covariance matrix of the standardized variables is then R, the sample correlation matrix of the original vari ables. (See Exercise 3.13.) We define
Since the resulting vectors [ (x l k - xk )/ � , (x2 k - xk )/�,
(xn k - xk )/ � J = (yk - xk l) '/ � all have length Vn---=--1 , the generalized sample variance of the standardized vari ables will be large when these vectors are nearly perpendicular and will be small when two or more of these vectors are in almost the same direction. Employing the argument leading to (3 -7), we readily find that the cosine of the angle (}i k between (yi - xil)/ � and (yk - xk l)/� is the sample correlation coefficient rik · There fore, we can make the statement that I R I is large when all the rik are nearly zero and it is small when one or more of the rik are nearly + 1 or -1. In sum, we have the following result: Let 0 0 0 '
i
=
1, 2,
0 0 0
'p
vs;;l l
be the deviation vectors of the standardized variables. These deviation vectors lie in the direction of d i , but have a squared length of n - 1 . The volume generated in p-space by the deviation vectors can be related to the generalized sample variance. The same steps that lead to (3-15) produce
(
Generalized sai?ple va�iance of the standardized variables
)
=
IRI
=
(n
_
1 )-P (volume) 2
(3-20)
Section 3.4
Genera l ized Va riance i'
3
\
\
d 2 ....
....
.... \ .... .... \ \ \
1 37
3
II \ I I \ I I \ I I \ I I I I I � � II I \ I I I\
......
d
I I \1 I I\ I \ I \ I � 'I 2 IU3 I
d1
�------.. 2
(a)
(b)
Figure 3 . 1 0 The vol u m e generated by equa l-length deviation vectors of the sta n d a rd ized va riables.
The volume generated by deviation vectors of the standardized variables is il lustrated in Figure 3.10 for the two sets of deviation vectors graphed in Figure 3.6. A comparison of Figures 3.10 and 3.6 reveals that the influence of the d 2 vector ( large variability in x2 ) on the squared volume I S I is much greater than its influence on the squared volume I R I · The quantities I S I and I R I are connected by the relationship ( 3-21 ) l S I = ( s1 1 s22 · . . sp p ) I R I so ( 3-22 ) [The proof of ( 3-21 ) is left to the reader as Exercise 3.12. ] Interpreting ( 3-22 ) in terms of volumes, we see from ( 3-15 ) and ( 3-20) that the squared volume ( n - 1 ) P I S I is proportional to the squared volume ( n - 1 ) P I R I . The constant of proportionality is the product of the variances, which, in turn, is pro portional to the product of the squares of the lengths ( n - 1 )sii of the di . Equation ( 3-21 ) shows, algebraically, how a change in the measurement scale of X1 , for exam ple, will alter the relationship between the generalized variances. Since I R tfs-based on standardized measurements, it is unaffected by the change in scale. However, the relative value of I S I will be changed whenever the multiplicative factor s1 1 changes. Example 3.1 1
(I l l ustrating the relatio n between I S I and I R I)
[4 ]
Let us illustrate the relationship in ( 3-21 ) for the generalized variances I S I and I R I when p = 3. Suppose 3 1 s = 3 9 2 (3 X 3 ) 1 2 1
1 38
Chapter 3
= =
= R = [ i ! !]
Sa m p l e Geometry and Ra ndom Sa m p l i ng
Then s1 1
4, s22
9, and s3 3
= =
1. Moreover,
Using Definition 2A.24, we obtain 3 9 2 ( -1 ) 2 + 3 IsI 4 1 2 1 4 (9 - 4) - 3 ( 3 - 2) 1 1 2 I I = 1 � i ( -1 ? + � I 3 2 = ( 1 - $) - ( � ) ( � - � )
R
= R=
It then follows that 14 = I S I
s1 1 s22 s3 3l I
= =
2 3 9 ( - 1 )3 + 1 ( -1 ) 4 1 2 1 + 1 ( 6 - 9) 14 1 1 2 i ( - 1 ? + � I2 �3 ( - 1 ) 4 7 + (�) ( � - � ) 1 8
7 (4) ( 9 ) ( 1 ) ( 1 8 )
Another Genera l i zatio n of Variance
=
14
( check )
•
We conclude this discussion by mentioning another generalization of variance. Specif ically, we define the total sample variance as the sum of the diagonal elements of the sample variance-covariance matrix S. Thus,
Example 3 . 1 2
(Calculating the tota l sa mple va riance)
Calculate the total sample variance for the variance-covariance matrices S in Examples 3.7 and 3.9. From Example 3.7. 252.04 -68.43 s= -68.43 123.67 and
[
Total sample variance From Example 3.9,
]
= = = = [ -� � �J = = = 3
s
252.04 + 123.67
s1 1 + s22
375.71
o
-� 1 2
and
Total sample variance
s 1 1 + s22 + s33
3 +1 +1
5
•
Section 3 . 5
Sa m p l e Mean, Cova riance, and Corre lation as Matrix Operat ions
1 39
Geometrically, the total sample variance is the sum of the squared lengths of the p deviation vectors d 1 = (y1 - x1 1), . . . , d P = (yp - xp 1), divided by n - 1 . The total sample variance criterion pays no attention to the orientation (correlation struc ture) of the residual vectors. For instance, it assigns the same values to both sets of residual vectors (a) and (b) in Figure 3.6. 3. 5 SAM PLE M EAN, COVARIANCE, AN D CORRE LATION AS MATRIX OPE RATI O N S
We have developed geometrical representations of the data matrix X and the de rived descriptive statistics x and S. In addition, it is possible to link algebraically the calculation of x and S directly to X using matrix operations. The resulting expressions, which depict the relation between x, S, and the full data set X concisely, are easily pro grammed on electronic computers. We have it that xi = (x l i . 1 + x2 i . 1 + . . . + Xn i . 1 )/n = yj 1jn. Therefore,
xl x =
x2
Xp
y1 1 n y2 1 n
X1 1 X 1 2 1 n
y� 1 n
X2 1 X2 2 Xp l Xp 2
or
x = _!_ X ' 1 n
(3-24)
That is, x is calculated from the transposed data matrix by postmultiplying by the vector 1 and then multiplying the result by the constant ljn. Next, we create an n X p matrix of means by transposing both sides of (3-24) and premultiplying by 1 ; that is,
1 1 x' = - 11 ' X = n
x l x2 x l x2
Xp Xp
x l x2
Xp
Subtracting this result from X produces the n X
p
(3-25)
matrix of deviations (residuals)
X1 1 - X 1 X 1 2 - X2 x - l 11 'X = X21 X 1 X22 X2 n Xn l - X 1 Xn2 - X2
X1 p - X p X2 p - Xp Xn p - Xp
(3-26)
1 40
Chapter 3
Sa m p l e Geometry and Random Sa m p l i n g
Now, the matrix ( n - 1 ) S representing sums of squares and cross products is just the transpose of the matrix (3-26) times the matrix itself, or
X 1 1 - X 1 X2 1 - X 1 X1 2 - X2 X22 - X2 (n - 1 )S =
X
X1 1 - X1 X1 2 - X2 X2 1 - X1 X22 - X2
(
= Xsince
(I
! ll'X ) ' ( X - ! ll' X ) = X ' ( 1 - ! 11 ' ) X
)(
)
' - _!_ 11 ' I - _!_ 11 ' = I - _!_ 11 ' - _!_ 11 ' + __!_2 11 ' 11 ' = I - _!_ 11 ' n n n n n n
To summarize, the matrix expressions relating x and S to the data set X are
(
)
1 (3-27) X ' I - _!_ 11 ' X n n-1 The result for S n is similar, except that 1/n replaces 1/ ( n - 1 ) as the first factor. The relations in (3-27) show clearly how matrix operations on the data matrix X lead to x and S. Once S is computed, it can be related to the sample correlation matrix R. The resulting expression can also be "inverted" to relate R to S. We first define the p X p 1 1 1 2 2 nsample standard deviation matrix D 1 and compute its inverse, ( D 1 ) = 112 . Let s=
D ll2 =
( pXp)
�
0
0
Vs;
0
0
0 0
Then
1
�
n- 1 12 =
0
1
0
vS;;
0
0
( pXp )
0 0
1
vs;;
(3-28)
Section 3.6
Since
[
pp
s� P
s: r S1 2 S= : s l s2
and
S1 2 S1 1 � � � Vs;
R=
we have
]
Sa m p l e Va l ues of Linear Com b i n ations of Va riables
p
S2 Sl p � vs;;; Vs; vs;;;
sl
p
sPP
�� sPP
=
[ ,:p p '1 2
r2
vs;;; vs;;;
R = n - 1 /2 sn - 1/2
1 41
;]
rP
(3-29)
Postmultiplying and premultiplying both sides of (3-29) by D 112 and noting that n - 1/2 D l/2 = D lf2 n - 1/2 = I gives (3-30) That is, R can be obtained from the information in S, whereas S can be obtained from D 112 and R. Equations (3-29) and (3-30) are sample analogs of (2-36) and (2-37). 3.6 SAM PLE VALU ES OF LIN EAR CO M B I NATI ONS OF VARIABLES
We have introduced linear combinations of p variables in Section 2.6. In many mul tivariate procedures, we are led naturally to consider a linear combination of the form
c'X = c1X1 + c2 X2 + · · · + cP XP whose observed value on the jth trial is j = 1, 2, . . . , n The n derived observations in (3-31) have ( c'x 1 + c'x 2 + . · + c'xn ) Sample mean == ----- n 1 = c' (x 1 + x2 + · · · + x n ) = c' x n
(3-31)
.
(3-32)
Since (c' xj - c' x) 2 = (c' (xj - x) ) 2 = c' (xj - x) (xj - x)'c, we have (c'x 1 - c' x) 2 + (c'x2 - c' x) 2 + . . . + (c'xn - c' x) 2 . Sample variance = n-1 c' (x 1 - x) (x l - x)'c + c' (x2 - x) (x 2 - x)'c + . . . + c' (x n - x) (xn - x)'c n- 1 (x l - x) (x l - x)' + (x2 - x) (x 2 - x)' + . . . + (x n - x) (x n - x)' = c' c n-1
[
]
1 42
Chapter 3
Sa m p l e Geometry and Ra ndom Sa m p l i ng
or (3-33) Sample variance of c ' X = c ' S c Equations (3-32) and (3-33) are sample analogs of (2-43). They correspond to sub stituting the sample quantities x and S for the "population" quantities IL and I, re spectively, in (2-43). Now consider a second linear combination b' X = b1 X1 + b2 X2 +
···
+ bP XP
whose observed value on the jth trial is j = 1, 2, . . . , n
(3-34)
It follows from (3-32) and (3-33) that the sample mean and variance of these derived observations are Sample mean of b ' X = b' x Sample variance of b'X = b' Sb Moreover, the sample covariance computed from pairs of observations on b ' X and c' X is Sample covariance (b'x 1 - b' x) (c' x 1 - c' x) + (b'x2 - b' x) (c'x2 - c' x) + · · · + (b'x n - b' x) (c'x n - c' x) n - 1 b' (x l - x) (x l - x) ' c + b' (x 2 - x) (x 2 - x) ' c + . . . + b' (x n - x) (x n - x) ' c n - 1 (x l - x) (x l - x) ' + (x 2 - x) (x 2 - x) ' + . . . + (x n - x) (x n - x) ' = b' c n - 1 or Sample covariance of b'X and c' X = b ' S c (3-35) I n sum, we have the following result.
[
]
Result 3.5. The linear combinations
· ·· ·
b'X = b1 X1 + b2 X2 + + bP XP c'X = c1 X1 + c2 X2 + · · + cP XP have sample means, variances, and covariances that are related to x and S by Sample mean of b' X = b' x Sample mean of c' X = c' x Sample variance of b ' X = b ' Sb (3-36) Sample variance of c' X = c' Sc • Sample covariance of b'X and c ' X = b ' S c
Section 3.6
Example 3.1 3
Sa m p l e Va l ues of Linear Com b i n ations of Va riables
1 43
{Means and covaria nces fo r l i near combi nations)
We shall consider two linear combinations and their derived values for the n = 3 observations given in Example 3.9 as
[�:] [ �:]
Consider the two linear combinations
b 'X = [2 2 - 1 ] and
c'X = [ 1 - 1 3 ]
= 2Xl + 2X2 - X3
= X1 - X2 + 3X3
The means, variances, and covariance will first be evaluated directly and then be evaluated by (3-36). Observations on these linear combinations are obtained by replacing X1 , X2 , and X3 with their observed values. For example, the n = 3 observa tions on b'X are
b'x 1 b'x2 b 'x 3
= 2x 1 1 + 2x 1 2 - x 1 3 = 2( 1 ) + 2(2) - (5) = 1 = 2x2 1 + 2x22 - x23 = 2(4) + 2 ( 1 ) - (6) = 4 = 2x3 1 + 2x32 - x33 = 2(4) + 2(0) - (4) = 4
The sample mean and variance of these values are, respectively, ( 1 + 4 + 4) == 3 Sample mean = 3 ( 1 - 3 )2 + ( 4 - 3 )2 + (4 - 3 )2 . Sample variance = == 3 3_1 In a similar manner, the n
c'x 1 c'x2 c'x3
==
3 observations on c'X are
1x 1 1 - 1x 12 + 3x 1 3 == 1 ( 1 ) - 1 (2) + 3 ( 5 ) = 14 = 1 (4) - 1 ( 1 ) + 3 ( 6 ) == 21 == 1 (4) - 1 ( 0) + 3(4) == 16
==
and ( 1 4 + 2 1 + 16 ) Samp1e mean == -_______ - 17 3 2 + (16 - 17 ) 2 (14--- 17) 2 + (21---17) -----Sample variance == _ 3 1
==
13
1 44
Chapter 3
Sa m p l e Geometry and Ra ndom Sa m p l i ng
Moreover, the sample covariance, computed from the pairs of observations (b'x 1 , c' x 1 ) , (b'x 2 , c'x 2 ) , and (b'x3 , c ' x3 ) , is Sample covariance ( 1 - 3 ) ( 14 - 17) + (4 - 3 ) (21 - 17) + (4 - 3 ) (16 - 17) 9 2 3-1 Alternatively, we use the sample mean vector x and sample covariance matrix S derived from the original data matrix X to calculate the sample means, variances, and covariances for the linear combinations. Thus, if only the de scriptive statistics are of interest, we do not even need to calculate the obser vations b ' xj and c'xj . From Example 3.9, 3 3 and S = X=
-� OJ [� iI
[�]
Consequently, using (3-36), we find that the two sample means for the derived observations are Sample mean of b ' X = b' X = [2 2 - 1 J
Sample mean of c' X = c' X = [ 1 - 1 3] Using (3-36), we also have Sample variance of b'X = b ' Sb = [2 2 - 1 ]
[�] [�]
= 17
( check)
(check)
[ �] �J [ -l [ l] [ i � �] [-� ] -n [
= [2 2 - 1 ] Sample variance of c' X = c' Sc
=3
3
-2
1 1 2
=3
( check)
= 13
( check)
= [1 -1 3] -
= [1 -1 3]
Chapter 3
Sample covariance of b ' X and c ' X = b ' Sc
Exe rcises
1 45
[ -� �] [ -� ] [ -!] = 3
=
[2 2 - 1 ]
=
[2 2 - 1 ]
-2
1
1 2
�
( check )
As indicated, these last results check with the corresponding sample quan tities computed directly from the observations on the linear combinations. • The sample mean and covariance relations in Result 3.5 pertain to any number of linear combinations. Consider the q linear combinations
ai l xl
+
ai 2 x2
+
0 0 0
+
ai p xp ,
i = 1, 2,
0 0 0
'
( 3-37 )
q
These can be expressed in matrix notation as
a l l xl a 2l xl
+
+
a l2 x2 a 22 x2
+
a q l xl
+
a q 2 X2
+
+
0 0 0
0 0 0
0 0 0
+
a l p xp a 2 p xp
+
a q p xp
+
=
a1 1 a1 2 a 2 1 a 22
a lp a2 p
xl x2
aq l aq 2
aq p
xp
= AX
( 3-38 )
Taking the ith row of A, ai , to be b' and the kth row of A, ak , to be c' , we see that Equations ( 3-36 ) imply that the ith row of AX has sample mean ai x and the ith and kth rows of AX have sample covariance aiSa k . Note that aiSa k is the ( i, k ) th ele ment of ASA' . Result 3.6. The q linear combinations AX in ( 3-38 ) have sample mean vector
Ax and sample covariance matrix ASA ' .
•
EXERCI SES
3.1. Given the data matrix
Graph the scatter plot in p = 2 dimensions. Locate the sample mean on your diagram. (b) Sketch the n = 3-dimensional representation of the data, and plot the de viation vectors y1 - x1 1 and y2 - x2 1. (c) Sketch the deviation vectors in (b) emanating from the origin. Calculate the lengths of these vectors and the cosine of the angle between them. Relate these quantities to Sn and R.
(a)
1 46
Chapter 3
Sa m p l e Geometry and Random Sam p l i n g
3.2. Given the data matrix
Graph the scatter plot in p = 2 dimensions, and locate the sample mean on your diagram. (b) Sketch the n = 3-space representation of the data, and plot the deviation vectors y1 - .X1 1 and y2 - .X2 1 . (c) Sketch the deviation vectors in (b) emanating from the origin. Calculate their lengths and the cosine of the angle between them. Relate these quan tities to Sn and R. Perform the decomposition of y1 into .X1 1 and y1 - .X1 1 using the first column of the data matrix in Example 3.9. Use the six observations on the variable X1 , in units of millions, from Table 1.1. (a) Find the projection on 1 ' = [ 1 , 1, 1, 1, 1, 1 ] . (b) Calculate the deviation vector y1 - .X1 1 . Relate its length to the sample standard deviation. (c) Graph (to scale) the triangle formed by y1 , .X 1 1 , and y1 - x1 1 . Identify the length of each component in your graph. (d) Repeat Parts a-c for the variable X2 in Table 1.1. (e) Graph (to scale) the two deviation vectors y1 - .X1 1 and y2 - .X2 1 . Calcu late the value of the angle between them. Calculate the generalized sample variance I S I for (a) the data matrix X in Ex ercise 3.1 and (b) the data matrix X in Exercise 3.2. Consider the data matrix X= 3 5 2 (a) Calculate the matrix of deviations (residuals), X lx' . Is this matrix of full rank? Explain. (b) Determine S and calculate the generalized sample variance I S I· Interpret the latter geometrically. (c) Using the results in (b), calculate the total sample variance. [See (3-23).] Sketch the solid ellipsoids (x - x) ' S-1 ( x - x) < 1 [see (3-16)] for the three matrices (a)
3.3. 3.4.
3.5. 3.6.
3.7.
[
s=
[ OJ
�!
[
5 -4
-
�]
-
]
-4 5 '
[f
(Note that these matrices have the same generalized variance I S I · ) 3.8. Given 1 0 S = 0 1 0 0 0 1
1
and S =
-
- -2
Chapter 3
Exercises
1 47
(a) Calculate the total sample variance for each S . Compare the results. (b) Calculate the generalized sample variance for each S, and compare the
re sults. Comment on the discrepancies, if any, found between Parts a and b. 3.9. The following data matrix contains data on test scores, with x 1 = score on first test, x2 = score on second test, and x 3 = total score on the two tests: 12 18 X = 14 20 16
17 20 16 18 19
29 38 30 38 35
( a) Obtain the mean corrected data matrix, and verify that the columns are lin
early dependent. Specify an a' = [ a 1 , a 2 , a3] vector that establishes the lin ear dependence. (b) Obtain the sample covariance matrix S, and verify that the generalized vari ance is zero. Also, show that Sa = 0, so a can be rescaled to be an eigen vector corresponding to eigenvalue zero. (c) Verify that the third column of the data matrix is the sum of the first two columns. That is, show that there is linear dependence, with a 1 = 1, a 2 = 1, and a 3 = - 1 . 3.10. When the generalized variance is zero, it is the columns of the mean corrected data matrix Xc = X - 1i' that are linearly dependent, not necessarily those of the data matrix itself. Given the data 3 1 0 6 4 6 4 2 2 7 0 3 5 3 4 (a) Obtain the mean corrected data matrix, and verify that the columns are lin early dependent. Specify an a' = [ a 1 , a 2 , a 3 ] vector that establishes the de pendence. (b) Obtain the sample covariance matrix S, and verify that the generalized vari ance is zero. (c) Show that the columns of the data matrix are linearly independent in this case. 3.11. Use the sample covariance obtained in Example 3.7 to verify (3-29) and (3-30), which state that R = n- 112 S D - 112 and D 112 RD 112 = S . 3.12. Show that I S I = (s1 1 s22 · · · spp ) l R 1 . Hint: From Equation (3-30), S = D 112 RD 112 . Taking determinants gives I S I = I D 112 l l R I I D 112 1 . (See Result 2A.11.) Now examine I D 112 1 . 3.13. Given a data matrix X and the resulting sample correlation matrix R, consider the standardized observations ( x1 k - xk )/ � , k = 1, 2, . . . , p, j = 1, 2, . . . , n. Show that these standardized quantities have sample covari
ance matrix R.
1 48
Chapter 3
=
Sa m p l e Geometry and Ra ndom Sam p l i ng
3.14. Consider the data matrix X in Exercise 3.1. We have n p
= =
=
=
2 variables X1 and X2 • Form the linear combinations c' X b' X
(a) Evaluate the
3 observations on
[ �J - X1 2x2 3] [ �J 2X1 3X2
=
[ - 1 2] [2
+
+
sample means, variances, and covariance of b'X and c'X from first principles. That is, calculate the observed values of b'X and c'X, and then use the sample mean, variance, and covariance formulas. (b) Calculate the sample means, variances, and covariance of b'X and c'X using (3-36). Compare the results in (a) and (b). 3.15. Repeat Exercise 3.14 using the data matrix
and the linear combinations
X= [ � � � ] = [ �:] = -3] [ �: ] == =
b'X and c'X
8 3 3
[1 1 1]
[1 2
#Lv and covari ance matrix E(V - JL v ) (V - JL v ) ' Iv · Show that E(VV' ) Iv + JL viLv · 3.17. Show that, if X and z are independent, then each component of X is
3.16. Let V be a vector random variable with mean vector E(V)
( pX l )
=
(q X l )
independent of each component of Z. Hint: P[Xl < x l , x2 < x2 , . . . ' xp < Xp and zl < Z r , . . . ' Zq < Zq ] P[Xl < x l , x2 < x2 , . . . xp < xp ] . P[Zl < Z r , . . . ' Zq < Zq ] by independence. Let x2 , . . , xP and z2 , . . . , Z tend to infinity, to obtain q P[X1 < x 1 and Z1 < z 1 ] P[X1 < x 1 ] P[Z1 < z 1 ] .
=
'
•
for all x l ' Z l · So xl and zl are independent. Repeat for other pairs. REFERENCES 1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed.). New York: John Wiley, 1 984. 2. Eaton, M., and M. Perlman."The Non-Singularity of Generalized Sample Covariance Matrices." Annals of Statistics, 1 (1973), 710-717.
CHAPTER
4
The Multivariate Normal Distribution
4. 1
I NTRODUCTION
A generalization of the familiar bell-shaped normal density to several dimensions plays a fundamental role in multivariate analysis. In fact, most of the techniques encountered in this book are based on the assumption that the data were generated from a multivariate normal distribution. While real data are never exactly multivariate normal, the normal density is often a useful approximation to the "true" population distribution. One advantage of the multivariate normal distribution stems from the fact that it is mathematically tractable and "nice" results can be obtained. This is frequently not the case for other data-generating distributions. Of course, mathematical at tractiveness per se is of little use to the practitioner. It turns out, however, that nor mal distributions are useful in practice for two reasons: First, the normal distribution serves as a bona fide population model in some instances; second, the sampling dis tributions of many multivariate statistics are approximately normal, regardless of the form of the parent population, because of a central limit effect. To summarize, many real-world problems fall naturally within the framework of normal theory. The importance of the normal distribution rests on its dual role as both population model for certain natural phenomena and approximate sampling distribution for many statistics. 4.2 TH E M U LTIVARIATE N ORMAL D E N S ITY AND ITS PROPERTI ES
The multivariate normal density is a generalization of the univariate normal density to p 2 dimensions. Recall that the univariate normal distribution, with mean JL and variance a-2 , has the probability density function >
- 00
< X <
00
(4-1) 1 49
1 50
Chapter 4
The M u ltiva r i ate Norm a l D istribution I I
I
I
:-+--- .683 �: p - 2cr p - er
f.1
Figure 4.1
f.1 + cr
A normal density with mean J.L and va ria nce a2 and sel ected a reas u n der the cu rve.
f.1 + 2cr
A plot of this function yields the familiar bell-shaped curve shown in Figure 4.1. Also shown in the figure are approximate areas under the curve within ±1 standard devi ations and ±2 standard deviations of the mean. These areas represent probabilities, and thus, for the normal random variable X, P ( J.L - a- < X < J.L + a-) - .68 P ( J.L - 2a- < X < J.L + 2 a-) - .95
It is convenient to denote the normal density function with mean J.L and variance 2 by N ( J.L , a-2 ) . Therefore, N( 10, 4) refers to the function in (4-1) with J.L = 10 and aa- = 2. This notation will be extended to the multivariate case later. The term (4-2) in the exponent of the univariate normal density function measures the square of the distance from x to J.L in standard deviation units. This can be generalized for a p X 1 vector x of observations on several variables as (4-3) The p X 1 vector IL represents the expected value of the random vector X, and the p X p matrix I is the variance-covariance matrix of X. [See (2-30) and (2-31 ).] We shall assume that the symmetric matrix I is positive definite, so the expression in (4-3) is the square of the generalized distance from x to JL . The multivariate normal density is obtained by replacing the univariate dis tance in (4-2) by the multivariate generalized distance of (4-3) in the density function of ( 4-1) . When this replacement is made, the univariate normalizing constant 1 2 1 2 2 1 (27T ) 1 ( a- ) must be changed to a more general constant that makes the volume under the surface of the multivariate density function unity for any p. This is neces sary because, in the multivariate case, probabilities are represented by volumes under the surface over regions defined by intervals of the xi values. It can be shown ( see [1 ]) that this constant is (27T ) -P/2 1 I l - 1 12 , and consequently, a p-dimensional normal density for the random vector X' = [X1 , X2 , , Xp] has the form • • •
f (x)
1 - ( x - JL) I- ( x - JL) 12 e 1 2 2 P (27T ) / 1 I 1 / I
=
1
(4-4)
where - oo < xi < oo , i = 1, 2, . . . , p. We shall denote this p-dimensional normal density by Np ( /L, I ) , which is analogous to the normal density in the univariate case.
Section 4.2
Example 4.1
The M u lt iva riate Normal Density and Its Properties
1 51
{Bivariate normal density)
Letrametuseevalrs JLuat1 e tE(he X1 ) , 2-JLv2ariaE(te Xnor2)m, alo-1densVar(ity inXte1r)m, so-of2 theVar(indivXid2ual) , andpa r . ) Cor ,X X ( ) ya:;; P1 2 Uso-1i2n/g(� 1 2 Result 2A.8, we find that the inverse of the covariance matrix p=
=
=
=
=
=
=
IS
Iobtntraoiducin o-1no-g2the-cora-ir2elato-io1no-coef2 ( 1fic-iePInt2p) 1, 2andby wrtheitisnqguara-1e2d dips1t2a� vo:;; , we nce becomes =
=
-P1 2�o-1 ya:;;J [x1 - JLJL21 ] lT22(x1 - JL1 )2 lT1 (x2 - JLlT2l)2lT2 ( 1 - �PI2) ( xl - JL1) (x2 - JL2) 11 - Pi2 [ ( x� 1 - JL1 )2 ( x2 - JL2 )2 - ( x� 1 - JL1 ) ( x2 - JL2 ) ] Theand l(axs2t expr- JL2e)sjvo:;; ion is wr. it en in terms ofthe standardized values ( x1 - JL1 )/� lsit-y andinNextvolvi,nsiginntcehe inditovgetido-ual1tho-epar2 expr-ameta-eis2eiorsnJLo-fo11 r, o-JLth22e,(o-bi1 1v-ar, o-PIia22te), ,andwe can2)sunorbsmtitalutedenfor f(xl , x2) 21TVlTl lT12 ( 1 - PI2) exp {- 2( 1 � Pi2) [ ( x�l y ( x�2 y ( �- JL1 ) ( x� - JL2 ) ] } The expris mores ieoninifnormatiivsesoinmewhat unwi e l d y, and t h e compact gener a l f o r m i n many ways . On t h e ot h er hand, t h e expr e s i o n iForn examplis use, eftuhleforarndomdiscusvarsinigablcerestXai1nandproXper2 artieesuncorof threelnorated,msalo tdihsattribution. the joint density can be writ en as the product of two univariate normal densities +
I
III = ( 4-4)
III
ya:;;
ya:;;
Zp 1 2
- 2p l 2
+
=
o-2 [ -P1 2 �
X2 -
ya:;;
(4-5)
ya:;;
=
(p =
p1 2 :
=
(4-6)
X
+
_
( 4-4) (4-6)
( 4-6)
if
Zp 1 2
X1
v;;:;;
p 1 2 = 0,
1 52
Chapter 4
The M u ltiva riate Norm a l D istribution
each of the form of (4-1). That is, f (x 1 � x2 ) = f (x 1 )f(x2 ) and X1 and X2 are independent. [See (2-28) .] This result is true in general. (See Result 4.5.) Two bivariate distributions with o- 1 1 = o-22 are shown in Figure 4.2. In Figure 4.2(a), X1 and X2 are independent (p 1 2 = 0). In Figure 4.2(b), p1 2 = .75. Notice how the presence of correlation causes the probability to concentrate along a line. •
(a)
(b) Figure 4.2 Two biva riate normal distributions. (a) (b) a 1 1 = a2 2 and p 1 2 = . 7 5 .
a1 1
=
a2 2
and
p1 2
= 0.
Section 4.2
The M u ltiva riate Normal Density and Its Properties
x
1 53
From the expression in (4-4) for the density of a p-dimensional normal vari able, it should be clear that the paths of values yielding a constant height for the den sity are ellipsoids. That is, the multivariate normal density is constant on surfaces where the square of the distance ( - p, ) ' I- 1 ( - p, ) is constant. These paths are called contours: Constant probability density contour = { all such that ( - p, ) ' I-1 ( - p, ) = c 2 } = surface of an ellipsoid centered at p, The axes of each ellipsoid of constant density are in the direction of the eigen vectors of I- 1 , and their lengths are proportional to the reciprocals of the square roots of the eigenvalues of I- 1 . Fortunately, we can avoid the calculation of I- 1 when determining the axes, since these ellipsoids are also determined by the eigenvalues and eigenvectors of I. We state the correspondence formally for later reference.
x
x
x
x
x
Result 4.1. If I is positive definite, so that I- 1 exists, then Ie = Ae implies I- 1 e =
(�)
e
so (A, e) is an eigenvalue-eigenvector pair for I corresponding to the pair ( 1/ A, e) for I- 1 • Also, I- 1 is positive definite. Proof. For I positive definite and e i=- 0 an eigenvector, we have 0 < e ' Ie = e' (Ie) = e' (Ae ) = Ae' e = A. Moreover, e = I- 1 (Ie) = I-1 (Ae) , or e = AI- 1 e, and division by A > 0 gives I-1 e = ( 1/A)e. Thus, ( 1/A, e) is an eigenvalue eigenvector pair for I-1 . Also, for any p X 1 by (2-21)
x, x'(�l = l (!)eiej ) x i�= l (!)(x'e Ai i)2 x' i�= l Ai) (x'
x' I-1 x =
Al
=
>
0
since each term Aj 1 (x' ez) 2 is nonnegative. In addition, e i = 0 for all i only if p = 0. So =I= 0 implies that ( 1/ e i ) 2 > 0, and it follows that I-1 is posi • tive definite.
x
x
The following summarizes these concepts:
1 54
Chapter 4
The M u ltivariate Norm a l Distri bution
A contour of constant density for a bivariate normal distribution with o-1 1 = o-2 2 is obtained in the following example. Example 4.2
(Contou rs of the biva riate normal density)
We shall obtain the axes of constant probability density contours for a bivari ate normal distribution when o-1 1 = o-22 . From (4-7), these axes are given by the eigenvalues and eigenvectors of I. Here I I AI I = 0 becomes A 0 = o-1 1 lT 1 2 = ( A - o-1 1 - o-1 2 ) - o-1 1 + o-1 2) Consequently, the eigenvalues are 1 = o-1 1 + o-1 2 and A2 = o-1 1 - o-1 2 . The eigenvector e 1 is determined from -
( A.
A.
or
o-1 1 e1 + o-1 2 e2 = o-1 1 + o-1 2) e 1 lT 1 2 e1 + o-1 1 e2 = o-1 1 + lT 1 2 ) e2 These equations imply that e1 = e2 , and after normalization, the first eigenvalue-eigenvector pair is 1 v'2 1 v'2
((
Similarly, A2 = o-1 1 - o-1 2 yields the eigenvector e2 = [ 1/ v'2 , - ljv'2] . When the covariance o-1 2 (or correlation p 1 2) is positive, A1 = o-1 1 + o-1 2 is the largest eigenvalue, and its associated eigenvector e1 = [1/ v'2 , 1/ v'2 ] lies along the 45° line through the point IL ' = [JL 1 , JL2 ]. This is true for any positive value of the covariance (correlation). Since the axes of the constant-density el lipses are given by ±eVA;" e 1 and ±eVA; e 2 [see (4-7)] , and the eigenvectors each have length unity, the major axis will be associated with the largest eigen value. For positively correlated normal random variables, then, the major axis of the constant-density ellipses will be along the 45° line through IL · (See Figure 4.3.)
Figure 4.3
A consta nt-de ns ity conto u r for a biva riate normal d i stribution with a 1 1 a 2 2 and a 1 2 > O (or p1 2 > 0). =
Section 4.2
The M u ltiva riate Normal Density and Its Properties
1 55
A.2 a1 - a1 2
will be the largest When the covariance (correlation) is negative, = eigenvalue, and the major axes of the constant-density ellipses will lie along a line at right angles to the 45° line through IL . (These results are true only for
a 1 a2 · ) =
To summarize, the axes of the ellipses of constant density for a bivariate normal distribution with are determined by
a 1 a2 ±cYa1 - a1 2 ±cVa1 a1 2 c2 =
1
v1
-
+
1
and
v1
-
1
v1
-
-1
v1
-
•
We show in Result 4.7 that the choice = x�( a) , where x�( a) is the upper ( 1 00a )th percentile of a chi-square distribution with p degrees of freedom, leads to contours that contain ( 1 - a) X 100% of the probability. Specifically, the following is true for a p-dimensional normal distribution:
The constant-density contours containing 50% and 90% of the probability under the bivariate normal surfaces in Figure 4.2 are pictured in Figure 4.4.
112
11 2
Figure 4.4
in F i g u re 4.2.
111
xl
-
-
-
� /1 1
xl
The 50% a n d 90% conto u rs for the bivariate normal d istri butions
The p-variate normal density in (4-4) has a maximum value when the squared distance in ( 4-3) is zero-that is, when x = IL · Thus, IL is the point of maximum den sity, or mode , as well as the expected value of X, or m ean . The fact that IL is the mean of the multivariate normal distribution follows from the symmetry exhibited by the constant-density contours: These contours are centered, or balanced, at IL ·
1 56
Chapter 4
The M u ltiva riate Norm a l Distribution
Additional Properties of the Multivariate N o rmal Distri bution
Certain properties of the normal distribution will be needed repeatedly in our ex planations of statistical models and methods. These properties make it possible to manipulate normal distributions easily and, as we suggested in Section 4 . 1 , are partly responsible for the popularity of the normal distribution. The key proper ties, which we shall soon discuss in some mathematical detail, can be stated rather simply. The following are true for a random vector X having a multivariate normal distribution: 1. Linear combinations of the components of X are normally distributed. 2. All subsets of the components of X have a ( multivariate ) normal distribution.
3. Zero covariance implies that the corresponding components are independently distributed. 4. The conditional distributions of the components are ( multivariate ) normal. These statements are reproduced mathematically in the results that follow. Many of these results are illustrated with examples. The proofs that are included should help improve your understanding of matrix manipulations and also lead you to an appre ciation for the manner in which the results successively build on themselves. Result 4.2 can be taken as a working definition of the normal distribution. With this in hand, the subsequent properties are almost immediate. Our partial proof of Result 4.2 indicates how the linear combination definition of a normal density re lates to the multivariate density in ( 4-4) . Result 4.2. If X is distributed as Np( JL, I ) , then any linear combination of + variables a' X == + .· + is distributed as N(a' JL, a' Ia). Also, if a' X is distributed as N(a' JL, a' Ia) for every a, then X must be Np( JL, I ) .
a1X1 a2X2 aPXP .
Proof.
The expected value and variance of a ' X follow from (2-43) . Proving that a' X is normally distributed if X is multivariate normal is more difficult. You can • find a proof in [1]. The second part of result 4.2 is also demonstrated in [1] . Example 4.3
{The distri bution of a l i near com b i nation of the components of a normal random vector)
Consider the linear combination a' X of a multivariate normal random vector determined by the choice a' == [ 1, 0, , O J . Since . . .
a' X == [ 1 , o, . . . , O J
xx2l X 1 ==
Section 4.2
1 57
The M u ltiva riate Norm a l Dens ity and Its Properties
and a' 1L == [ 1 , 0 , . . . , O J
we have
I-LJL2l JLp
== JL 1
lTlT1l 2 lTlT21 2 lTlT2l pp lT l lTlp lT2p lTpp anderal iyt, ftohleomarws fgrionmal Resdistruilbtutionthofat anyX1 iscomponent distributedXasi ofN JLi1s,N(o-1 JL)i., o-Mori ) · e gen The next res u l t cons i d er s s e ver a l l i n ear combi n at i o ns of a mul t i v ar i a t e nor m al vector If is distributed as , the linear combinations pa x a x la2lxll a2l ppxpp (qXp)(pXl) arstaentdis,sitsridibutstreidbutasedasNp( IL Also, (px l) (pdxl) , where dis a vector of conThe expect e d val u e and t h e covar i a nce mat r i x of f o l o w ffroormm withAny linear Thuscombi, tnhate iconclon usion concer is a linnearing combifonl atowsiondiofrectlyoffrtohme ResulThet second part of the result can be obtained by considering d) d) , wher e i s di s t r i b ut e d as I t i s known f r o m t h e uni varunchanged iate case tandhat additransnglaatconses thtaentmeand to ttohe raILndom vard iable IL led)aves. Sithnecevariawasnce arbitrary, dis distributed as p IL d, For distributed as find the distribution of 1 0
a' Ia == [ 1 , 0 , . . . , OJ
==
0
(
4.2
•
X
X.
Result 4.3.
X
N ( JL I ) ,
A X
Proof.
a'X + (a'
X+
Example 4.4
X
+ ... + + ... +
==
X +
Nq(AJL, AIA ' ) . + d, I ) .
(2-45). a' X a == A' b. 4.2.
a' X
q
E(AX) b ' (AX)
N(
AX
N(a' JL, a ' Ia).
a'
+
AX X,
a' + a' == a' ( I).
a' (X +
a' X
+
(The distri bution of two l i near combi nations of the components of a normal ra ndom vecto r)
N3 ( JL, I ) ,
-1 1
a
==
•
1 58
Chapter 4
The M u ltiva riate Normal Distri bution
By Result the distribution of AX is multivariate normal with mean 4.3,
and covariance matrix
Alfiedtebyrnatdiirveelcty,caltheculmeanationvectof thore AmeansIL andandcovarcovarianceiancesmatofrixthAeItwA'o rmayandombe varverii ables Y1 xl - x2 and x2 - x3 . are thWeemshaveelvesmentnormioalnedly dithstatribalutl esud.bsWeets ofstaatemulthistiprvaroiperatetnory fomrmalalrlayndomas Resvectult or X are noriamnceallmaty dirsitxriIbutased. If we respectively par tition X, its mean vectAllosur bsILe, tands ofiXts covar [ X [ IL ] ] 1 1 l q q Xl ) ( ( ) (pXxl) ( p-xq2)X l) (pILxl) ( p-ILq2) X l) and I I 1 1 2 q q q p x x ( ( ( ) : I(pxp) ( pI-q2)1Xq) ( p-qI)X2(pq-) q) then xl is distributed as Nq( #Ll , Il ) · Set (qAXp) [ (qXq) (qX (p-q) J in Result and the conclusion fol ows. TotheapplsubsyeRest ofuinltteresttoasanX1 and selesuctbstheteofcorthreescomponent s of X, we s i m pl y r e l a bel variances as IL1 and I1 , respectively. ponding component means and co Y2 =
=
•
4.4.
Result 4.4.
=
=
----------------
---------------
!
I I
- - - - - - - - - - - - - - - - Ir - - - - - - - - - - - - - - - - - - - - - -
l l
Proof.
=
4.4
1
0
4.3,
arbitrary
•
Section 4.2
Example 4.5
The M u ltiva riate Norm a l Dens ity and Its Properties
1 59
(The distri bution of a su bset of a normal random vector)
If X is distributed as N5 ( 1L, I) , find the distribution of [�:]. We set
andand notparetittihoatnedwitash this assignment, X, JL, and I can respectively be rearranged 24 J.LJ.L24 a-2 a-24 i 1 2 a-23 a-25 X= 3 JL = J.LJ.L3l I = a-a-2123 a-a-3144 i a-1 3 a-a-31 3 a-a-31 55 X5 J.L5 a-25 a-45 ! a-1 5 a-35 a-5 or IL x 1 l X= x21) JL = IL2 Thus, from Result for we have the distribution Itexpris clesearedfbyromsimthplisyexampl egththatetapprhe noroprmialatedimeans stributandion fcovaror anyiancessubseftrocanm tbehe s e l e ct i n oressigairnyal. IL and I. The formal process of relabeling and partitioning is unnec We ar e now i n a pos i t i o n t o s t a t e t h at zer o cor r e l a t i o n bet w een nor m al r a ndom variables or sets of normal random variables is equivalent to statistical independence. If X1 and X2 are independent, then Cov (X1 , X2) = a q1 q2 matrix of zeros. Ionlf [y�ifJIi1s2 = ( [:-;J [�;-;-i- i!�] ), then X1 and X2 are independent if and x x xl x
(I -� �-� ---- �� � _l __c:__l_�---- ��-� ----�-� ?_ (Il l !
'
'
( 2Xl )
( 2Xl )
(3 X
(3 X l )
4.4,
•
Result 4.5.
(a)
(b)
(q 1 x l )
0,
(q2 x l )
Nq 1 +qz 0.
,
X
1 60
Chapter 4
The M u ltivariate Norma l Distribution
, ) and INfq2X( p,1 2and, I2 )X, r2esarpecte iinvdependent and ar e di s t r i b ut e d as ( p. N I 1 1 1 q ely, then [i�J has the multivariate normal distribution Nq!+qz ( [-:;-] [t�-�+!:J) ( S ee Exer c i s e f o r par t i a l pr o of s bas e d upon f a ct o r i n g t h e dens i t y function when I12 = Let (3X1) beN3( p., I )with [ I= J Are XSi1 nandce XX12 iandndependent ? What about X , X and X3 ? ( ) 1 2 X have covar i a nce t h ey ar e not i n dependent . = o1 2 2 However, partitioning X and I as I [ I 1 1 2 X= [1!-l I = [- - - - -;- - J- = (I1X212) : (IlX2l) ] we( X s,eXe th) atandX1X=3 ar[e�in:]dependent Therefore, haveby Rescovarulitance matThisriixmIpl1i2es=X[3�isJin. dependent and X 3 1of xl and2 also of x2 ion of the bibecaus variatee tnorhe mjoalintdidensstribituty fiounnctthioatn p[s1e2e=(We(zpoiercouloncortdedtrheoutenlatbeioinn)wrouriimt plediniesascdusitnhsdependence e pr o duct of t h e mar g i n al ( n or m al ) dens i t i e s of Xcas1 eandof ResX2 uThilt s fawictt,hwhiq1 c=h weq2 =encouraged you to verify directly, is simply a special Let X = [�;-] be distributed as Np( p,, I) with p, = [-:;], IX2==[x""�2"2�,-1i!-s:�nor-""�"2�-?m]al,andandI hasI2 l Then the conditional distribution of X1 , given that (c)
'
Proof.
Example 4.6
4014
Oo)
•
(The eq u ivalence of zero cova ria nce and i ndependence fo r normal variabl es)
4 1 0
1 O 3 o 0 2
1,
4 1 0
1 lO 3 !o 0 i2
l _(�->:��i-��->:-� L i
4050
•
0
0 4-6)]
0
405
10
Result 4.6.
>
00
Section 4.2
The M u ltivari ate Normal Density and Its Properties
1 61
and Notvariaebltheat. the covariance does not depend on the value x2 of the conditioning We s h al l gi v e an i n di r e ct pr o of . ( S ee Exer c i s e whi c h us e s t h e den sities directly.) Take [ : - ""�"1 2..�.,.2-21 ] A(pxp) (p(q-Xqq) X) q : (pqX-q)(Xp-(pq-) q) so is jointly normal with covariance matrix AIA' given by Proof.
4.13,
I
=
--------------
0
�- - - - - - - - - - - - - - - - - - -
i
!
I
Sidependent nce X1 -. ILMor1 -eoverI12I,2tih(eXquant i a nce, t h ey ar e i n 2 - ILi2ty) andX1 -X21L-1 -IL2Ihave1 2I2izer(Xo2 covar has di s t r i b ut i o n ) IL 2 x2 - 1L2) is, athconse conditant. NBecaus q(O, Ie1 X-1 I-1 21LI12i-i2I1)1.2IGi2iv(enX2 t-hatILX2)2 andx2X, 2IL-1 1LI2 1ar2Ie 2iin(dependent tdiiosnaltribdiutsitornibofutiXon1 of- X1L11 -- IIL1l2I-2Ii(1X2I2 2-i(x1L22)-. ILSi2n)ceis Xth1e -same1L1 as- tIh1e2Iuncondi t i o nal ) ( X 2 i 1L 2 2 iwhens Nq(XO,2Ihas1 -theIpar12It2iiciul2a1 )r,valsouies xth2 .e Equirandomvalentvectly, giorvenX1 th-atILX1 2-=Ix122,IX2i1(ixs2di-str1Lib2)uted as Nq( ILl I1 2I2i(x2 - 1L2) , I1 - I12I2i I2 1 ) . gi v en t h at Thetion,condi t i o nal dens i t y of x f o r any bi v ar i a t e di s t r i b u x x 2 2 l ' is defined by =
+
+
Example 4.7
•
(The co nditional density of a bivariate normal distri butio n) =
wher e f ( . I f f ( , i s t h e bi v ar i a t e nor ) i s t h e mar g i n al di s t r i b ut i o n of X ) x x x 1 2 2 2 mal density, show that f( x1 x2) is I
1 62
Chapter 4
The M u ltivariate Norm a l Distri bution
erms iionnvolving xbecome, e o-1 -ofa-thie2/o-bi2variao-te1 nor( 1 -malPIdens2) . Theity [tsweoe tEquat Herexponent 1 - JL1 iaparn thet from the multiplicative constant -1/2(1 - PI2), =
(4-6)]
vo=;;: ,
Becaus e or p op1 2 / ya:;; 1 1 2 2 exponent is =
va:;;; vo=;;: =
o-1 2/o-2 , the complete
The constant term 2TTv'o-1 o-2 ( 1 - PI2) also factors as Dividing the joint density of X1 and X2 by the marginal density and canceling terms yields the conditional density Thusx2 , wix2tihs N(ourJLcusl to(marlT1 2/ylTnot2 ) a(xti2o-n, JLth2e) ,condilTl ( 1ti-onalPI2di) )s. trNow,ibutioInl of-IX1l 2giIv2eni I2th1 at o-whi1 c-h wea-i2/obto-2ainedo-1by( lan-inPIdi2r)eandct metI1 2hIod.2i o-12/o-2 , agreeing with Result =
==
+
=
==
4.6�
•
Section 4.2
The M u ltivariate Norma l Dens ity and Its Properties
1 63
For thAle mull conditivartioianalte nordistmrialbutsiitounsatiaron,e i(tmisulwortivarthiaemphas i z i n g t h e f o l o wi n g: t e ) nor m al . The conditional mean is of the form 1. 2.
(4-9) where the {3's are defined by
f3f32l,,qq++l f3f32l,,qq++22 f3f32l,,pp /3q,q+l /3q,q+2 /3q,p Thevalue(condis) ofttihoenalcondicovartioniiannce,g variable(s).2 2 , does not depend upon the We concl u de t h i s s e ct i o n by pr e s e nt i n g t w o f i n al pr o per t i e s of mul t i v ar i a t e nor malconsrtaandom vect o r s . One has t o do wi t h t h e pr o babi l i t y cont e nt of t h e el l i p s o i d s of nt dens i t y . The ot h er di s c us s e s t h e di s t r i b ut i o n of anot h er f o r m of l i n ear com bi2natiTheons. chi-square distribution determines the variability of the sample variance sin the mulfotrivsarampliateescasfreo.m a univariate normal population. It also plays a basic role Let be diisstdiribstutriebdutasedNasp( x , wherwiteh x denot0.eThen s t h e chi s q uar e diThestriNbutp(ion withdistrdegributeioesnofasfsriegedom. ns)},prwherobabie xlit(ya)1 denot solid (el1l00a)ipsotidh - a teos tthhee upper x a ( percentile of the x distribution. We knowwhertehatZ1 ,x 2, is ,definarede inasdependent the distributio1)n raofndomthe varsumi ables. Next, by the spectral decompositipon1[see Equations (2-16) and (s2o-21) with wher e and see Result 4.1], p Cons e quent l y , p 2 p for instance. Now, 2 p we can write ) , where 1
I 1 1 - I 1 I 2i i 1
3.
= s1 1
Result 4.7. X (a) (X - JL ) ' I- 1 (X - JL )
JL, I) �
III �
>
p
(b)
JL, I ) {x: (x - JL ) ' I- 1 (x - JL ) < � �
Proof. Zi + Z� +
A = I,
( 1/ A.J e i .
· · ·
� Z
+ Z� ,
. . •
�
N(O,
ZP
I-1 ei = Ie i = Ai e i , I- 1 = L - ei ei , i =l Ai (X - JL ) ' I - 1 (X - JL ) = L ( 1/ A.J (X - JL ) ' e i ei (X - JL ) = i =l
L ( 1/ Ai ) ( e i (X - IL ) ) = iL [ ( 1/ \/A; ) e i (X - IL ) ] = iL Z[ ,
i =l
=l Z = A(X - IL
=l
1 64
Chapter 4 The M u ltivariate Norm a l D istribution
(pXl) (pAXp) anddistrXibut-edILasis Ndips(tOr,ibAutIeA'd as) , wherNp( eI) . Therefore, by Result z
==
==
0,
4.3, Z ==
A(X - IL ) is
BycludeResthuatlt (X Z-1 1L), Z2' ,I. -.1 ,(XZP-ar1L)e has a x�-disstt1raindarbutiodn.normal2 variables, and we con For Par t b, we not e t h at P[ ( X c ] i s t h e pr o babi l i t y as ( X JL) JL)' I 1 2 sfrigonedmPartot a,thPe [el(Xlip-soiJL)d (' XI-1-(X�-t-) ' IJL)- (X -x�(JL)a)] c by a,andPar the densittybholNp( JLds,. I ). But Res u l t pr o vi d es an i n terpretation of a squared statistical distance. When X is distributed as Np( JL , I ) , icomponent s the squarehasd staatmuch istical ldiarsgterancevarfiraoncem Xthtano thanote populher,aittiowinlmeancontrvectibutoerleILs· Itfo onethe sleqsuartheand ditwstoavarnce.iablMores tehoverat ar,etwnearo hilgyhluncory corrreellaatteed.d rEsandom var i a bl e s wi l cont r i b ut e s e nt i a l y , t h e us e of t h e i n ver s e efofftehctescovarof coriarnceelatimaton. rFrix,om tshtea1ndarproofdizofesResall uofltthe variables and eliminates the (X - JL) ' I- (X - JL) Zt Z� Z� 4.5,
independent
< == 1 -
<
<
Remark: (Interpretation of statistical distance)
(1)
+
4.7
(2)
4.7,
==
•
+ ··· +
Section 4.2
1 I -2
1 65
The M u ltiva riate Normal Dens ity and Its Properties
1 (2-22)), Z = I - 2 (X -
Np (O, I p )
In terms of (see IL ) has a distribution, and Thetranssfqouarrmeded tsotatiisntdependent ical distancestaisndarcalcdulnoratemdalasraifndom , first,vartheiarblandom vect o r wer e e s and t h en t h e us u al squarNexted dis, tconsance,idterhethsuemlinofearthcombi e squarneatsioofntofhevectvarioarblreasndom , werevarappliablieed.s ] Thithatsiltindefearincombi nation dif errasndomfromvartheialblineearthatcombiis a lninatearionscombiconsniatderioendofearvectlierorisn. es a Prbineatvioiousnlofy, weothdierscuniussveard aiate randomrandomvarvariabliaebls.e that could be writ en as a linear com Let be mut u al l y i n dependent wi t h di s t r i b ut e d as (Note that each has the covariance matrix Then is distributed arase joi(nt�ly multiv(ar�iate nor) m) alMoreover, and with covariance matrix 1 1 (X - JL ) ' I - 1 (X - JL ) = (X - JL ) ' I - 2 I - 2 (X - JL ) = z ' z = z i + z� + · · · + z�
X
p
c1 X 1 + c2 X 2 + · · · + cn X n = [X1 l X2 ! · · · ! Xn c (n x l ) ( pX n )
(4-10)
p X 1 vector single
X 1 , X2 , . . . , Xn Xj
Result 4.8. Np ( #L j , I) .
+
···
Cj f.Lj ,
Np
+ bn X n
I.)
same
cJ :t .
( � cJ ) :t
( � bJ ) :t n j= l
Consequently, and are independent if By Result the component vector is multivariate normal. In particular, is distributed as IL and Proof.
v2
b' c = L cj bj = 0.
4.5(c),
np
X
( npXl )
( n pXl)
v2 = bl X l + b2 X 2
vl
(b' c) I
(b ' c) I
vl
Xj
IL 1 IL 2
#L n
Ix
( n pX n p)
Nnp (JL, Ix) ,
I 0 0 I
0 0
0
I
0
where
1 66
Chapter 4
The M u ltivariate Norm a l Distri bution
The choice [ where is the identity matrix, gives A ( 2p X np )
I
p
X
_
en l bn l
e 1 1 e2 I b1 I b2 l
]
p
n "" £,; AX = j =l n "" j£,; =l
e1-X1·
b-X 1 1.
andcation shiowss northmatal has the firstbybloResck udiltagonal 4.3. Sttrearmightforward block multipli (± ) The off-diagonal term is (± ) Thissotethrmat i(s ±the covar) iance matrixandfor arVe2i.ndependent Consequentbyly,Reswhenult .fS b For s u ms of t h e t y pe i n t h e pr o per t y of zer o cor r e l a t i o n i s equi v al e nt t o requiring the coefficient vectors and to be perpendicular. Letdom vectors withand be independent and identically distributed ran [ -:] and [ -: -� �] WeThisfiisrsat rconsandomidervara liianbleare wicombith meannation of the three components of and variance Thatgle raindom s, a linearvaricombi n at i o n of t h e component s of a r a ndom vect o r i s a s i n a bl e cons i s t i n g of a s u m of t e r m s t h at ar e each a cons t a nt t i m es a variable. This is very different from a linear combination of random vectors, say, N2 p ( AJL, Aix A' ) Aix A'
AX
[el i , e2 I, . . . , en iJ [ el l, e2 1, . . . , en iJ ' =
[e l i , e2 I, . . . , en iJ [ b1 I, b2 I, . . . , bn iJ ' =
b' c = 0 ,
j =l
cj bj :t
=
0 , V1 (p Xp)
(4-10) , b
Example 4.8
1 =1
1 =1
cJ I
cA I
V1 , V2
n
� ej bj = ( ) • .
c
(Linear co mbi nations of ra ndom vecto rs)
X 1 , X 2 , X3 ,
3X1
X4
�
=
:t
=
a'X 1
a'Ia = 3ai + a� + 2a� - 2a 1 a2 + 2a 1 a 3 a'X 1 e1 X 1 + e2 X 2 + e3X3 + e4X 4
X1
.
Section 4.2
1 67
The M u ltivariate Norm a l Dens ity and Its Properties
whia racndomh is itsvectelf aorr.andom vector. Here each term in the sum is a constant times Now consider two linear combinations of random vectors and 3X4 X X 1 2 nation of vec tFiornsdandtByhe mean alRessouthlvectte covarowir andtihancecovarbetiawnceeenmatthem.rix for eachthleinfearirst combi l i n ear combi n at i o n has mean vector +
4.8
c1
=
c2
=
+ x3
c3
=
c4
=
1/2,
[ -� ] For the second linearandcombi-3natitoongetof rmeanandomvectvectorors, we apply Result with ][ � and covariance matrix -� ] [ - �� � � and covariance matrix
(ci + d + d + d) I
b1
=
b2
=
b3
=
1
b4
=
1 XI
-1 1 1 0 0 2
=
4.8
=
(b1 + b2 + b3 + b4) 1L
(bi + b� + b� + b�) I
=
O IL
=
1
=
12 x I
=
12
0 24
Fitornsalilsy, the covariance matrix for the two linear combinations of random vec
Evervariayncecomponent ofy tcomponent he first linearofcombi ncondationlofinearrandomcombivectnatoiorsnhasof rzerandomo co wi t h ever t h e s e vectorIsf., in addition, each X has a trivariate normal distribution, then the two lin ear combinatinoatnsioofnsvecthaveoras jaroienitnsdependent ix-variate nor. mal distribution, and the two linear combi •
1 68
Chapter 4
The M u ltivariate Norn1 a l Distribution
4.3 SAM PLI NG FROM A M U LTIVARIATE NORMAL DISTRI BUTI ON AN D MAXI M U M LI KELIHOOD ESTI MATION
Wetion,diwescusshsaledl besamplconcering andnedswielethctsinamplg random sampla mulestbrivariefliayteinnorChaptmalepopul r Inattihoin-is secn e s f r o m particular, with the sampling distribution of X and Leta mulustiasvarsuiamete northatmtalhepopul1atvection owirsthXmean1 , X2,vect. . ,oXrn reandprescovarent airaancendommatsraimplx eSifrnocem jXdensoi1n, tXitdens2ie, s. : .it, yXfnuarncteiomutn ofualalllythinedependent observationsandis eachthe prhasoductdistofribtutheiomarn ginal normthale { ofJointXdens, . .ityXn } fi { ( 21T )PI12 1 1 2 } 2' 1 1 (4-11) 1T 2 ( When t h e numer i c al val u es of t h e obs e r v at i o ns become avai l a bl e , t h ey may be sasubsa ftiutnctuteidonfoofr theandxj in Equat for thieonfix(4-ed1s1)et. ofTheobsreesruvlattiniognsexprx1 ,exs2i,o. n,. ,nowxn , isconscalliedderthede Many good s t a t i s t i c al pr o cedur e s empl o y val u es f o r t h e popul a t i o n par a met e r s tnitehrqatueval"besiusescalt"tlhexpleatd ain the obstheervjedoindatt densa. Oneity evalmeaniuandatentdgheatofmaxithe mobsisizteionrgvsateparleioctansmett.heThieparr svaltaemeuches are calAtledthis point, we shall consider maximum likelihood estimation of the para metvatieornss x1and, x2, . f. o, xr na asmulfitxivedariandate norconsmaliderpopultheajtoioinn.t Todensdoitysoof, weEquattake itohne obs(4-1e1)r evalmattueatrse, wed atrethwresietevaltheueslik. elTheihoodresufultnctis itohne ilnikanotelihoodher ffournctm.ioWen. sIhnalorl dneeder tososimempladify dioftiiotsnaldiaprgonalopereltieesmentfor tsh,eandtracetheofpraosperquartieesmatof trhixe. t(rTacehe artraecediofscusa matsed rinixDefis thineistuiomn 2A.28 and Result 2A.Let1A2.)be a symmetric matrix and x be a 1 vector. Then x' Ax tr (x' Ax) tr (Axx' ) tr (A) where the are the eigenvalues of A. x For Par t a, we not e t h at x' Ax i s a s c al a r , s o x' Ax t r ' Ax) . We ( poidimnenstediooutns in Resuandlt 2A.12 thatretsrp(ectBCiv)ely. tThir ( CsB)folfoowsr anybecaustwo ematBCrihasces Band C ofas ( ) its ith diagonal element, so tr (BC ) � � . Similarly, the jth diagonal 3.
S.
The M u ltivariate Normal Like l i hood X
p
I. Np ( JL , I ) ,
IL
-
xl '
j =l
=
IL
likelihood.
I
-,�/ x; - P- ) ' l- l ( x; - P- ) /2 e n np 2 2 ) / 1Il /
I
best
maximize maximum likelihood estimation, maximum likelihood estimates.
I
IL
kXk
Result 4.9.
(a)
l e-( x; - P- ) ' l- ( x; - P- ) /2
=
=
(b)
k
kX
=
� A.i ,
A.i
i =l
Proof.
mXk
=
=
k X m,
=
m
k
k
� bij cji
b; h ;
j =l
Section 4.3
Sa m p l i ng from a M u ltivariate Normal D istri b ution
1 69
C C B) s o t r eltre(ment of B i s ( i= 1 ) i= 1 ( j= 1 ) ( j i = 1 = 1 BThenC)tr.(Letx' (Ax'x))be thetrmat( (Arxix)xB' )wi, andth the r1,esuandlt foleltoAxws. play the role of the matrix C. Par t b i s proved by us i n g t h e s p ect r a l decompos i t i o n of ( 2 2 0) t o wr i t e ATherePfo' APre, t,rwher(A) e PP'tr (P' AP)and Atr (isAaPPdi'a)gonaltr (matA ) rixAwi1 th Aent2 ries A.1 , A.2, , in the joint density in (4-11) can be simplified. By Result 4.9(a)Now, the exponent (xj - ) ' I-1 (xj - ttrr [[I(x-j1 (-xj -) 'p,)I-1((xxjj -- p,)p,)' (4-12) Next, j=1 (xj - IL ) ' I-1 (xj - j=1 tr [ (xj1 - ' I-1 (xj - ] j = 1 tr [ I - ( x j - ( x j - ' ] saccorince dthinegttroaceResofulat s2A.um1of2(bmat). Wecanaddands rices is equal to tuhbte rsauctxm of th(e1/trna)ces ofxjtihneeacmathrtiecresm, j =1 (xj - p,) in j=1 (xj - p,) (xj - p,) ' to give j= 1 (xj - x x - p,)(xj - x p,)' j=1 (xj- x)(xj - x) ' j=1 ( x - JL)( x - IL )' j=1 (xj- x)(xj - x) ' n( x - p,)( x - p,)' (4-14) ' ' x becaus e t h e cr o s p r o duct t e r m s , ( and x , ) ( j x) ( x p, ) j x) IL j j = 1 = 1 arande bot(4-h14)mat, wericescanofwrzeritoes.th(eSeejoiExernt densciseit4.y 1of5.)a Consrandomequentsamplly, use ifnrgomEquata mulionstiv(ar4-i1a3)te normal population as � C · ·b· ·
.£,;
1l
m=
=
=
=I
=
=
=
It )
It
n
It ) =
�
= .£,; � � = � � bl 1 c1- l = .£,; c1. l. bl. 1. .£,; .L.J
l1 '
+
+
=
···
It
=
=
n
] ]
It )
� n
= �
n
It )
It )
It )
=
�
n
�
n
�
+
+x-
n
+
n
+
= �
= �
n
�
n
�
n
�
(x
+ "- k ·
• • •
"-k ·
•
1 70
Chapter 4
The M u ltivariate Normal Distribution
Subshoodtiftuuncttinigotn.heWeobssehralvedl denotvaluese this function byintL(o tJLhe, jointtodensstreistytyiheelfdasctthtehatlikielt iis a fcontunctaioinn tofhethsepeci(unknown) popul a t i o n par a met e r s and Thus , when t h e vect o r s fic numbers actually obs[ (erved, we have l )/ I t wi l be conveni e nt i n l a t e r s e ct i o ns of t h i s book t o expr e s t h e exponent i n t h e lidikentelihitoody function ( in dif erent ways. In particular, we shall make use of the )] tr [ ( � )] tr [ ( � ( tr [ ( � )] Theof nextand result wil eventually allow us to obtain the maximum likelihood estimators Gi v en a s y mmet r i c pos i t i v e def i n i t e mat r i x and a s c al a r it fol ows that � � for all positive definite , with equality holding only for Let [ s e e Equat i o n be t h e s y mmet r i c s q uar e root of sotr Then and t r rmat ix is orpos, equiitivvealtedefrntliyn,ite becausThuse ,Letthe eigbeenvalan ueiesgenvalof ue of are posiThitiviesf by Exercise Result 4.9(b) then gives tr tr andResult we can writeby Exercise From the properties of determinants in x 1 , x2 , . . . , xn
I) ,
IL
xj
L( IL ' I ) =
I.
1 e-tr l -' t ( xJ - X ) ( xi -X ) ' + n ( X - 1'- ) ( X - IL ) ' np n 2 2 / / (21T ) 1 I l
2
'
( 4-16 )
4-16)
I- 1
(xi - X) (xi - X) ' + n ( X - p, ) ( X - p, ) '
=
I-1
(xi - X) ( xi - X) '
+ n tr[I - 1 ( X - p, ) ( X - p, ) ' ]
=
I -1
(xi - X) ( xi - X) '
+ n ( X - p, ) ' I - 1 (X - p, )
Maxi m u m Li ke l ihood Esti mation of
Result 4.10.
>
and
I
I.
IL
b
IL
4-17)
p
0,
I I
X
l b e -tr ( l-l B )/2 <
I
l b (2b ) Pb e -bp
I = ( 1/2b ) B .
( pxp)
Proof.
B 112 B 112 B 112 = B , B 112 B- 112 = I , [ (I - 1 B 1 12 ) B 1 12 J = [ B 1 12 (I - 1 B 112 ) ] .
B 1 12 y # 0
B
p
y # 0.
2.17.
(I - 1 B ) =
p
I B 112 I - 1 B 112 1 = IT TJi i =1 2A.11,
B (2-22)] , 1 1 1 2 2 B - 1 B - 1 = B- . (I - 1 B) = B 112 I -1 B 112 . TJ 1 1 2 2 2 y' B 1 I - 1 B 1 1 y = ( B 1 y ) ' I - 1 ( B 112 y) > 0 TJi B 112 I - 1 B 1 12
p
( B 1 12 I - 1 B 112 ) = :L TJi i =1 2.12.
12 1 1 I B 1 f2 I - 1 B 1/2 1 = I B 1 /2 1 1 I - 1 1 1 B 1/2 1 = I I- I I B /2 1 1 B / 1 1 = I I- 1 I I B I = - I B I III
Section 4.3
Sa m p l i ng from a M u ltivariate Normal Distri bution
1 71
or 1 1 I I i = 1 I I I Combining the results for the trace and the determinant yields ( I)' � y = D I I I I But= the fTheunctchoiion TJcbee-TJ/2 has afomaxir eachmum,thwiertehforreespgiectvesto of (2b)be-b, occurring at � � I I The upper bound is uni81 q2Iuel-1y8at1;2ta=in8ed1 when I 8=1 2 since, for this choice, and Moreover, I I I I I I Straightforward substitution for tr[I-1 and I yields the bound asserted. The maxi m um l i k el i h ood es t i m at e s of and ar e t h os e val u es denot e d by I dependand i-ton thhate obsmaxiermvedizevaltheuesfunc1t,ion Ith) rinough the sTheummarestimy sattaetsisticands andi wil Let , be a r a ndom s a mpl e f r o m a nor m al popul a t i o n 1 with mean and covariance I. Then and I == arvaleuteshe, and ofare andcalleId, treshepectively. Their observed of and I. The exponent i n t h e l i k el i h ood f u nct i o n [ s e e Equat i o n apar t from the multiplic1ative factor � , is [see tr [ I - ( � )] B f2 - B 1 /2 1 IBI
1
1
TJ
-1 bl e -tr [ :t B J /2
TJi
2b.
=
=
e -:�, YJ./2 p
1];
lb
2b,
p
II
TJi
--
IBI
=
B lb
TJ,
i,
lb e
p
1
-tr (r' B ) /2 <
lb
7Jf e YJ,/2
(2b ) Ph e -bp
( 1/2b )B, 2 ( 2b )B - 1 = (2b) I
(p X p)
B 1 f2 I - 1 B 1 /2 1 IBI
1
(2b ) I IBI
1/l l b
BJ
jL
L ( JL , , x x 2 . . . , xn
Result 4.11. IL
IL
•
jL
(4-16).
x
S.
X X2 , . . . , Xn A
1� � X ) (Xj - X ) n (Xj _
j= 1
max imum likelihood estimators n x ( 1/n) L (xj x) (xj - x) ' , j =1 mates IL
IL
-
(4-17)]
(xj - X) (xj - x) '
_
,
(n - 1 ) n
S
maximum likelihood esti
-
Proof.
(2b) P IBI
(4-16)],
+ n ( X - p ) ' I- 1 ( X - p )
1 72
Chapter 4
The M u ltiva riate Norma l Distribution
Byles Resultx. ThusI-1,itshpose liiktielveihdefoodiniistemaxi, so tmheizdiedstawinceth r(xespect )t'oI-1 (atx x. It unre mains to maximize overoccurIs .atByi Result wi(xtjh- x)(xj -andx) 'B, asstja=t1e(d.xj - x)(xj - x) ', the maximum ji=k1elihood estimators are random quantities. They are obtained The maxi m um l bycorrreesplpondiacingngthreandomobservecvattioornss, Xx11,, xX22, , , ,xn in the expres ions for and with the We not e t h at t h e maxi m um l i k el i h ood es t i m at o r X i s a r a ndom vect o r and t h e maxitimatmesumarelitkheleiirhoodpartiescutliamratvaloruIesifsoar rthandom mat r i x . The maxi m um l i k el i h ood es e given data set. In addition, the maximum of the likelihood is 4.1,
IL =
- IL
4.10 n = ( 1/n ) �
A
- IL ) j:t =
IL
>
0
n
b = n/2
= �
• • •
• • •
jL
Xn .
i
•
A
( 4-18)
= [ (n - 1 ) /n J P I S I ,
or, since I constant (generalized variance)-n/2 Theand, gener a l i z ed var i a nce det e r m i n es t h e "peakednes s " of t h e l i k el i h ood f u nct i o n a t i o n is multiconsvMaxiariaetquent emumnorlmlyik,aliels. iahnatooduresatlimeasmatoursreposofsvares iaanbility when the parent popul Let be t h e maxiwhichmiums a fluikncteliihooodn of estThen imatotrhofe and consider estimating thofe parameter is given by (See and For example, The maximum likeliarhoode these timaximatomr umof JL'liIkel-1iJLhoodis [L'es:it-i1m[L,atwherors ofe andX andI, rTheespectmaxiivelmyum. likelihood estimator of is �, where 2 j= 1 is the maximum likelihood estimator of Var I
A
I
L( jL , i ) =
6.
[1]
1.
[14].)
( 4-19)
X
6,
h(6)
( a function of (J)
6
invariance p roperty.
A
maximum likelihood estimate h ( 0) (
same
(4-20)
function of {))
jL =
i = ( ( n - 1 )/n ) S
IL
�
2.
n 1 a- · . = � ( X· . - X-) n A
ll
-
l1
a-ii =
l
( Xi ) ·
h(6),
Section 4.4
The Sa m p l i n g Distri bution of X and
1 73
5
Sufficient Statistics
Frx1 o, xm2,expr. . , xens onliony throughthethoie nsat mpldenseitmean y dependsandonthtehesuwholm-ofe-ssqeuart ofeobss-and-ervatcrosionssWe expres this fact by saying prthatoductands matrix or are ( 4-15),
j
x
n
L (xi - x) (xi - x)' = (n - 1 ) S . ( n - 1 )S ( S ) sufficient statistics : i=1
x
portance� andof suf iincitehnte datstataismatticsrfioxr norismcontal popul aitnionsandis thatrealgarl ofdltehse iofnfothrmeThesataimploinmabout a i n ed e s i z e Thi s gener a l y i s not t r u e f o r nonnor m al popul a t i o ns . Si n ce many mul t i v ar i a t e t e c h ni q ues begi n wi t h s a mpl e means and covar i a nces , i t i s pr u dent tIof check ona cannot the be regarofdedtheasmulmultivtiarvariaitaetnore normmalalas,stuemptchniiqouesn. tSeehat Sectdependion t h e dat solely on and may be ignoring other useful sample information. n.
4.6.)
4.4
x
X
I
S,
adequacy
x
(
S
TH E SAM PLI N G DISTRI B UTION OF X AN D S
Themal popul tentatiavteioasn swiumptth meanion th�atand covariance conscompltituteetaelryandom s a mpl e f r o m a nor det e r m i n es t h e s a mpl i n g diofstriandbutionsbyofdrawiandng a parHeraleewel wiprthetsheentfatmiheliraersuuniltsvonariathtee conclsampluisniognsdi.stributions pop I n t h e uni v ar i a t e cas e we know t h at i s nor m al wi t h mean ulation mean and variance populsaampltionevarsizieance Thetributreiosunltwifothr tmean he mul�tiandvariacovarte casiaence matrixis analogous in that has a normal dis For t h e s a mpl e var i a nce, r e c a l t h at i s di s t r i b ut e d as tchi-ismquares ae chiis t-hsqeuardisetrvaributiaioblneofhavia snugm of squardegreseesof ofindependent freedom d.sft.andar. In dtunorrn, mthalis random variables. ThatTheisi,ndividual itserdimstsributedarase independent· · · ly distributed as I t i s t h i s l a t e r f o r m t h at i s s u i t a bl y gener a l i z ed t o t h e bas i c s a mpl i n g di s tribution for the sample covariance matrix. x 1 ' x2 , . . . ' xn
X
X
S
)
S.
(p = 1 ) ,
_1 (]"2
n
=
u2
N(O, u2 ) .
(n
-
JL = (
X
------
(p
+ · · · + ( u Zn _ 1 ) 2 .
I
1 ) s2
>
2)
(1/n )I . n 2 ( n - 1 ) s = iL =1 n-1 uZi
X
(Xi - X) 2
u2 (Zi +
(
)
+ Z� _ 1 )
= (uZ1 ) 2
1 74
Chapter 4 The M u ltivariate Norm a l D i stri bution
The samplafteirngitsdidistsrciboverutioenr;ofit tihs edefsaimplnedeascovartheisauncem ofmatinrdependent ix is called prtheoducts of multivariate normal random vectWiorss.harSpecit disftirciabluty,ion with d.f. distribution of l y di s t r i b ut e d as whereWethesummarare eachize thiendependent sampling distribution results as fol ows:
Wishart
distribution,
Wm( I I ) = = ·
( 4-22)
m
m
� z j zJ j =l Np ( 0, I) .
Zj
Becaus e i s unknown, t h e di s t r i b ut i o n of cannot be us e d di r e c t l y t o make idinfsetrriebncesutionaboutof doesHowever , proonvides iThindependent inftoormconsatiotnruabout andstic tfhoer not depend s al l o ws us ct a s t a t i makinForg intfheereprncesesentabout, we recorasdwesomeshalflusretheerinrChapt e r e s u l t s f r o m mul t i v ar i a bl e di s t r i b ut i o n tithseordefyi.nTheitionfoasl oawisunmg profothpere itniedependent s of the Wiprshoarductt disst,ributionPraroeofders cainvedbediforundectlyinfrom I
S
X
S
IL ·
IL ·
/L,
I,
5.
[1 ].
Z j Zj .
Properties of the Wishart Distribution
wm ( l I I ) 1 + 2
If A Ais distribthutenedAas AAis distriinbdependent l y of A , whi c h i s di s t r i b ut e d as 2 uted as A That is, the degrIf Aees iofs frdieedom add. s t r i b ut e d as t h en i s di s t r i b ut e d as A Al t h ough we do not have any par t i c ul a r need f o r t h e pr o babi l i t y dens i t y f u nc tfioornmof. Thethe Widenssharityt didoesstribnotutioexin,sitt mayunlesbetofhessoamemplinetseirzeest toissgreeeiattserratthhaner compl i c at e d t h e number of variables When it does exist, its value at the positive definite matrix A is A , Apos i t i v edef i n i t e A r( � where r is the gamma function. See 1.
2.
l Wm2 ( 2 1 I ) ,
l
Wm ( A I I ) ,
Wm ( C C' I CIC' ) .
Wm 1+ m2 (A 1 +
2 1 I).
( 4-24)
CAC '
n
p.
wn _ 1 ( I I ) = (·)
I
l ( n- p - 2 ) /2 e -tr [ AI -1] /2
2 p ( n- 1 ) /2 1Tp ( p - 1 ) /4 1 I l ( n- 1 ) /2
(
P II i =l
[1].)
( n - i) )
(4-25)
Section 4.5
4. 5
LARG E-SAMPLE B E HAVIOR OFX AN D S
La rge-Sa m p l e Behavior of X and S
1 75
Suppose the quantity X is determined by a large number of independent causes Vi , v;, . . . , Vn , where the random variables Vi representing the causes have approxi mately the same variability. If X is the sum X = Vi + Vi + · · · + Vn then the central limit theorem applies, and we conclude that X has a distribution that is nearly normal. This is true for virtually any parent distribution of the Vj 's, provid ed that n is large enough. The univariate central limit theorem also tells us that the sampling distribution of the sample mean,X, for a large sample size is nearly normal, whatever the form of the underlying population distribution. A similar result holds for many other important univariate statistics. It turns out that certain multivariate statistics, like X and S, have large-sample properties analogous to their univariate counterparts. As the sample size is increased without bound, certain regularities govern the sampling variation in X and S, irre spective of the form of the parent population. Therefore, the conclusions presented in this section do not require multivariate normal populations. The only require ments are that the parent population, whatever its form, have a mean IL and a finite covariance I. Result 4.12 (Law of large numbers). Let Yi , }2, . . . , Yn be independent obser vations from a population with mean E(Yi) = J.L. Then Yi + Y2 + · · · + Yn y = n
------
converges in probability to J.L as n increases without bound. That is, for any prescribed accuracy s > -s < Y J.L < s J approaches unity as n � oo . Proof.
0, P[
-
See [9].
•
As a direct consequence of the law of large numbers, which says that each Xi converges in probability to J.Li , i = 1, 2, . . . , p,
X converges in probability to IL
(4-26)
Also, each sample covariance sik converges in probability to O"ik ' i, k = 1, 2, . . . , p, and S ( or i = Sn ) converges in probability to I
(4-27)
1 76
Chapter 4
The M u ltivariate Norm a l Distribution
Statement (4-27) follows from writing
n ( n - l ) sik = L ( Xji - Xi ) ( Xjk - Xk ) j =1 n = L ( Xji - JLi + JLi - Xi ) ( Xjk - JLk + JLk - Xk ) j =1 n = L ( Xji - JLJ ( Xjk - JLk ) + n(Xi - JLi ) (Xk - JLk ) j =1 Letting lj = ( Xji - JLJ (Xjk - JLk ) , with E( lj ) = a-i k ' we see that the first term in st A converges to a-ik and the second term converges to zero, by applying the law of large numbers. The practical interpretation of statements ( 4-26) and (4-27) is that, with high probability, X will be close to IL and S will be close to I whenever the sample size is large. The statement concerning X is made even more precise by a multivariate ver sion of the central limit theorem. Result 4.13 (The central limit theorem). Let X 1 , X 2 , . . . , Xn be independent observations from any population with mean IL and finite covariance I. Then Vn
(X - IL ) has an approximate Np ( 0, I) distribution
for large sample sizes. Here n should also be large relative to p. Proof.
See [1].
•
The approximation provided by the central limit theorem applies to dis crete, as well as continuous, multivariate populations. Mathematically, the limit is exact, and the approach to normality is often fairly rapid. Moreover, from the results in Section 4.4, we know that X is exactly normally distributed when the un derlying population is normal. Thus, we would expect the central limit theorem approximation to be quite good for moderate n when the parent population is nearly normal. As we have seen, when n is large, S is close to I with high probability. Conse quently, replacing I by S in the approximating normal distribution for X will have a negligible effect on subsequent probability calculations. Result 4.7 can be used to show that n ( X - IL ) ' I - 1 ( X - IL ) has a x distribution
( �)
�
when X is distributed as NP /L, I or, equivalently, when Vn ( X - IL ) has an Np (O, I) distribution. The x distribution is approximately the sampling distribution of n( X - IL ) ' I - 1 ( X - IL ) when X is approximately normally distributed. Replac ing I - 1 by s- 1 does not seriously affect this approximation for n large and much greater than p.
�
Section 4.6
Assessi n g the Ass u m ption of Norma l ity
1 77
We summarize the major conclusions of this section as follows:
In the next three sections, we consider ways of verifying the assumption of nor mality and methods for transforming nonnormal observations into observations that are approximately normal. 4.6
ASS ESSING TH E ASSU M PTION OF N ORMALITY
As we have pointed out, most of the statistical techniques discussed in subsequent chapters assume that each vector observation Xi comes from a multivariate normal distribution. On the other hand, in situations where the sample size is large and the techniques depend solely on the behavior of X, or distances involving X of the form n ( X JL ) ' S - 1 ( X JL ) , the assumption of normality for the individual observations is less crucial. But to some degree, the quality of inferences made by these methods depends on how closely the true parent population resembles the multivariate nor mal form. It is imperative, then, that procedures exist for detecting cases where the data exhibit moderate to extreme departures from what is expected under multi variate normality. We want to answer this question: Do the observations Xi appear to violate the assumption that they came from a normal population? Based on the properties of normal distributions, we know that all linear combinations of normal variables are nor mal and the contours of the multivariate normal density are ellipsoids. Therefore, we address these questions: 1. Do the marginal distributions of the elements of X appear to be normal? What about a few linear combinations of the components Xi? 2. Do the scatter plots of pairs of observations on different characteristics give the elliptical appearance expected from normal populations? 3. Are there any "wild" observations that should be checked for accuracy? It will become clear that our investigations of normality will concentrate on the behavior of the observations in one or two dimensions (for example, marginal dis tributions and scatter plots). As might be expected, it has proved difficult to con struct a "good" overall test of joint normality in more than two dimensions because of the large number of things that can go wrong. To some extent, we must pay a price for concentrating on univariate and bivariate examinations of normality: We can -
-
1 78
Chapter 4
The M u ltivariate Normal D i stri bution
never be sure that we have not missed some feature that is revealed only in higher dimensions. ( It is possible, for example, to construct a nonnormal bivariate distrib ution with normal marginals. [See Exercise 4.8.]) Yet many types of nonnormality are often reflected in the marginal distributions and scatter plots. Moreover, for most practical work, one-dimensional and two-dimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representations, but nonnormal in higher dimensions, are not frequently encountered in practice. Eva l uati ng the Normal ity of the U n iva riate Marg i nal Distributions
Dot diagrams for smaller n and histograms for n > 25 or so help reveal situations where one tail of a univariate distribution is much longer than the other. If the his togram for a variable Xi appears reasonably symmetric, we can check further by counting the number of observations in certain intervals. A univariate normal dis tribution assigns probability .683 to the interval (JLi - �' JLi + � ) and proba bility .954 to the interval (JLi - 2�, JLi + 2�). Consequently, with a large sample size n, we expect the observed proportion Pi I of the observations lying in the interval ( xi - �' xi + � ) to be about .683. Similarly, the observed proportion Pi 2 of the observations in (xi - 2�, xi + 2� ) should be about .954. Using the normal approximation to the sampling distribution of Pi ( see [9]), we observe that either
I Pi l - .683 1 > 3
( .683) ( .317 ) n
1.396 Vn
or
I Pi 2 - .954 1 > 3
( .954) ( .046) n
.628 Vn
( 4-29)
would indicate departures from an assumed normal distribution for the ith charac teristic. When the observed proportions are too small, parent distributions with thick er tails than the normal are suggested. Plots are always useful devices in any data analysis. Special plots called Q-Q plots can be used to assess the assumption of normality. These plots can be made for the marginal distributions of the sample observations on each variable. They are, in effect, plots of the sample quantile versus the quantile one would expect to observe if the observations actually were normally distributed. When the points lie very near ly along a straight line, the normality assumption remains tenable. Normality is sus pect if the points deviate from a straight line. Moreover, the pattern of the deviations can provide clues about the nature of the nonnormality. Once the reasons for the non normality are identified, corrective action is often possible. ( See Section 4.8.) To simplify notation, let x 1 , x2 , , x n represent n observations on any single characteristic Xi . Let x ( l ) < x ( 2 ) < · · · < x ( n ) represent these observations after they are ordered according to magnitude. For example, x ( 2 ) is the second smallest obser vation and x ( n ) is the largest observation. The x u ) ' s are the sample quantiles. When • • •
Section 4.6
Assessi n g the Ass u m ption of Norm a l ity
1 79
the x u ) are distinct, exactly j observations are less than or equal to xu ) . (This is the oretically always true when the observations are of the continuous type, which we usually assume.) The proportion jfn of the sample at or to the left of x u ) is often approximated by (j - � )/n for analytical convenience. 1 For a standard normal distribution, the quantiles q(j) are defined by the relation
P[Z
iq(j) V2ii1
< q( 1· ) ] =
oo
--
j-�
e -z l2 dz = P ( · ) = -I n 2
(4-30)
(See Table 1 in the appendix). Here P U ) is the probability of getting a value less than or equal to q (j) in a single drawing from a standard normal population. The idea is to look at the pairs of quantiles ( q(j) , xu ) ) with the same associated cumulative probability (j - � )fn. If the data arise from a normal population, the pairs ( q(j) , x u ) ) will be approximately linearly related, since o-q (j) + J.L is nearly the ex pected sample quantile. 2 Example 4.9
{Co nstructi ng a Q-Q plot)
A sample of n = 10 observations gives the values in the following table:
Ordered observations
(j - � )/n
Probability levels
Standard normal quantiles q(j)
.05 .15 .25 .35 .45 .55 .65 .75 .85 .95
-1.645 - 1.036 -.674 -.385 -.125 .125 .385 .674 1.036 1.645
xu ) - 1.00 -.10 .16 .41 .62 .80 1.26 1.54 1.71 2.30
Here, for example,
P[Z
< .385] =
1.385 00
1 e -z212 dz 2'1T
" � v
=
.65. [See (4-30).]
Let us now construct the Q-Q plot and comment on its appearance. The Q-Q plot for the foregoing data, which is a plot of the ordered data xu ) against
(
1 The � in the numerator of j - �)In is a "continuity" correction. Some authors (see [5] and [10]) have suggested replacing j - �)In by j - �)l(n + �). 2 A better procedure is t o plot (m (; ) , x (;) ) , where m(j) = E(z(; ) ) is the expected value o f the jth order statistic in a sample of size n from a standard normal distribution. (See [12] for further discussion.)
(
(
1 80
Chapter 4 The M u ltivariate Norma l D i stribution
•
•
•
•
•
A 0-0 plot for the data in Exa m p l e 4.9.
Figure 4.5
the normal quantiles q ( j ) , is shown in Figure 4.5. The pairs of points ( q ( j ) , x u ) ) lie very nearly along a straight line, and we would not reject the notion that these data are normally distributed-particularly with a sample size as small as n = 10. • The calculations required for Q-Q plots are easily programmed for electronic computers. Many statistical programs available commercially are capable of pro ducing such plots. The steps leading to a Q-Q plot are as follows: ing probability values ( 1 � )fn, ( 2 � )/n, . . . , ( n - � J/n; 2. Calculate the standard normal quantiles q ( l ) , q ( 2 ) , , q ( n ) ; and 3. Plot the pairs of observations ( q ( l ) , x ( l ) ) , ( q ( 2 ) , x ( 2 ) ) , . . . , ( q ( n ) , x ( n ) ) , and exam ine the "straightness" of the outcome. 1. Order the original observations to get x ( l ) , X ( z ) , -
• . •
, x ( n � and their correspond
-
• • •
Q-Q plots are not particularly informative unless the sample size is moderate to large-for instance, n > 20. There can be quite a bit of variability in the straight ness of the Q-Q plot for small samples, even when the observations are known to come from a normal population. {A Q-Q plot fo r rad iation data) The quality-control department of a manufacturer of microwave ovens is re quired by the federal government to monitor the amount of radiation emitted when the doors of the ovens are closed. Observations of the radiation emitted through closed doors of n = 42 randomly selected ovens were made. The data are listed in Table 4.1 on page 181. In order to determine the probability of exceeding a prespecified tolerance level, a probability distribution for the radiation emitted was needed. Can we regard the observations here as being normally distributed? A computer was used to assemble the pairs (q( j ) ' xu ) ) and construct the Q-Q plot, pictured in Figure 4.6 on page 181. It appears from the plot that the data as a whole are not normally distributed. The points indicated by the cir cled locations in the figure are outliers-values that are too large relative to the rest of the observations.
Example 4. 1 0
Section 4.6
Assessi n g the Assu m ption of Normal ity
TABLE 4.1
RAD IATI ON DATA (DOOR CLOSE D)
Oven no.
Radiation
Oven no.
Radiation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.15 .09 .18 .10 .05 .12 .08 .05 .08 .10 .07 .02 .01 .10 .10
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.10 .02 .10 .01 .40 .10 .05 .03 .05 .15 .10 .15 .09 .08 .18
181
Oven no.
Radiation
31 32 33 34 35 36 37 38 39 40 41 42
.10 .20 .11 .30 .02 .20 .20 .30 .30 .40 .30 .05
Source: Data courtesy o f J. D. Cryer .
.40
.30
.20 3
.10
.00
2 - 2.0
3
•
5
- 1 .0
e3
2 9
.0
2
3
• •
1 .0
2.0
3.0
q (j)
Figure 4.6
A 0-0 plot of the radiation data (door closed) from Exa m p l e 4. 1 0. (The i ntegers in the plot i n d icate the n u mber of poi nts occ u pyi n g the same locatio n . )
For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same normal quan tile. This quantile is calculated using the average of the quantiles the tied ob • servations would have if they all differed slightly.
182
Chapter 4 The M u ltiva riate Normal Distribution
The straightness of the Q-Q plot can be measured by calculating the correla tion coefficient of the points in the plot. The correlation coefficient for the Q-Q plot is defined by n
rQ =
(x ) - x ) ( q U ) j=2:1 u
q)
----;:============ ==---;:============= ==
��
(x (j) - i) 2
�� ( %l - zd
( 4-31)
and a powerful test of normality can be based on it. (See [5], [10], and [11 ].) Formally� we reject the hypothesis of normality at level of significance a if rQ falls below the ap propriate value in Table 4.2. TABLE 4.2
CRITI CAL POI NTS FOR TH E Q-Q PLOT CORRELATION COEFFICI ENT TEST FOR NORMALITY Significance levels a Sample size .10 .05 n .01
5 10 15 20 25 30 35 40 45 50 55 60 75 100 150 200 300 Example 4. 1 1
.8299 .8801 .9126 .9269 .9410 .9479 .9538 .9599 .9632 .9671 .9695 .9720 .9771 .9822 .9879 .9905 .9935
.8788 .9198 .9389 .9508 .9591 .9652 .9682 .9726 .9749 .9768 .9787 .9801 .9838 .9873 .9913 .9931 .9953
.9032 .9351 .9503 .9604 .9665 .9715 .9740 .9771 .9792 .9809 .9822 .9836 .9866 .9895 .9928 .9942 .9960
(A correlati on coefficient test fo r norma lity)
Let us calculate the correlation coefficient rQ from the Q-Q plot of Example 4.9 (see Figure 4.5) and test for normality. Using the information from Example 4.9, we have x = .770 and
10 2: (x(j ) - x )q (j) j= 1
=
10 8.584, 2: (x (j) - x) 2 j= 1
=
10 8.472, and 2: q(j) j= 1
=
8.795
Section 4.6
Since always, q
=
Assessi n g the Ass u m ption of Norm a l ity
1 83
0,
rQ
=
8.584
V8.472 V8.795
=
.994
A test of normality at the 10% level of significance is provided by referring rQ == .994 to the entry in Table 4.2 corresponding to n == 10 and a == .10. This entry is .9351. Since rQ > .9351, we do not reject the hypothesis of normality. • Instead of rQ , some software packages evaluate the original statistic proposed by Shapiro and Wilk [11]. Its correlation form corresponds to replacing q( J ) by a func tion of the expected value of standard normal-order statistics and their covariances. We prefer rQ because it corresponds directly to the points in the normal-scores plot. For large sample sizes, the two statistics are nearly the same (see [12]), so either can be used to judge lack of fit. Linear combinations of more than one characteristic can be investigated. Many statisticians suggest plotting
in which A 1 is the largest eigenvalue of S. Here xj == [ xj 1 , xj 2 , , xj p ] is the jth ob servation on the p variables X1 , X2 , , XP . The linear combination e � x j corre sponding to the smallest eigenvalue is also frequently singled out for inspection. (See Chapter 8 and [ 6] for further details.) . • •
• • •
Eva l u ating Bivariate Normal ity
We would like to check on the assumption of normality for all distributions of 2, 3, . . , p dimensions. However, as we have pointed out, for practical work it is usu ally sufficient to investigate the univariate and bivariate distributions. We consid ered univariate marginal distributions earlier. It is now of interest to examine the bivariate case. In Chapter 1, we described scatter plots for pairs of characteristics. If the ob servations were generated from a multivariate normal distribution, each bivariate distribution would be normal, and the contours of constant density would be ellipses. The scatter plot should conform to this structure by exhibiting an overall pattern that is nearly elliptical. Moreover, by Result 4.7, the set of bivariate outcomes x such that .
has probability .5. Thus, we should expect roughly the same percentage, 50%, of sam ple observations to lie in the ellipse given by {all x such that (x - x) ' S - 1 (x - x) < x�( .5) } where we have replaced IL by its estimate x and I - 1 by its estimate s - 1 . If not, the normality assumption is suspect.
1 84
Chapter 4
The M u ltivariate Norm a l D istri bution
Example 4. 1 2
(Checki ng bivariate normal ity)
Although not a random sample, data consisting of the pairs of observations x 1 == sales, x2 == profits) for the largest U.S. industrial corporations are list ed in Exercise These data give
(
1.4. ] 10 [62,2927309 ' [ 10,255.005.7206 255.14.7306] 105 _s - 77,661.1 18 [ -255.14.7306 10,-255.005.2706 ] 10_5 ] 00184 -..0103293 [ -..0003293 _ 10 28831 5 3 x�(.5) 1.39. [ ] ] 62,309] ' [ -..0000184 3 62 0 09 03293 -. [ -- 2927 , _ 10 03293 .128831 - 2927 5 1.39 50% 1. 4 126,974, 4224 [ 126,4224974 -- 292762,309] ' [ -..0003293 03293 ] [126,4224974 -- 292762,309] 10_5 00184 - ..0128831 4.34 1.39 50%1.20, .59, .83, 1.88, 1.01, 1.02, 5.33, .81, .97, 1. 3 9, . 7 0, 50% 5, 10 4.13.) X ==
S ==
X
so
1
_
X
==
From Table in the appendix, x ' == [ x 1 , x2 ] satisfying
X
Thus, any observation
==
x1 x2
x1 x2
X
<
is on or inside the estimated contour. Otherwise the observation is outside this contour. The first pair of observations in Exercise is [ x 1 , x2 ] ' == [ J. In this case X
==
>
and this point falls outside the contour. The remaining nine points have generalized distances from x of and respectively. Since seven of these distances are less than a proportion, of the data falls within the contour. If the observations were normally dis tributed, we would expect about half, or of them to be within this contour. This large a difference in proportions would ordinarily provide evidence for rej ect ing the notion of bivariate normality; however, our sample size of is too small li to reach this conclusion. (See also Example Computing the fraction of the points within a contour and subjectively com paring it with the theoretical probability is a useful, but rather rough, procedure. A somewhat more formal method for judging the joint normality of a data set is based on the squared generalized distances
d j2 - Xj - X- ) ' s - 1 ( Xj - X- ) , _
(
j ==
1, 2,
... , n
(4-32)
Section 4.6
Assess i n g the Ass u m ption of Norm a l ity
1 85
where x 1 , x2 , . . . , x n are the sample observations. The procedure we are about to de scribe is not limited to the bivariate case; it can be used for all p > 2. When the parent population is multivariate normal and both n and n - p are greater than 25 or 30, each of the squared distances d i , d �, . . . , d � should behave like a chi-square random variable. [See Result 4.7 and Equations (4-26) and ( 4-27).] Al though these distances are not independent or exactly chi-square distributed, it is helpful to plot them as if they were. The resulting plot is called a chi-square plot or gamma plot, because the chi-square distribution is a special case of the more gener al gamma distribution. (See [6] .) To construct the chi-square plot, 1. Order the squared distances In ( 4-32) from smallest to largest as
2 < . . . < d 2n d 2( 1 ) < - () - d (2 ) -
( qcj ( j - Din), dfn),
where qc , p ( ( j - Din) is the 100 � j - � )In quantile of the chi-square distribution with p degrees of freedom.
2. Gra)J h the pairs
•
Quantiles are specified in terms of proportions, whereas percentiles are speci fied in terms of percentages. The quantiles qc, p ( ( j - �)/n) are related to the upper percentiles of a chi squared distribution. In particular, qc , p (( j - �)In) = x�((n - j + � )In). The plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention. Example 4. 1 3
(Co nstructi ng a chi-square pl ot)
Let us construct a chi-square plot of the generalized distances given in Exam ple 4.12. The ordered distances and the corresponding chi-square percentiles for p = 2 and n = 10 are listed in the following table:
1
1 2 3 4 5 6 7 8 9 10
d Jn
1 qc ,2 1Q
.59 .81 .83 .97 1.01 1.02 1.20 1.88 4.34 5.33
.10 .33 .58 .86 1 .20 1.60 2.10 2.77 3.79 5.99
c
1 - 2
)
1 86
Chapter 4
The M u ltiva r i ate Norm a l D istri bution
•
•
•
•
•
•
•
•
•
•
1..-..L-----L--'---'---�
0
1
Figure 4.7
2
3
4
5
6
qc,2 (( j- !) / 1 0)
A ch i-sq u a re plot of the ordered d i stances in Exa m p l e 4. 1 3 .
A graph of the pairs ( qc , 2 ( (j - � )/10 ) , d (n) is shown in Figure 4.7. The points in Figure 4.7 do not lie along the line with slope 1. The small est distances appear to be too large and the middle distances appear to be too small, relative to the distances expected from bivariate normal populations for samples of size 10. These data do not appear to be bivariate normal; however, the sample size is small, and it is difficult to reach a definitive conclusion. If further analysis of the data were required, it might be reasonable to transform them to observations more nearly bivariate normal. Appropriate transforma • tions are discussed in Section 4.8.
In addition to inspecting univariate plots and scatter plots, we should check multivariate normality by constructing a chi-squared or d 2 plot. Figure 4.8 on page 187 contains d 2 plots based on two computer-generated samples of 30 four-variate nor mal random vectors. As expected, the plots have a straight-line pattern, but the top two or three ordered squared distances are quite variable. The next example contains a real data set comparable to the simulated data set that produced the plots in Figure 4.8.
Section 4.6
1 87
Assessi n g the Ass u m ption of Normal ity
dJ)
d&) •
10
10
8
8
6
6
4
4
2
2
2
0
4
6
Figure 4.8
8
10
12
• •
-• •
• ••
•
•
0
qc,i(J - �/ 30)
0
2
4
6
8
10
12
qc,i0 - �/ 30)
Ch i-sq u a re plots for two s i m u l ated fou r-va riate normal data sets with n
Example 4. 1 4
=
30.
(Eva l uati ng m u ltivariate normal ity for a fou r-variable data set)
The data in Table 4.3 were obtained by taking four different measures of stiffness, x 1 , x2 , x3 , and x4 , of each of n = 30 boards. The first measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The squared distances dy = (xj - x) ' S - 1 (xj - x) are also presented in the table. TABLE 4.3
FOU R M EAS U R E M ENTS OF STI FFN ESS
Observation no.
xl
x2
x3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389
x4
d2
1778 .60 2197 5.48 2222 7.62 1533 5.21 1883 1.40 1546 2.22 1671 4.99 1874 1.49 2581 12.26 1508 .77 1667 1.93 1898 .46 1741 2.70 1678 .13 1714 1.08
Source: Data courtesy of William Galligan.
Observation no.
xl
x2
x3
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
x4
d2
1281 16.85 1176 3.50 1308 3.99 1755 1.36 1646 1.46 2111 9.90 1477 5.06 1516 .80 2037 2.54 1533 4.58 1469 3.40 1834 2.38 1597 3.00 2234 6.28 1284 2.58
1 88
Chapter 4
The M u ltivariate Norm a l D i stribution
•
•
••
• ••
••• •• • ••••
•
••
•
•
•
• • •
•
•
'-T------,--�---r--.--.-�
0
2
Figure 4.9
10
8
6
4
12
qc, 4 ((j- � ) / 30)
A ch i-sq u a re plot for the data i n Exa m p l e 4. 1 4.
The marginal distributions appear quite normal (see Exercise 4.33), with the possible exception of specimen (board) 9. To further evaluate multivariate normality, we constructed the chi-square plot shown in Figure 4.9. The two specimens with the largest squared distances are clearly removed from the straight-line pattern. Together, with the next largest point or two, they make the plot appear curved at the upper end. We will • return to a discussion of this plot in Example 4.15. We have discussed some rather simple techniques for checking the multivariate normality assumption. Specifically, we advocate calculating the dJ , j = 1, 2, . . . , n [see Equation ( 4-32)] and comparing the results with x2 quantiles. For example, p-variate normality is indicated if 1. Roughly half of the dJ are less than or equal to
: t� : f)� : (� : r)����:: (! :T) :� : :�;� : ��=�:�l :::::: t o
2.
qc , p ( .50 ) .
h o ee
s
s
,r s e i ,p .p line having slope 1 and that passes through the origin.
,
st
(See [6] for a more complete exposition of methods for assessing normality.)
Section 4.7
Detecting Outl iers and Clea n i n g Data
1 89
We close this section by noting that all measures of goodness of fit suffer the same serious drawback. When the sample size is small, only the most aberrant be havior will be identified as lack of fit. On the other hand, very large samples in variably produce statistically significant lack of fit. Yet the departure from the specified distribution may be very small and technically unimportant to the infer ential conclusions.
4.7
D ETECTI N G OUTLI ERS AND CLEAN I NG DATA
Most data sets contain one or a few unusual observations that do not seem to be long to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied. Outliers are best detected visually whenever this is possible. When the num ber of observations n is large, dot plots are not feasible. When the number of char acteristics p is large, the large number of scatter plots p(p - 1 )/2 may prevent viewing them all. Even so, we suggest first visually inspecting the data whenever possible. What should we look for? For a single random variable, the problem is one di mensional, and we look for observations that are far from the others. For instance, the dot diagram ••
• • ••
•
• • •• •• •••• • • •
• ••• • •
•
• •
�------�---r� x
reveals a single large observation. In the bivariate case, the situation is more complicated. Figure 4.10 on page 190 as shows a situation with two unusual observations. The data point circled in the upper right corner of the figure is removed from the pattern, and its second coordinate is large relative to the rest of the x2 measure ments, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its com ponents has a typical value. This outlier cannot be detected by inspecting the mar ginal dot diagrams. In higher dimensions, there can be outliers that cannot be detected from the univariate plots or even the bivariate scatter plots. Here a large value of (xj - x) ' S - 1 (xj - x) will suggest an unusual observation, even though it cannot be seen visually.
1 90
Chapter 4
The M u ltiva r i ate Norm a l Distri bution
•
•
•
•
•
• •• • •
•• • •• •
• • • • • • •
•
@ • • •
•
• • •
•
•
•
• •
• •
•
..
.
•
•
•
• •
• •
..
• •
•
�-------r--��
•
•
• • • ••••
Figure 4. 1 0
• •• • ••• ••
••
• • •• •
•
@
-� .
x l
Two outl iers; o n e u n ivariate a n d o n e biva riate.
Steps for Detecti ng Outl iers
1. Make a dot plot for each variable. 2. Make a scatter plot for each pair of variables.
3. Calculate the standardized values Zjk = ( xjk - xk )/� for j = 1 , 2, . . , n and each column k = 1 , 2, . . . , p. Examine these standardized values for large or small values. 4. Calculate the generalized squared distances (xj - x) ' S - 1 (x j - x) . Examine these distances for unusually large values. In a chi-square plot, these would be the points farthest from the origin. .
In step 3, "large" must be interpreted relative to the sample size and number of variables. There are n X p standardized values. When n = 100 and p = 5, there are
Section 4.7
Detecting Outl iers and Clea n i n g Data
191
500 values. You expect 1 or 2 of these to exceed 3 or be less than -3, even if the data came from a multivariate distribution that is exactly normal. As a guideline, 3.5 might be considered large for moderate sample sizes. In step 4, "large" is measured by an appropriate percentile of the chi-square distribution with p degrees of freedom. If the sample size is n = 100, we would ex pect 5 observations to have values of dy that exceed the upper fifth percentile of the chi-square distribution. A more extreme percentile must serve to determine obser vations that do not fit the pattern of the remaining data. The data we presented in Table 4.3 concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on x5 = tensile strength. Nine observation vectors, out of the total of 112, are given as rows in the following table, along with their standardized values.
xl
x2
x3
x4
Xs
Z1
Z2
1631 1770 1376 1705 1643 1567 1528 1803 1587
1528 1677 1190 1577 1535 1510 1591 1826 1554
1452 1707 723 1332 1510 1301 1714 1748 1352
1559 1738 1285 1703 1494 1405 1685 2746 1554
1602 1785 2791 1664 1582 1553 1698 1764 1551
.06 .64 -1 .01 .37 .11 - .21 -.38 .78 - .13
- .15 .43 - 1.47 .04 - .12 - .22 .10 1 .01 - .05
Z3
Z4
Zs
.05 .28 - .12 1.07 .94 .60 -2.87 -.73 @]) .13 .81 - .43 .04 - .20 .28 - .56 - .28 -.31 1.10 .26 .75 1.23 @)) .52 -.35 .26 - .32
The standardized values are based on the sample mean and variance, calculat ed from all 112 observations. There are two extreme standardized values. Both are too large with standardized values over 4.5. During their investigation, the researchers recorded measurements by hand in a logbook and then performed calculations that produced the values given in the table. When they checked their records regarding the values pinpointed by this analysis, errors were discovered. The value x 5 = 2791 was corrected to 1241, and x4 = 2746 was corrected to 1670. Incorrect readings on an individual variable are quickly detected by locating a large leading digit for the standardized value. The next example returns to the data on lumber discussed in Example 4.14. Example 4. 1 5
(Detecting outl iers in the data on l u m ber)
Table 4.4 on page 192 contains the data in Table 4.3, along with the standard ized observations. These data consist of four different measures of stiffness x 1 , x2 , x3 , and x4 , on each of n = 30 boards. Recall that the first measurement involves sending a shock wave down the board, the second measurement is de termined while vibrating the board, and the last two measurements are obtained from static tests. The standardized measurements are
1 92
Chapter 4
TABLE 4.4
The M u ltiva riate Norm a l D istri bution
FOU R M EASU R E M E NTS OF STI FFN ESS WITH STAN DARDIZED VALUES
xl
x2
x3
x4
Observation no.
z1
Z2
Z3
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859 1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649 21 49 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389 1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1778 219 7 2222 1533 1883 1546 1671 1874 2581 1508 1667 1898 1741 1678 1714 1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
-.1 1.5 .7 -.8 .2 -.6 .1 .6 3.3 -.5 - .6 .4 -.2 -.1 -.1 .1 -1.8 - 1.5 -.2 - .6 1.1 - .0 - .8 .5 -.2 -.6 .8 -.8 1.3 -1.3
-.3 .9 -.2 -.4 .5 -.1 -.2 .2 3.3 - .5 - .5 .5 .3 -.2 -.3 1.3 - 1.8 -1.2 -.4 -.5 1.4 -.4 - .7 .4 - .8 -1.1 .5 - .2 1.7 - 1.2
.2 1.9 1.0 -1.3 .3 -.2 -.8 .7 3.0 -.4 .0 .4 .3 -.1 -.4 -1.1 -1.7 -.8 .3 -.6 .1 -.3 -.7 .5 - .5 -.9 .6 -.3 1.8 - 1.0
k = 1, 2, 3, 4;
Z4
d2
.60 .2 5.48 1.5 7.6 2 1.5 5.21 -.6 .5 1.40 2.22 -.6 4.99 -.2 .5 1.49 2.7 c@§) .77 -.7 1.93 - .2 .5 .46 2.70 .0 .13 - .1 1.08 -.0 -1.4 � 3.50 -1.7 3.99 - 1.3 .1 1.36 1.46 - .2 9.90 1.2 5.06 -.8 .80 -.6 2.54 1.0 4.58 -.6 3.40 -.8 .3 2.38 3.00 -.4 6.28 1.6 2.58 -1.4
j = 1, 2, . . . ' 30
and the squares of the distances are dy = (xj - x) ' S - 1 (xj - x) . The last column in Table 4.4 reveals that specimen 16 is a multivariate outlier, since x�( .005) = 14.86; yet all of the individual measurements are well within their respective univariate scatters. Specimen 9 also has a large d 2 value. The two specimens (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern in Figure 4.9. Once these two points are removed, the remaining pattern conforms to the expected straight line relation. Scatter plots for the lumber stiffness measurements are given in Figure 4.11 on page 193. The solid dots in these figures correspond to specimens
Section 4.7 1 500 I
-
I
I
I
-
-
x1
0
� ('.! -
0
� -
0 �0 � �
c:P oo
0
-
-
0 0
•
0
00
0 0 C"l �
000
c:Pco
0
({)
,�•
c:P
•
0 Oo 0 0
•
0
J
I
•oca �0 o o 0oc?CSJ •
0
•
0 0
0
0
I
I
•
I
•
0 0
�0
<2
•
• I
I
Cb �:e
0£�
2400
•
('.!
o
(0 �
•
0 0 lr) �
1--
I--
·-
0
Cb
'b o o 0
� �c{; 0
0
0 0 lr)
00o0
0 00 • o cg o �� 0
�
00 0
0
1 800
0 <6
0
x3
•
oa £o o�cico (0
•
1 200
I
8
856
0
{1)9)00 0
I
l
0
0 0 00 0
0
I
0 0
�66
Figure 4.1 1
dots.
16
I
•9
ow
0
2500
1500
0� o0� o
8
I
0 oo
•
<Xto
-
I
x2
0% 0 � o • Cb g o 0 0
""' ('.!
0 0
•
2500
I
1 93
Detecting Out l iers and Clea n i ng Data
@
0
_ o ('.! ('.! -
0
g �
0
--- 8 �
co
oa o � o co61J8 �
1000
I
I
I
1 600
I
I
I
2200
I
Scatter plots for the l u m be r st iffn ess data with s peci mens 9 a n d 1 6 pl otted as sol id
9 and 16. Although the dot for specimen 16 stands out in all the plots, the dot for specimen 9 is "hidden" in the scatterplot of x3 versus x4 and nearly hidden in that of x 1 versus x3 • However, specimen 9 is clearly identified as a multi variate outlier when all four variables are considered. Scientists specializing in the properties of wood conjectured that speci men 9 was unusually clear and therefore very stiff and strong. It would also appear that specimen 16 is a bit unusual, since both of its dynamic measure ments are above average and the two static measurements are low. Unfortu nately, it was not possible to investigate this specimen further because the • material was no longer available.
1 94
Chapter 4
The M u ltiva riate Norm a l D i stri bution
If outliers are identified, they should be examined for content, as was done in the case of the data on lumber stiffness in Example 4.15. Depending upon the nature of the outliers and the objectives of the investigation, outliers may be deleted or ap propriately "weighted" in a subsequent analysis. Even though many statistical techniques assume normal populations, those based on the sample mean vectors usually will not be disturbed by a few moderate outliers. Hawkins [7] gives an extensive treatment of the subject of outliers. 4.8 TRANSFORMATIO N S TO N EAR NORMALITY
If normality is not a viable assumption, what is the next step? One alternative is to ignore the findings of a normality check and proceed as if the data were normally distributed. This practice is not recommended, since, in many instances, it could lead to incorrect conclusions. A second alternative is to make nonnormal data more "nor mal looking" by considering transformations of the data. Normal-theory analyses can then be carried out with the suitably transformed data. Transformations are nothing more than a reexpression of the data in different units. For example, when a histogram of positive observations exhibits a long right hand tail, transforming the observations by taking their logarithms or square roots will often markedly improve the symmetry about the mean and the approximation to a normal distribution. It frequently happens that the new units provide more natural expressions of the characteristics being studied. Appropriate transformations are suggested by (1) theoretical considerations or (2) the data themselves (or both). It has been shown theoretically that data that are counts can often be made more normal by taking their square roots. Similarly, the logit transformation applied to proportions and Fisher's z-transformation applied to correlation coefficients yield quantities that are approximately normally distributed.
In many instances, the choice of a transformation to improve the approximation to normality is not obvious. For such cases, it is convenient to let the data suggest a transformation. A useful family of transformations for this purpose is the family of
power transformations.
Power transformations are defined only for positive variables. However, this is not as restrictive as it seems, because a single constant can be added to each obser vation in the data set if some of the values are negative.
Section 4.8
1 95
Tra nsformations to Near Normal ity
Let x represent an arbitrary observation. The power family of transformations is indexed by a parameter A. A given value for A implies a particular transformation. For example, consider xA with A = -1. Since x - 1 = 1/ x, this choice of A corresponds to the reciprocal transformation. We can trace the family of transformations as A ranges from negative to positive powers of x. For A = 0, we define x 0 == ln x. A se quence of possible transformations is
. . . , x- 1 = lX , xo = ln x, x l/4 = � , x l/2 = Vx , shrinks large values of x
increases large values of x
To select a power transformation, an investigator looks at the marginal dot di agram or histogram and decides whether large values have to be "pulled in" or "pushed out" to improve the symmetry about the mean. Trial-and-error calculations with a few of the foregoing transformations should produce an improvement. The final choice should always be examined by a Q-Q plot or other checks to see whether the tentative normal assumption is satisfactory. The transformations we have been discussing are data based in the sense that it is only the appearance of the data themselves that influences the choice of an ap propriate transformation. There are no external considerations involved, although the transformation actually used is often determined by some mix of information supplied by the data and extra-data factors, such as simplicity or ease of interpretation. A convenient analytical method is available for choosing a power transforma tion. We begin by focusing our attention on the univariate case. Box and Cox [3] consider the slightly modified family of power transformations
{
A-1 (x A) = X A ln x
A =I= O
(4-34)
A=O
which is continuous in A for x > 0. (See [8].) Given the observations x 1 , x2 , , xn , the Box-Cox solution for the choice of an appropriate power A is the solution that maximizes the expression • • •
C (A) = - � ln
[� � ( x?l - x(Al ) 2]
We note that x) A) is defined in (4-34) and
(
+ (A - 1 )
xJ - 1 ( - _!_ � - _!_n j� £./ X · A) £./ = l I n j=l A
W X
)
� ln x1
(4-35)
(4-36)
is the arithmetic average of the transformed observations. The first term in ( 4-35) is, apart from a constant, the logarithm of a normal likelihood function, after maximiz ing it with respect to the population mean and variance parameters.
1 96
Chapter 4
The M u ltivariate Normal Distribution
The calculation of f(A) for many values of A is an easy task for a computer. It is helpful to have a graph of e (A) versus A, as well as a tabular disp]ay of the pairs ( A, f(A) ) , in order to study the behavior near the maximi�ing value A . For instance, if either A = 0 (logarithm ) or A = � ( square root ) is near A, one of these may be pre ferred because of its simplicity. Rather than program the calculation of (4-35), some statisticians recommend the equivalent procedure of fixing A, creating the new variable
X� - 1
j = 1, . . . , n
(4-37)
and then calculating the sample variance. The minimum of the variance occurs at the same A that maximizes ( 4-35). Comment. It is now understood that the transformation obtained by maximiz ing f(A) usually improves the approximation to normality. However, there is no guar antee that even the best choice of A will produce a transformed set of values that adequately conform to a normal distribution. The outcomes produced by a trans formation selected according to ( 4-35) should always be carefully examined for pos sible violations of the tentative assumption of normality. This warning applies with equal force to transformations selected by any other technique. Example 4. 1 6
(Determ i n i ng a power tra nsformatio n fo r univariate data)
We gave readings of the microwave radiation emitted through the closed doors of n = 42 ovens in Example 4.10. The Q-Q plot of these data in Figure 4.6 in dicates that the observations deviate from what would be expected if they were normally distributed. Since all the observations are positive, let us perform a power transformation of the data which, we hope, will produce results that are more nearly normal. Restricting our attention to the family of transformations in (4-34), we must find that value of A maximizing the function f ( A) in (4-35). The pairs ( A, e ( A) ) are listed in the following table for several values of A: A
f ( A)
A
f ( A)
- 1.00 -.90 -.80 -.70 -.60 -.50 -.40 -.30 -.20 -.10 .00 .10 .20 (30
70.52 75.65 80.46 84.94 89.06 92.79 96.10 98.97 101.39 103.35 104.83 105.84 106.39 106.51)
.40 .50 .60 .70 .80 .90 1.00 1.10 1.20 1.30 1.40 1.50
106.20 105.50 104.43 103.03 101.33 99.34 97.10 94.64 91.96 89.10 86.07 82.88
Section 4.8
1 97
Tra nsformations to Near Normal ity
0 \C) 0 ....-4
0 lr) 0 ....-4
0.0
0.2
0. 1 Figure 4. 1 2
0
0.4
I .3 A = 0.28
0.5
Plot of f(A) versus A fo r rad i ation data (door closed).
"
The curve of f(A) versus A that allows the more exact determination A = .28 is shown in Figure 4.12 . ,... It is evident from both the table and the plot ,...that a value of A around .30 maximizes f(A) . For convenience, we choose A = .25. The data xj were reexpressed as
1 /4 j X (1 4 / ) xj - -1 4 _
-
1
-
-
-
j
=
1, 2, . . . ' 42
and a Q-Q plot was constructed from the transformed quantities. This plot is shown in Figure 4.13 on page 198. The quantile pairs fall very close to a straight line, and we • would conclude from this evidence that the xY14 ) are approximately normal.
1 98
Chapter 4
The M u ltiva r i ate Norm a l D istribution X
( 1 /4 ) (j )
- .50
- 1 .00
- 1 . 50
- 2.00
- 2.50
- 3.00 ---'----'---'---�--
- 2.0
- 1 .0
.0
1 .0
2.0
3.0
q (j )
Figure 4. 1 3
A 0-0 plot of the tra nsformed radiation data {doo r closed). (The i ntegers i n the plot i n d icate the n u m ber of points occ u pying the same locati on.)
Transforming M u ltivariate Observati ons
With multivariate observations, a power transformation must be selected for each of the variables. Let A1 , A2 , . . . , AP be the power transformations for the p measured characteristics. Each Ak can be selected by maximizing ( 4-38 ) where x1 k , x2 k , . . . , xn k are the n observations on the kth variable, k = 1, 2, . . Here
.
, p.
(4-3 9) is the arithmetic average of the transformed observations. The jth transformed mul tivariate observation is
Section 4.8
Tra nsformations to Near Normal ity
1 99
x 1t - 1 A1
x1i - 1
:\2
"
where A 1 , A2 , , AP are the values that individually maximize (4-38). The procedure just described is equivalent to making each marginal distribution approximately normal. Although normal marginals are not sufficient to ensure that the joint distribution is normal, in practical applications this may be good enough. If not, we could start with the values A 1 , A2 , . . . , AP obtained from the preceding trans formations and iterate toward the set of values A' = A 1 , A2 , . . . , Ap ] , which collec tively maximizes . . •
[
f( A 1 , A2 , . . . , Ap ) n n n = - - lnl S ( A ) I + ( A 1 - 1 ) :L ln xj 1 + (A2 - 1 ) :L ln xj 2 2 j=l j=l n
+ . . + ( AP - 1 ) :L ln xj P j= l ·
(4-40)
where S ( A ) is the sample covariance matrix computed from
j = 1, 2,
. . .
,n
AP Maximizing (4-40) not only is substantially more difficult than maximizing the in dividual expressions in (4-38), but also is unlikely to yield remarkably better results. The selection method based on Equation (4-40) is equivalent to maximizing a multivariate likelihood over JL , I and A, whereas the method based on (4-38) corresponds to maxi mizing the kth univariate likelihood over J.Lk , a-k k , and Ak . The latter likelihood is gener ated by pretending there is some Ak for which the observations ( x1Z - 1 ) / Ak , j = 1, 2, . . , n have a normal distribution. See [3] and [2] for detailed discussions of the univariate and multivariate cases, respectively. (Also, see [8].) .
200
Chapter 4
The M u ltivariate Normal Distribution
Exa m p l e 4. 1 7
(Determ i n i ng power transformations for biva riate data)
Radiation measurements were also recorded through the open doors of the n = 42 microwave ovens introduced in Example 4.10. The amount of radia tion emitted through the open doors of these ovens is listed in Table 4.5. In accordance with the procedure outlined in Example 4.16, a power trans formation for these data was seleActed by maximizing f( A) in ( 4-35). The ap proximate maximizing value was A = .30. Figure 4.14 on page 201 shows Q-Q plots of the untransformed and transformed door-open radiation data. (These data were actually transformed by taking the fourth root, as in Example 4.16.) It is clear from the figure that the transformed data are more nearly normal, al though the normal approximation is not as good as it was for the door-closed data. Let us denote the door-closed data by x1 1 , x 2 1 , . . . , x4 2, 1 and the door-open data by x1 2 , x22 , , x4 2 , 2 • Choosing a power transformation for each set by maximizing the expression in (4-35) is equivalent to maximizing e k ( A) in (4-38 ) with k = 1, 2. TQ.us, using theAoutcomes from Example 4.16 and the foregoing results, we have A 1 = .30 and A 2 = .30. These powers were determined for the marginal distributions of x1 and x2 • We can consider the joint distribution of x1 and x2 and simultaneously de termine the pair of powers (A1 , A2 ) that makes this joint distribution approxi mately bivariate normal. To do this, we must maximize e ( A 1 , A2 ) in ( 4-40) with respect to both A1 and A2 • • . •
TABLE 4.5
RAD IATION DATA (DOOR OPEN)
Oven no.
Radiation
Oven no.
Radiation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.30 .09 .30 .10 .10 .12 .09 .10 .09 .10 .07 .05 .01 .45 .12
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.20 .04 .10 .01 .60 .12 .10 .05 .05 .15 .30 .15 .09 .09 .28
Source: Data courtesy of J. D. Cryer.
Oven no.
Radiation
31 32 33 34 35 36 37 38 39 40 41 42
.10 .10 .10 .30 .12 .25 .20 .40 .33 .32 .12 .12
Section 4.8
201
Tra nsformations to Near Normal ity
•
.60
.45
•
• •
. 30
4
•
• •
2 .15 2
.0
•
-2.0
3
•
5
- 1 .0
6
9
2
.0
2.0
1 .0
q (j)
3.0
(a) X
( 1 /4) (j)
.00
- .60
- 1 .20
- 1 . 80
- 2.40
- 3.00 - 2.0
- 1 .0
1 .0
.0
2.0
3.0
(b) Figure 4.14 Q-Q plots of (a) the orig i n a l a n d (b) the transformed radiation data (with door open). (The i ntegers i n the plot i n d i cate the n u mber of poi nts occu pyi ng the same location.)
We computed f( A 1 , A2 ) for a grid of A1 , A2 values covering 0 < A 1 < .50 and 0 < A2 < .50, and we constructed the contour plot �ho:vn in Figure 4.15 on page 202. We see that the maximum occurs at about ( A 1 , A2 ) = ( 16 , 16 ) The "best" power transformations for this bivariate case do not differ substantially from those obtained by considering each marginal distribution. • .
.
.
202
Chapter 4
The M u ltivariate Normal D istribution
0.5
221 222
0.3
223
0.4 0.2 0. 1
0.0
0 225.9
225 0.0
0.1
0.2
224
223 222 221 0.3 0.4 0.5
��----�--�� A l Figure 4. 1 5
Conto ur p l ot of f ( A 1 , A2 ) for the radiation data.
As we saw in Example 4.17, making each marginal distribution approximately normal is roughly equivalent to addressing the bivariate distribution directly and making it approximately normal. It is generally easier to select appropriate trans formations for the marginal distributions than for the joint distributions. If the data includes some large negative values, a more general transformation (see Yeo and Johnson [13]) should be applied. { (x + 1 ) A - 1 }/A log (x + 1 ) - { ( -x + 1 ) 2 -A - 1 }/(2 - A) -log ( -x + 1 )
X X X X
> > < <
0, A 0, A 0, A 0, A
=F 0 =0 =F 2 =2
EXERCI SES
4.1. Consider a bivariate normal distribution with JL 1 = 1, JL2 = 3, o- 1 1 = 2, o-22 = 1
and p 1 2 = .8. (a) Write out the bivariate normal density. (b) Write out the squared statistical distance expression (x as a quadratic function of x 1 and x2 • -
p,
) ' I -1 (x - p. )
Chapter 4
4.2. Consider a bivariate normal population with JL 1 = 0, JL2 =
203
Exercises
2, o- 1 1 = 2, o-22 = 1,
-
and p 1 2 = .5. (a) Write out the bivariate normal density. (b) Write out the squared generalized distance expression ( x - IL ) ' I - 1 ( x IL ) as a function of x 1 and x2 • (c) Determine (and sketch) the constant-density contour that contains 50% of the probability. 4.3. Let X be N3 ( JL , I ) with IL ' = - 3, 1, 4] and
[
I=
[ -� -� �]
Which of the following random variables are independent? Explain. (a) xl and x2 (b) x2 and x3 (c) (X1 , X2 ) and X3 xl + x2 (d) and x3 2 (e) x2 and x2 - � xl - x3 4.4. Let X be N3 ( JL , I ) with IL ' = [2, - 3, 1 ] and I=
[ � � �]
1 2 2 (a) Find the distribution of 3X1 - 2X2 + X3 • (b) Relabel the variables if necessary, and find a 2 X 1 vector a such that X2 and x2 - a are independent.
{ �J
4.5. Specify each of the following.
(a) The conditional distribution of xl ' given that x2 = x2 for the joint distrib ution in Exercise 4.2. (b) The conditional distribution of x2 ' given that xl = x l and x3 = x3 for the joint distribution in Exercise 4.3. ( c ) The conditional distribution of x3 ' given that xl = x l and x2 = x2 for the joint distribution in Exercise 4.4. 4.6. Let X be distributed as N3 ( JL, I ) , where IL ' = [ 1 , -1, 2 ] and I=
[ � � -� ]
-1 0 2 Which of the following random variables are independent? Explain. (a) xl and x2 (b) xl and x3 (c) x2 and x3 (d) (X1 , X3 ) and X2 (e) X1 and X1 + 3X2 - 2X3
204
Chapter 4
The M u ltivariate Norm a l D istri bution
4.7. Refer to Exercise 4.6 and specify each of the following.
(a) The conditional distribution of xl ' given that x3 = x3 . (b) The conditional distribution of xl ' given that x2 = x2 and x3 = x3 . 4.8. (Example of a nonnormal bivariate distribution with normal marginals.) Let X1 be N(O, 1 ) , and let X1 if - 1 < X1 < 1 x2 = xl otherwise Show each of the following. (a) X2 also has an N(O, 1 ) distribution. (b) X1 and X2 do not have a bivariate normal distribution. Hint: (a) Since xl is N (O, 1 ) , P[ - 1 < xl < x] = P[ -x < xl < 1 ] for any X . When -1 < x2 < 1 , P[X2 < x2 ] = P[X2 < - 1 ] + P[ - 1 < X2 < x2 ] P[X1 < - 1 ] + P[ -1 < -X1 < x2 ] = P[X1 < - 1 ] + P[ -x2 < X1 < 1 ] . But P [ -x2 < X1 < 1 J = P [ - 1 < X1 < x2 ] from the symmetry argument in the first line of this hint. Thus, P[ X2 < x2 J = P[ X1 < - 1 ] + P[ - 1 < X1 < x2 ] = P[X1 < x 2 ] , which is a standard normal probability. (b) Consider the linear combination X1 - X2 , which equals zero with proba bility P[ I X1 1 > 1 ] = .3174. 4.9. Refer to Exercise 4.8, but modify the construction by replacing the break point 1 by c so that -X1 if -c < X1 < c X2 = xl elsewhere Show that c can be chosen so that Cov ( X1 , X2 ) = 0, but that the two random variables are not independent. Hint: For c = 0, evaluate Cov (X1 , X2 ) = E[X1 (X1 ) ] For c very large, evaluate Cov ( X1 , X2 ) . E [ X1 ( -X1 ) ] . 4.10. Show each of the following. (a)
{-
==
{
(b)
� � �� l A B 1 l A OI I I I B 1
Hint: (a)
0
0
= I A I I B I for I A I
*
0
. II 1 I� �I 0
by the = . Expanding the determinant 0' B 0' 0' 0' first row (see Definition 2A.24) gives 1 times a determinant of the same form, with the order of I reduced by one. This procedure is repeated until
_:
.
1 x I B I is obta ed Similarly, expanding the determinant O last row gives = 1 A I·
l o' I 1
by the
(b)
� � �� � � !I I :, K:-C1 I · I I A I cl =
by the last row gives
0,
Chapter 4
Exe rcises
But expanding the determinant
205
1 :, A:C I
= 1. Now use the result in Part a.
A is square, A I = I A22l l A 11 - A 12A2i A21 I for I A22l # 0 = IA 11 I A 22 - A 21 A !i A 12 I for IA 11 I # 0 Hint: Partition A and verify that [ OI, -A 12I A2�] [ AA2111 AA2212 ] [ -A22!1 A21 OI J [ Au - A1,2A2"�A2 1 A22 J Take determinants on both sides of this equality. Use Exercise 4.10 for the first and third determinants on the left and for the determinant on the right. The sec ond equality for I A I follows by considering
4.11. Show that, if
I
=
0
0
A symmetric, f 1 [ [ J J (A -A A A2�A A : 2 1 21 [ l � ; : K :2" �� u , A �A2 1 �J Thus, (A 11 - A 12 A2iA 21 ) - 1 is the upper left-hand block of A- 1 . Hint: Premultiply the expression in the hint to Exercise 4.11 by [ I -A 12I A-122 ] -1 and postmultiply by [ A;-Ii A I J-1 Take mverses of the 21 resulting expression. Show the following if I I I # 0. (a) Check that I I I = I I22l l I 11 - I 1 2 I 2ii 21 I . (Note that I I I can be factored into the product of contributions from the marginal and conditional distri
4.12. Show that, for =
_
0,
4.13.
0
.
_
.
.
butions.) (b) Check that
(x - 1L )'I-1 (x - 1L ) = [x1 - 1L1 - I-112I2i (x2 - 1L2 ) J ' (I11 - I 12I2i i21 ) [ x1 - IL 1 - I12I2i(x2 - 1L2 )] + (x 2 - IL 2 )'I 2i (x2 - IL 2 ) X
(Thus, the joint density exponent can be written as the sum of two terms corresponding to contributions from the conditional and marginal distrib utions.) (c) Given the results in Parts a and b, identify the marginal distribution of and the conditional distribution of =
X 1 1 X2 x2 •
X2
206
Chapter 4
The M u ltivariate Norm a l Distri bution
Hint: (a) Apply Exercise 4. 11. (b) Note from Exercise 4.12 that we can write (x - IL ) ' I - 1 (x - IL ) as
If we group the product so that
[
1 , 0
-I 1z iZi I
][ ]=[ x 1 - 1L 1 X2 - IL 2
x 1 - 1L 1 - I 1 2 I 2i(x2 - 1L2 ) X2 - IL 2
]
the result follows. 4.14. If X is distributed as Np ( IL , I) with I I I # 0, show that the joint density can be written as the product of marginal densities for o if I 1 2 = X2 X1 and X
(q Xl )
(q ( p - q))
(( p - q) Xl)
Hint: Show by block multiplication that I 1i 0 � , Is t he I. nverse of � 0 I 2i Then write
J
J [ I 1i x 1 - IL 1 ] ( x - IL ) , � -1 ( x - IL ) - x l - IL l ) ' ( x2 - IL 2 I2i J x2 - IL 2 = (x l - 1L l ) ' I1i (x l - IL l ) + (x2 - IL 2 ) ' I2i(x2 - IL )
[
�
_
Note that I I I n
=
.
=
)']
[(
[ [
I1 1 0
,
0'
0
I 22 0
2 I I 1 1 l l I 22 l from Exercise 4.10(a). Now factor the joint density. n
IL ) ' and �l ( x - IL ) (xj - x) ' are both p x p j= =l matrices of zeros. Here x; = [ xj 1 , xj 2 , , xj p ] , j = 1, 2, . . . , n, and 1 n x = - � xj n j= l 4.16. Let X 1 , X 2 , X 3 , and X 4 be independent Np ( IL , I ) random vectors. (a) Find the marginal distributions for each of the random vectors V1 = 41 X 1 - 41 X 2 + 41 x 3 - 41 x 4 and V2 = 41 X 1 + 41 X 2 - 41 X3 - 41 X 4 (b) Find the joint density of the random vectors V1 and V2 defined in (a). 4.17. Let X 1 , X 2 , X 3 , X 4 , and X 5 be independent and identically distributed randon1 vectors with mean vector IL and covariance matrix I. Find the mean vector and covariance matrices for each of the two linear combinations of random vectors 1 1 1 1 1 s X 1 + s X2 + s X 3 + 5 X 4 + 5 X s
4.15. Show that
� (xj j
x) ( x -
• • •
Chapter 4
Exercises
207
and
X 1 - X2 + X3 - X4 + X 5 in terms of IL and I. Also, obtain the covariance between the two linear com binations of random vectors. 4.18. Find the maximum likelihood estimates of the 2 X 1 mean vector IL and the 2 X 2 covariance matrix I based on the random sample 3 6 X = 54 74 4 7 from a bivariate normal population. 4.19. Let X 1 , X 2 , . . . , X 2 0 be a random sample of size n = 20 from an N6( /L, I ) pop ulation. Specify each of the following completely. (a) The distributi on of (X 1 - �-L ) ' I - 1 (X 1 - IL ) (b) The distributions of X and Vn (X IL ) (c) The distribution of ( n - 1 ) S 4.20. For the random variables X 1 , X 2 , . . . , X 2 0 in Exercise 4.19, specify the distrib ution of ( 19 S) ' in each case. 0 (a) = -
B �B
4.22.
4.23.
4.24.
�� � ��
OJ
n
1 0 0 0 0 0 0 1 0 0 0 Let Xr, . . . , X60 be a random sample of size 60 from a four-variate normal distri bution having mean IL and covariance I. Specify each of the following completely. (a) The distribution of X (b) The distribution of (X 1 - IL ) ' I - 1 (X 1 - IL ) (c) The distribution of n ( X - IL ) ' I - 1 ( X - IL ) (d) The approximate distribution of n ( X - IL ) ' S - 1 ( X - IL ) Let X 1 , X2 . . . , X75 be a random sample from a population distribution with mean IL and covariance matrix I. What is the approximate distribution of each of the following? (a) X (b) n ( X - 1L ) ' S - 1 ( X - 1L ) Consider the annual rates of return (including dividends) on the Dow-Jones industrial average for the years 1963-1972. These data, multiplied by 100, are 20.6, 18.7, 14.2, - 15.7, 19.0, 7.7, -11.6, 8.8, 9.8, and 18.2. Use these 10 observa tions to complete the following. (a) Construct a Q-Q plot. Do the data seem to be normally distributed? Explain. (b) Carry out a test of normality based on the correlation coefficient rQ . [See ( 4-31).] Let the significance level be a = .1 0. Exercise 1.4 contains data on three variables for the 10 largest industrial cor porations as of April 1990. For the sales ( x 1 ) and profits ( x2 ) data: (a) Construct Q-Q plots. Do these data appear to be normally distributed? Explain. (b)
4.21.
B [� � B=[
208
Chapter 4
The M u ltiva ri ate Norm a l Distribution
(b) Carry out a
test of normality based on the correlation coefficient rQ . [See (4-31).] Set the significance level at a = .10. Do the results of these test s corroborate the results in Part a? 4.25. Refer to the data for the 10 largest industrial corporations in Exercise 1.4. Con struct a chi-square plot using all three variables. The chi-square quantiles are 0.3518 0.7978 1.2125 1.6416 2.1095 2.6430 3.2831 4.1083 5.3170 7.8147 4.26. Exercise 1 .2 gives the age x 1 , measured in years, as well as the selling price x2 , measured in thousands of dollars, for n = 10 used cars. These data are repro duced as follows: 9 10 11 3 8 7 5 5 7 7 2.30 1.90 1.00 .70 .30 1.00 1.05 .45 .70 .30 x2 (a) Use the results of Exercise 1.2 to calculate the squared statistical distances (xj - x) ' S- 1 (xj - x) , j = 1, 2, . . . , 10, where xj = [xj l ' Xj 2 J . (b) Using the distances in Part a, determine the proportion of the observation s falling within the estimated 50% probability contour of a bivariate normal distribution. (c) Order the distances in Part a and construct a chi-square plot. (d) Given the results in Parts b and c, are these data approximately bivariate normal? Explain. 4.27. Consider the radiation data (with door closed) in Example 4.10. Construct a Q-Q plot for the natural logarithms of these data. [Note that the natural logarithm transformation corresponds to the value A = 0 in (4-34).] Do the nat ural logarithms appear to be normally distributed? Compare your results with Figure 4.13. Does the choice A = � or A = 0 make much difference in this case? The following exercises may require a computer. 4.28. Consider the air-pollution data given in Table 1 .5. Construct a Q-Q plot for the solar radiation measurements and carry out a test for normality based on the correlation coefficient rQ [see ( 4-31)]. Let a = .05 and use the entry corre sponding to n = 40 in Table 4.2. 4.29. Given the air-pollution data in Table 1 .5, examine the pairs X5 = N0 2 and x6 = 0 3 for bivariate normality. (a) Calculate statistical distances (xj - x) ' S- 1 (xj - x) , j = 1 , 2, . . . , 42, where xj = [ xjs , xj 6 ] . (b) Determine the proportion of observations xj = [ xj 5 , xj 6], j = 1, 2 , . . . , 42, falling within the approximate 50% probability contour of a bivariate nor mal distribution. (c) Construct a chi-square plot of the ordered distances in Part a. 4.30. Consider the used-car data in Exercise 4.26. (a) Determine the power transformation A 1 that makes the x 1 values approxi mately normal. Construct a Q-Q plot for the transformed data. (b) Determine the power transformations A2 that makes the x2 values approx imately normal. Construct a Q-Q plot for the transformed data. (c) Determine the power transformations A' = [ A 1 , A2 ] that make the [ x 1 , x2 ] values jointly normal using (4-40). Compare the results with those obtain ed in Parts a and b.
Chapter 4
4.31. Examine the marginal normality of the observations on variables X1 , X2 ,
4.32. 4.33. 4.34. 4.35. 4.36. 4.37. 4.38.
209
References
, X5 for the multiple-sclerosis data in Table 1 .6. Treat the non-multiple-sclerosis and multiple-sclerosis groups separately. Use whatever methodology, including transformations, you feel is appropriate. Examine the marginal normality of the observations on variables X1 , X2 , , X6 for the radiotherapy data in Table 1.7. Use whatever methodology, including transformations, you feel is appropriate. Examine the marginal and bivariate normality of the observations on variables X1 , X2 , X3 , and X4 for the data in Table 4.3. Examine the data on bone mineral content in Table 1.8 for marginal and bi variate normality. Examine the data on paper-quality measurements in Table 1 .2 for marginal and multivariate normality. Examine the data on women 's national track records in Table 1.9 for marginal and multivariate normality. Refer to Exercise 1 . 1 8. Convert the women ' s track records in Table 1.9 to speeds measured in meters per second. Examine the data on speeds for mar ginal and multivariate normality. Examine the data on bulls in Table 1.10 for marginal and multivariate normality. Consider only the variables YrHgt, FtFrBody, PrctFFB, BkFat, SaleHt, and SaleWt. • • •
• • •
REFE RENCES 1 . Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed.). New York: John Wiley, 1984. 2. Andrews, D. F. , R. Gnanadesikan, and J. L. Warner. "Transformations of Multivariate Data." Biometrics, 27 , no. 4 (1971) , 825-840. 3. Box, G. E. P., and D. R. Cox. "An Analysis of Transformations" (with discussion). Jour nal of the Royal Statistical Society (B) , 26, no. 2 (1964) , 211-252. 4. Daniel, C. , and F. S. Wood, Fitting Equations to Data (2d ed.). New York: John Wiley, 1980. 5. Filliben, J. J. "The Probability Plot Correlation Coefficient Test for Normality." Techno metrics, 17, no. 1 (1975), 11 1-1 17. 6. Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations New York: John Wiley, 1977. 7. Hawkins, D. M. Identification of Outliers. London, UK: Chapman and Hall, 1980. 8. Hernandez, F. , and R. A. Johnson. "The Large-Sample Behavior of Transformations to Normality." Journal of the American Statistical Association, 75, no. 372 (1980), 855-861. 9. Hogg, R. V. , and A. T. Craig. Introduction to Mathematical Statistics (4th ed. ) . New York: Macmillan, 1 978. 10. Looney, S. W., and T. R. Gulledge, Jr. "Use of the Correlation Coefficient with Normal Probability Plots." The American Statistician, 39, no. 1 (1985), 75-79. 1 1 . Shapiro, S. S., and M. B. Wilk. "An Analysis of Variance Test for Normality (Complete Samples)." Biometrika, 52, no. 4 (1965 ) , 591-611. 12. Verrill, S., and R. A. Johnson. "Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality." Journal of the American Statistical Association, 83 , no. 404 (1988), 1192-1 197. 13. Yeo, I. and R.A. Johnson "A New Family of Power Transformations to Improve Nor mality or Symmetry." Biometrika, 87, no. 4 (2000), 954-959. 14. Zehna, P. "Invariance of Maximum Likelihood Estimators." Annals of Mathematical Sta tistics, 37 , no. 3 (1966) , 744.
C HAPTER
5
Inferences ab out a Mean Vector
5.1
I NTRODUCTI ON
This chapter is the first of the methodological sections of the book. We shall now use the concepts and results set forth in Chapters 1 through 4 to develop techniques for an alyzing data. A large part of any analysis is concerned with infe ren ce that is, reach ing valid conclusions concerning a population on the basis of information from a sample. At this point, we shall concentrate on inferences about a population mean vec tor and its component parts. Although we introduce statistical inference through ini tial discussions of tests of hypotheses, our ultimate aim is to present a full statistical analysis of the component means based on simultaneous confidence statements. One of the central messages of multivariate analysis is that p correlated vari ables must be analyzed jointly. This principle is exemplified by the methods pre sented in this chapter. -
5.2
TH E PLAU SIBI LITY OF POPU LATI ON M EAN
1-Lo
AS A VALU E FOR A NORMAL
Let us start by recalling the univariate theory for determining whether a specific value JLo is a plausible value for the population mean JL · From the point of view of hypothe sis testing, this problem can be formulated as a test of the competing hypotheses Here H0 is the null hypothesis and H1 is the (two-sided) alternative hypothesis. If X1 , X2 , , Xn denote a random sample from a normal population, the appropriate test statistic is (X - JLo ) 1 � 1 2 2 £.J Xj and s = £.J ( Xj - X ) where X = � t= Vn n n 1 sj j =l j =l • • •
-
'
21 0
-
-
_
Section 5.2
The P l a u s i b i l ity of f.Lo as a Va l ue for a Normal Pop u l ation Mean
21 1
This test statistic has a student ' s !-distribution with n - 1 degrees of freedom ( d.f. ). We reject H0 , that JLo is a plausible value of JL , if the observed I t I exceeds a specified percentage point of a !-distribution with n - 1 d.f. Rejecting H0 when I t I is large is equivalent to rejecting H0 if its square, 2 2 -1 2t = (X - JLo ) = n (X ) (s ) (X - JLo ) (5-1) JLo s 2jn is large. The variable t 2 in (5-1) is the square of the distance from the sample mean X to the test value JLo · The units of distance are expressed in terms of sj Vn , or es timated standard deviations of X. Once X and s 2 are observed, the test becomes: Reject H0 in favor of H1 , at significance level a, if (5-2) where tn_ 1 (aj 2) denotes the upper 100(a/2)th percentile of the !-distribution with n - 1 d.f. If H0 is not rejected, we conclude that JLo is a plausible value for the normal population mean. Are there other values of JL which are also consistent with the data? The answer is yes! In fact, there is always a set of plausible values for a normal population mean. From the well-known correspondence between acceptance regions for tests of H0 : JL = JLo versus H1 : JL # JLo and confidence intervals for JL , we have {Do not reject H0: JL
=
JLo at level a} or
x - JLo sj Vn
is equivalent to
or
{ tLo
lies in the 100 ( 1 - a)% confidence interval X ± tn_ 1 (a/2)
.:n } (5-3)
The confidence interval consists of all those values JLo that would not be rejected by the level a test of H0 : JL = JLo . Before the sample is selected, the 100(1 - a)% confidence interval in (5-3) is a random interval because the endpoints depend upon the random variables X and s. The probability that the interval contains JL is 1 - a; among large numbers of such independent intervals, approximately 100 ( 1 - a)% of them will contain JL . Consider now the problem of determining whether a given p X 1 vector JL o is a plausible value for the mean of a multivariate normal distribution. We shall pro ceed by analogy to the univariate development just presented. A natural generalization of the squared distance in (5-1) is its multivariate analog
212
I nferences a bout a Mean Vector
Chapter 5
where
-
X (px 1 )
=
1 n - � x. n j.£.J =1 1 '
S (p xp)
=
n
1 _
l
� (X X ) (X X) , and /L o jj.£.J j= l (px 1) -
-
I
=
M1 o M2 0
JLp o The statistic T2 is called Hotelling's T2 in honor of Harold Hotelling, a pioneer in multivariate analysis, who first obtained its sampling distribution. Here ( 1/n )S is the estimated covariance matrix of X. (See Result 3.1.) If the observed statistical distance T2 is too large-that is, if x is "too far" from �L o-the hypothesis H0 : IL = /L o is rejected. It turns out that special tables of T2 per centage points are not required for formal tests of hypotheses. This is true because (n - 1 ) p . 2 (5 - 5) T I S d'1stn'b ute d as ( p ) Fp , n-p
n
_
where Fp, n - p denotes a random variable with an F-distribution with p and n - p d.f. To summarize, we have the following:
-
Statement (5-6) leads immediately to a test of the hypothesis H0 : IL = /L o versus H1 : IL #- /L o . At the a level of significance, we reject H0 in favor of H1 if the observed ( n - 1 )p 2 1 (5-7) T - n ( X - IL o ) s ( X - IL o ) > F (a ( n p ) p, n - p ) It is informative to discuss the nature of the T2 -distribution briefly and its corre spondence with the univariate test statistic. In Section 4.4, we described the manner in which the Wishart distribution generalizes the chi-square distribution. We can write 1 (Xj - X ) (Xj - X ) ' j l T2 = Vn ( X - p, 0 ) ' � l Vn ( X - IL o ) I
_
(±
n
_
)
which combines a normal, NP ( 0, I ), random vector and a Wishart, Wp, n _1 ( I ) , random matrix in the form
T2p,n - 1
= =
)(
The Plausibi l ity of f.L o as a Va l ue for a Normal Popu lation Mean
Section 5.2
(
()
1 Wishart random matrix multivariate normal multivariate normal ' random vector d.f. random vector -1 Np (O, I) ' Wp , n - 1 (I) Np (O, I )
[ 11 n
]
_
)(
This is analogous to or
()
(
213
) (5-8)
)
1 ( scaled) chi-square normal random variable normal t 2n - 1 d.f. random variable random variable for the univariate case. Since the multivariate normal and Wishart random variables are independently distributed [see ( 4-23)], their joint density function is the product of the marginal normal and Wishart distributions. Using calculus, the distribution (5-5) of T 2 as given previously can be derived from this joint distribution and the rep resentation (5-8). It is rare, in multivariate situations, to be content with a test of H0 : IL = #L o , where all of the mean vector components are specified under the null hypothesis. Ordinarily, it is preferable to find regions of IL values that are plausible in light of the observed data. We shall return to this issue in Section 5.4. _
-
(Eval uati ng T2)
Example 5 . 1
Let the data matrix for a random sample of size n population be
[ X = l:O �]
Evaluate the observed T2 for #L a T2 in this case? We find
=
= 3 from a bivariate normal
[9, 5 ] . What is the sampling distribution of
6+
10
+8
3 9+6+3 3
and
s1 1
=
( 6 - 8) 2 +
s1 2
=
( 6 - 8) ( 9 - 6) +
s22
=
- 8 ) 2 + (8 - 8 ) 2
( 10 2
(10 -
=4
8 ) (6 - 6) + ( 8 - 8 ) ( 3 - 6) == -3 2 ( 9 - 6) 2 + ( 6 - 6) 2 + (3 - 6) 2 =9 2
---- -----
214
Chapter 5
Inferences about a Mean Vector
so
s= Thus,
[
4 -3 9 -3
]
[ ] - [� 2t ]
1 9 3 - � s-1 (4) (9) - ( -3) ( -3 ) 3 4 and, from (5-4) ,
[.!.� 2�.!. J [ ] = 2
8-9 3 [ -1, 6 5 Before the sample is selected, T has the distribution of a (3 - 1 )2 ,3_ = 4F2, l ( 3 2) F2 2 random variable.
T 2 = 3[8 - 9, 6 - 5]
_
_
•
The next example illustrates a test of the hypothesis H0 : IL = ILo using data col lected as part of a search for new diagnostic techniques at the University of Wiscon sin Medical School. Example 5.2
(Testi ng a m u ltivariate mean vecto r with T2)
Perspiration from 20 healthy females was analyzed. Three components, X1 = sweat rate, X2 = sodium content, and X3 = potassium content, were mea sured, and the results, which we call the sweat data, are presented in Table 5.1 . Test the hypothesis H0 : IL ' = [ 4, 50, 10] against H1 : IL ' # [ 4, 50, 10] at level of significance a = .1 0. Computer calculations provide 2.879 10.010 - 1 .810 4.640 x = 45.400 , s = 10.010 199.788 -5.640 3.628 9.965 - 1.810 -5.640 and
] [ ] [ ] [ -:��� :��� -:��� [ ][ ] [ ] 8- 1
We evaluate
=
.258 - .002
20[ 4.640 - 4, 45.400 - 50, 9.965 - 10]
.402
.258 .586 - .022 - .022 .006 - .002 .258 - .002 .402
= 20[ .640,
-4.600, - .035 ]
4.640 - 4 45.400 - 50 9.965 - 10 .467 - .042 = 9.7 4 .160
Section 5.2
The Plausi b i l ity of f.L o as a Va l u e for a Norm a l Pop u l ation Mean
TABLE 5 . 1
Individual
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 5
SWEAT DATA
xl (Sweat rate)
x2 (Sodium)
x3 (Potassium)
3.7 5.7 3.8 3.2 3.1 4.6 2.4 7.2 6.7 5.4 3.9 4.5 3.5 4.5 1.5 8.5 4.5 6.5 4.1 5.5
48.5 65.1 47.2 53.2 55.5 36.1 24.8 33.1 47.4 54.1 36.9 58.8 27.8 40.2 13.5 56.4 71.6 52.8 44.1 40.9
9.3 8.0 10.9 12.0 9.7 7.9 14.0 7.6 8.5 11.3 12.7 12.3 9.8 8.4 10.1 7.1 8.2 10.9 11.2 9.4
Source: Courtesy of Dr. Gerald Bargman.
Comparing the observed T2 ( n - 1 )p ( n p) Fp,n - p ( .10) _
= 9.74 with the critical value 19 ( 3 )
= 17 F3 , 1 7( .10) = 3.353(2.44) = 8.18
we see that T2 = 9.74 > 8.18, and consequently, we reject H0 at the 10% level of significance. We note that H0 will be rejected if one or more of the component means, or some combination of means, differs too much from the hypothesized values [ 4, 50, 10] . At this point, we have no idea which of these hypothesized val ues may not be supported by the data. We have assumed that the sweat data are multivariate normal. The Q-Q plots constructed from the marginal distributions of xl ' x2 ' and x3 all ap proximate straight lines. Moreover, scatter plots for pairs of observations have approximate elliptical shapes, and we conclude that the normality assumption • was reasonable in this case. (See Exercise 5.4.)
One feature of the T2-statistic is that it is invariant (unchanged) under changes in the units of measurements for X of the form
21 6
Chapter 5
I nferences about a Mean Vector
X ) + pXd l ) , Y = C )( pXl ( pXl) ( pXp (
C nonsingular
( 5 - 9)
A transformation of the observations of this kind arises when a constant bi is sub tracted from the ith variable to form Xi - bi and the result is multiplied by a constan t a i > to get a i (Xi - bi) · Premultiplication of the centered and scaled quantities a i (Xi - bi) by any nonsingular matrix will yield Equation (5-9). As an example, the operations involved in changing Xi to a i (Xi - bi ) correspond exactly to the process of converting temperature from a Fahrenheit to a Celsius reading. Given observations x 1 , x2 , . . . , xn and the transformation in (5-9), it immedi ately follows from Result 3.6 that n y = C X + d and Sy = _ :L ( yj - y) ( yj - y) ' = CSC' n j=l Moreover, by (2-24) and (2-45), JL y = E( Y ) = E(CX + d) = E(CX) + E(d) = CJL + d Therefore, T 2 computed with the y ' s and a hypothesized value IL Y , o = CJL o + d is T2 = n( y - IL Y,o ) ' S; 1 ( Y - IL Y,o ) = n ( C ( x - IL o ) ) ' ( CSC ' ) - l ( C ( x - IL o ) ) = n ( x - 1L o ) ' C ' ( CSC ' ) - 1 C ( x - JL o ) = n(x - 1L o ) ' C' (C' ) - 1 S -1 C -1 C ( x - IL o ) = n ( x - 1L o ) ' S - 1 ( x - JL o ) The last expression is recognized as the value of T2 computed with the x 's.
0
11
5.3
H OTELLI NG'S T2 AND LIKELIHOOD RATI O TESTS
We introduced the T2 -statistic by analogy with the univariate squared distance t2. There is a general principle for constructing test procedures called the likelihood ratio method, and the T2 -statistic can be derived as the likelihood ratio test of H0 : IL = IL o . The general theory of likelihood ratio tests is beyond the scope of this book. (See [3] for a treatment of the topic.) Likelihood ratio tests have several optimal properties for reasonably large samples, and they are particularly convenient for hy potheses formulated in terms of multivariate normal parameters. We know from (4-18) that the maximum of the multivariate normal likelihood as IL and I are varied over their possible values is given by max L( JL, I) p,, l
=
1
e -n p/2 2 2 n np ( 21T ) / 1 I l /
(5-10)
A
where A
1
n :L (xj - x) (xj - x) ' and jL
1
n :L xj
I== x = -n n j =l j=l are the maximum likelihood estimates. Recall that jL and i are those choices for � and I that best explain the observed values of the random sample.
Section 5.3
Hote l l i ng's
Under the hypothesis H0 : IL
T2
JL o , the normal likelihood specializes to
- (21rYP/2 1 I l n/2 ( =
1 � exp - 2 t't (xi - ILo ) ' I - 1 (xi - ILo )
1
L ( #L o , I ) -
217
and Like l i h ood Ratio Tests
)
The mean ILo is now fixed, but I can be varied to find the value that is "most likely" to have led, with #La fixed, to the observed sample. This value is obtained by maxi mizing L( JL o , I) with respect to I. Following the steps in (4-13), the exponent in L( JL o , I) may be written as
n
n [ J � [ I-1 ( � (xi - P-o) (xi - ILo) ' ) ] n� (xj - JLo ) (xj - JLo ) ' b /2,
1 - -2 � (xj - JL o ) ' I- 1 (xj - JL o ) j= l
=
j=l
=
Applying Result 4.10 with B
=
1 - -2 � tr I- 1 (xj - JL o ) (xj - JL o ) '
- tr
and
j= l
=
n
we have
(5-11) with A
Io
=
1
-
n
� (xj - JL o ) (xj - ILo) ' n j =l
To determine whether ILo is a plausible value of JL, the maximum of L( ILo , I ) is com pared with the unrestricted maximum of L( JL, I) . The resulting ratio is called the likelihood ratio statistic. Using Equations (5-10) and (5-11), we get max L ( JL 0 , I ) = Likelihood ratio = = :t (5-12) A max L ( JL, I) I o 1 1 p, , I i o I is called Wilks ' lambda. If the ob = i The equivalent statistic served value of this likelihood ratio is too small, the hypothesis H0 : IL = ILo is unlikely to be true and is, therefore, rejected. Specifically, the likelihood ratio test of H0 : IL = ILo against H1 : IL # JL o rejects H0 if
( III ) n/2
A
A2/n I 1 / 1
n
� (xj - x) (xj - x) ' j
n � (xj - ILo) (xj - ILo ) ' =l
n/2
(5-13)
j= l where ca is the lower ( lOOa ) th percentile of the distribution of ( Note that the likelihood ratio test statistic is a power of the ratio of generalized variances. ) Fortu nately, because of the following relation between and A , we do not need the dis tribution of the latter to carry out the test.
T2
A.
21 8
Chapter 5
I nfe rences a bout a Mean Vector
Np( #L , I)
X 1 , X 2 , X n T2 1 IL #Lo T2 1 ) -1 2A fn ( 1 --(n - )
Result 5.1. Let , be a random sample from an popu lation. Then the test in (5-7) based on is equivalent to the likelihood ratio test of H0 : == versus H : # because . . •
IL #Lo
+
==
Proof.
Let the (p
+
1 ) X (p
+
1 ) matrix
n (xj - x) (xj - x) ' n( x - ILo) ( x - ILo) ' j= 1 1 ' ' ' j±=1 (xj - X) (xj - X) - 1 - n (X - ILo) ( j±=1 (xj - X) (xj - X) ) (X - ILo ) Since , by (4-14), n n :L (xj - #Lo) (xj - #Lo)' :L (xj - x x - #Lo) (xj - x x - #La ) ' j=1 j=1n :L (xj - x) (xj - x) ' n ( x - ILo) ( x - ILo) ' j=1 the foregoing equality involving determinants can be written n n 2 ( ) T ' ' 1 :L (xj - X)(xj - X) ( - 1 ) 1 ( - ) :L (xj - P-o )(xj - ILo) (n 1) 1=1 1 =1 or 2 ) ( T 1 I nio I I ni I (n 1 ) Thus, (5-14) A2fn III�oI I ( 1 ( n T-2 1 ) ) -1 Here H0 is rejected for small values of A 2/n or, equivalently, large values of T2 • critical values of T2 are determined by (5-6). Incidentally, relation (5-14) shows that T 2 may be calculated from two deter minants, thus avoiding the computation of s - 1 . Solving (5-14) for T2 , we have
( - 1 ) :L
+
=
==
+
+
==
+
+
=
A
A
=
=
=
+
_
_
+
The II
Section 5.3
T2
=
=
(n - 1) A
III
I Io
H otel l i ng's
\
-
T2
(n - 1)
n
(n - 1 ) L (xj j =l
JLo) (xj - ILo)'
------
n
219
and Likel i h ood Ratio Tests
- (n - 1 )
(5-15)
x) (xj - x) ' =l Likelihood ratio tests are common in multivariate analysis. Their optimal large sample properties hold in very general contexts, as we shall indicate shortly. They are well suited for the testing situations considered in this book. Likelihood ratio methods yield test statistics that reduce to the familiar F- and t-statistics in univari ate situations.
L (xj j
-
General Li keli hood Ratio Method
8 L(8)
We shall now consider the general likelihood ratio method. Let be a vector con sisting of all the unknown population parameters, and let be the likelihood func tion obtained by evaluating the joint density of X 1 , X 2 , . , X n at their observed values For x l ' x2 , . . . ' X n . The parameter vector takes its value in the parameter set = [,ur , . . . , ,u p , example, in the p-dimensional multivariate normal case, lT 1 b lT 1 p , lT2 2 ' . lT2 P ' , lTp - l , p , lTp p ] and COnsists Of the p-dimensional space, where - oo < ,u 1 < oo , . . . , - oo < ,Up < oo combined with the [p(p + 1 )/2] -dimensional space of variances and covariances such that is positive definite. has dimension = p + p(p + 1 )/2. Under the null hypothesis Therefore, = is restricted to lie in a subset of For the multivariate normal = {,u l = JL 1 o , JL2 = JL2 o , . . . , ,U p = ,Up o ; and unspecified, situation with = lT l b . . . ' lT l P ' lT22 ' ' lT2 p ' ' lTp - l , p ' lTpp with positive definite} , so has dimen sion = + p(p + 1 )/2 = p(p + 1 )/2. A likelihood ratio test of E rejects in favor of ft. if max A = OEeo < c (5-16) max · · ·
,
· ·
,
8
. . •
. .
e
8'
e.
I e v H0: 8 80 , 8 IL ILo I e e. 0 eo I eo v0 0 Ho: 8 e o Ho Hl : 8 e o L(8) L(8) where c is a suitably chosen constant. Intuitively, we reject H0 if the maximum of the likelihood obtained by allowing 8 to vary over the set e 0 is much smaller than the maximum of the likelihood obtained by varying 8 over all values in e. When the maximum in the numerator of expression (5-16) is much smaller than the maximum in the denominator, e o does not contain plausible values for 8. In each application of the likelihood ratio method, we must obtain the sam • . .
• • •
0E8
pling distribution of the likelihood-ratio test statistic A . Then c can be selected to pro duce a test with a specified significance level a. However, when the sample size is large and certain regularity conditions are satisfied, the sampling distribution of -2 ln A is well approximated by a chi-square distribution. This attractive feature ac counts, in part, for the popularity of likelihood ratio procedures.
220
Chapter 5
I nferences a bout a Mean Vector
Result 5.2.
is, approximately, a x;-vo random variable. Here the degrees of freedom are v = ( dimension of (dimension of 8 o ) ·
e) -
-
v0
•
Statistical tests are compared on the basis of their power, which is defined as the curve or surface whose height is P[ test rejects evaluated at each parameter vector Power measures the ability of a test to reject when it is not true. In the rare situation where = is completely specified under and the alternative H1 consists of the single specified value = , the likelihood ratio test has the highest = 60] . power among all tests with the same significance level a = P[ test rejects In many single-parameter cases has one component), the likelihood ratio test is uni formly most powerful against all alternatives to one side of (} = In other cases, this property holds approximately for large samples. We shall not give the technical details required for discussing the optimal prop erties of likelihood ratio tests in the multivariate situation. The general import of these properties, for our purposes, is that they have the highest possible (average) power when the sample size is large.
8.
8 80
(8
H0 I 8], H0
8 81
H0
H0 I 8 H0: 00 •
5.4 CO N F I D E N CE REG I O N S AN D S I M U LTAN EOUS CO M PARISONS OF COMPO N E NT M EANS
To obtain our primary method for making inferences from a sample, we need to ex tend the concept of a univariate confidence interval to a multivariate confidence re gion. Let be a vector of unknown population parameters and @ be the set of all possible values of A confidence region is a region of likely values. This region is determined by the data, and for the moment, we shall denote it by R(X) , where X = [ X , X 2 , . . . , X n ] ' is the data matrix. The region R(X) is said to be a 100 ( 1 a)% confidence region if, before the sample is selected, P[R(X) will cover the true = 1 a (5-17) This probability is calculated under the true, but unknown, value of The confidence region for the mean of a p-dimensional normal population is available from (5-6) . Before the sample is selected, (n P n(X ( X - < (n p) Fp,n - p (a) = 1 - a whatever the values of the unknown and In words, X will be within [ ( n 1 ) pFp, n P (a) / ( n p) ] of with probability 1 a, provided that distance is defined in terms of n s- l . For a particular sample, and can be computed, and the inequality < (n - l )pFp, n - p (a)j(n - p) will define a region R(X)
8
8.
1
8
-
IL
8] -
IL) S_1 IL) - 1)p ] IL I. 1/2 - _ JL , -x S n(x - JL)' S-1 ( x - JL)
[
_
,
_
_
8.
Section 5.4
Confidence Reg ions and S i m u ltaneous Com pa risons of Com ponent Means
221
within the space of all possible parameter values. In this case, the region will be an ellipsoid centered at i. This ellipsoid is the 100( 1 - a) % confidence region for p.
To determine whether any /L o falls within the confidence region (is a plausible value for IL ), we need to compute the generalized squared distance n( i - p 0 ) S - 1 ( i - p0 ) and compare it with [p(n - 1 )/ ( n - p ) ] Fp, n - p (a). If the squared distance is larger than [p(n - 1 )/(n - p ) ] Fp, n - p (a) , /L o is not in the confi dence region. Since this is analogous to testing H0 : IL = /L o versus H1 : IL # IL o [see (5-7)], we see that the confidence region of (5-18) consists of all p0 vectors for which the T2 -test would not reject H0 in favor of H1 at significance level a. For p > 4, we cannot graph the joint confidence region for p. However, we can calculate the axes of the confidence ellipsoid and their relative lengths. These are determined from the eigenvalues Ai and eigenvectors e i of S. As in (4-7), the direc tions and lengths of the axes of p(n - 1 ) , (a) nX-p x p ) < cz (n p) Fp n - p '
( - ) ' s-1 ( -
-
_
are determined by going
� cj Vn
=
� Vp(n - 1 ) Fp,n - p (a)jn( n - p)
units along the eigenvectors e i . Beginning at the center i, the axes of the confidence ellipsoid are ±�
p(n - 1 ) (a) e i n(n p) Fp, n - p _
where Sei
=
Ai e i , i
=
1, 2,
. . .
, p (5-19)
The ratios of the A/s will help identify relative amounts of elongation along pairs of axes. Example 5 . 3
(Constructing a confidence el l i pse for p)
Data for radiation from microwave ovens were introduced in Examples 4.10 and 4.17. Let x = �measured radiation with door closed and
1
x2
=
� measured radiation with door open
222
Chapter 5
n
I nferences a bout a Mean Vector
42 [ .564] [ .0144 .0117] .603203.018 -163..0117391 ] .0146 ' = [ -163.391 200.228 ..0026,02, [[ .-704,.710, .710].704 95% 42[ .564 - I-Ll ' .603 - [ -163.203.039118 -163200..232891 ] [ ..560364 - i-Ll ] 2(4041) F2,4o( .05) 3. 2 3, . 0 5) 42(203.018) ( .564 - 42(2-00.84(228)163.(3.691)03 ( .564 - JLl ) ( .603 - 6.62 [ . 5 62, . 5 89 42(203.018) ( .564 - -.562)84(2163.42(391)200.( .526428)-( ..650362)-( .6.50389)-2 .589) = 1.30 6.62 . 5 62, . 5 89 [ ..558962] [ ..556289 ] . . 5. 1 . .564, .603p], ( n - 1 ) 2( 4 1) ) 42(40) (3.23) .064 n( n p pn ((nn - p1)) 2(42(441)0) (3.23) .018 .704, .710] .710, .704 For the
pairs of transformed observations, we find that
=
x
S
'
=
=
8_1
The eigenvalue and eigenvector pairs for S are A1 = e1 = e2 = A2 = The
J
confidence ellipse for IL consists of all values ( JL1 , JL2 ) satisfying JL2 J
- JL2 <
or, since F2 ,40(
=
JL1 ) 2
- JL2 ) 2
+
JL2 ) <
To see whether IL '
J is in the confidence region, we compute
=
We conclude that IL '
+
=
< J is in the region. Equivalently, a test of H0:
[
would not be reJected In favor of H1 :
IL =
IL
at the a
#
=
.
05
level of significance. The joint confidence ellipsoid is plotted in Figure The center is at and the half-lengths of the major and minor axes are given by x' = [
and
v'Ar
vx;
_
Fp, n - p (a)
=
Y.026
=
_
Fp, n - p (a)
=
\!.002
=
respectively. The axes lie along e1 = [ and e2 = [ J when these vectors are plotted with x as the origin. An indication of the elongation of the confidence ellipse is provided by the ratio of the lengths of the major and minor axes. This ratio is
2� )pn((nn - p1)) 2 v'A; np((nn -- p1)) -
Fp, n - p (a) Fp, n - p (a)
-
_
\!A,
�
VA;
- ..016145 3 6
_
_
- ·
Section 5.4
Confidence Reg ions and S i m u ltaneous Compa risons of Com ponent Means
223
2 .65
.60
.55
.55
.60
Figure 5.1
for p.. data.
A 95% confi dence e l l i pse based on microwave-ra d i ation
The length of the major axis is 3.6 times the length of the minor axis.
•
S i m u ltaneous Confidence Statements
While the confidence region n( x - IL ) ' S- 1 ( x - IL ) < c 2 , for c a constant, correctly assesses the joint knowledge concerning plausible values of JL, any summary of con clusions ordinarily includes confidence statements about the individual component means. In so doing, we adopt the attitude that all of the separate confidence state ments should hold simultaneously with a specified high probability. It is the guaran tee of a specified probability against any statement being incorrect that motivates the term simultaneous confidence intervals. We begin by considering simultaneous confidence statements which are intimately related to the joint confidence region based on the T2 -statistic. Let X have an Np ( JL , I) distribution and form the linear combination Z = a 1 X1 + a 2 X2 + · · · + a P XP = a' X From (2-43),
JLz = E(Z) = a' 1L and
a-� = Var (Z) = a' Ia Moreover, by Result 4.2, Z has an N (a' JL, a' Ia ) distribution. If a random sample X 1 , X 2 , , Xn from the Np ( JL, I ) population is available, a corresponding sample of Z's can be created by taking linear combinations. Thus, j = 1, 2, . . . , n z.1 = a 1 X·1 1 + a 2 X1· 2 + · · · + a P x1. P = a' X 1· The sample mean and variance of the observed values z 1 , z2 , , Z n are, by (3-36), z = a' x . . •
• • •
224
Chapter 5
I nferences about a Mean Vector
and 2 = a' Sa s whereSixmandultaSneousare thceosnfamplidence meane intevectrvalsorcanandbecodevelvariaoncpede matfrormixaofconstheix/sder,arteisopnecoftivconely. fidenceForinatervals fandor aa-' IL� unknown, for variousa choi100(c1es-ofa)a.%Theconfarigdument pr o ceeds as f o l o ws . is based on student's t-ratio - J.Lz Vn (a' x - a'IL)ence interval for J.Lz = a' � (5-20) t= / and leads to the statement z
fixed
z
--Sz Vn
Va'Sa
or a IL a tn - l ( a/2) Vn (5-21 ) a - tn _ 1 ( a/2) 1 d. f . whereIntnequal_ 1 (a/2)ity (i5-s t2h1)e upper 100( a /2) t h per c ent i l e of a ! d i s t r i b ut i o n wi t h n can be i n t e r p r e t e d as a s t a t e ment about t h e component s of t h e mean vect o r · For exampl e , wi t h a , and ( 5 2 1) becomes = [ 1 , 0, . . , OJ, a = J.L ' ' IL IL 1 ta'hSae us=uals1conf.) Clideencearly,iwentercoulval fdomake r a norsmeveral popul atiidoencen mean.state(mentNotes, iabout n this castheecom, that a l conf ponent s of each wi t h as s o ci a t e d conf i d ence coef f i c i e nt 1 a, by choos i n g di f e r entstatecoefmentfisciteantkenvecttogetorhs era.isHowever , t h e conf i d ence as s o ci a t e d wi t h al l of t h e not 1bl-e ta.o associate a "col ective" confidence coefficient I n t u i t i v el y , i t woul d be des i r a ofever1 ,-a aprwiictehmusthe tconfbe ipaidenced foirnttehrevalconveni s that canencebeofgenera laargteedsbyimulalltachoineousces confof a.idHowence coefcificfchoiicientce: iofntea.rvals that are wider (les precise) than the interval of (5-21) for a spe Gi v en a dat a s e t x , x , . . , x and a par t i c ul a r a, t h e conf i d ence i n t e r v al i n ( 5 2 1) n 1 2 is that set of a' IL values for whiVnch (a' x - a'IL) tn-l(a/2) It = or, equivalently, n(a'x - a' IL )2 n(a' - IL) t2 = a' Sa a' Sa t�-J(a/2) 2(5-22) sativielmyulstmaneous conf i d ence r e gi o n i s gi v en by t h e s e t of a val u es s u ch t h at t i s r e l ' IL al l f o r choi c es of a. I t s e ems r e as o nabl e t o expect t h at t h e cons t a nt 2 toped�_1 (afj2)or manyin (5-2choi2) wicesl beof a.replaced by a larger2 valu2e, c , when statements are devel Cons i d er i n g t h e val u es of a f o r whi c h t c , we ar e nat u r a l y l e d t o t h e de termination of Va'Sa Vn
X
' -
<
'
Va'Sa
X +
<
' -
.
#L,
I
------
A
<
Va'Sa
)
(X
all
<
z
<
Section 5.4
225
Confidence Reg ions and S i m u lta neous Compa r isons of Com ponent Means
2 JL)) 2 maxt - max Using the maxiJL))m2 izatio[n lemma (2-50)JL))wi2th] , ) , and we get max max T2 (5-23) with the maximum occurring for proportional to . lation with I positiLetve defX1i,nXit2e,. . Then,. , Xn sbeimaulrtaandom neouslsyafmplor ale lfromtheanintervJL,al I) popu (X ) wil contain with probability Fr o m ( 5 2 3) , ) 2 c2 2T 2c implies for every or c -y ---;:;c -y 2 fthorateverwil ycontaChoos i n g [ s e e ( 5 6 ) ] gi v es i n t e r v al s c in for all with probability P[T2 c2] . e nt t o r e f e r t o t h e s i m ul t a neous i n t e r v al s of Res u l t 5. 3 as 2TheT -intsIeutrccesviaslsconveni 2 ,ssivinecechoitheccover a ge pr o babi l i t y i s det e r m i n ed by t h e di s t r i b ut i o n of T • [0, 0, for these T2-inte[rv,al0,s allow us to concl[0, ude that and so on through � � x2 � x2 - � (5-24) _
a
a
n(a' ( x a ' Sa x
a
n(a' (x a' Sa
n
=
(a' (x a ' Sa
a
==
a, d
(x -
==
Result 5.3.
a' IL
=
fa'Sa < a' IL < a' x +
_
a'
. . . , 1J
p(n (n p(n (n
- 1) (a) p) Fp, n - p - 1) (a) F p) p, n - p
_
_
0
==
1
. . . , OJ,
---;,;- < JL 1 < X1 + \j� (S:;:; \j ---;,;- < JL2 < + _
_
0 0
0
a'
0
0
-
==
<
fa'Sa -;;--n-
p(n - 1 ) Fp,n - p (a)j(n - p) 1 - a == a,
a' IL
_
n( a' x - a' IL a' Sa
==
a.
X1 -
_
n(X - p, ) ' S - 1 (X - p, ) <
a' x -
a'
Np (
1 - a.
a,
==
)
p(n - 1 ) (a)a' Sa n(n p) Fp,n - p
a'X +
_
Proof.
IL
a,
p(n - 1 ) (a)a' Sa , n(n p) Fp, n - p
-
S,
_
_
s-1 ( x -
a
a'
==
n ( x - p, ) S 1 ( x - p, ) = _
=
B
IL
<
•
1, . . . , O J ,
p(n (n p(n (n
- 1) , (a) p) Fp n - p - 1) (a) p) Fp,n - p
_
_
I p(n - 1 ) I p(n - 1 ) (S;; PP + (a) JL F X < p p X P - \j (n p) Fp,n - p (a) \j(S;; < p,n ----; \j P \j (n p) ----;; n; _
_
alifyliholng tdhseimcoefultfaineous l y wi t h conf i d ence coef f i c i e nt Not e t h at , wi t h out mod thee dif ereandnces corIn trheisspcasondie ng tocient [0, . .we, 0, can0,make .and. ,we0,stahavetement0, thes sabout wher tatement a' S a
a'
==
1 - a,
ai , sii - 2sik + skk ' ==
1 - a.
ak , . . . , O J ,
ai
==
1
JLi - i-Lk a k == - 1.
226
Chapter 5
I nfe rences about a Mean Vector
5)
( 5 2 2 confidence intervals are ideal for "data snooping." The The s i m ul t a neous T confbinatidioencens ofcoefthefcomponent icient 1 - as remaithatnmers unchanged f o r any choi c e of s o l i n ear com i t i n s p ect i o n can be es t i m at e d. mentsInaboutaddition, accorbeldionngig[toS·ntg· hteoS·rtehs]ue-lstlas[mplinX·Supple-mean-ementcenterewed el1)canlipsiensclude the state Fp n-p(a) (5-26) J and stTheil maisimntulaitnaneous the confT2idconfenceidcoefenceficinietnterv(al1 s-foa)r tfhoer itnhdie vwholiduale scomponent et of statements of s. mean vect o r ar e j u s t t h e s h adows , or pr o j e ct i o ns , of t h e conf i d ence el l i p s o i d on t h e mulcomponent taneous axesconf.idThiences connect intervalios ngibetvenwbyeen(5t-h2e4)sihsadows il ustraoftedthienelthliepnextsoid andexamplthees. i IfonurExampl e 5. 3 , we obt a i n ed t h e 95% conf i d ence el l i p s e f o r t h e means of t h e t h r o ot s of t h e door c l o s e d and door o pen mi c r o wave r a di a t i o n meas u r e 2 ment s . The 95% s i m ul t a neous T i n t e r v al s f o r t h e t w o component means ar e , f r o m ( 5 2 4) ( X1 ( 1) Fp, n -p ( .05) -y ----;; , X1 ( 1) Fp, n -p ( .05) \j ----;;�) 2(41) 3.23 \j�) or ( .516, .612) ( .564 - 2(4041) 3.23 \j� · .564 � 1) Fp,n-p( .05) -y ----;;) ( X2 - 1) Fp, n -p ( .05) \j ----;; • X2 ( .603 - 2�� ) 3.23 � , .603 2�:1) 3.23 �) or ( .555, .651 ) I n Fi g ur e 5. 2 , we have r e dr a wn t h e 95% conf i d ence el l i p s e f r o m Exampl e 5.el3li.pThese on95%thesaxesimultofaneousthe component intervals armeans e shown. as shadows, or projections, of this ThenatiosncorPreosgrobtamai(nCedLEP)by subt87estcolX1leandge stthuedentColsleongetQualhe ColifilceatgeioLevel Exami sandubtehisststoXry2, andx2 X3verarbealgi, vanden ixn 3Tablseci5.ence.2 on Thespagee228datfaogir vXe1 snoTescialt s(cCieQT)nce a,
based upon an examination of
JLi
the data
SA,
( JLi , JLk )
n[ xi - JLi , xk - JLk ]
ll
JLl· xk - i-Lk
l
k sik skk
_
l
p(n n-P
<
'
a
Example 5.4
-
=
-
=
{Si mu ltaneous confidence i nterva ls as shadows of the co nfidence e l l i psoid)
jp n - \j ( n p)
,
�
_
fN44
_
jp n + \j (n p) _
(N44
+
(S;
j p(n \j ( n p ) _
_
j p(n + \j ( n p)
fs;
_
+
II
Example 5.5
{Constructi ng simu ltaneous confidence i ntervals and e l l i pses)
n
=
=
=
=
Section 5.4
227
Confidence Reg ions and S i m u ltaneous Compa risons of Component Means
---------,- - - - - - - - - - - - - - - - - I
I I
�---�--��
.516
.612
0.500
0.552
pl
0.604
[ 527.54.764]9 and [ 5691.600.3541 600.126.0551 217.23.235]7 217. 2 5 25. 1 3 23. 3 7 23. 1 1 Let Weus comput have e the 95% simultaneous confidence intervals for and
S i m u ltaneous T 2 -i nte rva ls for the com ponent means as shadows of the co nfidence e l l i pse on the axes-m icrowave ra d i ation data. Figure 5.2
x
S
==
==
JL 1 , JL2 ,
JL3 •
n
- 1) Fp,n-p(a) 3((8877 - 3)1) F ( .05 ) 3 (8486) (2.7) 8.29 and we obtain the simult�aneous conf i d ence s t a t e ment s [ s e e ( 5 2 4) ] 5691.87 34 527.74 �5691.87 34 527. 7 4 or 504.45 551.03 �126.05 54. 6 9 54. 6 9 87 or 51.22 58.16 25.13 25.13 p
(
n
_
=
P
- v'8.29
- v'8.29 \j(IT6.05 �
- v'8.29 \jf2ill ---p;J
3, 4 8
_
<
f.L l
<
<
I-L l
<
<
f.L2
<
<
JL2
<
<
f.L3
<
=
=
+
+
+
v'8.29
v'8.29
---p;J v'8.29 \jf2ill
228
Chapter 5
TABLE 5.2
I nferences about a Mean Vector
COLLEG E TEST DATA
(scieSncexocliaandl x2 x3 (scieSncexocil aandl x2 x3 Indivi1dual hi468story) (Ver41bal) (Sci26ence) Indi45vidual hi494story) (Ver41bal) (Sci24ence) 23 428514 3953 2126 4647 541362 4736 2517 45 547614 6167 2733 4948 408594 2868 2317 67 421501 4667 2922 5051 501687 2575 2633 89 527527 5055 2319 5352 633647 5267 2931 1011 587620 7263 3231 5455 614647 5965 2534 131214 468561541 536259 262019 575658 448408633 516555 242819 161517 507527614 654832 282721 596061 441501435 546035 202221 1819 580507 6459 2121 6263 507620 4271 2436 2021 521574 5452 2325 6465 415554 5269 2030 2223 488587 5164 3127 6667 348468 2849 2518 2425 488587 6256 2618 6869 507527 5447 2631 2627 421481 5238 2616 7071 527435 4750 2826 2829 428640 4065 2519 7273 733660 7370 2533 313032 574547580 616464 282728 747576 428507527 453762 282919 3334 494554 5351 2621 7778 481507 4861 2319 363537 454507647 526558 282323 807981 607527488 416966 282823 3839 521427 5766 2126 8283 561614 7059 2334 4041 468587 5557 3014 8584 474527 4941 3016 4243 507574 5461 3131 8687 441607 4767 2632 44 507 53 23
Source: Data courtesy of Richard W. Johnson.
Section 5.4
229
Confidence Reg ions and S i m u lta neous Com pa risons of Com ponent Means
or 23. 6 5 26. 6 1 J.L 3 Witwo-thdtihmeensposiosnaliblesexcept ionotofs dothenotverrbevealal scoranyes,stehreiomarus depar ginaltures fplroomtsnorand c at t e r pl malsamplityefosirztehies lcolargleegeenough qualifitcoajtuiosntifytestthdate meta. h(Sodolee Exerogy, cevenise 5.t1h8.ough) Mortheeoverdata, tarhee not quiThete norsimmulaltalneous y distriTb2ut-ined.terv(Salees above Sectioarn 5.e 5wi.)der than univariate intervals becaus e al l t h r e e mus t hol d wi t h 95% conf i d ence. They may al s o be wi d er t h an neces s a r y , becaus e , wi t h t h e s a me conf i d ence, we can make s t a t e ment s about dif erForencesin.stance, with [0, 1, -1 J, the interval for J.L2 - J.L3 has endpoints ( x2 X3) �p(n( n -p1)) Fp,n-p( 05) �s2 s3n - 2s23 (54.69 -25.13) /126.05 23.8711 - 2(23.37) 29.56 3.12 in tseorv(2al6.Fis4cann4,al32.lyal, s6weo8)becanisconsa cons95%trutrctconfuectd iconffdoencer tihdeenceiotntheerrelvalldiipffsoeersreJ.Lfnceso2r-pai. J.Lr3s• ofSimmeansultaneous , and t h e same 95% confidence holds. For[126.exampl e , f o r t h e pai r , J.L 3 ) , we have ( J.L 2 l 87[54.69 - JL2 , 25.132 - JL3] 23.0357 23.23.31721 ]- [54.25.6139 -- JLJ.L23 ] 0.849(54.69 - J.L2) 4.633(-25.213 -0.8J.L59(3) 54.69 - J.L2) (25.13 - J.L3) 8.29 Thilipsseselfloirptsheeisotshhowner two pairFigurs ofe 5.means3 on.pageThe 230,projealctoingonswiorthshtadows he295%ofcothnfesideenceellipseles on the axes are also indicated, and these projections are the T -intervals. Anthe alcomponent ternative apprs J.Liooneach tato thaetconsime,trasuctsiuoggesn oftconfed byidence(5-2i1)ntewirvtalhs is to cons[0, . i.d, er0, 0,p variables andwherleeads to t1.heThiintserapprvals oach ignores the covariance structure of the - tn-l (a/2) � JL1 tn-l (a/2) � x2 - tn-l (a/2) � JL2 x2 tn-l ( a/2) � (5-27) <
<
Q-Q
a'
_
_
_
±
=
+
·
_
± v'8.29 \j
=
+
=
+
±
=
X
<
in
•
A Comparison of S i m u ltaneous Confi dence I nterva ls with One-at-a-Ti me I nterva ls
ai ,
ai
. . . , OJ
X1
a'
=
<
<
<
<
X1
+
+
=
230
Inferences a bout a Mean Vector
Chapter 5
� V)
rN
0 V)
522
500
111
544
11 3
\£') N
500
50.5
54.5 Figure 5.3
522
544
58.5 9 5 % confidence e l l i pses for pairs of means and the s i m u lta n eous T 2 -i nte rva ls
col l ege test data .
Alwe dothnotoughknowpriowhatr to sampl i n g, t h e i t h i n t e r v al has pr o babi l i t y of cover i n g t o as s e r t , i n gener a l , about t h e pr o babi l i t y of i n t e r v al s con taininTog thseihedr resospmeectliivgeht on tAshe prweoblhaveem,poiconsntiedderoutth,ethspisecipraolbabicaseliwherty is note the obser vations have a joint normal distribution and0 0 0 0 0 Sivarnceiablthee, andobsseorvon,atiotnshe onprotducthe firrsutlvare foirablinedependent are independent of t h os e on t h e s e cond fore tP[healsalmplt-inteeirsvsalesleictn e5-d.27) contain the events can be applied, and If .95 and 6, this probability is .95)6 .74. JLi ,
1-a all
JL/S.
1 - a.
lTl l
I=
be
(
1 -a=
p
=
JL/S J (
( 1 - a) ( 1 - a) · · · ( 1 - a) ( 1 - a) P =
To guar a nt e e a pr o babi l i t y of 1 t h at al l of t h e s t a t e ment s about t h e compo nentrate tmeans hol d s i m ul t a neous l y , t h e i n di v i d ual i n t e r v al s mus t be wi d er t h an t h e s e pa -Forinter1val-s; just .how95, nmuch15,wiandder depends one mulbothtipplianders n,ofas well asinon(51-2-4) and p 4, t h (5-27) are )p( n - 1) 4( 1 4) (3.36) 4.14 ) (n p andtervals ar( .e025)100(4.2.14145,- 2.re1s45)pect/2.iv1el45y. Cons93%equent l y , i n t h i s cas e t h e s i m ul t a neous i n wi d er t h an t h os e der i v ed f r o m t h e one at-a-tiTablme temet5.3 hgiod.ves some critical distance multipliers for one-at-a-time 2t-intervals i n t e r v al s . comput e d accor d i n g t o ( 5 2 1) , as wel l as t h e cor r e s p ondi n g s i m ul t a neous T 2 Innasgenerp inacrl,etasheeswianddthdecrof theeasTes-ifnotrerfvixaleds, rpelasatinveintcro ethaseets-.intervals, increases for fixed �f ((nn - 1)p p) ( .025) p == 4 p 10 n 11.6.5329 2.2.2.001456410 4.3.3.631041 255015 5. 0 5 1.1.997060 3.3.0189 100 4.4.2618 asallsconfociTheatieddencecompar withasanysoicisoacolnteidmlewictpliitoehndaofbycolTTabll2e-ctinietoen5.rvof3aliissn,adifobivritdfiunfualxedat inirn,andtseirnvcealp,st,ihfseo.9rconf5,theandisdaencemethen,overlecan,vel asmaiwentahavein anseoveren, beal much l e s t h an . 9 5. The onea t a t i m et i n t e r v al s ar e t o o s h or t t o confimidesencelooklevelat tfhoemr seasparthaetebesstatteposmentsibsleaboutinfor,msaaty,ioalnl pconcermeansn. Never t h el e s , we s o met iinntgearvmean, i f t h i s i s t h e onl y i n f e r e nce t o be made. Mor e over , i f t h e onea t a t i m e 2 al s ar e cal c ul a t e d onl y when t h e T t e s t r e j e ct s t h e nul l hypot h es i s , s o me r e stheanarcherthesTth2-2iinnktetrhveyalsmaydo. more accurately represent the information about the means The T i n t e r v al s ar e t o o wi d e i f t h ey ar e appl i e d onl y t o t h e p component means . ToFigurseee why,5.2. Iconsf JL1ildieers itnheitconfs T2-iidnencetervalelandlipseJLand2 liesthine sitims Tul2t-ianneous i n t e r v al s s h own i n idence n the elrelctipasngle ande fomorrmede. Theby thconfese itdwenceo intelerlivpalses.isThismsalrectler abutnglterhasvealcont, prthoenababiin(sJLtl1hi,teyJLconf2.9) 5lieofis cover i n g t h e mean vect o r wi t h i t s component means and Cons e quent l y , t h e • JL JL 1 2 prtheobabirectalinglty ofe fcover inbyg ththeetTw2o-iinntdiervvialduals. Thimeanss resuJLlt1 landeadsJLus2 witol consbe lairdgererathseancond.95 apfor o r m ed proach to making multiple comparisons known as the Bonfer oni method.
Section 5.4
231
Confidence Reg ions and S i m u lta neous Com pa risons of Com ponent Means
a
a ==
==
==
Fp, n - p ( . O S)
_
==
tn - l
a.
�
=
11
==
==
TABLE 5.3 t-
CRITICAL DISTANCE M U LTI PLI ERS FOR O N E-AT-A-TI M E I NTERVALS A N D T2- I NTE RVALS FOR S E LECTED n AN D p ( 1 - a == .95 ) _
tn - 1
00
IL
Fp, n - p ( . OS )
==
232
Chapter 5
I nferences a bout a Mean Vector
The Bo nferroni Method of M u lti ple Co mparisons
number die sviimdualultaconfneousidenceintersvtalatsement s. IsOfunltttehn,eseatstIeiftntuthatieoinonumber nsis riet sitsrposicteofsdibtslopeeciatosmfdoiealdlbetcomponent ter thanoftihnmeans of Re or l i n ear combi n at i o ns i s s m al l , s i m ul t a neous conf i d ence i n t e r v al s can be 2 e preciisseo)nsthians calthleedsimthule taneous T -intervals.becaus The ale tidevelterisnatdeveloivpede metotpedhathodfarroefmoshraormulprteortbabiip(mleorcompar l i t y i n equal i t y car r y i n g t h at name. Suppos e t h at , pr i o r t o t h e col l e ct i o n of dat a , conf i d ence s t a t e ment s about natiotnshe value of with are requitruered. Let denote a confidNowence sl(itsnaeearteeExermentcombiciabout salel true ] at least one false false ) true) ) ( 8 I n equal i t y ( 5 2 8) , a s p eci a l cas e of t h e Bonf e r o ni i n equal i t y , al l o ws an i n ves t i gattionorsttorucontcturreolbehithenoverd thael conferroirdrenceate statements. Thereregaris alsdoletsheoffltehxiebcorilitryelofa a grtaontupstofateiment mporst.ant statements and balancing it by an otcontherroLetchoil inusgcethdevelfeorertrhooeprlrseaismteulifmotaporrneous i n t e r v al es t i m at e s f o r t h e r e s t r i c t e d s e t cons i s t i n g ofcomponent the component s of Lacki n g i n f o r m at i o n on t h e r e l a t i v e i m por t a nce of t h es e s, we consider the individual t-intervals ( � ) -FE with we have,Sincefrom (5-28), contains [ ( : ) -FE contains all i J (� � �) terms Thermakeeftohreef, owil othwianngoveral confstateiment dences:level greater than or equal to we can 5.3. a' IL = a 1 JL 1
+ a 2 JL2 +
m
···
JLi
+ a p JLp
Bonferroni method,
a1 1L , a2 /-L, . . . , a'm iL a; IL P[ Ci
5.6), P[ Ci
== 1 - P[
1 - :L p ( ci i= l
=
a 1 + a2 +
JL i
=
J
Ci
m
>
J
Ci 1 - ai , i
m
=
1, 2, . . . , m.
m
1 - :L ( 1 - p ( ci i=l
· · ·
52 )
+ am ,
IL ·
X; ± tn - 1
i
i
=
1, 2, . . . , m
P[Xi ± tn _ 1 (aj 2m ) �
ai = ajm. i == 1, 2, . . . , m,
f.Li ,
P X; ± tn - 1 2
>
1 -
== 1 - a
m ==
JLi ]
+
+
=
1 - ajm,
..· +
m
1 - a,
p
5 29 )
(-
Section 5.4
233
Confidence Reg ions and S i m u ltaneous Com pa risons of Component Means
The s t a t e ment s i n can be compar e d wi t h t h os e i n The per c ent a ge poiarenoft tnth_e1 (saj2amep)srtreuplctaucesre.v'(n - 1 )pFp,n-p(a)j(n - p) , but otherwise the intervals Letshalusl obtretauirnntthoethseimmiulctraoneous wave oven rBonf adiatieornodatni aconfin Exampl e s and We i d ence i n t e r v al s f o r t h e means , and , of t h e f o ur t h r o ot s of t h e door c l o s e d and door o pen mea snotureinmentg tJLh1ats winth=JLai2 =and = We= make use of= the resutoltsgetin Example Iff = -J¥ or or JL2 x 2 ( =2 FiFiggururee shalowsong twiheth theTcorsirmesulponditaneousng confBonf idenceerinotniervin2altesrfvoalr sJL. 1 For, JL2 feachrom component mean, t h e Bonf e r o ni i n t e r v al f a l s wi t h i n t h e T i n t e r v al . Cons e aarngulUoinart)rreegigioonnfoformrmededbybyththeetwtwooTBonf errvalonis. Iinftweervarales 2 iquent sintconterelsyat,ientdhedonle rineycttihnaengulthreectcomponent i n t e means, the Bonfer oni intervals provide more (5-29)
(5-24).
(Co nstructi ng Bonferro ni simu ltaneous confidence i nterva ls and co mparing them with T2 -i nterva ls)
Example 5 . 6
5.3
95%
.05/2, i 1 , 2. 42 t4 1 ( .05/2 ( 2 ) )
5.3,
2.327,
:X1 ± t4 1 ( .0125 )
.564 ± 2.327
.521
<
- ± t4 1 .0125 ) \jrs; ---;:;-
� .603 ± 2.327 \jfN46
.560
<
5.4 5.2,
.651 .646
t4 1 ( .0125 )
-
- -
- -
-
95%
- - - - -
- -
5.4.
I-L l
<
.607
<
.646
95%
- - - - - - - - - - - - - - - - - - -
-
- - - - - -- - - - - - -
-
- -
I
....., I
Bonferroni
.56 0 . 555
-
- - - - -
-
- -
_
_
_
- - - - - - - - - - - - - - - - -
_ -
_
- - - -
_
-
_
- - - - -
_ -
_
.....
- - - - - - -
r-
-���,----�--�,�516 . 07 .612
0.500
. .521
0.552
6
0.604
pl
The 95% T2 a n d 9 5 % Bonferro n i s i m u lta n eous confidence i nte rva ls for the component mea ns-microwave ra d i ation data .
Fi gure 5.4
234
Chapter 5
I nferences a bout a Mean Vector
prregieciosnefesortimatgivesesththane plthaeusT2ib-ilnetevalrvualess. fOnor tthheepaiothrers (hand, t h e conf i d ence , ) when t h e cor r e l a JL JL 1 2 tion between the measured variables is taken into account. T2-intTheervalBonfs (recaler loResni iunlttervalhaves fortlhienearsamecombigenernaatl ifoonsrm: and the analogous (critical value) � Consequently,Lengt in everh yofinBonfstanceer wheroni ineterv=al m, Length of T2-interval )--p(nn--p , As we have poi n t e d out , f o r whia smcalh ldoesnumbernot depend on t h e r a ndom quant i t i e s and always be shortemr. ofHowspecimucfiedh parshorametter irsicinfdiunccattieodnsin Tabltehe Bonffor eserleoctnieidntnandervalsp.wil - = m=p n 95%
IL
•
a' IL
5.3)
a' X ±
ai
aj
1)
X
(5-30)
Fp n - p (a)
S.
a' JL ,
5.4
TABLE 5.4
(LE NGTH OF BONFERRONI I NTERVAL)/(LE NGTH OF T2-I NTE RVAL) FOR 1 a .95 AND ai = .05/m
2 .88 .90 .91 .91 .91
15 25 50 100 00
4 .69 .75 .78 .80 .81
10 .29 .48 .58 .62 .66
We s e e f r o m t h e Tabl e t h at t h e Bonf e r o ni met h od pr o vi d es s h or t e r i n t e r valfidences wheninmterv=alp.s needed Becausfeotrhieynfearreence,easyweto applwil yofandtenprapploviydesitmheulrtealneous ativelyts-hinortetrconvals based on the Bonfer oni method. When t h e s a mpl e s i z e i s l a r g e, t e s t s of hypot h es e s and conf i d ence r e gi o ns f o r can beercconsises tructed wiandthout thefasorsluamptrge in,onweofara enorablmealtopopulmakeatiniofn.ereAsncesil uaboutstratetdheinpop� uldeparatiotnurmean even t h ough t h e par e nt di s t r i b ut i o n i s di s c ret e . I n f a ct , s e r i o us e s f r o m a nor m al popul a t i o n can be over c ome by l a r g e s a mpl e s i z es . Bot h tpresotxis ofmathypotely) thheseiersnomiand nsialmluleveltaneous conf i d ence s t a t e ment s wi l t h en pos s e s ( ap s. 5.4
5.5
LARG E SAM PLE I N FERENCES ABOUT A POPU LATI ON M EAN VECTOR
IL
5.15, 5.16,
5.17,
Ex
Section 5 . 5
La rge Sample Inferences about a Pop u l ation Mean Vector
235
The advant a ges as s o ci a t e d wi t h l a r g e s a mpl e s may be par t i a l y of f s e t by a l o s and S. On t h e iotnhsaermplhand,e insfionrcemat( ionScaus) is aesdubyf icusientingsuonlmmary they fsuommar y s t a t i s t i c s rinormal populal, thaetimorons e[seefefi(c4-ie2nt1)l]y, tthhee sclaomplserethinefounderrmatlioyinnwig popul a t i o n i s t o mul t i v ar a t e nor m l be ut i l i z ed i n maki n g i n f e r e nces . Alwel laknow rge-2 satmplhat e( Xinf-ereIL)nces' ( n-about 1 S ) -1 ( XIL ar- eIL)basedn(Xon -a x�-t2-)d' Sis-t1r(iXbut-ioIL)n. From proximately x with pn(d.Xf.,-and th'sus-1,( X - X�(a) ] 1 - a is ap whereEquatx�(ai)oins the upperimmedi( 100atae)lythleperadscentto lailregofe sathmple xe�t-edsisttsrofibuthypotion.heses and simul taneous confidence regions. These procedures are summarized in Results and Let X , X , ... , X be a r a ndom s a mpl e f r o m a popul a t i o n wi t h n 1 2 . When and pos i t i v e def i n i t e covar i a nce mat r i x p i s l a r g e, t h e hy mean n I IL potprohxiesmiats H0:ely a,IL if thILoe obsis reejrevcteded in favor of H1 : IL ILo , at a level of significance ap n - ILo) ' S-1 - ILo) X� ( a ) Here x�(a) is the upper ( 100a)th percentile of a chi-square distribution with p d.f. Compar i n g t h e t e s t i n Res u l t 5. 4 wi t h t h e cor r e s p ondi n g t h eory t e s t i n normal we s e e t h at t h e t e s t s t a t i s t i c s have t h e same s t r u ct u r e , but t h e c r i t i c a l val u es ar e disafmeererntesu. lAt inclosisteuratexami n at i o n, however , r e veal s t h at bot h t e s t s yi e l d es s e nt i a l y t h e 2 o(apr) iaratee.apprThiosxifoml atowse dilyrequal ectly ffroormn tlharegfearctieolnstahtatiwherve( ntoe-p.th1e)p(SxeeFp-,tTablne-sp(t ofaes)Resj(andn u-lt p)inanditsheapprx�appendi x.) Let X , X , ... , X be a r a ndom s a mpl e f r o m a popul a t i o n wi t h n 1 2 mean IL and positive definite covariance I. If n - p is large, a' X 1 - a.s Consequently, wewil cancontmakeain a'the 1f00(or ever1 -yaa,) %wistihmprulotababineouslityconfappridoenceximatstealtyement � contains JL1 X1 contains JL2 contains J.Lp x,
x,
(4-28),
=
1L )
P[
1L )
_:_
<
(5-31)
( 5 31) _
5.4
5.5.
Result 5.4.
=I=
=
(x
(x
>
•
(5-7),
5.4
3
4
Result 5.5.
-----;;--± � \jfa'S8
/-L ,
± �
236
Chapter 5
I nfe rences about a Mean Vector
and,ellipsiens addition, for all pairs = the sample mean-centered [ J -1 [ - J contain The f i r s t par t f o l o ws f r o m Res u l t 5A. l , wi t h The pr o ba = bispleciityallevelchoiiscaescons=equence of The swhertatemente =s for th=e are obtainTheed byelltihpe sfiodienceds forlepaivelrsofofapprmeansoximfoatl eolwy from Resforualltl statementwith s is,=once agaiThen, a roveresulatlofconthe large sItamplis goode dissttraitbisutticioalnprthaeorcticyestuommarsubjeictzedthiesn e large sample inference procedures teroattheedepar sametchecks r e qui r e d of t h e nor m al t h eor y met h ods . Al t h ough s m al l t o mod vifroamtiotnshecoulnomiudrnecausals flreoevelmproblnorImef,msaloni.tythSpecdoe basnotifiisccausaofl yQ-Q, ethanye pltruoditefserandicrulotrotierhsaertfeorimaynveslarbetigge,atfaivr eredevimovedcdees, outapprlioerprs iandate corothrerectfoivremacts ofioextns, rineclmeudideparng tratunsrefosrarmeatiinodinsc,atareeddes(seie,rafblore.exampl e , Metelatihvodsely fionr tseesnstinitgivmean vect o r s of s y mmet r i c mul t i v ar i a t e di s t r i b ut i o ns t h at ar e r e t o depar t u r e s f r o m nor m al i t y ar e di s c us s e d i n I n s o me i n s t a nces , Re sults Theandnext exampl are useefuall lonlowsy fusorto il ulastrrgaetesatmplhe conses. truction of large sample si multaneous statements for single mean components. Aabimuslity iicneducat oorsteetstnateditohnalousnorandsmsofinFiFinnnilashnd.stuSummar dents onytshteiatrisnatticsivfoermusparitcofal or d er t the= dataFisnenit arshe12tgivhengrianderTabls. e These statistics are based on a sample of ( JLi , JLk ) , i, k
sii sik n[ xi - JLi , xk - I-Lk ] sik skk Proof.
a'
1, 2, . . . , p,
xi - JL i xk - JLk _
(5-31). [0, . . . , 0, a i , 0, . , OJ, . .
1
-
a
5A.2
<
•
2
Xp (a)
( JLi , JLk )
c 2 x�( a) .
ai
1, i
c2
JLi 1, 2, . . . , p . x�( a) .
(5-31).
II
n
extreme
a.
5.4
5.5
Example 5.7
n
[2]),
very
[10] .
{Co nstructi ng large sample si m u ltaneous confidence i nterva ls)
96
5.5.
TABLE 5 . 5
MUSICAL APTITU DE PROFILE M EANS AND STANDARD D EVIATIONS FOR 96 1 2TH-G RAD E F I N N I S H STU D E NTS PARTICI PATI NG IN A STAN DARDIZATION PROG RAM
Varx i=ablmele ody l = harmony x3x4 ==mettempoer Xx56 == phrbalaancesing x7 = style x2
Source: Data courtesy of V. Sell.
28.1 26.6 35.4 34.2 23.6 22.0 22.7
Raw score Standard deviation 5.76 5.85 3.82 5.12 3.76 3.93 4.03
( �)
Section 5 . 5
237
Large Sa m p l e Inferences about a Pop u l ation Mean Vector
Let us cons t r u ct s i m ul t a neous conf i d ence i n t e r v al s f o r t h e i n di v i d ual meanFrcomponent s ov' m Res�'ult simultaneouswhere confidence limitsThusare, giwivtenh ap-by proximately confidence, � contains or contains or contains or contains or contains or contains or contains or Baspotheesd,iperze thhapse mus, uponicalthaptousitaundsde profoAmer i c an s t u dent s , t h e i n ves t i g at o r coul d hy f i l e t o be ILo Wemeteserecomponent from the ssimofulILotaneousdo notstaappear tementtsoabovebe platushatibtlheevalmeluesody,fortetmpo,he corandre sponding means of Finnish scores. When t h e s a mpl e s i z e i s l a r g e, t h e onea t a t i m e conf i d ence i n t e r v al s f o r i n di vidual means are (a) {% (a) {%--:: --:: ) t h per c ent i l e of t h e s t a ndar d nor m al di s t r i b ut i o n. wher e a i s t h e upper a TheindivBonfidualermeansoni simtaulketatneous c o nf i d ence i n t e r v al s f o r t h e s t a t e ment s about t h e h e s a me f o r m , but us e t h e modi f i e d per c ent i l e t o gi v e ( a ) {%--:: ( a ) {%--:: ed (or sheadow of the confTablideence gielvliespsothide)ininditevrivdaluals f, oBonfr theermusoniic, alandaptchiitu-sdequardatea-biasn Exampl 90% JLi , i = 1, 2, . . . , 7. 5.5,
x� ( .lO)
X; ±
i
=
90%
x� ( .10 )
1 , 2, . . . , 7,
12.02 .
=
90%
JL 1
26 . 06
<
i-L l
<
30.14
JL2
24.53
<
JL2
<
28.67
JL3
34.05
<
JL3
<
36.75
JL4
32.39
<
JL4
<
36.01
JLs
22.27
<
JLs
<
24.93
VT2.02 3.93 12.02 V%
JL6
20.61
<
JL6
<
23.39
4.03 22.7 ± VT2.02 12.02 V%
JL7
21.27
<
JL7
<
24.13
28.1 ± VT2.02
96 5.85 26.6 ± VT2.02 12.02 V% 35.4 ± 34.2 ±
23.6 ±
22.0 ±
VT2.02 3.82 12.02 V%
VT2.02 5.12 12.02 V%
VT2.02 3.76 12.02 V%
=
[31, 27, 34, 31, 23, 22, 22]
•
X;
-
z
2
\j
z( /2 )
<
f.L;
<
X;
+
z
2
\j
i
=
1 , 2, . . . ' p
100( /2
m = p
z( aj2p )
X;
-
5.6
z
2p
\j
<
f.L;
<
X;
+
z
2
p
\j
i
=
1 , 2, . . . ' p
5.7.
238
Inferences a bout a Mean Vector
Chapter 5
TABLE 5.6
TH E LARG E-SAM PLE 95% I N DIVI DUAL, B O N FERRONI, AND T2-I NTERVALS FOR TH E M USICAL APTITU D E DATA z( .025 ) 1 .96. z( .025/7) 2.69. T2, x�( .05 ) 14.07.
TheThe sone-imulatta-aneous-time Bonfconfiedrenceoni iinntteerrvvalalss ususee = = of t h e el l i p s o i d , us e The simultaneous One-or sahtadows = lipsoid -time BonfLowerer oni IUpper ntervals Shadow Lower of ElUpper Varxl i==ablmele ody Lower -aUpper xx23 == hartempomony x4 == phrmetaesring xx6 == balstylaence Alt-tbhasoughed perthecentsamplilesersaitzheermaythanbeuslaertghe,eschiome-sqsuartatiestoriciastnsandarprefdernortomrealta-ibnasthede F-and perstantcents. Theiles. FTheandlatperer cconsentitlaents prs aroducee thelianrfgineritienstaemplrvales and,size lhence, imits ofartehemorforemconser coner vatconfivie.denceTablelelipsogiidv, esinttehrevalinsdifovridtualhe ,musBonficaler aptoniit,uandde datF-baased,. Comparor shinadowg Tablofe the withethrTablelativeely lawergeseseamplthatealsilzofe th=e interthvale sdiifneTablrencese arare etylpiarcgalerly. iHowever , wi t h n t h e t h i r d, tenths, digit. 26.95 25.43 34.64 33.18 22.85 21.21 21.89
X5
7
5.7
29.25 27.77 36.16 35.22 24.35 22.79 23.51
26.52 24.99 34.35 32.79 22.57 20.92 21.59
29.68 28.21 36 . 45 35.61 24.63 23.08 23.81
25.90 24.36 33.94 32.24 22.16 20.50 21.16
30.30 28.84 36.86 36.16 25.04 23.50 24.24
t
5.6,
n
5 .7
5.7
96,
or
TABLE 5.7
TH E 95% I N D IVI D UAL, B O N FERRO N I, AND T2-I NTE RVALS FOR TH E MUSICAL APTITUDE DATA t95 ( .025 ) 1 .99. t95 ( .025/7 ) 2.75. T2 , F7, 89( .05) 2.11.
TheThe sone-imulatta-aneous-timeBonfconfiedrenceoni iinntteerrvvalalss ususee = = The simultaneous One-or sahtadows of t h e el l i p s o i d , us e = Shadow of El l i p s o i d Bonf e r o ni I n t e r v al s a t i m e Varxl =iablmele ody Lower Upper Lower Upper Lower Upper xx32 === hartempomony x4 === phrmetaesring xx6 == balstylaence X5
7
26.93 25.41 34.63 33.16 22.84 21.20 21.88
29.27 27.79 36.17 35.24 24 . 36 22.80 23.52
26.48 24.96 34.33 32.76 22.54 20.90 21 .57
29.72 28.24 36.47 35.64 24.66 23.10 23.83
25.76 24.23 33.85 32.12 22.07 20.41 21.07
30.44 28.97 36.95 36.28 25.13 23.59 24.33
Section 5.6
5. 6
239
M u ltivariate Q u a l ity Control Charts
M U LTIVARIATE QUALITY CONTROL CHARTS
Tovariimatprion.oveWhenthe quala manuf ity of goods and s e r v i c es , dat a need t o be exami n ed f o r caus e s of wepabiarlietiemonis andtosrtianbig lactityivofititaehctseofuprriaonscesgeprrvs.oiccesWhen e, dats isaacontshprouloincesduousbes iscollystpraleblctoeducie,dthteongvarevaliteiamsutatioenortihswhenecapro duced by common caus e s t h at ar e al w ays pr e s e nt , and no one caus e i s a ma or s o ur c e of varTheiatiopurn. pose of any control chart is to identify occurrences of of varindiiacattioenathneedat comefor farotmimoutelysridepaie ofr,thbute ustuheyal prcanocesalss.oThessuggese caust imeprs ofovement variatiosntooftthene monprocesfArso. mcontContspreciorloacharllcharcaust tteysspimakeofcalvarlythiconsaetivaron.isitastofiondatvisaibplleoandt edalinlotwimonee ortdoerdiandstinguihorshizontcomal lesin.esOne, callusedefcontul control rlioml charits, thtatis itnhdie cat-ceharthet amount of var i a t i o n due t o common caus ( r e ad bar char t . To cr e at e an c har t , PlCroetattheeandindiplviodtualtheobscenterevratlinioens orthseasmplample means i n t i m e or d er . e mean of al l of t h e obs e r v at i o ns . CalculaUpper te and plcontot rtohlelcontimit roUCLl limits given by3 standard deviation Lower cont r o l l i m i t LCL 3 s t a ndar d devi a t i o n Thethe obsstaendarrvatdiodevins beiatinognplinottheed.contForroslinliglmeitobss isetrhveatesiotnsim, atit eisdofsttaendarn thed sdeviamplateiosntanof dardevidadevitionaistitohn.e sIafmpltheemeans of s u bs a mpl e s of s i z e ar e pl o t e d, t h en t h e s t a ndar d of plus andassumiminngusnorthrmeealsltyandardistsrtdiabndardeviutedadtdatdeviionsa,aaroftioefnachosldisevlyiednsedisgonalbythiatngthanerTheeoutis -contaoverf-crontoylslmrimoallitlobsschance, e r v a tion-that is, an observation suggesting a special cause of variation. ThetivitiMadi s o n, Wi s c ons i n , pol i c e depar t m ent r e gul a r l y moni t o r s many of i t s ac dattotaaleonfsoasrf1ivpar2epaydit fofpereanrentiongoi kiods,nordsngaboutofqualoverihaltytiimmf apreyearhourovement s. . Eacprhogrobsam.ervTablatione r5.e8prgievseesnttshea We exami n e t h e s t a bi l i t y of t h e l e gal appear a nces over t i m e hour s . A com puttheesamer calculasationAlgivsoes, the sa3558.mple sSitanndarce inddidevividualatiovaln uises wil be 607,plot anded, thies control limits arUCLe = 3 3558 3 607 5379 3558 1737 LCL 3 3 607 Thein Fidatgurae, 5.al5ongonwipageth th240.e centerline and control limits, are plot ed as an -chart j
special causes
X
1. 2.
)
X-
X
x,
3.
( (
) )
=
x +
=
-
x
( (
) )
m
Vm .
Example 5.8
(Creating a u nivariate co ntro l chart)
x1
x1.
=
x1 x1
=
+
-
�
( �) (�)
=
=
+
( (
) )
x1
=
=
=
X
240
Chapter 5
I nferences about a Mean Vector
TABLE 5.8
FIVE TYPES OF OVE RTI M E H O U RS FOR TH E MADISON, WISCO N S I N, POLICE D E PARTM E NT
xl
x2
x3
x4
Xs
3387 3109 2670 3125 3469 3120 3671 4531 3678 3238 3135 5217 3728 3506 3824 3516
2200 875 957 1758 868 398 1603 523 2034 1136 5326 1658 1945 344 807 1223
1181 3532 2502 4510 3032 2130 1982 4675 2354 4606 3044 3340 211 1 1291 1365 1175
14,861 1 1,367 13,329 12,328 12,847 13,979 13,528 12,699 13,534 1 1,609 14,189 15,052 12,236 15,482 14,900 15,078
236 310 1182 1208 1385 1053 1046 1100 1349 1 150 1216 660 299 206 239 161
1 COA Meet i n g Hol d over Ext r a or d i n ar y Legal Appear a nces Hours Event Hours Hours Hours Hours
1 Compensatory overtime allowed.
Legal
5500
� ::3
ca > ca ::3
"C1
·> � �
Appearances Overtime Hours
UCL 5379 =
4500
x1
3500
=
3558
2500
LCL
1 500 0
5
10
15
Observation Number Figure 5.5
The X -ch a rt for x1 = legal a ppea ra nces ove rtime h o u rs.
=
1737
Section 5.6
241
M u ltivariate Q u a l ity Control Charts
The l e gal appear a nces over t i m e hour s ar e s t a bl e over t h e per i o d i n whi c h tcommon he data wercauseecols, sloecnoted.speciThealvar-causiateiovarn iinatoverion itsiminedihourcatesd.appears to be due to ivariate fappror coroachrelashtioulonsdbebe ustweeend tWiocharmonith moracttoeerriprtshtanoiccess oneands stiamwibiporllitcontyt.antSuchrcharol tashaectapprovererisotaiachcl, aprmulcoanbabitaccount l i t y of f a l s e l y s i g nal i n g a sablpeciesacanl causmakee of ivart imiapostionsibwhenle to oneasseiss nottheprovereseantl .erHirogrhracortertehlatatiiosnsimamong t h e var i numberTheof2twunio vmosariattcommon e charts. multivariate charts are (i) the ellipse foplrmieatd bychara tlaandrge (i ) thTwoe T -cascharest.that arise in practice need to be treated dif erently: Moni t o r i n g t h e s t a bi l i t y of a gi v en sampl e of mul t i v ar i a t e obs e r v at i o ns Set t i n g a cont r o l r e gi o n f o r f u t u r e obs e r v at i o ns Itinviartiaial tye, weobsconservatidioernsthxe1 ,usx2e, of multiLatvareiart, ewecontdisrcousl prs tohcedur eoscedurfor aessawhen mple ofthemulob es e pr servations are subgroup means. We assume that X X2, _!_ , Xn are i_!_ndependently_!_distributed_!_as By_!_Result 4.8, X. - X ( - ) X. - X - - X . - X. - - X has and 2 Cov ( Xj - X) ( - ) Each Xj - X has a normal distribution but, Xj - X is not independent of the sam pl(Xje covar i a nce mat r i x S. However t o set control limits we approximate that 1 - X ) ' S- (Xj - X ) has a chi-square distribution. The el l i p s e f o r m at char t f o r a bi v ar i a t e cont r o l r e gi o n ichars theactmoreriestiinctsuonitivteheofjtthheunichart artse, butplotitsedappras aopaiachris(xjlim1 ,ixjted2).toThetwo variaqualblesi.tyTheellitpwsoe consists of all x that satisfy(x - x)'S-1 (x - x) Letcharusactreerfiesrtitcos Exampl e and cr e at e a qual i t y el l i p s e f o r t h e pai r of over t i m e culation gives (legal appearances, extraordinary event) hours. A computer cal •
1. 2.
. . . , X
n
.
Charts fo r Mon itoring a Sample of Ind ividual M u ltivariate Observations for Stabil ity
1,
1
=
. .
1
.
Np ( JL , I ) .
n 1
1
n
_
=
1
1 n I
· · ·
+
n 1- 1
1 n 1+
( n - 1 )n -2 I
=
(n
n
· · ·
1)
n
n
I
Ellipse Format Chart.
95%
<
Example 5.9
x�( .05 )
{An e l l i pse format chart for overti me hou rs)
5.8
(5-32)
242
I nfe rences a bout a Mean Vector
Chapter 5
] = [ 3558 1478
] and = [ We i l u s t r a t e t h e qual i t y el l i p s e f o r m at char t us i n g t h e el l i p s e , whi c h consists of all that satisfy Here =( = and the ellipse becomes ) x-
367,884.7 -72,093.8 -72,093.8 1,399,053.1 99%
S
x
(x - x ) ' S - 1 (x - x ) x� ( .01 ) p 2, so x�( .01 ) 9.21, 2 (x l - x1 ) 2 (x l - xl ) (x2 - x2 ) (x2 - x1 ) sl l s22 + - 2s1 2 S1 1 s1 1S22 - S12 2 S1 1S22 S22 ( 367844.7 X 1399053.1 ) X 367844.7 X 1399053.1 - ( -72093.8) 2 (x 1 - 3558 ) 2 (x 1 - 3558) ( x2 - 1478) (x2 - 1478) 2 + - 2( - 72093 · 8 ) 1399053.1 367844.7 367844.7 X 1399053.1 <
_______
----
=
(
0 0 0 V)
s
) This ellipse format chart is graphed, along with the pairs of data, in Figure
j:.Ll
�
� � =: 0 �
.b
j:.Ll
5.6.
0 0 0 � •
0 >
9.Zl
• -+-
•.:;j � � >
=: �
<
• 0 0 0
•
-
•
•
• 0
• •
•
•
+.
•
•
•
•
K
0 0 0
7 1500
3500
2500
4500
Appearances Overtime
5500
Figure 5.6 The q u a l ity control 99% e l l i pse fo r legal a ppea ra nces a n d extraord i n a ry event ove rtime.
Not i c e t h at one poi n t , i n di c at e d wi t h an ar r o w, i s def i n i t e l y out s i d e of t h e elstrliupctsee.d.WhenThe a poi-charnttisfooutr ofwasthegicontven rinolFireggiuroen, indithvatidualfor charis givtsenarienconFig ure Whenon pagethe lower control limit is les than zero for data that must be non negat i v e, i t i s gener a l y s e t t o zer o . The LCL = l i m i t i s s h own by t h e das h ed line in Figure 5.7
X 243.
5.7.
x1
5.5;
0
X x2
Section 5.6
243
M u ltivariate Q u a l ity Control Charts
Extraordinary Event Hours
6000 UCL 5027
5000 � ::3
ca > ca
::3 "'0
· s; �
� �
=
4000 3000 2000 1 000 0 - 1000 - 2000
LCL
=
- 207 1
- 3000 0
10
5
15
Observation Number
Figure 5.7
The X-chart for x 2 = extraord i n a ry eve nt h o u rs.
Was t h er e a s p eci a l caus e of t h e s i n gl e poi n t f o r ext r a or d i n ar y event over ttihmeeUnithatteids outStatseidsebombed the uppera fcontoreirgonl lcapiimittainl,FiandgursetudentDurs at iMadi ng thissopern wereiod, prweekotesperting.iod.A Almathjoough,rity ofbythites extverryaordefdiinniartioyn,overexttriamoredwasinaryusovered intimtheatoccurfours onltainyswhentabilitsyp. ecial events occur and is therefore unpredictable, it stil has a cer 2 A T c har t c a n be appl i e d t o a l a r g e number of char a ct e r i s t i c s . Un ldiiksepltahyede elliinpsteimfeorormatder, itraisthnoter tlhimaniteasd atosctwatotevarr pliaoblt, eands. Morthisemakes over, thpate poiternntss andare trendsForvisitbhleej. th point, we calculate the T2-statistic T x) ' S-1 (xj - x) (xj Wethe upper then plcontot threolTl2im-valit ues on a time axis. The lower control limit is zero, and we use UCL or, sometTherime esis,no centerline in the T2-chart. Notice that the T2-statistic is the same as the quantity used to test normality in Section 2 Usonitnhgethtwe opolvariceia2depar t m ent dat a i n Exampl e we cons t r u ct a T p l o t bas e d l e gal appear a nces hour s and bl e s ext r a or d i n ar y x x l 2 event hours. T -charts with more than two variables are considered in 5.7?
•
T2-Chart.
J
(5-33)
=
=
x�( .05 )
x�( .01 ) .
4.6.
dJ
Example 5. 1 0
(A T2 -chart for overtime hou rs)
5.8,
=
=
244
Chapter 5
I nferences about a Mean Vector
cise 5.e225.6. 9We. take = .01 to be consistent with the ellipse format chart iExern Exampl The T c har t i n Fi g ur e 5. 8 r e veal s t h at t h e pai r ( l e gal appear a nces , ext r a orin dExampl inary evente 5.9),hourconfsirfmors perthatiotdhi1s iiss dueout ofto contthe lraorlg. eFurvaltuheerofinvesexttriagoratdioinn,arasy event overtime during that period. a
II
12 10 8 N
N
6 4 2
•
0 0
•
• •
2
4
8
6
12
10
14
16
Period Figure 5.8
The T2 -ch a rt for legal appea ra n ces h o u rs and ext raord i n a ry eve nt hou rs, a = . 0 1 .
2 When t h e mul t i v ar i a nt e T c har t s i g nal s t h at t h e j t h uni t i s out of cont r o l , sBonfhouledrbeonidetineterrmvialnsediswhifrequent ch varlyiachosbles earneforerstphonsis puriblpeos. e. modiThe ktfiehdvarregiiaoblnebasis outed control if does not ltine i1n( .t0h05/e inp)terval tn 1 ( .005/p) where p is the total number of measured variables. Thetubeasyokessembltoy aoftuabe.drivTheeshafintputfors anto tauthe oautmobiomatle eredquiwelredsintghemachicirclenweles musdingt contof goodrol equald toitbey. wiIntorhindercerttoacontin operrolatthinegprliomcesitsswher, oneepraomachi ne prneeroducesmeasweluredds ces s engi four critical variables: X1 =Voltage (volts) XX32 == FeedCurresntpeed((amps)in/min) X = ( i n er t ) Gas fl o w ( c f m ) 4 Table 5.9 gives the values of these variables at 5-second intervals. it
on
A
xk 1
Exa m p l e 5 . 1 1
of
( xk
-
vs;:;; '
xk
+
YS;;;; )
{Contro l of roboti c welders-more than T2 needed)
of
be
Section 5.6
TABLE 5.9
M u ltivariate Q u a l ity Control Charts
245
WELD ER DATA
Case Voltage (X1 ) Current (X2) Feed speed (X3) Gas flow (X4) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
23.0 22.0 22.8 22.1 22.5 22.2 22.0 22.1 22.5 22.5 22.3 21.8 22.3 22.2 22.1 22.1 21.8 22.6 22.3 23.0 22.9 21.3 21.8 22.0 22.8 22.0 22.5 22.2 22.6 21.7 21.9 22.3 22.2 22.3 22.0 22.8 22.0 22.7 22.6 22.7
276 281 270 278 275 273 275 268 277 278 269 274 270 273 274 277 277 276 278 266 271 274 280 268 269 264 273 269 273 283 273 264 263 266 263 272 277 272 274 270
Source: Data courtesy o f Mark Abbotoy.
289.6 289.0 288.2 288.0 288.0 288.0 290.0 289.0 289.0 289.0 287.0 287.6 288.4 290.2 286.0 287.0 287.0 290.0 287.0 289.1 288.3 289.0 290.0 288.3 288.7 290.0 288.6 288.2 286.0 290.0 288.7 287.0 288.0 288.6 288.0 289.0 287.7 289.0 287.2 290.0
51.0 51.7 51.3 52.3 53.0 51.0 53.0 54.0 52.0 52.0 54.0 52.0 51.0 51.3 51.0 52.0 51.0 51.0 51.7 51.0 51.0 52.0 52.0 51.0 52.0 51.0 52.0 52.0 52.0 52.7 55.3 52.0 52.0 51.7 51.7 52.3 53.3 52.0 52.7 51.0
246
Chapter 5
I nferences a bout a Mean Vector
The nor m al as s u mpt i o n i s r e as o nabl e f o r mos t var i a bl e s , but we t a ke t h e natlatiuornaflolrogar2succesithmsivofe obsgas eflrovw.atioInsn addion eachtion, vartheriaeblise.no appreciable serial corre-· A T c har t f o r t h e f o ur wel d i n g var i a bl e s i s gi v en i n Fi g ur e The dot tnoedpoilinne tiss tarhee out oflimcontit androl,tbuthe socaslidelineisisouttheside thliemit. Usliminigt.the limit, format char, thets) showqualfoirtytwelo-· varlipsieablfWhat eorsgas? Mosdoflotwht eofandqualthevolivartytacontige,ablsrehosownlarelelipiinnsecontFis g(eurlroeipl.seHowever r e veal s t h at cas e i s out ofvarcontiate rol charandtthfoisr lisn(duegas fltoow)an,unusin Fiuguralley largeshvolowsumethatofthgasis poiflonw.t isTheoutuniside the tAlhrletehseigotmaherlimuniitsv. arItiaappear sharthtats havegas flalowl poiwasntrsewisetthatinththeeitarrtghetrefeosrigcasmae t e c control limits. 95%
99%
31
5.10,
X
�
('l
99% Limit
• 95% Limit
� - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
8-
•
42I
• • • •
•
•
•
•
6- •
0-
31
X
1210-
99%
5.11,
32.
14-
95%
5.9. 99%
••
•
I
•
•
• •
••
•
•
•
•
•
I
10
0
•
•
•
••
20
• •
•• •
• ••
I
•
• I
40
30
Case
The T 2-chart for the we l d i n g data with 9 5 % and 99% l i m its. Figure 5.9
4.05
•
4.00 0 �
,.-.
�
1/.l � OJ)
s::=
•
3 . 95
._., �
••
• . a• a ••• a • •• a. • • • ••
3 . 90
3 . 85 20.5
2 1 .0
2 1 .5
22.0
22.5
Voltage
23.0
23.5
24.0
Figure 5 . 1 0 The 99% q u a l ity control e l l i pse for l n (gas fl ow) and vo ltage.
Section 5.6
M u ltivariate Q u a l ity Control Charts
247
1 UCL = 4. 005
4.00
1:><
3 . 95
Mean = 3 . 95 1
3 . 90
LCL = 3 . 896 0
10
40
30
20
The u n ivariate X-c h a rt for l n (gas flow) .
Figure 5 . 1 1
Case
I n t h i s exampl e , a s h i f t i n a s i n gl e var i a bl e was mas k ed by 99% l i m i t s , or 2 almost masked (with 95% limits), by being combined into a single T -value. Thecontrgoalol renowgionifsotroausfuetudatre aobsx1e, rxv2at, ion, xxnor, colfulteuctreedobswhenervataioprnso.cesThes isresgitaoblnei,ntowhisetcha aproces futures obsis setravblatei,owen is expect e d t o l i e i s cal l e d a or I f t h e t a ke t h e obs e r v at i o ns t o be i n dependent l y di s t r i b ut e d as iNtop(ringIqual) . Becaus ity, we egitvheestheerebasgioinsc diarsteriofbutmorionetgener heoryaasl imResporutltance5.6. than just for mon Let X be i n dependent l y di s t r i b ut e d as Np , . . , X ) , and , X I ( n 1 2 let X be a2future observation f,rom1 the same distribution. Then- ) T = (X -X) s- (X - X) is distributed as n-p and a ( - a)% p-dim-ensio1nal pre-diction ellipsoid is given by all x satisfying ( ) ' s- ( ) p, n - p a ) We f i r s t not e t h at X X has mean Si n ce X i s a f u t u r e obs e r v at i o n, X and X are independent, so Cov(X -X) = Cov(X) + Cov( X ) = I + -I = + I and, by Result + (X - X) is1 distributed as Np( O, I ) . Now, (X - X ) 'S- (X - X) ivariatmate norrixmialn ,tNhep(forIm) , random vector and an independent Wiwhischharcombi t, Wp,nn-es1 (I)a mul, ratndom •
Contro l Reg ions for Future I nd ividual Observations . . •
forecast,
prediction, region.
JL,
Result 5.6.
JL,
-
n n+ 1
-
100 1
X- X
X- X
<
(n 1 p n - p FP'
(n 2 - 1 )p n(n p) F
(
_
Proof.
0.
-
4.8, Vnj ( n
-
1)
(rl \j -;;-+1
(rl \j -;;-+1
0,
1 n
(n
1)
n
248
Chapter 5
I nferences a bout a M ea n Vector
variatvecte noromr al ) ( mulratndom ivariatvecte noromr al ) ' ( Wishart rad.ndomf. matrix )-1 ( mulratindom
has th212-213. e scaled distribution claimed according to (5-8) and the discussion on pages The constant for the ellipsoid fol ows from (5-6). lt 5.6 foranda futitusreaxesobsarereveddetvalermueinxieds byan elthleipeisoNotgienvect d. eItthiatsorcentsthofeeprreeddiSiatnctceitohne rinegiitioanl sianmplReseumean [ -' nn nJ 1 beftionorelelianypse newis 1 observations are taken, the probability that wil fal in the predic Keep i n mi n d t h at t h e cur r e nt obs e r v at i o ns mus t be s t a bl e bef o r e t h ey c a n be used tBaso detederonmiResne contult 5.ro6l, weregiobtonsaifnorthfuettuwreo obscharertvs atfoirofnsut. ure observations. With 2, the 95% prediction ellipse in Resn2ul-t 5.16)2specializes to (5-34) (x - x ) (x - x) n n 2) Anytrol elfulitpusree. observation xis declared to be out of control if it fal s out of the con Ieventn Examploverteim5.9e,hourwe checked t h e s t a bi l i t y of l e gal appear a nces and ext r a or d i n ar y ' s . Let s us e t h es e dat a t o det e r m i n e a cont r o l r e gi o n f o r f u ture paiFrrosmofExampl values.e 5.9 and Figure 5.6, we find that the pair of values for pe reliolidps11e. werAlleofoutthofe poicontntrsoarl. eWethenremoved t h i s poi n t and det e r m i n ed t h e new 99% i n cont r o l , s o t h ey can s e r v e t o det e r m i n e t h e Fi95%gurpreAny5.edi12cftualiotounngrreeobswigiothnerjtvhuatestiiondefnitifianal15ledinfgsotiranblteheobs2.eleliThirpvsatesiiocontsnsre. garroldeledlipasse sitsasblhowne or iinn contcontrrooll.obsAnerobsvatieornvatoriosnpecioutasl-icdause ofe vartheiaeltiliopn.se represents a potential out-of For each new observation x, plot n n 1 (x ' x F
II
S.
x,
-1
-
P (X - X ) S (X - X )
<
( 2 (
_
1 )p F (a) p ) p,n - p
=
-a
X
- a.
Co ntrol E l l i pse for Futu re Observations
p
=
Example 5 . 1 2
_
,
S
_1
_
<
(
(
_
F2 ,n - 2 ( .05 )
{A Co ntro l e l l i pse for futu re overti me hours)
p
=
ll
T2 -Chart fo r Futu re Observations
T2
=
+
1 - x ) S - ( - x)
Section 5.6
249
M u ltivariate Q u a l ity Control Charts
0 0 0 �
�
•
0 0 0 N
•
a '€ �
>
0 � �
> � >.
0 0 0
0 C\S l:i >< �
0 0 lr)
� � � ;....
• •
•
......
• •
•
•
•
-+-
•
•
•
•
0 0 0 lr) I
1 500
2500
3500
4500
5500
Figure 5 . 1 2 The 95% co ntrol e l l i pse for future legal a ppeara nces a n d ext raord i n a ry event overtime.
Appearances Overtime
in time order. Set LCL = and take UCL = Poisuggesnts tabovethat ththeeprupperocesscontin quesrol ltiimonit srheoulpredsebent potexamientniaedl sptoecidetalecausrmineevarwhetiatihoern andim mediate corrective action is warranted. Itdentis lasysdiusmedtributtheatd aseach randomWevectproorceedof obsdiferevratentiolnsy whenfromtthheespramplocesinsgisprinodepen cedur e sfiprecist sfaiempls theat, we deteunirmtisnbee itsselseactmpled,eatmeanthe saXme1 andtimcovar e, fromiatncehe prmatocesrixs. SFr1 . oWhen m the the populFor aatgener ion isanorl sumbsalampl, these emeantwo rXajndom quantX hasitiaesnorarme ialndependent .n with mean , di s t r i b ut i o Xj Oand Cov Xj - X = ( )2 Cov Xj -2 Cov X1 = 0,
(n (n
_
l )p F ( . OS ) p) p,n - p
Control Charts Based on Subsam ple Means
m >
(
_
1
=
)
Np ( 0, I ) .
1 1-n
(
_
)
+
1
n
n
(
-
)
(n - 1 ) nm
I
250
I nferences about a Mean Vector
Chapter 5
where X j=1 xj As wi l be des c r i b ed i n Sect i o n 6. 4 , t h e s a mpl e covar i a nces f r o m t h e s u b i n Chapt e r 6) of t h e scommon amples ccovar an beicombi n ed t o gi v e a s i n gl e es t i m at e ( c al l e d ance This pooled estimate is S ( S 1 S2 S n ) n Her e and, t h er e f o r e , of t h ei r mean i s i n dependent of each X ) S j Furof frtheedom. er, Notic)Se itshdiatswetribaruteedesastima Wiatinshgart irnatndom mat r i x wi t h degr e es e r n al l y f r o m t h e dat a col l e ct e d anylargeginumber ven periofod.degrTheseese ofestfirmeedom. ators arConse combiequentnedlyt,o give a single estimator with (5-35) is distributed as Fp, nm - n - p+1 individiuals multivariate observations, theInellanipsanale foormgousat charfashtiofonrtpaio ourrs ofdisscuusbssaiompln one means ( )S ( ) F2, nm - n- 1 ( .05) (5-36) althoughSubsthample rigehts -corhandresspiondide isnusg utoalpoily apprnts outoximsidateeofd asthe�cont.05)rjol .ellipse should carmeasefuulred.y checked f o r changes i n t h e behavi o r of t h e qual i t y char a ct e r i s t i c s bei n g The interested reader i2s refer ed to for additional discussion. we plot the quantToity constrTyuct a T -Xcharj -tXwi)t'hS-s1u(bsXjampl- Xe)data and characteristics, where the for ) Fp, nm- n- p+1 ( .05) UCL The UCLValuiess ofofteTjn apprthat oexceed ximatetdheasUCL.0cor5) when i s l a r g e. r e s p ond t o pot e nt i a l y out o f c ont r o l special cause variation, which should be checked. (See 1 �
=
= -
n
£.J
n
S pooled
I.
=
1
+
-
n
( nm ( nm - n
+ ··· +
nm - n
I
X.
in a
(nm - n)p ( nm - n - p + 1 )
Ellipse Format Chart. _
=
x-x
,
_1
=
_
x-x
<
( n - 1 ) ( m - 1 )2 m ( nm - n - 1 )
x(
m
be
[9]
T2-Chart.
j
=
p
=
1 , 2, . . . , n,
=
m(
(n - 1 ) (m - 1 p ( nm - n - p + 1 )
x�(
n
[9] .)
or
Section 5.6
M u ltivariate Q u a l ity Control Charts
251
Contro l Reg ions for Future Subsample Observations
Once dat a ar e col l e ct e d f r o m t h e s t a bl e oper a t i o n of a pr o ces s , t h ey can be us e d t o set contIf rXolisliamfiutstuforer fsuutbsuraemplobseemean, rved suthbsenamplX -e means . X has a mul t i v ar i a t e nor m al di s t r i bution with mean and Cov X -X = Cov X -Cov X1 = Consequently, X -X 'S-1 X -X is distributed as 0
(
-
=
-
)
( ) +
nm ( n+1
1 n
)
( nm - n)p ( nm - n - p + 1 )
(
_
(n + 1 ) I nm
)
(
)
Fp, nm -n - p +l
The pr e di c t i o n el l i p s e f o r a f u ture subsample mean for = characteristics is defined by the set of all such that F2, nm - n- 1 ( .05 ) (5-37) where, again, the right-hand side is usually approximated as .05 . As bef o r e , we br i n g i n t o t h e control limit and plot the quantity for future sample means in chronological order. The upper control limit is then UCL = Fp, nm -n - p+l .05) The UCLPointiss outoftesnidappre of tohxiemprateedidcastion el.l0ip5)sewhenor aboveis ltahrege.UCL suggest that the cur rprentevivalousuesstaofblethpre qualocesist.yThichars amaycteribesticgoods are diorfbad,erentbutin salommeoswayt cerftraoinmlythwaroseraofnttshea careful search for the reasons for the change. Control Ellipse for Future Subsample Means. p 2 ,
_1
_
(x - x) S (x - x) _
=
=
<
x
(n + 1 ) (m - 1 ) 2 m ( nm - n - 1 )
x�(
T2-Chart for Future Subsample Means.
nj ( n + 1 )
( n + 1 ) (m - 1 )p ( nm - n - p + 1 ) x�(
) /m
(
n
252 5.7
Chapter 5
I nfe rences about a Mean Vector
I N FERENCES ABOUT M EAN VECTORS W H E N S O M E OBSERVATI O N S ARE M I S S I N G
stofhearvectecorodrinobsg equiervatpmention arore unavai laeblofe.thThie unwis mayl inoccurgnessbe causaOfretesepn,ondent ofsoamebrcomponent etakdown i n becaus o ans w er a par t i c ul a r i t e m on a s u r v ey ques t i o nnai r e . The bes t way t o handl e i n compl e t e obs e r v at i o ns , or mi s i n g val u es , depends , t o a l a r g e ext e nt , on t h e exper . If theewipatthteextrnrofemelmiys hiingghvalincomes ues is clwhooselryeftuiesde ttoo rtehsepvalonduein a tshuervreyeismpononsentsaaell,ascontruiechs,esasxtubspeopl e quent i n f e r e nces may be s e r i o us l y bi a s e d. To dat e , no s t a t i s tsiictualattieocnshniwherquesehavedata beenare midevels ingoatpedrandom-t for thesehcasat ies,scas. However , we ar e abl e t o t r e at anism regener sponsiablleapprfor tohache mifsorincomput g valuesinisg maxiinflmuumencedlikebyels iinhthoodwhie valchesutethiofme chance tatheesvarfromech iambliens. complete data isconsgivenistsbyofDemps t e r , Lai r d , and Rubi n [ 5 ] . Thei r t e chni q ue, cal l e d t h e ving two steps. We call them the and Givanen isstoteemerpsati:vesetcalimatculeatioofn itnhvole unknown tshtaeticontsticsr. ibution of any mis ing observation to the (complparetea-dmetatae)rssu, prf iceidientct Us e t h e pr e di c t e d s u f i c i e nt s t a t i s t i c s t o comput e a r e vi s e d es tTheimatcale ofcultahteioparn cyclametesefrrso.m one step to the other, until the revised estimates do f er apprtheeciobsableyrvfatroimonsthXe 1es, tXim2,at...e,obtXnaarineeda irnandom the presviamplouseitferroamtioan.p-variate not diWhen nordatamsalufpopul a t i o n, t h e pr e di c t i o n-es t i m at i o n al g or i t h m i s bas e d on t h e compl e t e icient statistics [see (4-21)] n T1 j�=l Xj and n T2 j�=l XjXj nXX ' Iandn thvaris casiance-JL e, the algandorithmreprspoectceedsivelyas-arfoleounknown ws: We assandumemusthatt betheespopultimataetid.on mean 1) denote the For each vect o r wi t h mi s i n g val u es , l e t ) x xj 2 ) denote those components which are available. Thus, mixj' s- inxjg(1component s and ) x 1) , xj(2)' ] . Gi v en es t i m at e s and f r o m t h e es t i m at i o n s t e p, us e t h e mean of t h e condi 2 l ( ( tional normal distribution of x ), given x ), to estimate the mis ing values. That
of
of
not
A
EM algorithm, prediction
estimation
r--..1
1. Prediction step.
(J
2. Estimation step.
=
=
=
=
(n -
nX
l)S
+
I,
Prediction step.
- [
1
ji
I
If all the components x1 are missing, set x1
is, 1
=
ii and x1x� =
I
+ ii ii ' .
253
Inferen ces about Mean Vectors when Some Observations Are M iss ing
Section 5.7
,..___ , � ,..___ ( 1 ) ,.._x __j( 1 ) -- E ( X j(l) I xj( 2 ) ., JL � ) - JL x } 1 ) T1 . x} 1 )
( 2 ,..___ ( 2 ) ) � �1 2 � �221 Xj( ) - IL
+ ( 5 3 8) estimatNextes t,htehecontprerdiibcutteidoncontof ributtoion of to T2 is + ( 5 3 9) and The cont r i b ut i o ns i n ( 5 3 8) and ( 5 3 9) are s u mmed over al l x wi t h s i n g com L ponents. The results are combined with the sample data to yield T1 and Comput e t h e r e vi s e d maxi m um l i k el i h ood es t i m at e s ( s e e Result 4.11): , 1 (5-40) We ileu5.st1r3.ate the computational aspects of the prediction-estimation algorithm in Exampl Esplettiemdatateathseet normal population mean and covariance using the incom 02 36 5 1 25 Here We obt4, ain th3,e andinitiaparl satsmplof eobsaverervaatgesion vectors and are mis ing. +2 5 6, 0 + � + 1 1, 3 + 6 +4 2 + 5 4 fvalromuest,hseo avaithatlable obs6, efrovratexampl ions. Subse, wetcanitutiobtng tahinesieniavertial covarages ifaoncer anyestmiimsatinesg. Weeventshualalllyconsprotduces ruct ththeseemaxiestimmatumeslusikelinighoodthe diesvtiimsoatr e becausThuse, the algorithm (6 6 + 6 +4 (5 6 + (6 6 21 52 -
2 ,..___ � , �) - � Xj( 1 ) Xj(l) 1 - E ( Xj(l)x(l)1 j 1 Xj( ) ,. JL �1 1 _
_
_
,.._Xj__ ( 1 ) ,.._Xj__ ( 1 ) 1
� �- 1 � �12�22�21
�
T2 .
Estimation step.
T IL = l
,..___ ,..___ ,..___ � IL � � = -n T2 - IL
-
n
Example 5. 1 3
(I l l u strati ng the EM algorithm)
IL
X=
n
,..___ I-L l =
=
7
=
p
7
I
x1
'ii2 =
=
x4
,..___ = JL3
=
=
x1 1 =
n
lTl l
=
- )2
lT3 3
(7 - ) 2
=-
- )2
- )2
I.
254
Chapter 5
I nfe rences a bout a Mean Vector
= 1(6 - 1) (7 - 6)(2 - 1) 4 (5 - 6)( 1 - 1) (6 - 6) ( 1 - 1 ) 43 = 1 The pr e di c t i o n s t e p cons i s t s of us i n g t h e i n i t i a l es t i m at e s and t o pre diSeect th(5e-3cont8) andribut(5io-3ns9)of.] the mis ing values to the suf icient statistics and T2. The first component of x1 is mis ing, so we partition and as - 6) (0
lT1 2
lT23
+
+
+
(T1 3
= 4'
I T1
ii
[
ii
I
! = ] [ :�: = �:] = [L 1 [! n [ � ] = 5.73 = � - [L 1] [! n D J (5.73? = 32.99 � = 5.73 [0, 3] = [0, 17.18] For the two mis ing components of we partition and as = [ �� J = [-�-��], and predict [ ::J = E( [ �:J = 5; ) = [�J = [�] D J (5 - 4) = [�::] for the contribution to Also, from (5-39),
x1 1
and predict
= /i 1
+
= (7 1 1
I 12 I zi
6
- I 1 2 I zi i 2 1
xl l [x1 2 ' x1 3 ]
+
-1
+
-
xr 1
= xl l [x1 2 '
1
+
x1 3 ]
ii
x4 ,
ii
J.L (
- -,.:_:---
J.L3
I
)
x4 3 m
+
-1
/L , I
+
I 1 2 I2H x43 - /L3 )
T1 .
1.3]
Section 5.7
255
Inferences a bout Mean Vectors when Some Observations Are M iss ing
and sartaetisthtiecscontare ributions to T2 . Thus, the predicted complete-data suf icient [ �4 1 [ x3 ] x ] X [ 1 1 1 2 T1 xx1123 xx223 x3x323 x4x432 ] � � x x x XI �1 1 1 1 T2 = x---� 3 X X X3 X x x x� XI � 1 1 2 2 2 2 2 2 2 -----X42X43 - X21 3 X23 X323 X423 1 X13 X2 1X23 X3 1X3 X41X43 X1 2X1 3 X2 X23 X32X3 ---[ 72 02 �
=
+ + +
+ + +
+
+
�+
+
=
=
[
24.13 5.73 + 7 + 5 + 6.4 0 + 2 + 1 + 1.3 - 4.30 16.00 3+6+2+5
+ + +
_
_
+ + +
+� +
32.99 + + 5 2 + 41 .06 0 + 7(2) + 5 ( 1 ) + 8.27 17.18 + 7(6) + 5 (2) + 32 148.05 27.27 101.18 27.27 6.97 20.50 101 .18 20.50 74.00
+
+
+ +
+
+
+
+
+
+ 2 2 + 12 + 1.97 0(3) + 2(6 ) + 1 (2) + 6.5
]
This compl The nextetesesonetimprateiodincstitoenp,susteip.ng[ ]provi[des th]e revised estimates2 � 2[- 4 ] ��: !� [ �: ��] ] =[ Not e t h at l:T and l:T are l a r g er t h an t h e cor r e s p ondi n g i n i 1 2 tsieacondl estimvarateiasblobtesabyinedthbye saremplplaecinmeans g the miofstihnegremai obsernvatinigonsvalonuest.heThefirstthandird (5-40),
/1
��
1 ""' T
= -
n
=
i\
=
24.13 � 4.30 16.00
=
6.03 1.08 4.00
""' IL ""' ' IL
148.05 27.27 101 .18 .61 .33 .33 .59 1 .17 .83 1
=
27.27 1 6.97 20.50 74.00 4.00 1.17 .83 2.50
.61
=
.59
2 The final entries in I are exact to two decimal places.
[6.03 1 .08 4.00]
256
Chapter 5
I nferences about a Mean Vector
varingicomponent ance estimats.e remains unchanged, because it is not affected by the mis The i t e r a t i o n bet w een t h e pr e di c t i o n and es t i m at i o n s t e ps cont i n ues unt i l tshoertelareementeasilsyofhandlanded witrhemaia comput n esseentr.ial y unchanged. Calculations of this nents OnceoccurfiinnalX,esalittlismJLeatemssuechs rethasandatonablare etoobtILtreaatined and rIL)elatively few mis ing compo asstaantementapprsoxiwoulmatdet100(hen 1fol ow as confin Sectidenceion ellipbutsoidwi. Theth sriemplulatcedaneousby confandidencere placed by The pr e di c t i o n-es t i m at i o n al g or i t h m we di s c us s e d i s devel o ped on trheleabasteditsoththate component obs e r v at i o ns ar e mi s i n g at r a ndom. I f mi s i n g val u es ar e r e s p ons e l e vel s , t h en handl i n g t h e mi s i n g val u es as s u gges t e d may i n reoladuceted tsoetrhioeusrebispaonsseseisnbeito tnhgemeasestimuatred.ionConsprocedurequentes.lyTypi, wecmusally,tmibesdubiing ovalusuofesany rtcomput a t i o nal s c heme t h at f i l s i n val u es as i f t h ey wer e l o s t at r a ndom. When mor e tthematan aicfecausw valesuteshatarecrmieatseidng,thiem.t is imperative that the investigator search for the sys l:T33
I
ii
II
jL
i
) ' I - 1 ( [L -
n ( {L -
- a )%
5.5,
A
I.
<
(5-4 1)
x� ( a )
x
jL
S
Caution.
are
5.8
D I FFICU LTI ES DUE TO TI M E D E P E N D E NCE IN M U LTIVARIATE OBSERVATI ONS
Forobsetrhveatmetionshods described iconsn thitsitchapt e r , we have as s u med t h at t h e mul t i v ar i a t e e aarreandom dentnot beofvaloneidanot. Theherpr. eIsfetncehe obsof evenervatiuaotnsmoderat coleleamount ctsampl ed overeof; ttithmimate,eitsdependence h, itshaseysuaremptiniodepen namongmay tmulhe tobsaneouservatconfionsidcanencecausinteersvealrios,uswhidicfhicaruletialesl fconsor tetrsutsct, confed asidsuencemingretgihatonsin, depen and si denceWeholwidsl. il ustrate the nature of the dif iculty when the time dependence can be rreaprndomesentvected asora mulfotlivoarwiathtee fmulirst torivdareriaautte AR(oregr1e)smodel ive [AR(1)] model. Let the wher e t h e ar e i n dependent and i d ent i c al l y di s t r i b ut e d wi t h and Covand 1. Under tandhis model all of thCove eigenvalues of the coefwherficieent matrix are between X 1 , X2 , . . . , Xn
p X
Xt
( et)
et
=
Ie
(Xt, Xt - j )
=
00
Ix
=
j ix
j iE 1 j 2: j =O
E [ et J
1
(5-42)
=
0
-1
Section 5.8
D ifficu lties Due to Ti me Dependence i n M u ltivariate Observations
257
The 1,AR(thrlo)ughmodelthe (coef5-42)ficrieelntatemats threixobserFurvattiohnerat, thtiemauteto, troegrtheesobsiveemodel rvationsaysat titmhee obscoefefrivcatientionsmatarreixindependent , under mul t i v ar i a t e nor m al i t y , i f al l t h e ent r i e s i n t h e rreegrgreessivioenmodel om the factvarthati (abl5-4e2)andlookstheliprkeevia omulusarvalteiv0.arueiTheaXtte -vername sl asiothnauteofinoadependent wivartiahblXcomes aset. thefrdependent As shown in Johnson and Langeland [8] , t
<1> .
-
where the arrow indicates convergence in probability, and (5-43) Moranceematoverri,xfogirvlenargbye n,(5Vn-43)(. X - is approximately normal with mean and covari To1.makeNowthconse caildcerulatthioenslareasgeys,asmpluppose nomie thenunderal95%lyinconfg priodcesences hasellipsoidc/JfIorwherJL. e all JL such that n( X - ) 'S-1 ( X - .05)} Thipendents ellip. sIofidthhase obslaergrveatsaiomplns earcover a ge pr o babi l i t y . 9 5 i f t h e obs e r v at i o ns ar e i n de e r e l a t e d by our aut o r e gr e s i v e model , however , t h i s ellipsoid has coverage probability Table 5.1of0 svarhowsiablhowes the coverage probability is related to the coefficient and the number Accor d i n g t o Tabl e 5. 1 0, t h e cover a ge pr o babi l i t y can dr o p ver y l o w, t o . 6 32, even fTheor thinedependence bivariate casase.sumption is crucial, and the results based on this assump tion can be very misleading if the observations are, in fact, dependent. 95% 5 . 2 5 0 . 2 5 -. 12 ..999389 ..995050 ..887134 ..764232 510 ..999899 ..995050 ..764151 ..410593 15 1.000 .950 .548 .090 0
IL)
I 4> I
=
<
{
IL
IL )
<
x�(
4>
p.
TABLE 5. 1 0
COVE RAG E PROBAB I LITY O F TH E N O M I NAL CO N F I D ENCE ELLI PSOID cP
p
S U PP LE M E NT SA
Simultaneous Confidence Intervals and Ellipses as Shadows of the p-Dimensional Ellipsoids
Wethe prbegiojencttihoins s(ushppladow)ementofaanry selelctipisoonidbyontesotaablliinshe.ing the general result concerning Let t h e cons t a nt c and pos i t i v e def i n i t e mat r i x de 2 ttheremelinliepstohiedel, tlhipesoid c } . For a given vector u and belonging to ( Proj ection ( shcadow) 2} on uof) c u'u u whishadowch extextendsendsfrocmWAu along uniu witst,hsloength cV'u'c� Au/u' u.. When u i s a uni t vect o r , t h e The s h adow al s o ext e nds � c units in the -u direction. ByIts squarDef2 eidnlietngtionh is (z' u)2t/hu'e upr. Weoj ectwantion tofo maxianymizone thius sihsadowgivenover al( l' wi)2 th c . The extwiethndedequalCauchy-Schwar z i n equal i t y i n s t a t e s t h at Set t i n g i t y when and(u'u) (length ofweprobtojeactinion)2 (z' u)2 Thebesidchoies belce ongincg to/the boundaryieldysofequalthe ieltileispsandoid.thThatus givises, the maximc2uum' Au/u'shadow, provides the loandngesittsshleadow.ngth isConscV'u'equentAuju'lyu, t.heWiprthojtehcteiounin oft vecthe cellip2sfooirdthonis u tihsatc� tor u , the projection extcends c2 vu' u units along u The projection of the ellipsoid also extends the same length in the direction -u. Result SA.l.
{ z: z' A- 1 z
>
<
{z' A- 1 z
0
p
# 0,
p
z
A
WAn
=
<
X
0
I z'u I
Proof.
<
(z'u) uju' u. z z' A- 1 z bd (b' B d) (d' B- 1 d ) , B = A- \ <
b
<
<
=
(2-49) b = z, d =
u,
=
Au
uju' u,
V e Ae = ' u
258
z' A- 1 z
z
e0 = j �
=
kB - 1 d.
( z' A- 1 z) (u' Au )
z = An �
=
by
z
2A.12,
u
"'
;-;-
�
II
Chapter 5
Supplement
Result 5A.2. Suppose that the ellipsoid { z: z' A- 1 z
U = [ u 1 l u2 ] is arbitrary but of rank two. Then
{
z in the ellipsoid based on A- 1 and c2
or z ' A- 1 z
}
. t h at 1mp 11es .
c 2 implies that
<
{
<
SA
259
c 2 } is given and that
for all U, U' z is in the ellipsoid based on (U' AU) - 1 and c 2
(U' z ) ' (U' AU ) - 1 (U' z )
<
c2
}
for all U
Proof. We first establish a basic inequality. Set P = A112 U(U' Au) - 1 U' A1 12 , where A = A1 12 A1 12 . Note that P = P ' and P 2 = P, so ( I - P ) P' = P - P 2 = 0.
Next, using A- 1 = A- 1 12 A- 1 12 , we write z ' A- 1 z = ( A- 1 12 z) ' ( A- 1 12 z ) and A- 1 12 z = PA- 1 12 z + (I - P ) A- 1 12 z. Then z ' A- 1 z = ( A- 1 12 z ) ' ( A- 1 12 z ) = ( PA- 1 12 z + (I - P ) A- 1 12 z ) ' ( PA- 1 12 z + ( I - P ) A- 1 12 z) = ( PA- 1 12 z ) ' ( PA- 112 z ) + ( ( I - P ) A- 1 12 z ) ' ( (I - P ) A- 1 12 z) z ' A- 1 12 P ' PA- 1 12 z = z ' A- 1 12 PA- 112 z = z ' U(U' AU) - 1 U ' z c 2 and U was arbitrary, the result follows.
>
Since z' A- 1 z
<
(SA-l) •
Our next result establishes the two-dimensional confidence ellipse as a projec tion of the p-dimensional ellipsoid. (See Figure 5.13.) Proj ection on a plane is simplest when the two vectors u 1 and u2 determining the plane are first converted to perpendicular vectors of unit length. (See Result 2A.3.) c 2 } and two perpendicular unit vectors u 1 and u2 , the projection (or shadow) of { z ' A- 1 z c 2 } on the u 1 , u2 plane results in the two-dimensional ellipse { (U ' z) ' (U' AU) - 1 ( U ' z ) c 2 } , where U = [u 1 l u2 ] .
Result 5A.3. Given the ellipsoid {z: z ' A- 1 z
<
<
<
Proof.
By Result 2A.3, the projection of a vector z on the u 1 , u2 plane is u!z = UU' z (u1 z) u 1 + (u2z) u2 = [u 1 l u2 ] u2z
[ ]
3
�--� 2 UU'z
The sha dow of the e l l i psoid c2 o n the u 1 , u 2 p l a n e is a n e l l i pse. 1 z ' A- z :::;;
Figure 5 . 1 3
260
Chapter 5
I nfe rences about a Mean Vector
The proj ection o f the ellipsoid {z: z ' A- 1 z c 2 } consists o f all UU ' z With 2 1 z' A- z c • Consider the two coordinates U ' z of the projection U ( U ' z ) . Let z be long to the set { z: z' A- 1 z c 2 } so that UU ' z belongs to the shadow of the ellipsoid . By Result 5A.2, ( U ' z ) ' (U' AU) - 1 ( U ' z ) c2 <
<
<
<
so the ellipse { ( U ' z ) ' (U' AU ) - 1 (U ' z ) c 2 } contains the coefficient vectors for the shadow of the ellipsoid. Let Ua be a vector in the u 1 , u2 plane whose coefficients a belong to the ellip se {a' (U ' AU) - 1 a c 2 } . If we set z = AU(U ' AU) - 1 a, it follows that U' z = U' AU (U' AU) - 1 a = a <
<
and z' A- 1 z
=
a' (U' AU) - 1 U' AA- 1 AU (U' AU ) - 1 a = a' (U' AU)- 1 a
<
c2
Thus, U' z belongs to the coefficient vector ellipse, and z belongs to the ellipsoid z' A- 1 z c 2 . Consequently, the ellipse contains only coefficient vectors from the pro j ection of { z: z ' A- 1 z c 2 } onto the u 1 , u2 plane. II <
<
Remark. Projecting the ellipsoid z ' A- 1 z c 2 first to the u 1 , u2 plane and then to the line u 1 is the same as proj ecting it directly to the line determined by u 1 . In the context of confidence ellipsoids, the shadows of the two-dimensional ellipses give the single component intervals. <
2<
Remark. Results 5A.2 and 5A.3 remain valid if U p linearly independent columns.
q
<
=
[ u 1 , . . . u q ] consists of ,
EXE RCI S E S
5.1. (a) Evaluate T 2 , for testing
H0:
IL ' = [7, 1 1 ] , using the data
2 12 9 X = 9 10
86 8
(b) Specify the distribution of T 2 for the situation in (a) . (c) Using (a) and (b) , test at the a = .05 level. What conclusion do you reach? 5.2. Using the data in Example 5.1, verify that T 2 remains unchanged if each ob servation xj , j = 1 , 2, 3, is replaced by C xj , where
H0
c
=
[ � -� J
Chapter 5
Exercises
261
Note that the observations
yield the data matrix
5.3.
[
(6 - 9) (6 + 9)
(10 - 6) ( 10 + 6)
]
(8 - 3 ) ' (8 + 3)
(a) Use expression (5-15) to evaluate T 2 for the data in Exercise 5.1. (b) Use the data in Exercise 5.1 to evaluate A in (5-13). Also, evaluate Wilks' lambda.
5.4. Use the sweat data in Table 5 . 1 . (See Example 5.2.) (a) Determine the axes of the 90% confidence ellipsoid for IL· Determine the lengths of these axes. (b) Construct plots for the observations on sweat rate, sodium content, and potassium content, respectively. Construct the three possible scatter plots for pairs of observations. Does the multivariate normal assumption seem justified in this case? Comment. 5.5. The quantities x, S, and s- 1 are given in Example 5 . 3 for the transformed microwave-radiation data. Conduct a test of the null hypothesis H0 : IL ' = [ .55, .60 J at the a = .05 level of significance. Is your result consistent with the 95% confidence ellipse for IL pictured in Figure 5 . 1 ? Explain.
Q- Q
5.6. Verify the Bonferroni inequality in (5-28) for m = 3. Hint: A Venn diagram for the three events C1 , C2 , and C3 may help.
5.7. Use the sweat data in Table 5.1 (See Example 5.2.) Find simultaneous 95% T 2 confidence intervals for JL 1 , JL2 , and JL3 using Result 5.3. Construct the 95% Bonferroni intervals using (5-29). Compare the two sets of intervals. 5.8. From (5-23), we know that T 2 is equal to the largest squared univariate t-value constructed from the linear combination a' xj with a = s 1 ( x - JL o ) · Using the results in Example 5.3 and the H0 in Exercise 5.5, evaluate a for the transformed microwave-radiation data. Verify that the t 2 -value computed with this a is equal to T 2 in Exercise 5.5. -
5.9. Harry Roberts, a naturalist for the Alaska Fish and Game department, studies grizzly bears with the goal of maintaining a healthy population. Measurements on n = 61 bears provided the following summary statistics (see also Exer cise 8.23): Variable
Sample mean :X
Weight (kg)
Body length (em)
Neck (em)
Girth (em)
Head length (em)
Head width (em)
95.52
164.38
55.69
93.39
17.98
31.13
262
Chapter 5
I nferences about a Mean Vector
Covariance matrix
S=
3266.46 1343.97 731 .54 1 175.50 162.68 238.37
1343 .97 721 .91 324.25 537 .35 80.17 117 .73
731 .54 324.25 179 .28 281 .17 39.15 56.80
1175.50 537.35 281 .17 474.98 63.73 94.85
162.68 238.37 80.17 1 17.73 56.80 39.15 94.85 63.73 13 .88 9.95 13.88 21 .26
(a) Obtain the large sample 95% simultaneous confidence intervals for the s i x
population mean body measurements. (b) Obtain the large sample 95% simultaneous confidence ellipse for mean weight and mean girth. (c) Obtain the 95% Bonferroni confidence intervals for the six means in Part a. (d) Refer to Part b. Construct the 95% Bonferroni confidence rectangle for the mean weight and mean girth using m = 6. Compare this rectangle with the confidence ellipse in Part b. (e) Obtain the 95% Bonferroni confidence interval for mean head width - mean head length
using m = 6 + 1 = 7 to allow for this statement as well as statements about each individual mean.
5.10. Refer to the bear growth data in Example 1 .10 ( see Table 1 .4) . Restrict your
attention to the measurements of length. (a) Obtain the 95% T2 simultaneous confidence intervals for the four popula tion means for length. (b) Refer to Part a. Obtain the 95% T 2 simultaneous confidence intervals for the three successive yearly increases in mean length. (c) Obtain the 95% T 2 confidence ellipse for the mean increase in length from 2 to 3 years and the mean increase in length from 4 to 5 years. (d) Refer to Parts a and b. Construct the 95% Bonferroni confidence intervals for the set consisting of four mean lengths and three successive yearly in creases in mean length. (e) Refer to Parts c and d. Compare the 95% Bonferroni confidence rectangle for the mean increase in length from 2 to 3 years and the mean increase in length from 4 to 5 years with the confidence ellipse produced by the T2 -procedure.
5.11. A physical anthropologist performed a mineral analysis of nine ancient Peru vian hairs. The results for the chromium ( x 1 ) and strontium ( x2 ) levels, in parts per million ( ppm ) , were as follows:
x 1 (Cr) x2 (Cr)
.55
.74
12.57 73.68 1 1 .13 20.03
20.29
.48 40.53
2.19
.66
.93
.37
.22
.78 4.64
.43
1 .08
Source: Benfer and others, "Mineral Analysis of Ancient Peruvian Hair," American Journal of Physical Anthropology, 48, no. 3 (1978), 277-282.
Chapter 5
Exercises
263
It is known that low levels (less than or equal to .100 ppm) of chromium sug gest the presence of diabetes. while strontium is an indication of animal pro tein intake. (a) Construct and plot a 90% joint confidence ellipse for the population mean vector IL ' == [,u 1 , ,u2 ], assuming that these nine Peruvian hairs represent a random sample from individuals belonging to a particular ancient Peruvian culture. (b) Obtain the individual simultaneous 90% confidence intervals for ,u 1 and ,u2 by "proj ecting" the ellipse constructed in Part a on each coordinate axis. (Alternatively, we could use Result 5 .3 ) Does it appear as if this Peruvian culture has a mean strontium level of 10? That is, are any of the points (,u 1 arbitrary, 10) in the confidence regions? Is [ .30, 10] ' a plausible value for IL ? Discuss. (c) Do these data appear to be bivariate normal? Discuss their status with ref erence to Q-Q plots and a scatter diagram. If the data are not bivariate normal, what implications does this have for the results in Parts a and b? (d) Repeat the analysis with the obvious "outlying" observation removed. Do the inferences change? Comment. .
5.12. Given the data 3 X ==
4
6 4 8
0 3 3
5 with missing components, use the prediction-estimation algorithm of Section 5.7 to estimate IL and I. Determine the initial estimates, and iterate to find the first revised estimates.
5.13. Determine the approximate distribution of - n ln ( l I 1 / 1 I 0 I ) for the sweat data in Table 5.1. (See Result 5.2.) 5.14. Create a table similar to Table 5.4 using the entries (length of one-at-a-time t-interval)/(length of Bonferroni t-interval) .
Exercises 5. 15, 5. 16, and 5. 1 7 refer to the following information: Frequently, some or all of the population characteristics of interest are in the form of attributes. Each individual in the population may then be described in terms of the attributes it possesses. For convenience, attributes are usually nu merically coded with respect to their presence or absence. If we let the variable pertain to a specific attribute, then we can distinguish between the presence or absence of this attribute by defining
X
X-{ _
1 0
if attribute present if attribute absent
In this way, we can assign numerical values to qualitative characteristics. When attributes are numerically coded as 0-1 variables, a random sam ple from the population of interest results in statistics that consist of the counts
264
Chapter 5
I nferences about a Mean Vector
of the number of sample items that have each distinct set of characteristics. If the sample counts are large, methods for producing simultaneous confidence statements can be easily adapted to situations involving proportions. We consider the situation where an individual with a particular combi nation of attributes can be classified into one of q + 1 mutually exclusive and exhaustive categories. The corresponding probabilities are denoted by PI , p2 , . . . , Pq , Pq+I · Since the categories include all possibilities, we take Pq +l = 1 ( PI + p2 + · · · + Pq ) · An individual from category k will be assigned the ( ( q + 1 ) X 1 ) vector value [0, . . . , 0, 1 , 0, . . . , OJ ' with 1 in the kth position. The probability distribution for an observation from the population of individuals in q + 1 mutually exclusive and exhaustive categories is known as the multinomial distribution. It has the following structure: -
1 1
Category
2
k
q
0
0
0
0
0 0 0
1
0 0
1
Outcome (value)
Probability (proportion)
0
1
0
0
0
1
Pq Pq+I
PI
+1
0 0 0
0
0 0
q
=
1
-
q i i� =I P
Let Xj , j = 1, 2, . . , n, be a random sample of size n from the multinomial distribution. The kth component, xj k ' of xj is 1 if the observation (individual) is from category k and is 0 otherwise. The random sample X I , X 2 , , X n can be converted to a sample pro portion vector, which, given the nature of the preceding observations, is a sam ple mean vector. Thus, .
• • •
PI "
p = "
Pq+I
1 n = - � Xj n j=I
with
E(p)
= p =
PI P2 Pq+I
"
and
lT I , q+I lT2,q+ I lTI,q+I lT2, q+ I
lTq+I,q+I
For large n, the approximate sampling distribution of p is provided by the cen tral limit theorem. We have
Chapter 5
Vn ( p - p)
Exercises
265
N( O, I )
is approximately
where the elements of I are a-kk = Pk ( 1 - Pk) and a-i k = - PiPk . The normal approximation remains valid when a-kk is estimated by ;;.kk = Pk ( 1 - Pk ) and a-i k is estimated by ;;. ik = - pi pk , i =I= k. Since each individual must belong to exactly one category, Xq+ l , j = 1 - ( X1j + X2 j + · · · + Xq j ) , so Pq+ l = 1 - ( P I + p2 + · · · + Pq ) , and as a result , has rank q. The usual inverse of does not exist, but it is still possible to develop simultaneous 100 ( 1 - a ) % confidence intervals for all linear combinations a' p.
:i
:i
Result. Let X 1 , X 2 , . . . , X n be a random sample from a q + 1 category multinomial distribution with P[Xj k = 1 ] = pk , k = 1 , 2, . . , q + 1, j = 1 , 2, . . , n. Approximate simultaneous 100 ( 1 - a ) % confidence regions for all linear combinations a'p = a 1 p 1 + a 2 p2 + · · · + a q+I Pq+ l are given by the observed values of .
.
r;:Ia
a' p ± � 'J -----;;provided that
n-
q is large. Here p
=
( 1 /n )
n
�X, j=l j
and
" I
=
{ (fi k } is a
( q + 1 ) X ( q + 1 ) matrix with ;;.k k = Pk ( 1 - Pk) and ;;.i k = -pi pk , i # k. Also, x� ( a ) is the upper ( 100a) th percentile of the chi-square distribution with q d.f. •
In this result, the requirement that n - q is large is interpreted to mean npk is about 20 or more for each category. We have only touched on the possibilities for the analysis of categorical data. Complete discussions of categorical data analysis are available in [1] and 5.15. Let Xj i and Xj k be the ith and kth components, respectively, of X j . (a) Show that JLi = E ( Xj i ) = Pi and a-ii = Var ( Xj i ) = Pi ( 1 - Pi ) , i = 1 , 2, ' p. (b) Show that a-i k = Cov(Xj i Xj k ) = - PiPk , i # k. Why must this covariance ' necessarily be negative? 5.16. As part of a larger marketing research project, a consultant for the Bank of Shorewood wants to know the proportion of savers that uses the bank ' s facili ties as their primary vehicle for saving. The consultant would also like to know the proportions of savers who use the three maj or competitors: Bank B, Bank C, and Bank Each individual contacted in a survey responded to the fol lowing question:
[4] .
0 0
0
D.
Which bank is your primary savings bank? Bank of Bank B Bank C Bank Response: Shorewood
D
Another Bank
No Savings
266
Chapter 5
I nfe rences about a Mean Vector
A sample of n = 355 people with savings accounts produced the follow ing counts when asked to indicate their primary savings banks (the people wit h no savings will be ignored in the comparison of savers, so there are five categories) :
Bank (category)
Bank of Shorewood
Bank B
105
119
56
25
PI
P2
P3
P4
Observed number Population proportion Observed sample proportion
A
PI
105 355
=
=
.30
P2
= .33
Bank e Bank D
P3
=
. 1 6 P4
=
Another bank 50
. o7
1 ( PI + P2
Ps
=
p5
=
Total n = 35 5
+ P3 + P4 )
.14
Let the population proportions be PI p2 p3 p4
1
-
( PI
+ P2 + p3 + p4)
= = =
= =
proportion of savers at Bank of Shorewood proportion of savers at Bank B proportion of savers at Bank C proportion of savers at Bank D proportion of savers at other banks
(a) Construct simultaneous 95% confidence intervals for PI , p2 , • • • , p5 •
(b) Construct a simultaneous 95% confidence interval that allows a comparison of the Bank of Shorewood with its maj or competitor, Bank B. Interpret this interval.
5.17. In order to assess the prevalence of a drug problem among high school stu dents in a particular city, a random sample of 200 students from the city's five high schools were surveyed. One of the survey questions and the correspond ing responses are as follows: What is your typical weekly marijuana usage? Category
Number of responses
None
Moderate (1-3 j oints)
Heavy (4 or more j oints)
117
62
21
Chapter 5
Exercises
267
Construct 95% simultaneous confidence intervals for the three proportions ( PI + P2 ) · PI , P2 , and P3 = 1 -
The following exercises may require a computer. 5.18. Use the college test data in Table 5.2. (See Example 5.5.) (a) Test the null hypothesis H0 : IL ' = [ 500, 50, 30] versus HI : IL ' =F [ 500, 50, 30 J at the a = .05 level of significance. Suppose [ 500, 50, 30 J ' represent average scores for thousands of college students over the last 10 years. Is there rea son to believe that the group of students represented by the scores in Table 5.2 is scoring differently? Explain. (b) Determine the lengths and directions for the axes of the 95% confidence el lipsoid for JL . (c) Construct Q-Q plots from the marginal distributions of social science and history, verbal, and science scores. Also, construct the three possible seat ter diagrams from the pairs of observations on different variables. Do these data appear to be normally distributed? Discuss. 5.19. Measurements of x i = stiffness and x2 = bending strength for a sample of n = 30 pieces of a particular grade of lumber are given in Table 5 . 1 1 . The units are pounds/( inches ) 2 • Using the data in the table, TABLE 5. 1 1
xi
LU M B E R DATA
(Stiffness: modulus of elasticity) 1232 1115 2205 1 897 1932 1612 1598 1804 1752 2067 2365 1646 1579 1880 1773
x2
xi
x2
(Bending strength)
(Stiffness: modulus of elasticity)
(Bending strength)
4175 6652 7612 10,914 10,850 7627 6954 8365 9469 6410 10,327 7320 8196 9709 10,370
1712 1 932 1820 1900 2426 1558 1470 1 858 1587 2208 1487 2206 2332 2540 2322
7749 6818 9307 6457 10,102 7414 7556 7833 8309 9559 6255 10,723 5430 12,090 10,072
Source: Data courtesy of U.S. Forest Products Laboratory.
268
Chapter 5
I nferences about a Mean Vector
(a) Construct and sketch a 95% confidence ellipse for the pair [JL 1 , JL2 ] ' , where JL 1 = E ( X1 ) and JL2 = E ( X2 ) . (b) Suppose JL 1 o = 2000 and JL2 o = 10,000 represent "typical" values for stiff
ness and bending strength, respectively. Given the result in (a), are the data in Table 5.11 consistent with these values? Explain. (c) Is the bivariate normal distribution a viable population model? Explai n with reference to Q-Q plots and a scatter diagram.
5.20. A wildlife ecologist measured x 1
= tail length (in millimeters) and x2 = wing length (in millimeters) for a sample of n = 45 female hook-billed kites. These data are displayed in Table 5.12. Using the data in the table,
TABLE 5. 1 2
B I R D DATA
xl (Tail length)
x2 (Wing length)
xl (Tail length)
x2 (Wing length)
xl (Tail length)
x2 (Wing length)
191 1 97 208 1 80 1 80 188 210 196 191 179 208 202 200 1 92 199
284 285 288 273 275 280 283 288 271 257 289 285 272 282 280
186 197 201 190 209 187 207 178 202 205 190 189 21 1 216 189
266 285 295 282 305 285 297 268 271 285 280 277 310 305 274
173 194 198 180 1 90 191 196 207 209 179 186 174 181 189 188
271 280 300 272 292 286 285 286 303 261 262 245 250 262 258
Source: Data courtesy of S. Temple.
(a) Find and sketch the 95% confidence ellipse for the population means f.LI and JL2 . Suppose it is known that JL 1 = 190 mm and JL2 = 275 mm for male
hook-billed kites. Are these plausible values for the mean tail length and mean wing length for the female birds? Explain. (b) Construct the simultaneous 95% T 2 -intervals for JL I and JL2 and the 95% Bonferroni intervals for JL 1 and JL2 • Compare the two sets of intervals. What advantage, if any, do the T 2 -intervals have over the Bonferroni intervals? (c) Is the bivariate normal distribution a viable population model? Exp l ain with reference to Q-Q plots and a scatter diagram.
Chapter 5
Exercises
269
5.21. Using the data on bone mineral content in Table 1 . 8, construct the 95% Bonferroni intervals for the individual means. Also, find the 95% simultaneous T 2-intervals. Compare the two sets of intervals. 5.22. A portion of the data contained in Table 6.10 in Chapter 6 is reproduced in the following table.
M I LK TRANSPORTATION-COST DATA Fuel ( x 1 ) 16.44 7.19 9.92 4.24 1 1 .20 14.25 13 .50 13.32 29. 1 1 12.68 7.51 9.90 10.25 11.11 12.17 10.24 10.18 8.88 12.34 8.51 26.16 12.95 16.93 14.70 10.32
Repair ( x2 ) 12.43 2.70 1 .35 5.78 5.05 5.78 10.98 14.27 15 .09 7.61 5.80 3.63 5.07 6.15 14.26 2.59 6.05 2.70 7.73 14.02 17.44 8.24 13 .37 10.78 5.16
Capital ( x 3 ) 1 1 .23 3.92 9.75 7.78 1 0.67 9.88 10.60 9.45 3.28 10.23 8.13 9.13 10.17 7.61 14.39 6.09 12.14 12.23 1 1 .68 12.01 16.89 7.18 17.59 14.58 17.00
These data represent various costs associated with transporting milk from farms to dairy plants for gasoline trucks. Only the first 25 multivariate observations for gasoline trucks are given. Observations 9 and 21 have been identified as out liers from the full data set of 36 observations. (See [2] .) (a) Construct plots of the marginal distributions of fuel, repair, and capi tal costs. Also, construct the three possible scatter diagrams from the pairs of observations on different variables. Are the outliers evident? Repeat
Q-Q
270
Chapter 5
I nferences a bout a Mean Vector
Q-Q
the plots and the scatter diagrams with the apparent outliers removed. Do the data now appear to be normally distributed? Discuss. (b) Construct 95% Bonferroni intervals for the individual cost means. Also, find the 95% T2 -intervals. Compare the two sets of intervals.
5.23. Consider the thirty observations on male Egyptian skulls for the first time pe riod given in Table 6.13 on page 344. (a) Construct plots of the marginal distributions of the maxbreath� basheight, baslength and nasheight variables. Also, construct a chi-square plot of the multivariate observations. Do these data appear to be normal ly distributed? Explain. (b) Construct 95% Bonferroni intervals for the individual skull dimension vari ables. Also, find the 95% T2 -intervals. Compare the two sets of intervals.
Q- Q
5.24. Using the Madison, Wisconsin, Police Department data in Table 5.8, construct individual X charts for x3 = holdover hours and x4 = COA hours. Do these individual process characteristics seem to be in control? (That is, are they sta ble?) Comment. 5.25. Refer to Exercise 5.24. Using the data on the holdover and COA overtime hours, construct a quality ellipse and a T 2 -chart. Does the process represent ed by the bivariate observations appear to be in control? (That is, is it stable?) Comment. Do you learn something from the multivariate control charts that was not apparent in the individual X -charts? 5.26. Construct a T 2-chart using the data on x 1 = legal appearances overtime hours, x2 = extraordinary event overtime hours, and x3 = holdover overtime hours from Table 5.8. Compare this chart with the chart in Figure 5.8 of Example 5.10. Does plotting T2 with an additional characteristic change your conclusion about process stability? Explain. 5.27. Using the data on x3 = holdover hours and x4 = COA hours from Table 5 . 8 , construct a prediction ellipse for a future observation x' = ( x3 , x4 ) . Remem ber, a prediction ellipse should be calculated from a stable process. Interpret the result. REFERENCES 1. Agresti, A. Categorical Data Analysis, New York: John Wiley, 1 988. 2. Bacon-Shone, J. , and W. K. Fung. "A New Graphical Method for Detecting Single and Multiple Outliers in Univariate and Multivariate D ata." Applied Statistics, 36, no. 2 (1987), 153-162. 3. Bickel, P. J. , and K. A. D oksum. Mathematical Statistics: Basics Ideas and Selected Top ics, San Francisco: Holden-Day, 1977.
4. Bishop, Y. M. M., S. E. Feinberg, and P. W. Holland. Discrete Multivariate Analysis: The ory and Practice. Cambridge, MA: The MIT Press, 1975 .
Chapter 5
References
271
5 . Dempster, A. P. , N. M. Laird, and D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion ) ." Journal of the Royal Statistical Society (B ) , 39, no. 1 (1977), 1-38. 6. Hartley, H. 0. "Maximum Likelihood Estimation from Incomplete Data." Biometrics, 14 (1958), 174-1 94.
7. Hartley, H. 0. , and R. R. Hocking. "The Analysis of Incomplete Data." Biometrics, 21 (1971 ) , 783-808. 8. Johnson, R. A. and T. Langeland "A Linear Combinations Test for Detecting Serial Cor relation in Multivariate Samples." Topics in Statistical Dependence. (1991) Institute of Mathematical Statistics Monograph, Eds. Block, H. et al. , 299-3 13. 9. Ryan, T. P. Statistical Methods for Quality Improvement. New York: John Wiley, 1989.
10. Tiku, M. L. , and M. Singh. "Robust Statistics for Testing Mean Vectors of Multivariate Distributions." Communications in Statistics-Theory and Methods, 11, no. 9 (1982) , 985-1001.
C H APTE R
6
Comparisons of Several Multivariate Means
6.1
I NTRODUCTI ON The ideas developed in Chapter 5 can be extended to handle problems involving the comparison of several mean vectors. The theory is a little more complicated and rests on an assumption of multivariate normal distributions or large sample sizes. Similarly, the notation becomes a bit cumbersome. To circumvent these problems, we shall often review univariate procedures for comparing several means and then general ize to the corresponding multivariate cases by analogy. The numerical examples we present will help cement the concepts. Because comparisons of means frequently ( and should ) emanate from designed experiments, we take the opportunity to discuss some of the tenets of good experi mental practice. A repeated measures design, useful in b ehavioral studies, is explic itly considered, along with modifications required to analyze growth curves. We begin by considering pairs of mean vectors. In later sections, we discuss several comparisons among mean vectors arranged according to treatment levels. The corresponding test statistics depend upon a partitioning of the total variation into pieces of variation attributable to the treatment sources and error. This parti tioning is known as the multivariate analysis of variance (MANOVA).
6.2
PAIRED CO M PARISONS AN D A REPEATED M EAS U R E S D E S I G N Pai red Comparisons
272
Measurements are often recorded under different sets of experimental conditions to see whether the responses differ significantly over these sets. For example, the effi cacy of a new drug or of a saturation advertising campaign may be determined by com paring measurements before the "treatment" ( drug or advertising ) with those after
Section 6.2
Pa i red Com parisons a n d a Repeated Measu res Des i g n
273
the treatment. In other situations, two or more treatments can be administered to the same or similar experimental units, and responses can be compared to assess the effects of the treatments. One rational approach to comparing two treatments, or the presence and ab sence of a single treatment, is to assign both treatments to the same or identical units (individuals, stores, plots of land, and so forth). The paired responses may then be an alyzed by computing their differences, thereby eliminating much of the influence of extraneous unit-to-unit variation. In the single response (univariate) case, let X1 1 denote the response to treatment 1 (or the response before treatment) , and let X1 2 denote the response to treatment 2 (or the response after treatment) for the jth trial. That is, ( X1 1 , X1 2 ) are measure ments recorded on the jth unit or jth pair of like units. By design, the n differences v.1 ==
x.1 1 - x.1 2 '
j
==
1 , 2,
. . .
,n
(6-1)
should reflect only the differential effects of the treatments. Given that the differences D1 in (6-1) represent independent observations from an N ( 8, o-� ) distribution, the variable
t where D
=
Di n�
n
1
and
D - 8 s d/
== --Vn
s� =
(6-2)
n
1
n
_
1
�
( Di - D )
2
(6-3)
has a !-distribution with n - 1 d.f. Consequently, an a-level test of
H0 : 8 versus
==
0 (zero mean difference for treatments)
H1 : 8 =I= 0
may be conducted by comparing I t I with tn _ 1 ( a/2)-the upper 100( a/2)th percentile of a !-distribution with n - 1 d.f. A 100 ( 1 - a ) % confidence interval for the mean difference 8 == E ( X1 1 - X12) is provided by the statement -
d - tn
-
l ( a/2)
Vn �
<
8
<
-
d + tn
-
l
( a/2 ) Vn �
(6-4)
(For example, see [7] .) Additional notation is required for the multivariate extension of the paired comparison procedure. It is necessary to distinguish between p responses, two treat ments, and n experimental units. We label the p responses within the jth unit as x1 j 1 x1 j2
==
variable 1 under treatment 1 variable 2 under treatment 1
x1 j p
==
variable p under treatment
==
1
----------------------------------------------------------------
x2j 1 x2 j 2 x2j p
==
variable 1 under treatment 2 variable 2 under treatment 2
==
variable p under treatment 2
==
274
Chapter 6
Compar isons of Severa l M u ltivari ate Means
and the p paired-difference random variables become Dj l Dj 2 ... Dj p Let n ;
= xl j l - x2 j l = xl j 2 - x2 1 2 . ..
(6-5)
= xl j p - X2 j p
= [ D1 1 , Dj 2 , . . . , Dj p ] , and assume, for j = 1 , 2, . . . , n, that (6-6)
BP If, in addition, D1 , D2 , . . . , Dn are independent Np ( 8, I d) random vectors, inferences about the vector of mean differences 8 can be based upon a T 2 -statistic. Specifically, T 2 = n ( D - 8 ) ' S� ? ( D - 8 ) (6- 7 ) where
1 � 1 - , £.J ( Dj - D ) ( Dj - D ) (6-8) £.J Dj and S d = =-� n j=l n - 1 j= l Result 6.1. Let the differences D1 , D2 , . . . , Dn be a random sample from an Np ( 8, Id) population. Then T2 = n ( D - 8 ) ' S � ? ( D - 8 ) is distributed as an [ ( n - 1 )p/ ( n - p ) ] Fp , n - p random variable, whatever the true o and Id · If n and n - p are both large, T 2 is approximately distributed as a x� random D
variable, regardless of the form of the underlying population of differences.
Proof.
The exact distribution of T 2 is a restatement of the summary in ( 5 - 6), with vectors of differences for the observation vectors. The approximate distribu • tion of T 2 , for n and n - p large, follows from ( 4-28). The condition 8 = 0 is equivalent to "no average difference between the two treatments." For the ith variable, Bi > 0 implies that treatment 2 is larger, on aver age, than treatment 1 . In general, inferences about 8 can be made using Result 6 . 1 .
Section 6.2
Example 6 . 1
Pa i red Compa risons a n d a Repeated Measu res Design
275
(Checking for a mean difference with pai red observati ons)
Municipal wastewater treatment plants are required by law to monitor their discharges into rivers and streams on a regular basis. Concern about the relia bility of data from one of these self-monitoring programs led to a study in which samples of effluent were divided and sent to two laboratories for testing. One half of each sample was sent to the Wisconsin State Laboratory of Hygiene, and one-half was sent to a private commercial laboratory routinely used in the mon itoring program. Measurements of biochemical oxygen demand (BOD) and suspended solids (SS) were obtained, for n == 1 1 sample splits, from the two laboratories. The data are displayed in Table 6.1 . TABLE 6 . 1
EFFLU E NT DATA
Sample j
Commercial lab x1 j 2 (SS) X l j l (BOD)
1 2 3 4 5 6 7 8 9 10 11
6 6 18 8 11 34 28 71 43 33 20
Source: Data courtesy of S. Weber.
27 23 64 44 30 75 26 124 54 30 14
State lab of hygiene x 2 j 1 (BOD) X 2 j 2 (SS) 25 28 36 35 15 44 42 54 34 29 39
15 13 22 29 31 64 30 64 56 20 21
276
Chapter 6
Com parisons of Severa l M u ltivariate Means
Do the two laboratories' chemical analyses agree? If differences exist, what is their nature? The T 2 -statistic for testing H0 : B ' == [ 8 1 , 82 ] == [ 0, 0] is constructed fro n1 the differences of paired observations:
- 19 -22 -18 -27 -4 - 10 - 14 12 10 42 15 - 1 11 -4
dj 1 == X 1 j 1 - X2 j 1
Here
d
==
[ � ] == [ ] 1
d2
-9.36 13.27 '
and T2
[
[
9 60 -2
199.26 88.38 88.38 418.61
.0055 - .0012 -.0012 .0026
][ ] -9.36 13.27
4 - 19 10 7 -
] ==
13.6
.05, we find that [p( n - 1 )/(n - p)]Fp, n - p ( .05 ) == [2( 10)/9 ]F2, 9(.0 5 ) 9.47. Since T2 == 13.6 > 9.47, we rej ect H0 and conclude that there is a
Taking a ==
== 1 1 [ -9.36 ' 13.27]
sd -
17
==
nonzero mean difference between the measurements of the two laboratories. It appears, from inspection of the data, that the commercial lab tends to produce lower B OD measurements and higher SS measurements than the State Lab of Hygiene. The 95% simultaneous confidence intervals for the mean differences 8 1 and 82 can be computed using (6-10). These intervals are
-
81 : d1
±
�
_
/sf: ( n - 1 )p Fp, n - p ( a. ) \j -;: = -9.36 ± v'9.47 (n p )
82 : 13.27 ± v'9.47 'v'(4I8.6I ------u- or ( -5.71, 32.25 )
) 199.26
1l or ( -22.46, 3.74)
The 95% simultaneous confidence intervals include zero, yet the hypothesis H0 : B == 0 was rej ected at the 5% level. What are we to conclude? The evidence points toward real differences. The point B == 0 falls outside the 95% confidence region for B (see Exercise 6.1), and this result is consistent with the T 2 -test. The 95% simultaneous confidence coefficient applies to the entire set of intervals that could be constructed for all possible linear combina tions of the form a 1 8 1 + a 2 82 • The particular intervals corresponding to the choices ( a 1 == 1, a 2 == O ) and (a 1 == O, a 2 == 1 ) contain zero. Other choices of a 1 and a 2 will produce simultaneous intervals that do not contain zero. (If the hy pothesis H0 : B == 0 were not rej ected, then all simultaneous intervals would in clude zero.) The B onferroni simultaneous intervals also cover zero. (See Exer cise 6.2.)
Section 6.2
Pa i red Com pa risons and a Repeated Measu res Design
277
Our analysis assumed a normal distribution for the Dj . In fact, the situa tion is further complicated by the presence of one or, possibly, two outliers. ( See Exercise 6.3.) These data can be transformed to data more nearly normal, but with such a small sample, it is difficult to remove the effects of the outlier ( s ) . ( See Exercise 6.4.) The numerical results of this example illustrate an unusual circumstance that can occur when making inferences. • The experimenter in Example 6.1 actually divided a sample by first shaking it and then pouring it rapidly back and forth into two bottles for chemical analysis. This was prudent because a simple division of the sample into two pieces obtained by pour ing the top half into one bottle and the remainder into another bottle might result in more suspended solids in the lower half due to setting. The two laboratories would then not be working with the same, or even like, experimental units, and the conclusions would not pertain to laboratory competence, measuring techniques, and so forth. Whenever an investigator can control the assignment of treatments to experi mental units, an appropriate pairing of units and a randomized assignment of treat ments can enhance the statistical analysis. Differences, if any, between supposedly identical units must be identified and most-alike units paired. Further, a random as signment of treatment 1 to one unit and treatment 2 to the other unit will help elim inate the systematic effects of uncontrolled sources of variation. Randomization can be implemented by flipping a coin to determine whether the first unit in a pair receives treatment 1 ( heads ) or treatment 2 ( tails ) . The remaining treatment is then assigned to the other unit. A separate independent randomization is conducted for each pair. One can conceive of the process as follows: Experimental Design for
3
2
Like pairs of expetimental units
D
D
{6
D
D
t
t
t
Treatments 1 and 2 assigned at random
Treatments 1 and 2 assigned at random
Treatments 1 and 2 assigned at random
Paired C o mp ari so n s • • • • • •
• •
n
D
D t
Treatments • 1 and 2 assigned at random
We conclude our discussion of paired comparisons by noting that d and Sd , and hence T 2 , may be calculated from the full-sample quantities x and S. Here x is the 2p X 1 vector of sample averages for the p variables on the two treatments given by and S is the 2p X
[
]
(6-11)
2p matrix of sample variances and covariances arranged as S
S
S
11 12 pxp) pxp) ( ( = S2 1 S 22 ( pXp) ( pXp)
(6-12)
278
Chapter 6
Comparisons of Seve ra l M u ltivariate Means
The matrix S 1 1 contains the sample variances and covariances for the p variables on treatment 1. Similarly, S 22 contains the sample variances and covariances computed for the p variables on treatment 2. Finally, S 1 2 == S2 1 are the matrices of sample co variances computed from observations on pairs of treatment 1 and treatment 2 variables. Defining the matrix
c
( p X 2p)
1 0 0 1
0 -1 0 0
0 0
1
-
0
0 1
0 0
0
-1
i
( 6-1 3)
(p + 1 )st column we can verify (see Exercise 6 .9) that
d1· == C x1· ' j == 1 , 2, . , n d == C x and Sd == CSC' .
.
(6 -14)
Thus, (6- 1 5 ) and it is not necessary first to calculate the differences d 1 , d 2 , . . . , d n . On the other hand, it is wise to calculate these differences in order to check normality and the as sumption of a random sample. Each row c; of the matrix C in (6 - 1 3) is a contrast vector, because its elements sum to zero. Attention is usually centered on contrasts when comparing treatments. Each contrast is perpendicular to the vector 1 ' == [ 1 , 1 , . . . , 1 J since c; 1 == 0. The com ponent 1 ' xj , representing the overall treatment sum, is ignored by the test statistic T 2 presented in this section. A Repeated Measures Design fo r Co mparing Treatments Another generalization of the univariate paired t-statistic arises in situations where q treatments are compared with respect to a single response variable. Each subj ect or experimental unit receives each treatment once over successive periods of time. The jth observation is
X 1- =
j
==
1, 2, . . . , n
Xj q
The name rep ea te d measures stems from the fact that all treatments are administered to each unit. For comparative purposes, we consider contrasts of the components of IL == E(X j ) · These could be where
Xji is the response to the ith treatment on the jth unit.
Section 6.2
Pa i red Comparisons and a Repeated Measu res Design
l�l � �3
1 -1 0 1 0 -1
- J.L 1 - J.L 2
-1 1 0 0 -1 1
1 J.L 1 - J.L2
1
J.L 1 - J.Lq
0
0
0 0
J.L 1
-1
J.Lq
J.L2
279
== C l �-t
or J.L2 I-L 3
==
0 0 0 0
I-L l
J.L2 == C �-t 2
0 0 0 - 1 1 J.Lq Both C 1 and C 2 are called contrast matrices, because their q - 1 rows are linearly in
J.Lq - J.Lq - 1
dependent and each is a contrast vector. The nature of the design eliminates much of the influence of unit-to-unit variation on treatment comparisons. Of course, the experimenter should randomize the order in which the treatments are presented to each subject. When the treatment means are equal, C 1 �-t == C 2 �-t == 0 . In general, the hy pothesis that there are no differences in treatments (equal treatment means) be comes C�-t == 0 for any choice of the contrast matrix C. Consequently, based on the contrasts C xj in the observations, we have means C x and covariance matrix CSC ' , and we test C�-t == 0 using the T2 -statistic 1 T 2 == n (C x) ' (CSC' ) - C x
It can be shown that T2 does not depend on the particular choice of C. 1 1 Any pair of contrast matrices C and C must be related by C 1 BC , with B nonsingular. 2 2 1 This follows because each C has the largest possible number, q 1 , of linearly independent rows, all perpendicular to the vector 1. Then (BC 2 ) ' ( BC 2 SC2 B ' )-1(BC2 ) = C2B ' ( B ' )-1 ( C 2 SC2) -1 B-1BC 2 = C2(C 2 SC2) - 1 C 2 , so T2 computed with C 2 or C 1 BC 2 gives the same result. =
-
=
280
Chapter 6
Comparisons of Seve ra l M u ltivariate Means
,
A confidence region for contrasts CJL , with IL the mean of a normal populatio n, is determined by the set of all CJL such that (n - 1 ) ( q - 1 ) , 1 (6 -17) n(C x - C p ) ( CSC ) - ( C x - Cp ) 1 (a) l (n q + ) Fq- l, n-q+ _
_
_
<
where x and S are as defined in (6-16). Consequently, simultaneous 100 ( 1 - a ) % confidence intervals for single contrasts c' IL for any contrast vectors of interest are given by ( see Result 5A. 1)
C' JL: c' x ± Example 6.2
(n - 1)(q - 1) Fq - 1 ' n-q+l ( a) (n - q + 1 )
�' Sc
( 6- 1 8 )
n
(Testi ng fo r eq ual treatments i n a repeated measures design)
Improved anesthetics are often developed by first studying their effects on an imals. In one study, 19 dogs were initially given the drug pentobarbital. Each dog was then administered carbon dioxide ( C0 2 ) at each of two pressure lev els. Next, halothane ( H) was added, and the administration of ( C0 2 ) was re peated. The response, milliseconds between heartbeats, was measured for the four treatment combinations: Present Halothane
Absent Low
High
C02 pressure
Table 6.2 contains the four measurements for each of the
= Treatment 2 = Treatment 3 = Treatment 4 = Treatment 1
19 dogs, where
high C0 2 pressure without H low C0 2 pressure without H high C0 2 pressure with H low C0 2 pressure with H
We shall analyze the anesthetizing effects of C0 2 pressure and halothane from this repeated-measures design. There are three treatment contrasts that might be of interest in the ex periment. Let JL1 , JL2 , JL3 , and JL4 correspond to the mean responses for treat ments 1 , 2, 3, and 4, respectively. Then
( JL3 + JL4 ) - ( JL 1 + JL2 ) = ( i-Ll
+ JL3 )
_
( JL2
( (
Halothane contrast representing the difference between the presence and absence of halothane
)
C0 2 contrast representing the difference + JL4 ) = between high and low C0 pressure 2
)
Section 6.2
TABLE 6.2
281
Pai red Com parisons and a Repeated Measu res Design
SLEEPI NG-DOG DATA Treatment
Dog
1
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
426 253 359 432 405 324 310 326 375 286 349 429 348 412 347 434 364 420 397
609 236 433 431 426 438 312 326 447 286 382 410 377 473 326 458 367 395 556
(
3
4
556 392 349 522 513 507 410 350 547 403 473 488 447 472 455 637 432 508 645
600 395 357 600 513 539 456 504 548 422 497 547 514 446 468 524 469 531 625
Source: Data courtesy of Dr. J. Atlee.
( JL 1 With
Contrast representing the influence
+ JL4 ) - ( JL2 + JL3 ) = of halothane on C0 2 pressure differences (H - C0 2 pressure "interaction" )
[ � =� � � ]
JL ' = [JL 1 , JL2 , JL3 , JL4 ] , the contrast matrix C is c=
-
1 -1 -1
The data (see Table
6.2) give
368.21 404.63 x= 479.26 502.89
1
2819.29 3568 . 42 7963.14 and S = 2943.49 5303.98 6851 .32 2295 . 35 4065.44 4499.63 4878.99
[ ]
It can be verified that
ex =
-
209.31 -60.05 ; - 12.79
[
9432.32 1098.92 927.62 CSC ' = 1098.92 5195.84 914 . 54 927.62 914 . 54 7557.44
]
)
282
Chapter 6
Com parisons of Severa l M u ltivariate Means
and T2
=
1 n( Cx) ' ( CSC ' )- ( Cx )
=
19(6.11)
=
116
.05, 18 (3 ) 18( 3) (n - 1 ) (q - 1 ) (3.24) = 10.94 F3 , 1 6 ( . 0 5 ) = Fq-l, n - q + l ( a) = 16 16 (n q + 1 ) From (6-16), T 2 = 116 > 10.94, and we rej ect H0 : CJL = 0 (no treatment ef
With a
=
_
fects) . To see which of the contrasts are responsible for the rej ection of H0 , we construct 95% simultaneous confidence intervals for these contrasts. Fron1 (6-18), the contrast
c1 1L
=
( JL3 + JL4 )
- ( JL 1 + JL2 )
=
halothane influence
is estimated by the interval ( X3 + X4 )
- ( .i1 + X2 ) ±
18(3) � F3, 16( .05 ) \j � 16
= =
where c1 is the first row of mated by C0 2 pressure influence
=
209.31 ± v'Tif.94
) 9432 . 32 19
209.31 ± 73.70
C. Similarly, the remaining contrasts are esti
(JL 1 + JL3 ) - ( JL2 + JL4 ) :
� - 60.05 ± v'ID.94 \/f5I95.84 H - C0 2 pressure "interaction"
=
=
-60.05 ± 54. 7 0
=
- 12.79 ± 65.97
(JL I + JL4 ) - ( JL2 + JL3 ) :
� - 12.79 ± v'ID.94 \1{1557M
The first confidence interval implies that there is a halothane effect. The presence of halothane produces longer times between heartbeats. This occurs at both levels of C0 2 pressure, since the H - C0 2 pressure interaction contrast, ( JL 1 + JL 4 ) - (JL2 - JL 3 ) , is not significantly different from zero. (See the third confidence interval.) The second confidence interval indicates that there is an effect due to C0 2 pressure: The lower C0 2 pressure produces longer times be tween heartbeats. Some caution must be exercised in our interpretation of the results because the trials with halothane must follow those without. The apparent H-effect may be due to a time trend. (Ideally, the time order of all treatments • should be determined at random.)
The test in (6-16) is appropriate when the covariance matrix, Cov (X) = I , cannot be assumed to have any special structure . If it is reasonable to assume that I has a particular structure, tests designed with this structure in mind have higher power than the one in (6-16). (For I with the equal correlation structure (8-14), see a dis cussion of the "randomized block" design in [10] or [16] .)
Section 6.3
6. 3
Com paring Mean Vectors from Two Popu l ations
283
CO M PAR I N G M EAN VECTORS FRO M TWO POPU LATI ONS A T 2 -statistic for testing the equality of vector means from two multivariate popula tions can be developed by analogy with the univariate procedure. (See [7] for a dis cussion of the univariate case.) This T 2 -statistic is appropriate for comparing responses from one set of experimental settings (population 1) with independent re sponses from another set of experimental settings (population 2) . The comparison can be made without explicitly controlling for unit-to-unit variability, as in the paired comparison case. If possible, the experimental units should be randomly assigned to the sets of experimental conditions. Randomization will, to some extent, mitigate the effect of unit-to-unit variability in a subsequent comparison of treatments. Although some precision is lost relative to paired comparisons, the inferences in the two-population case are, ordinarily, applicable to a more general collection of experimental units sim ply because unit homogeneity is not required. Consider a random sample of size n 1 from population 1 and a sample of size n 2 from population 2. The observations on p variables can be arranged as follows: Sample
Summary statistics
(Population 1)
(Population 2) X 21 , X 2 2 ' · · · X 2 n2 '
In this notation, the first subscript-1 or 2-denotes the population. We want to make inferences about (mean vector of population 1) - (mean vec tor of population 2) == IL l - IL 2 • For instance, we shall want to answer the question, Is IL l == �L 2 (or, equivalently, is ILl - �L 2 == 0)? Also, if p.. 1 - IL 2 # 0, which compo nent means are different? With a few tentative assumptions, we are able to provide answers to these questions. Assu m ptions Co ncerning the Structu re of the Data
1. The sample X 1 1 , X 12 , . . . , X 1 n 1 , is a random sample of size n 1 from a p-variate
population with mean vector IL l and covariance matrix I 1 • 2. The sample X 2 1 , X 22 , . . . , X 2 n2 , is a random sample of size n 2 from a p-variate population with mean vector p.. 2 and covariance matrix I 2 •
(6-19)
We shall see later that, for large samples, this structure is sufficient for making inferences about the p X 1 vector IL l - IL 2 • However, when the sample sizes n 1 and n 2 are small, more assumptions are needed.
284
Chapter 6
Compar isons of Severa l M u ltivariate Means
Further Assumptions when n 1 and
n2
Are Smal l
1. Both populations are multivariate normal. 2. Also, I 1 = I 2 (same covariance matrix) .
(6-20)
The second assumption, that I 1 = I 2 , is much stronger than its univariate counter part. Here we are assuming that several pairs of variances and covariances are near ly equal. nl
When i 1 = I 2 = I, � (x 1 1 - x 1 ) (x11 - x 1 ) ' is an estimate of (n 1 - 1 ) I and j= l n2 � (x 2 1 - x2 ) (x2 1 - x2 ) ' is an estimate of ( n 2 - 1 )I. Consequently, we can pool the j=l information in both samples in order to estimate the common covariance I . We set n2
nl
� (x l j - x l ) (x l j - x l ) ' + � (x2 j - x2 ) (x2j - x2 ) ' j= l j=l Spooled = n l + n2 - 2 n1 - 1 n2 - 1 s2 sl + n 1 + n2 - 2 n 1 + n2 - 2
------
n2
nl
(6-2 1 )
Since � (x 1 1 - x 1 ) (x 11 - x 1 ) ' has n 1 - 1 d.f. and � (x 2 1 - x2 ) (x 2 1 - x2 ) ' has j= l j= l n 2 - 1 d.f., the divisor ( n 1 - 1 ) + ( n 2 - 1 ) in ( 6-21) is obtained by combining the two component degrees of freedom. [See (4-24).] Additional support for the pool ing procedure comes from consideration of the multivariate normal likelihood. (See Exercise 6.11.) To test the hypothesis that IL l - IL 2 = a specified vector, we consider the squared statistical distance from x 1 - x2 to Now, E ( X 1 - X2 ) = E ( X 1 ) - E ( X 2 ) = 1L1 - 1L2 Since the independence assumption in (6-19) implies that X 1 and X 2 are indepen dent and thus Cov ( X 1 , X 2 ) = 0 (see Result 4.5), by (3-9), it follows that 1 1 1 1 Cov ( X 1 - X2 ) = Cov (X 1 ) + Cov ( X2 ) = - I + - I = - + - I (6-22) n2 nl n l n2 Because S p ooled estimates I, we see that
, Do D0.
(
( � �J Spooled 1
)
+
is an estimator of Cov ( X 1 - X 2 ). The likelihood ratio test of
Ho : IL1 - IL2 = Do
is based on the square of the statistical distance, Reject if T 2 = ( X1 - X2 + S pooled 1
H0
-
T 2 , and is given by (see [1]).
Do) ' [ ( � �J J ( X1 - X2 - Do) > c2 1
( 6- 2 3 )
Section 6.3
285
Com pa ring Mean Vectors from Two Popu l ations
2 wher e t h e cr i t i c al di s t a nce c is determined from the distribution of the two-sample T2 -statistic. andomsasmplampleeofofsisziezenn21ffrroommNNpp((JL11-21,,II)), . . , X1 is ararndom andthenX21 , X2 , . . , XIf2X112 1is,anX1i2n,dependent 2T - [ -Xl - X-2 - ( ILl - 1L2) ] ' [ ( n11 n12 ) J-l [X-l - -X2 - - P2) ] is distributed as Cons e quent l y , [ ( X-l - X-2 - ( p,l - IL2 ) ) ' [ ( n11 n12 ) J-l ( X-l - X-2 - - /L2 ) ) c2 ] - 1 where c2 (n(ln1 n2n2- - -) 1 ) n +n -p- 1 We f i r s t not e t h at -X1 -X-2 -n1l Xl -n1l X12 -n1l Xl n1 - -n12 X21 - -n12 X2 - -n2 X2n2 is distributed as by-1jnRes2 • uAccor lt 4.8d,inwig ttho c1 c2 . 1/n1 and + +2 == == +n2 n ) as ( 1 and n ) S I 1 i s di s t r i b ut e d as ( ( ) ( ) S I W 1 W 1 1 1 1 2 2 2 n n n n Byareasalssuomptindependent ion, the X1./Frs andomthe X2/s( nar1e-in1dependent , s o ( 1 1 ) and ( ) S S 2 1 1 2 n 1 i s t h en di s t r i b ut e d as ) S ( ) S 1 2 2 , e r e r . o ) f ( The I W + 2 2 1 n n 2T - ( n1l n12 )-1/2 ( -X1 - -X2 - ( ILl - 1L2) )' -1 (-n1l n12 )-1/2 (-Xl - -X2 - ( P-1 - 1L2) ) variatvecte noromr al) (mulratindom variatvecte noromr al) ' ( Wishart rad.ndomf. matrix)-l (mulratindom � l I 2 n [ Np(O, I)' : �n��2 �l Np(O, I ) whichfoisr tthheeTrel-adtiisotrnibtoution specified in with n replaced by n1 n2 - 1. [See 11
Result 6.2.
1
( p,l
S pooled
+
P
( p, l
S pooled
+
a
<
(6-24)
=
Proof.
==
+
p
+
2
p
FP
,
1
(a)
2
+ ... +
+
==
(4-23),
==
.·
==
··· -
C 11 1
1
l
==
C 11 1
. .
·
Cn 1
==
+
( 4-24),
S poolect
-+-
C11
==
1
+-
==
=
(5-5)
F.]
(5-8),
+
•
286
Chapter 6
Compa risons of Severa l M u ltivariate Means
We are primarily interested in confidence regions for IL l - IL 2 • From ( 6 24 ) we conclude that all IL l - IL 2 within squared statistical distance c 2 of x1 - x2 con stitute the confidence region. This region is an ellipsoid centered at the observed difference x 1 - x2 and whose axes are determined by the eigenvalues and eigen vectors of s pooled ( or s�;oled ). -
Example 6.3
,
{Constructi ng a co nfidence reg ion for the difference of two mean vectors)
Fifty bars of soap are manufactured in each of two ways. Two characteristics, X1 == lather and X2 == mildness, are measured. The summary statistics for bars produced by methods 1 and 2 are 8.3 x l == sl = 4.1 '
[ ] == [ 10.2]
[� � J [� !]
Sz = 3.9 ' Obtain a 95% confidence region for IL1 - IL 2 • We first note that S 1 and S 2 are approximately equal, so that it is reason able to pool them. Hence, from ( 6 2 1 ) , 2 1 49 49 S pooled == 98 S 1 + 98 S2 == 1 5 Also, - 1.9 x l - x 2 == .2 so the confidence ellipse is centered at [ - 1 .9, .2 J . The eigenvalues and eigen vectors of S pooled are obtained from the equation 1 2-A = A2 - 7 A + 9 0 = I S pooled - AI I = l 5 A so A == ( 7 ± V49 - 36 ) /2 Consequently, A 1 == 5.303 and A2 == 1.697, and the corresponding eigenvectors, e 1 and e 2 , determined from i == 1, 2 are .290 .957 e l == and e 2 == .957 - .290 By Result 6.2, 1 1 2 1 1 ( 98 ) ( 2 ) + ( .05) -- .25 c + n l nz 50 50 ( 97 ) Fz,97 since F2 , 97 ( .05 ) == 3.1. The confidence ellipse extends x2
-
[ ]
[
]
'
_
.
(
[ ] ) ( )
[
]
-
\IT;
f(l_ l_) \j nl
+
n2
c2
==
\IT; V25
Section 6.3
Comparing Mean Vectors from Two Popu lations
287
2.0
- 1 .0
Figure 6.1
for J.A- 1 - J.A- 2 .
95% confi dence e l l i pse
units along the eigenvector ei , or 1.15 units in the e 1 direction and .65 units in the e 2 direction. The 95% confidence ellipse is shown in Figure 6.1. Clearly, = 0 is not in the ellipse, and we conclude that the two methods of l manufacturing soap produce different results. It appears as if the two process es produce bars of soap with about the same mildness ( X2) , but those from the • second process have more lather ( X ) .
IL - IL2
1
Simu ltaneous Confidence I nterva l s
It is possible to derive simultaneous confidence intervals for the components of the vector These confidence intervals are developed from a consideration of all possible linear combinations of the differences in the mean vectors. It is assumed that the parent multivariate populations are normal with a common covariance I.
JJ-1 - IL2•
+ n2 - 2)p/ (n1 + n2 - p - 1 ) ]Fp, n1 +n2- p - l (a) . a' ( X1 - X2) ± a' (_!_nl + _!_) a n 2 spoole ct will cover a' ( IL l - IL2) for all a. In particular JLl i - JL2 i will be covered by - 2) ± ( �1 + �2 ) Si , pooled for i 1 , 2, . . . , p Consider univariate linear combinations of the observations X l l ' X l 2 ' . . . ' X l n l and X21 ' X22 ' . . ' X 2n2 and a'X2j + given by a' X 1j + + .. + linear combinations have sample means and covariances +a' X 1,+a' S1a and a'These X2, a' S2a, respectively, where X1, S1, and X2, S2 are the mean and covariance statistics for the two original samples. See Result 3.5.) When both par a' S1 a and ent populations have the same covariance, i a' S2a are both es timators of a' I a , the common population variance of the linear combinations a' X 1 and a' X 2• Pooling these estimators, we obtain
Let c 2 With probability 1 - a. Result 6.3.
=
[ (n 1
c
X
(Xu
;
C
=
Proof.
···
a P X2 j p ·
=
a 1 X1 j 1
a 2 X1 j 2
·
a p Xl j p
s ,a
=
(
.
=
si, a
=
a 1 X2 j 1
a 2 X2 j 2
288
Chapter 6
Comparisons of Severa l M u ltivariate Means
(n l - 1 )s i,a + (n 2 - 1 ) s� a (n l + n 2 - 2) n1 - 1 n2 - 1 s2 a s1 + = a' (6-2 5 ) n 1 + n2 - 2 n 1 + n2 - 2 = a' S poolect a To test H0 : a' ( p.. 1 - p.. 2 ) = a' B0 , on the basis of the a ' X 1 j and a' X 2 j , we can form the square of the univariate two-sample t-statistic [a' ( X 1 - X 2 ) - a' ( p.. 1 - p.. 2 ) ] 2 2 ta = ( 6-26 ) 1 1 1 + a' + a s; nl n 2 ,poolect nl n 2 S poolect According to the maximization lemma with d = ( X1 - X 2 - ( p.. 1 - p.. 2 ) ) and B = ( 1 /n l + 1 /n 2 )S pooled in (2-50 ) , -l 1 ( X l - X 2 - ( IL l - IL 2 ) ) + t 2a ( X l - X 2 - ( IL l - IL J ) n 1 n 2 S p ooled 2 = T for all a # 0. Thus, ( 1 - a ) = P[T 2 c 2 ] = P[ti c 2 , for all a] 1 1 = P a ' ( X l - X 2 ) - a' ( IL1 - IL 2 ) c a ' - + - S poolect a for all a n l n2 where c 2 is selected according to Result 6.2. • s 2a , poolect -
(
, ------
[
J
(-1 )
)
----
<
' [( 1 ) J - -
-
-
[l
<
<
1
<
(
)
J
Remark. For testing H0 : �t 1 - �t 2 = 0, the linear combination a' ( x 1 - x2) , with coefficient vector a ex s;�oled ( x l - x2 ) , quantifies the largest population differ ence. That is, if T 2 rejects H0 , then a' ( x 1 - x2) will have a nonzero mean. Frequently, we try to interpret the components of this linear combination for both subject mat ter and statistical importance. Example 6.4
(Ca lcu lati ng simu ltaneous confidence i nterva ls fo r the differences i n mean co mponents)
Samples of sizes n 1 = 45 and n 2 = 55 were taken of Wisconsin homeowners with and without air conditioning, respectively. (Data courtesy of Statistical Laboratory, University of Wisconsin.) Two measurements of electrical usage (in kilowatt hours) were considered. The first is a measure of total on-peak con sumption ( X1 ) during July, and the second is a measure of total off-peak consumption (X2 ) during July. The resulting summary statistics are 204.4 8 1 = 13825.3 23823.4 xl = n l = 45 23823.4 73107.4 556.6 x2
=
[ ] [355.0 130.0 ]
'
'
82
=
[ [ 19618632.06.7
] 19616.7 ] 55964.5
'
'
n2
=
55
Section 6.3
289
Com paring Mean Vectors from Two Popu l ations
(The off-peak consumption is higher than the on-peak consumption because there are more off-peak hours in a month.) Let us find simultaneous confidence intervals for the differences in the mean components. Although there appears to be somewhat of a discrepancy in the sample variances, for illustrative purposes we proceed to a calculation of the pooled sample covariance matrix. Here n1 n2 82 = 81 S pooled = n 1 n2 n1 and (n 1 2 C F2 , FP , n 1 + n2 - p - l ( a ) n2
95%
] -+ 1- 2 + + n-2 1- 2 [215010963.5.75 21505. 5 63661.3 _- 98(972) 97( .05) -_ nl ++ n2--p2)-p1 (2.02) (3.1 ) 6.26 95% - ( 204.4 - 130.o ) � ( 15 + 515 ) 10963.7 21.7 - �127.r-(-11__-1_)_-__(556.6 - 355.0) 5 + 55 63661.3 74.7 328.5 95% 71323.5, .336, .942 3301.5, 942, -.336]. ) ( 1 1 �( � + � ) V71323.5 � 45 + 55 6.26 = 134.3 ( ) 1 1 �( � + �J V3301 .5 � 45 + 55 6.26 28. 9 6. 2 95% 290. [0, 5% 6.7.) ==
==
With IL1 - IL 2 == [ JL 1 1 - JL2 1 , JL1 2 - JL22 ] , the tervals for the population differences are
1L 1 1
or
± V6.26
f.L21 :
<
JL l l
JL2 1
<
± V6.26
1L 1 2 - f.L22 :
or
simultaneous confidence in
JL1 2 - JL22 We conclude that there is a difference in electrical consumption between those with air-conditioning and those without. This difference is evident in both on peak and off-peak consumption. The confidence ellipse for IL l IL 2 is determined from the eigenvalue-eigenvector pairs A. 1 == e1 == [ J and A.2 e2 == [ . Since <
<
==
and
VAr VI;
1
1
2
c2 = c2 =
=
on we obtain the confidence ellipse for IL l IL 2 sketched in Figure page Because the confidence ellipse for the difference in means does not OJ , the T 2-statistic will reject H0 : IL l IL 2 == 0 at the cover 0' == level. The coefficient vector for the linear combination most responsible for rejec • tion is proportional to S ��olect ( i 1 x2 ) . (See Exercise
290
Chapter 6
Com parisons of Several M u ltivariate Means
f.112 - J.122 300
200
1 00
0
1 00
200
Figure 6.2 9 5 % confi dence e l l i pse for P- 1 - P- 2 = ( JL 1 1 - IL2 1 , JL 1 2 - M2 2 ) .
The Bonferroni 100( 1 - a)% simultaneous confidence intervals for the p pop ulation mean differences are
( 1 n ) i i, 1 1 -+- s n 2
pooled
where tn 1 +n2 - 2 ( aj2p) is the upper 100 ( aj2p ) th percentile of a !-distribution with n 1 + n2 - 2 d.f. The Two-Sample Situation when I1
=I= I 2
When I 1 # I 2 . we are unable to find a "distance" measure like T 2 , whose distribu tion does not depend on the unknowns I 1 and I 2 . Bartlett ' s test [3] is used to test the equality of I 1 and I 2 in terms of generalized variances. Unfortunately, the con clusions can be seriously misleading when the populations are nonnormal. Nonnor mality and unequal covariances cannot be separated with Bartlett ' s test. A method of testing the equality of two covariance matrices that is less sensitive to the as sumption of multivariate normality has been proposed by Tiku and Balakrishnan [17] . However, more practical experience is needed with this test before we can rec ommend it unconditionally. We suggest, without much factual support, that any discrepancy of the order a- 1 , ii 4 a-2 ,ii ' or vice versa, is probably serious. This is true in the univariate case . The size of the discrepancies that are critical in the multivariate situation probably de pends, to a large extent, on the number of variables p. A transformation may improve things when the marginal variances are quite dif ferent. However, for n 1 and n2 large, we can avoid the complexities due to unequal covariance matrices.
=
Section 6.3
291
Com pa r i n g Mean Vectors from Two Popu lations
R esult 6.4. Let the sample sizes be such that n 1 - p and n 2 - p are large. Then, an approximate 100 ( 1 - a ) % confidence ellipsoid for IL l - IL 2 is given by all IL l - JL 2 satisfying
where x�( a ) is the upper ( 100a ) th percentile of a chi-square distribution with p d.f. Also, 100 ( 1 - a ) % simultaneous confidence intervals for all linear combinations a' ( �-t 1 - JL 2 ) are provided by
a' ( IL l - 1L 2 ) belongs to a' ( i 1 - i2 ) ± � Proof. From (6-22) and (3-9),
)a' (_!_n l S1
)
+ _!_ s 2 a n
2
E(X 1 - X 2 ) = IL 1 - IL 2
and
By the central limit theorem, X 1 - X 2 is nearly Np [ IL 1 - IL 2 , n1 1 I 1 + n2 1 I 2 ] . If I 1 and I2 were known, the square of the statistical distance from x l - x2 to IL l - IL 2 would be -1 ' + I I [ X 1 - X 2 - ( IL l - IL 2 ) ] n 1 n 2 [X 1 - X 2 - ( IL l - 1L 2 ) J l 2
(__!_ __!_ )
This squared distance has an approximate x�-distribution, by Result 4 . 7 . When n 1 and n 2 are large, with high probability, S 1 will be close to I 1 and S2 will be close to I 2 . Con sequently, the approximation holds with S 1 and S 2 in place of I 1 and I 2 , respectively. The results concerning the simultaneous confidence intervals follow from • Result 5 A.1.
Remark. 1 + n1 sl
If n 1 = n 2 = n, then (n - 1 )/ ( n + n - 2 ) = 1 / 2 , so ( n - 1 ) s l + ( n - 1 ) s2 1 1 1 1 -- - (S 1 + S 2 ) -+S 2 n n2 n n n+n-2
= S pooled
(! ! ) +
( )
With equal sample sizes, the large sample procedure is essentially the same as the procedure based on the pooled covariance matrix. ( See Result 6.2.) In one dimen sion, it is well known that the effect of unequal variances is least when n 1 = n2 and greatest when n 1 is much less than n 2 or vice versa.
292
Chapter 6
Com parisons of Severa l M u ltivariate Means
Example 6 . 5
{Large sa mple procedu res for i nferences a bout the difference i n means)
[
]
]
We shall analyze the electrical-consumption data discussed in Example 6.4 using the large-sample approach. We first calculate 1 8 1 1 13825.3 23823.4 1 8632.0 1961 6 .7 + 82 = + 1 n2 nl 55 19616.7 55964.5 45 23823.4 7 3 107.4 464.17 886.08 - 886.08 2642.15 The 95% simultaneous confidence intervals for the linear combinations a ' ( IL 1 - IL 2 ) = [ 1, 0 J JL 1 1 i-L2 1 = JL1 1 - JL21 JL12 - JL22 and a' ( IL l - 1L2 ) = [0, 1 ] JL1 1 JL2 1 = JL1 2 - JL22 JL12 - JL22 are (see Result 6.4) JL1 1 - JL2 1 : 74.4 ± V5.99 V464.17 or (21.7, 127.1 ) JL1 2 - JL22 : 201 .6 ± V5.99 \12642.15 or (75.8, 327.4) Notice that these intervals differ negligibly from the intervals in Example 6.4, where the pooling procedure was employed. The T2 -statistic for testing Ho : IL1 - IL 2 = 0 is -l 1 1 2 ' [ x 1 - x2 J + T = [ x 1 - x2 J n l s l -s2 n2 204.4 - 130.0 ' 464.17 886.08 - l 204.4 - 130.0 - 556.6 355.0 886.08 2642.15 556.6 - 355.0 -
[
[
]
[
][
[
J
]
[
[-
[
J
][ ] ][ ]
74.4 -4 59.874 - 20.080 - 15.66 - [ 74·4 201 · 6] ( 10 ) - 20.080 10.519 201 .6 For a = .05, the critical value is x�( .05 ) = 5.99 and, since T 2 = 15.66 > x�( .05 ) = 5.99, we reject H0 • The most critical linear combination leading to the rejection of H0 has co efficient vector -l 1 1 59.874 - 20.080 -4 74.4 a ex s l + n s 2 ( x l - x 2 ) - ( 10 ) nl - 20.080 10.519 201 .6 2 "
(
) -
=
[ [ : �: � J
][ ]
The difference in off-peak electrical consumption between those with air conditioning and those without contributes more than the corresponding dif II ference in on-peak consumption to the rejection of H0 : IL l - IL 2 = 0.
Section 6.4
Comparing Severa l M u ltivariate Popu lation Means (One-Way Manova)
293
A statistic similar to T2 that is less sensitive to outlying observations for small and moderately sized samples has been developed by Tiku and Singh [18] . Howev er, if the sample size is moderate to large, Hotelling ' s T2 is remarkably unaffected by slight departures from normality and/or the presence of a few outliers. 6. 4
CO M PARING SEVERAL M U LTIVARIATE POPU LATI ON M EANS (ON E-WAY MANOVA)
Often, more than two populations need to be compared. Random samples, collect ed from each of g populations, are arranged as Population 1: X 1 1 , X 1 2 ' . . . , X 1 n 1 (6-27) Population g: X g 1 , X g2 , . . . , X g ng MANOVA is used first to investigate whether the population mean vectors are the same and, if not, which mean components differ significantly. Assu m ptions a bout the Structu re of the Data fo r One-way MAN OVA
1. X e 1 , X e 2 , . . . , X e n c ' is a random sample of size n e from a population with mean JL e , e = 1, 2, . . . , g. The random samples from different populations are
independent. 2. All populations have a common covariance matrix I. 3. Each population is multivariate normal.
Condition 3 can be relaxed by appealing to the central limit theorem (Result 4.1 3) when the sample sizes n e are large. A review of the univariate analysis of variance ( ANOVA ) will facilitate our discussion of the multivariate assumptions and solution methods. A Sum mary of Univariate AN OVA
In the univariate situation, the assumptions are that Xe 1 , Xe 2 , , Xe nc is a random sample from an N( J.Le , o-2 ) population, e = 1, 2, . . . , g, and that the random samples are independent. Although the null hypothesis of equality of means could be for mulated as J.L 1 = J.L2 = · · · = J.Lg , it is customary to regard J.Le as the sum of an overall mean component, such as J.L , and a component due to the specific population. For in stance, we can write J.Le = J.L + (J.Le - J.L) or J.Le = J.L + T e where T e = J.Le - J.L . Populations usually correspond to different sets of experimental conditions, and therefore, it is convenient to investigate the deviations T e associated with the fth population ( treatment ) . • • •
294
Chapter 6
Comparisons of Severa l M u ltivariate Means
The reparameterization
(
) ( ) (
Te
+
JLe fth population mean
overall mean
fth population (treatment) effect
)
(6-28)
leads to a restatement of the hypothesis of equality of means. The null hypothesis becomes H0 : T 1 = T 2 =
= Tg = 0 The response Xej , distributed as N( JL + T e , o-2 ) , can be expressed in the suggestive form · ··
(
+
)
(
+
ee j
)
( 6-29) treatment random effect error where the eej are independent N(O, o-2 ) random variables . To define uniquely the model parameters and their least squares estimates, it is customary to impose the g constraint � n e T e = 0. ( overall mean)
€= 1
Motivated by the decomposition in (6-29), the analysis of variance is based upon an analogous decomposition of the observations,
(observation)
(
X
) ( +
overall sample mean
( xe - x) estimated treatment effect
)
(residual)
(6-30)
where x is an estimate of JL , T e = ( xe - x) is an estimate of Te , and (xej - xe ) is an estimate of the error eej . Example 6.6
(The sum of squares decomposition for u nivariate ANOVA)
Consider the following independent samples. Population 1: 9, 6, 9 Population 2: 0, 2 Population 3: 3, 1, 2 Since, for example, x3 = ( 3 + 1 + 2 ) / 3 = 2 and x = ( 9 + 6 + 9 + 0 + 3 + 1 + 2 )/8 = 4, we find that 3 = x3 1 = .X + ( x3 - x) + ( x3 1 - x3) = 4 + (2 - 4 ) + (3 - 2 ) = 4 + ( -2) + 1 Repeating this operation for each observation, we obtain the arrays
2 +
Section 6.4
295
Com pa ring Several M u ltivariate Popu l ation Means (One-Way Manova)
observation (x ej)
mean (x)
+
treatment effect + ( xe - x)
residual ( xej - xe)
The question of equality of means is answered by assessing whether the contribution of the treatment array is large relative to the residuals. (Our esti g mates T e == Xe - x of Te always satisfy � n e T e == 0. Under H0 , each Te is an €=1 estimate o f zero. ) I f the treatment contribution i s large, H0 should b e rej ected. The size of an array is quantified by stringing the rows of the array out into a vector and calculating its squared length. This quantity is called the sum of squares (SS) . For the observations, we construct the vector y ' == [9, 6, 9, 0, 2, 3, 1, 2]. Its squared length is ss bs 92 + 62 + 92 + 02 + 2 2 + 3 2 + 12 + 2 2 == 216 0
==
Similarly,
ss mean sstr
==
==
==
4 2 + 4 2 + 4 2 + 4 2 + 4 2 + 4 2 + 4 2 + 4 2 == 8(4 2 ) 128 42 + 42 + 42 + ( - 3 ) 2 + ( - 3 ) 2 + ( - 2 ) 2 + ( - 2 ) 2 + ( - 2 ) 2 3 ( 4 2 ) + 2 ( - 3 ) 2 + 3 ( - 2 ) 2 78 ==
==
and the residual sum of squares is ssres == 12 + ( - 2 ) 2 + 12 + ( - 1 ) 2 + 12 + 12 + ( - 1 ) 2 + 02
==
10
The sums of squares satisfy the same decomposition, (6-30), as the observations. Consequently, or 216 == 128 + 78 + 10. The breakup into sums of squares apportions vari ability in the combined samples into mean, treatment, and residual (error) com ponents. An analysis of variance proceeds by comparing the relative sizes of SS1r and SS res If H0 is true, variances computed from SS1r and SS res should be ap • proximately equal. .
The sum of squares decomposition illustrated numerically in Example 6.6 is so basic that the algebraic equivalent will now be developed. Subtracting x from both sides of (6-30) and squaring gives (x ej - x) 2 ( xe - x ) 2 + ( xej - xe) 2 + 2 ( xe - x ) ( x ej - xe ) nc We can sum both sides over j, note that � (xej - ie) == 0, and obtain j=l nc nc 2 2 � ( xej - x ) == n e ( xe - x) + � ( xej - xe ) 2 j =l j=l ==
296
Chapter 6
Com parisons of Severa l M u ltiva riate Means
Next, summing both sides over e we get g nc g g 2 2 L L (x ej - x) = L n e(Xe - x) + L €=1 j =1 €=1 €=1
(
SS cor
total (corrected)
or
) ( SS =
SS tr
between (samples)
ne L (xej - Xe ) 2 j =1
) ( SS +
( 6-31 )
SS res
within (samples)
) SS
g g nc 2 2 L L x� j (n 1 + n 2 + . . . + n g)x + L n e ( Xe - x) + L L (xe j - Xe ) 2 €=1 €=1 j =1 €=1 j =1 + + (6-3 2) (SSm ean) (SS tr ) ( ss obs ) ( SS res ) In the course of establishing (6-32), we have verified that the arrays represent ing the mean, treatment effects, and residuals are orthogonal. That is, these arrays, considered as vectors, are perpendicular whatever the observation vector ' y = [x 1 1 ' . . . ' X 1 ni ' x2 1 ' . . . ' X2 n2 ' . . . ' X gn J · Consequently, we could obtain ss res by subtraction, without having to calculate the individual residuals, because ss res SS obs - SSmean - SStr · However, this is false economy because plots of the residuals provide checks on the assumptions of the model. The vector representations of the arrays involved in the decomposition (6-30) also have geometric interpretations that provide the degrees of freedom. For an ar bitrary set of observations, let [x 1 1 , . . . , x 1 n 1 , x2 r , . . . , x 2n2 , , Xg ng ] = y ' . The obser vation vector y can lie anywhere in n = n 1 + n2 + · · · + n g dimensions; the mean vector x 1 = [ x , . . . , x] ' must lie along the equiangular line of 1 , and the treatment effect vector 1 0 0 nr 1 0 0 + ( x2 - x) 1 + . . . + ( Xg - :X ) 0 ( i1 - x) 0 n2 0 0 1 0 1 0 ng 0 1 0 = ( x1 - x ) u 1 + ( x2 - x ) u2 + · · · + (xg - x ) ug lies in the hyperplane of linear combinations of the g vectors u 1 , u2 , . . . , ug . Since 1 = u 1 + u2 + · · · + ug , the mean vector also lies in this hyperplane, and it is always perpendicular to the treatment vector. (See Exercise 6.10.) Thus, the mean vector has the freedom to lie anywhere along the one-dimensional equiangular line, and the treatment vector has the freedom to lie anywhere in the other g - 1 dimensions. The residual vector, e = y - (x 1 ) - [ ( x1 - x ) u 1 + · · · + ( xg - x) ug] is perpen dicular to both the mean vector and the treatment effect vector and has the freedom to lie anywhere in the subspace of dimension n - ( g - 1 ) - 1 = n - g that is per pendicular to their hyperplane. To summarize, we attribute 1 d.f. to SS m ean , g - 1 d.f. to SS10 and n - g == ( n 1 + n 2 + . . . + n g ) - g d.f. to ss res . The total number of degrees of freedom is g nc
=
• • •
}
}
}
Section 6.4
297
Com paring Severa l M u ltivariate Popu lation Means (One-Way Manova}
n == n 1 + n 2 + · · · + n g . Alternatively, by appealing to the univariate distribution theory, we find that these are the degrees of freedom for the chi-square distributions associated with the corresponding sums of squares. The calculations of the sums of squares and the associated degrees of freedom are conveniently summarized by an AN OVA table. ANOVA TABLE FOR COM PAR I N G U N IVARIATE POPU LATION M EANS Source Degrees of of variation Sum of squares (SS) freedom ( d.f.)
Treatments Residual (Error) Total (corrected for the mean)
g SS tr == � ne ( xe - x) 2
g-1
ss res
€ =1 g ne == � � (xe j - i e) 2 €= 1 j = 1
g � ne - g €=1
sscor
g == � €=1
ne � (xe j - x) 2 j =1
g � ne - 1 € =1
== T g == 0 at level a if ss tr / (g - 1 ) F == > Fg - 1 , 2- n e -g ( a) g � ne - g ss res €=1
The usual F-test rejects H0: T 1 == T2 ==
/(
·· ·
)
where Fg - 1 , 2- ne - g (a) is the upper ( 100a)th percentile of the F-distribution with g - 1 and 2:ne - g degrees of freedom. This is equivalent to rejecting H0 for large values of ss tr ;ss res or for large values of 1 + ss tr ;ss res . The statistic appropriate for a mul tivariate generalization rejects H0 for small values of the reciprocal 1
1 + Example 6.7
ss tr ;ssres
ss res ssres + ss tr
(6-33)
(A univa riate ANOVA table and F-test fo r treatment effects)
Using the information in Example 6.6, we have the following AN OVA table: Source of variation
Sum of squares
Treatments
SS r == 78
Residual
SS res ==
Total (corrected)
SS cor == 88
1
10
Degrees of freedom
g - 1 == 3 - 1 == 2 g � ne - g == ( 3 + 2 + 3) - 3 == 5 €=1 g � ne - 1 == 7 €=1
298
Chapter 6
Comparisons of Severa l M u ltivariate Means
Consequently, Simentnce effect) at the level of significweance.reject H0: (no treat Paral eling the univariate reparameterization, we specify the MANOVA model: F
F
=
=
SS 1r /(g - 1 ) ss res / (Ine - g)
19.5 > F2,5 ( .01 ) 1%
=
=
7 8/ 2 10/ 5
13.27,
=
T
1
19 _ 5
=
T2
=
T3
=
0
•
M u ltivariate Analysis of Variance (MAN OVA)
Accor d i n g t o t h e model i n of t h e obs e r v at i o n vect o r s a t are cor r e ilsaftieed,s tbuthe unithevcovar ariateiamodel The er r o r s f o r t h e component s of nce mat r i x i s t h e same f o r al l popul a t i o ns . A :vector of observatioxns may be decomposed as suggested(:by: the:model . Thus , � (overal sa�ple) ;;�:�� ( �;u :)) (obse: tion) mean ( effect ) e 1 The decompos i t i o n i n l e ads t o t h e mul t i v ar i a t e anal o g of t h e uni v ar i a t e sum of squares breakup in First we note that the product can be writ en as The sum over j ofHence, the misudmmidle ntgwotheexprcrosesprioonsductis overthe zeroand jmatyierlidxs, because (6-34), each component (6-29). I
Xe j
Xej
=
+
!J.-
( 6-35) (6-31). (x e j - i ) (x e j - i ) '
(x e j - i ) (x e j - i ) '
ne :L (x e j - ie ) j=1
g ne :L :L (x e j - x) (x e j - x) ' €=1 j=1
=
=
=
1-e
(6 -3 5 )
+
1,
[ (x e j - xe) + ( ie - i ) ] [ (x e j - ie ) ( ie - i ) ] ' (x e j - ie ) (x e j - ie ) ' + (x e j - ie ) (x+e - i ) ' + (xe - i ) (x e j - ie ) ' + ( xe - i ) ( ie - x)'
e
0.
g :L n e ( ie - i ) ( ie - i ) ' €=1
g ne :L :L (x e j - ie ) (x e j - ie ) ' €=1 j=l
(tofotaslq(uarcorersecandted)crsousm) (trseuatmmofentsquar(Beteswandeen) ) (reofsidsqualuar(eWs iandthincr) osusm) products cros products products =
+
(6-36 )
Section 6.4
299
Comparing Several M u ltivariate Pop u l ation Means (One-Way M a n ova)
The sum of squares and cros products matrix can be expres ed as within
g ne W = L L (x e j - ie ) (x ej - ie ) ' f=l j=1 = (n 1 - 1 ) S 1 + (n 2 - 1 ) S 2 + · · · + ( n g - 1 ) S g
(6-37)
where the fthesraempld ine.thThie tws mato-sarmplix iseacasgenere. It alpliazysatiaondomiofis tthnheantesamplroleeincovartestiinanceg formatthmateriprxrfiexosreencount nce of t r e at m ent ef f e ct s . Analogous to the univariate result, the hypothesis of no treatment effects, iands tescrteodsbyprconsoductiders. inEquig thevalreelanttilvye, wesizesmayof tconshe triedatermtenthe randelatriveseidsualizessuofmsthofe rseqsuariduales products. Fortablme.ally, we summarize tandhe caltotcaull acortionsrectleeadid nsgutmo tofhesqteuarst setsatandistic crinoas MANOVA Mat r i x of s u m of s q uar e s and Sour c e Degr e es of cros products SSP of variation freedom d.f. Treatment Residual Error Totfor athl ecormeanrected Thiexcepts tabltheatissexactquarelys tofhescsaalmearsfoarremr,ecomponent byheicomponent , asetrhpearANts. OVAFor exam table, pl a ced by t r vect o r count becomesry and also to some mulThetivardegriateediessoftribfruteedom plthee, univariate geomet cor r e s p ond t o i o n t h eor y i n vol v i n g WisharOnet denstesittiofes. See i n vol v es gener a l i z ed var i a nces . Wer e ect H0 if the ratio of generalized variances Se
(n 1 + n2 - 2) Spooled
Ho : T l = T2 = . . . = Tg = 0
(
)
MANOVA TAB L E FOR CO MPAR I N G POPULATION M EAN VECTORS
(
(
(
)
g B = L n e ( ie - i ) ( ie - x)' €=1 g ne = W L L (x ej - ie ) (x ej - ie ) ' €=1 j =1
g ne B + w = L L (x ej - i ) (xej - x) ' €=1 j =1
)
( xe - x ) 2
(
)
g-1
g L ne - g €= 1 g L ne - 1 €=1
( ie - x) ( ie - x)'.
( [1] .) H0 : T 1 = T2 =
j
)
A* =
IWI IB + WI
···
= Tg = 0
g L €=1
ne
(6-38)
L (x e j - x) (x ej - x)'
j =1
A * = I W 1/1 B + W I , (6-33) Wilks' lambda
isseteoo smalcorl. rTheespondsquanttoittyhe equivalent form propos e d or i g i n al l y by Wi l k s of t h e t e s t of H0: no t r e at Fment effects in the univariate case. has the virtue of being convenient (
[20]),
300
Chapter 6
Comparisons of Severa l M u ltivariate Means
= I W 1/1 B +
DISTR I B UTI ON OF WI LKS' LAM B DA, A *
TABLE 6.3
WI
varNo.iablofes grNo.oupsof Sampling distribution for multivariate normal data p=1
g>2
p=2
g>2
p>1
g=2
p>1
g=3
( �gne- 1 ) e ) ( � neg g1 - 1 ) e ) ( � ne -pp - 1 ) e ) ( � n e pp - 2 ) e ) A* A*
g
-
-
�
Fg- l, Ln,-g
VA* Vi\*
-
-
_
�
F2 (g- 1 ) , 2 (Ln 1 - g - l l
A* Fp ' 'Ln - p - 1 A* VA* F2p , 2 (Ln1 - p - 2 ) Vi\*
-
(
�
-
-
�
andrivedrefloartethdetosptecihe alilkcaseliheoods listreadtiioncrTablitereio6.n.3.2 TheFor exact di s t r i b ut i o n of c a n be de ot h er cas e s and l a r g e s a mpl e s i z es , a modiBarfictatleiton(sofee [4]due) hastoshBarowntletht at(siefe [4]i)s cantruebeandused to testis large, ( ( ) ) ( ) has approlxiarmge,atweelyreajechict -squarat seigdinisftirciancebutiolnevelwith if d.f. Consequently, ( ) ln ( ) 2 where d.f. is the upper )th percentile of a chi-square distribution with e ane addiThetionalsamplvareiasblizeesisarobse erved along wiandth the variablAre rinatngironducedg the iobsSuppos n Exampl ervation pairs in rows, we obtain A*
A*
-
n
-
1
( p + g)
ln A
2
-
2:n e = n
*
=
-
n
H0
- n-1-
X� (g - 1 ) ( a )
(p + g)
-
2:n e = n ( p + g) ln I W I 1 2 IB + WI p(g - 1 ) a
IwI IB + WI
2
(6-3 9 )
-
for
(6- 40)
> xp (g-l ) ( a)
( 100a
p(g - 1 )
Example 6.8
H0 •
H0
(A MAN OVA table and Wi l ks' lambda fo r testi ng the eq ual ity of three mean vecto rs)
6.6.
n 1 = 3, n 2 = 2,
Xe j
( )
n3 = 3.
2 Wilks' lambda can also be expressed as a function of the eigenvalues of A 1 , A , . . . 2 s
A* =
II t=l
1
At
,
A
s
of w - 1 B
as
--
1 +
where s = min (p, g - 1 ) , the rank of B . Other statistics for checking the equality of several multivari ate means, such as Pillai's statistic, the Lawley-Hotelling statistic, and Roy's largest root statistic can also be written as particular functions of the eigenvalues of w- 1 B . For large samples, all of these statistics are, essentially equivalent. (See the additional discussion on page 331.)
Section 6.4
301
Com pa ring Severa l M u ltivariate Popu l ation Means (One-Way Manova)
[� ] [�] [� ]
[ � ] [�] [�] [!] [�]
Wean overhavealalmean, ready exprtreatems ented tefhefeobsct, andervatreiosinsdualon itnheourfirsdit varscusiasblione asof tunihe vsuarmiaofte We found that ANOVA.
(� ! :) (: : :) ( =�( treat=�ment_):) ( -: �: :) =
+
+
(observation) (mean) effect (residual) and = Tot a l ( c or r e ct e d) = Repeat i n g t h i s oper a t i o n f o r t h e obs e r v at i o ns on t h e s e cond var i a bl e , we have ssobs 216
=
ssmean + sstr + ssres 128 + 78 + 10 SS obs - SS mean 216 - 128 =
SS
(! � ) = (� � ) ( =� =� ) ( -1
5
7
+
+
=
88
-� = �
3
)
(observation) (mean) (treefatfmecentt ) (residual) and = = Tot a l ( c or r e c t e d) = Thesby-ente rtwy o single-componentin ordanaler toyscompl es musettebethaugment e d wi t h t h e s u m of ent r y e ent r i e s i n t h e t a bl e . Prprooductceedicontng rroiwbutbyiorns:ow in the arrays for the two variables, we obtain the cros = Mean: TreatResmidentual:: 1 Total: = 8 9 7
5 5 5
ssobs 272
SS
=
3
3
3
ssmean + sstr + ssres 200 + 48 + 24 SS obs - SSmean 272 - 200
1 -1
0
=
72
MAN OVA
cross products
=
4(5) + 4(5) + · · · + 4(5) 8(4) ( 5 ) 160 3 ( 4 ) ( -1 ) + 2 ( - 3) ( - 3 ) + 3( - 2 ) ( 3) - 12 1 ( - 1 ) + ( - 2 ) ( - 2 ) + 1 (3) + ( - 1 ) ( 2 ) + . . . + 0( - 1 ) 9 (3) + 6(2) + 9 ( 7) + 0 ( 4) + . . . + 2(7) 149 =
=
302
Chapter 6
Com parisons of Severa l M u ltivariate Means
Total (corrected) cros product total cros product - mean cros product Thus, the MANOVA table takes the fol owing form: Sour c e Mat r i x of s u m of s q uar e s of variation and[ cros products Degrees of freedom ] Treatment [ � z! ] Residual Total (corrected) [ ] Equation i[s verified by noti[ng that ] [ ] ] Using we get = = 149 - 160 = - 11
78 - 12 - 12 48 1
3-1=2
3+2+3-3=5
88 - 11 - 11 72
7
(6-36)
88 - 11 = 78 - 12 10 1 + 1 24 - 11 72 - 12 48
(6-38),
10 1 2 1 24 1 10 24) W ) ( ( 239 = .0385 I I A* = B = 2 I + WI 88 - 11 88 ( 72 ) - ( - 11 ) 6215 - 11 72 =2 6.3 g = 3, T 1 == T2 = T3 = 0 T e i:- 0 the -----
-
Si n ce p and Tabl e i n di c at e s t h at an exact t e s t ( a s u mi n g nortreatmmalentity efandfectequals) vergrsuosupH1covar: at leiasancet onematrices) iofs avaiH0:lable. To carry out (no test(, we compar)e the test statistic ( ) ( ) with a percentage point ofd.anf. F-SindceistributionFhavi4,8( ng we reject H0andat the level and conclude that treatment dif erences exist. When t h e number of var i a bl e s , p, i s l a r g e, t h e MAN OVA t a bl e i s us u al l y not con sstoruthctated.espSteciil a,lityilsagoodrge entprraiectsiccaentobehavelocattheed.comput e r pr i n t t h e mat r i c es and Also, the residual vectors sdihsoulcusdsebed iexami n ed f o r nor m al i t y and t h e pr e s e nce of out l i e r s us i n g t h e t e chni q ues n Sections and of Chapter 1 - VA* ( 2:n e - g - 1 ) 1 - \!0385 = VA* (g - 1 ) \!0385
v2 = 2( 2:ne - g - 1 ) = 8 a = .01
8.19 >
8 - 3 - 1 = 8.19 3-1 v 1 = 2 (g - 1 ) = 4 .01 ) = 7.01, B
e ej = Xe j - Xe "
4.6
4.7
-
4.
II
W
Section 6.4
303
Com paring Severa l M u ltivariate Pop u l ation Means (One-Way Manova)
Example 6.9
(A m ultivariate analysis of Wisconsin nursing home data)
The Wi s c ons i n Depar t m ent of Heal t h and Soci a l Ser v i c es r e i m bur s e s nur s iofngfhomes omeanrmwageulaisnrfatohtree,rsandatatetesaverffoorr taeachhgee swageerfvaiciceslritayprt,ebasoinvitedhded.eonstTheaftea.ctdepar ors sutmchentas develleveloofps cara seet, Nur s i n g homes can be cl a s i f i e d on t h e bas i s of owner s h i p pr i v at e par t y , nonpr cility, Oneionftietrorpurmgedianiposazteeatofcarion,aerfandeacentciligover tys,torudynamentwascombitoandninatvesiocerntitofgiatfitcehatethitoewnefo fse.kctilseofd nurownersingshfiap orbasceris tandificatmeasion uorredbotinh houron coss perts. Fourpatiecosnt tday,s, computwereedseonlectaeperd f-opratanalient-ydsayis: 3 = cosngtandof pllaantundropy g laborl,axbor2 =, andcosXt of4 =dicosetartyoflahousbor, xekeepi xerlat=ioncosandt ofmainurnstiennance lianboritial. y steopartalaoftend accor = diobsng etorvowner ationssonhipeach. Summar of theypst=atistcosicstfvaror eachiablesofwerthee g = groups are given in the fol owing table. Number of obs e r v at i o ns Sampl e mean vect o r s Gr o up = private n1 = = nonprofit n2 = x l = x2 = x3 = = government 3 €= 1 ne = Sample covariance matrices = 2= 3= )
(
( )
)
516
A
(
4
3
e
1(
e
2(
e
3(
)
271
)
138
)
L
sl
s
2.066 .480 .082 .360
.291 - .001 .011 .002 .000 .010 .003 .261 .030 .017 .003 - .000 .018 .006
.001 .000 .010
2.167 .596 .124 .418
516
s
.561 .011 .025 .001 .004 .005 .037 .007 .002 .019
.004 .001 .013
Source: Data courtesy of State of Wisconsin Department of Health and Social Services.
2.273 .521 .125 .383
304
Chapter 6
Com parisons of Severa l M u ltivariate Means
SitnoceobtthaeinS/s seem to be reasonably compatible,3 they were pooled S
( 6-37)]
[se e
W = (n 1 - 1 ) S 1 + (n 2 - 1 ) S 2 + ( n 3 - 1 ) 3 182.962 4.408 8.200 1.695 .633 1.484 9.581 2.428 .394 6.538
Also,
n x + n + n3 x = 1 n1 + n2 i2+ n 3 i3 2 l
and
2.136 .519 .102 .380
3.475 1.225 B = L n e ( ie - x) ( ie - x) ' = 1.111 .821 .453 .235 f=l .584 .610 .230 .304
3
Toaverteasget costs among the(nthorowner s h i p ef f e ct s or , equi v al e nt l y , no di f e r e nce e e t y pes of owner s pr i v at e , nonpr o f i t , and ernmentComput ), we ecanr-basuseedthcale rceusluatltioinnsTablgivee for g and H0 : T 1 = T2 = T3
6.3
in
gov
= 3.
WI = .7714 A* = l BI + W ,
( "Z ne
-p-2 p
) ( 1 - VA* ) = ( 516 -44 - 2 ) ( 1 VA*
Y.7714 Y.77I4
)=
1 7 . 67
Let, so twehatreject at the level and conclude thatSinaverce age. dif erI, tdependi n g on t y pe of owner s h i p . i s i n f o r m at i v e t o compar e t h e r e s u l t s bas e d on t h i s "exact " t e s t those obtForainthede pruseisnegntthexampl e large-e,sample proceduries slaurmmar i z ed i n ge, and can be tested at the level by comparing = .01, F 1 020 ( .01 ) · 2.51, 8
a
F2 ( 4 ) , 2 (s l o ) ( .01 ) . H0 1%
x§ ( .01 )/8
==
2.51.
17 6 7
>
costs with
(6-40).
a ==
.01
"Z n e = n
3 However, a normal-theory test of H0 : I- 1 cance level because of the large sample sizes.
=
I-2
=
==
516
I- 3 would rej ect
H0
(6-39) and
H0 at any reasonable signifi
Section 6 . 5
305
S i m u ltaneous Confidence I nterva ls for Treatment Effects
- n g)/2)ln C 81 :� 1 ) = = wijectth at the =level. This=result is Siconsnceistent with the resu=lt based onwerthee foregoing F-statistic. ( - 1 -
-stt .s ln ( .7714)
(p +
X� ( g - 1 ) ( .01 ) H0 1%
x§( .01 )
132.76 > x§( .01 )
20.09.
132.76
20.09,
•
6.5
SI M U LTAN E O U S CO N FI D E NCE I NTERVALS FOR TREATM E NT E FFECTS
When t h e hypot h es i s of equal t r e at m ent ef f e ct s i s r e j e ct e d, t h os e ef f e ct s t h at l e d t o tfhere orenijeapprctionoachof th(seehypot hioesnis arecaofn ibenteusreesdt.toForconspaitrruwctisseimcompar isonsconf, theidBonence e Sect ul t a neous ivalntesrarvalesshfoorr ttehretcomponent s of t h e di f e r e nces ( o r Thes e i n t e r h an t h os e obt a i n ed f o r al l cont r a s t s , and t h ey r e qui r e cr i t i c al val u es only fLetor the unibe vthareiaitthe component t-statistic. of Since is estimated by = - = andThe two-sampl=e t-bas-ed confis tihdeencedif ienrteencervalbetis valweenid witwthoanindependent s a mpl e means . appr o pr i a t e l y modi f i e d a. Notice that Var = Var = ( ) wher iThats estieims, ateids tbyhediitvhiddiinaggonalthe corelermentespondiof ng elAsementsuggesofteWd byby its degrVarees of freedom. n ous confidence statements. whereIt remaiis thnes titohappordiagonaltion eltheementerrorofratWe overand nthe=numer Relencesat,ioson each twso-tilsaapplmpleiets-.inThertervale arwiel emplvarioayblthese andcriticg(algvalue)t/n2-g(paia/2m)rwise, diwherf ere m = 1) / 2 is the number of simultaneous confidence statements. Let n For t h e model i n wi t h conf i d ence at l e as t = - a) , belongs to 5.4)
Tk .
T ki
T" ki
T ki - T ei
xki
ILk - P.. e).
Tk - T e
Tk
7-k
(6-41)
xki - xi
xn
( T" ki - T" e; )
o-ii
(Xki - Xe; )
1 1 + nk n e
wii
1+
(5-28)
···
p
(1
T k i - T ei
(Xki - Xe J
+ ng . -1
(6-42)
pg(g
g � nk . k =l
CT; ;
(6-37),
I.
Result 6.5.
X.k - X.
(6-34),
306
Chapter 6
Com pa risons of Severa l M u ltivariate Means
e
i s t h e fitohr dialal gonal component s i 1, and al l di f e r e nces 1, Her e wi i element of uctionusofinsgimthuletanurneoussing-ihnometervaldatesatimintatroeduced s for thine paiExamplrwWeiseedis6.hf9ale. lreilncesustriantetrtehate mconsenttrmeans Weon thsaewtyipen Exampl e 6. 9 t h at aver a ge cos t s f o r nur s i n g homes di f e r , dependi n g of owner s h i p . We can us e Res u l t 6. 5 t o es t i m at e t h e magni t u des tmaihe nditfenance erencesla.borA ,compar i s o n of t h e var i a bl e X , cos t s of pl a nt oper a t i o n and 3 betcanweenbeprmadeivatebyly owned nuringsTin1g3 homes and gover n ment owned nur s i n g homes es t i m at . Us i n g ( 6 3 5) and T3 3 the information in Example 6.-.9, 0we70have . 1 37 ..002302 -.-.002039 0 20 . 0 03 -. 182.4.496208 8.200 1.9.568195 2..462833 1..439484 6.538 Consequently, T13 - T3 -.020 - .023 -.043 and n 271 138 107 516, so that ( 2711 1071 ) 5161.484- 3 •00614 Siquinrcee that t513and( .05/4(3)3,2)for 95%2.87. simSeeultAppendi aneous confx, Tablidencee 1.) sThetatement95% ssimweulrtea neous confidence statement is T1 3 - T3 3 belongs to T13 - T33 t513 ( .00208) )( n11 n13 ) nw33 -.-.004343 2..0818,7( .0or0614)( -.061, Weownedconclnurudesingthhomes at the averis higaherge maiby n.0t25enanceto .061andhourlaborpercospatt ifeontr gover n ment day t h an privately ownedT1nur3 -sinT2g3homes . Wi t h t h e s a me 95% conf i d ence, we c a n s a y t h at belongs to the interval (-.058, -.026) =
Example 6. 1 0
... , p
W.
<
k=
. . . , g.
II
(Si mu ltaneous i nterva ls fo r treatment differences-n ursing homes)
of
'
W=
=
=
=
+
+
=
=
+
p
=4
g
=
(
·
,...
,...
±
-
= =
± ±
+-
- g
-
.
02 5 )
for
Section 6.6
307
Two-Way M u ltivariate Ana lysis of Va riance
and bel o ngs t o t h e i n t e r v al 0 21, . 0 19) -. Thus , a di f e r e nce i n t h i s cos t exi s t s bet w een pr i v at e and nonpr o f i t nur s inurng shomes , but no di f e r e nce i s obs e r v ed bet w een nonpr o f i t and gover n ment ing homes. Folsis floowir ang our approtachwo-twoaythefione-xed-wefayects model andwethshenallsbrimieplflyy gener reviewaltihzee analto thye multivariate case by analogy. Wecaseass, stuhmeese texper hat measimentureament s ar e r e cor d ed at var i o us l e vel s of t w o f a ct o r s . In s o me represimententleavell dess ofigna emplsingleoyedtreatwimlentnotarconcer rangedn wiustihninthsiesverbook.al blo(cksSee. [The9] landcondipar[t1ict0]iuolnsfaorrexper di s c us s i o ns of exper i m ent a l des i g n. ) We s h al l , however , as s u me t h at obs e r v at i o ns at di f e r e nt combi n at i o ns of exper i m ent a l condi tions arLete itnhdependent of one anot h er . eectts iofvelexpery. Suppos imentealthcondiere artieonsg lebevelths eoflefvelactsorof1, fandor inbslteavelnce,s offactfaocr 1toandr 2, andfactothreat2,twreoisnpsdependent obs e r v at i o ns can be obs e r v ed at each of t h e gb combi natfactioorns2 ofbylevels. weDenotspeciinfyg tthhee unirthvobsariaetrevtatwio-onwataylmodel evel ofas factor 1 and level k of == 1, 2, , g (6-43) kr ==== 1,1, 2,2, . . . ' b wher0, e random== variable==s. Here ==represents==an0 overandal tlheevel, rearpreeseinntdependent efbetfewcteenof ffaaccttoorr 1,1 andrfeaprcteosre2.ntsThetheexpect fixed efedferctesofponsfacetoatr t2,heandfth levelis tofhefainscttteohrrea1ctfiandxioedn the kth level of factor 2 is thus ( t o r 1f a ( ef f e ct of ) over a l ) ( ef f e ct of ) ( f a ( remean ) t o r 2 ) sponse levelf == 1,2, .fact,g,or 1 k == 1,fa2,ctor 2b Interactio(n6-44) ' (
T2 3 - T33
•
6. 6
TWO-WAY M U LTIVARIATE ANALYSIS OF VARIANCE
MAN OVA,
univariate
U n i variate Two-Way Fixed-Effects Model with I nteracti on
4
n
e
Xekr ,
Xekr
J.L
+
e
Te
=
+
f3 k
+
. . . .
'Yek
+
eekr
.
,n
g � Te 2 €=1
N ( a- )
b � k k =1 f3 {3 k
b g � 'Yek � 'Yek k =1 €=1 J.L
eekr
Te
y
J.L
Te
+
+
f3 k
+
+
..
+
+
ek
�
'Yek
�
. ..
4The use of the term "factor" to indicate an experimental condition is convenient. The factors dis cussed here should not be confused with the unobservable factors considered in Chapter 9 in the context of factor analysis.
308
Chapter 6
Com parisons of Several M u ltivariate Means
Level 1 of factor 1 Level 3 of factor 1 Level 2 of factor 1
2
4
3
Level of factor 2 (a) Level 3 of factor 1 Level 1 of factor 1 Level 2 of factor 1
2
4
3
Cu rves fo r expected res ponses (a) with i nteraction a n d (b) without i nteraction .
Figure 6.3
Level of factor 2 (b)
Yek ,
i m pl i e s t h at t h e f a ct o r ef f e ct s ar e not addi t i v e The pr e s e nce of i n t e r a ct i o n, andrespcompl i c at e s t h e i n t e r p r e t a t i o n of t h e r e s u l t s . Fi g ur e s a and b s h ow expect e d asensaefuofnctiniotenraofcttiohne fmeans actor levels wi0 tfhorandall wiandthoutk. interaction, respective ly. TheonsIneabsasmanner analogous to each observation can be decomposed as i s t h e wher e s t h e over a l aver a ge, i s t h e aver a ge f o r t h e ft h l e vel of f a ct o r 1, xi aver1 agethefoktrhthleevelkthoflefvelactoofr 2.factSquaror 2,iandng and siusmmithe averng thaegedeviforattihoensfth level of fagictvoesr 6.3( )
(
)
e Yek (6-44), Xekr X + (xe. - x) + (x. k - x ) + ( :Xek - Xe. - x. k + x) + ( xekr - Xek) (6 -45 ) :Xe . x. k :Xe k an d ( xe kr - x) b b n g g 2 2 L1 kL1 L ( xek r - x) L bn(xe. - x) + L gn(x. k - x) 2 k= 1 €= = r= 1 €= 1 b g + L1 kL1 n(xek - Xe. - x. k + x ) 2 €= = b n g (6 -46) + L1 kL1 L1 ( xekr - Xek) 2 €= = r= =
=
=
or
sscor ssfac1 + ss fac2 + ssint + ss res =
Section 6.6
309
Two-Way M u ltivariate Ana lysis of Va ria nce
Thebreakupcorreinsp(ondi6-46)ngardegre ees of freedom associated with the sums of squares in the 1) - 1) - 1) (6-47) 1) 1 1) The ANOVA table takes the fol owing form: ANOVA TABLE FOR COMPARI N G EFFECTS OF TWO FACTORS ANDSourTHEIce R INTERACTION Degr e es of Sum of squares (SS) of variation freedom (d.f.) Factor 1 1 Factor 2 1 Interaction 1) Residual (Er or) 1) Total (corrected) 1 The F-1)ratios 1)oftotthhee mean meansqsuarquare,es, --1))1) ,can be used-to1)te, standfor t[7h]efeforfeacditssofcusfsaictonorof1,unifactvaroria2,teandtwo-fawctayoranal1-fayctsios rof2varinteiarnce.actio) n, respectively. See Prsponsoceedie consng ibystinanalg ofpogy,component we specifys t[hseeetw(6o--4w3)ay] fixed-ef ects model for a re 1,1, 2,2, . . ( 6 4 8) 1, 2, wherand tehe are independent random vectTheorvects. Thusors ,artheealrleofsponsordeers consp i1,st ofof pfameas ctorsu1randement2. s replicated times at each of the possible combinations of levels gbn
== (g
(b
+
-
-
+
(g
-
(b
g ssfacl == L1 bn( Xe . - x) 2 €= b ssfac2 == kL1 gn(x. k - x) 2 =
ssint
==
b
g
L kL n(xe k - Xe. - x. k 1 1
€= =
+
gb(n
gb-
+
x) 2
(g - 1 ) (b -
gb (n
sscor == L1 kL1 L1 ( xekr - x ) 2 €= = r= g b
ssint ! ( g
-
n
SS £ac1 / (g ss res / (gb( n
(b -
gbn
-
-
SS£ac2 / (b
(
M u ltivariate Two-Way Fixed-Effects Model with I nteractio n
vector
g
b
e == L €=L1 T k = 1 f3k e e kr
X ek r == e == k == r == g
==
1L
+
T e + f3k + Y ek + e ekr .. , g 'b ..., n
.
b
== L == 0. L €= 1 Y ek k = 1 Y ek Np ( 0, I )
n
X
310
Chapter 6
n
€=2:1 k2:= 1 r=2:1
X ek r X + ( xe. - x) + (x. k - x) + (x ek - X e. - x. k + x) + ( xekr - Xek ) X e. x x. k X ek an d
Fol l o wi n g ( 6 4 5) , we can dec o mpos e t h e obs e r v at i o n vect o r s as =where is the overal average of the observation vectors, is the average of the obsvecteorrvsatatiotnhevectkthorlesvelat tofhefftactholervel2, andof factoirs 1,the averis tahgee averof thaegeobsofetrhveatobsionervectvatioorns at theStftrhailgehtvelfoofrwfaractd ogenerr 1 aliztatheioktnsh oflevel(6-4of6)faandctor(2.6-47) give the breakups of the sum of squares and cros products and degrees of freedom: =
X ekr
g b
Com parisons of Severa l M u ltivariate Means
g
( xekr - x) ( xek r - x)'
€=2:1
(6- 49)
bn(i.e. - x) (xe. - x)'
b
+ k2: gn(x. k - x) (x.k - x)' =1 g b + 2:1 k2:1 n(xek - xe . - x.k + x) (xek - Xe. - x. k + x)' €= = g b n (6-50) + 2:1 k2:1 2: ( xekr - X ek ) ( xekr - i ek ) ' €= = r=1 (6-5 1) (g + (b - + (g - ( b - + gb(n -
gbn -
1) 1) 1) 1 1) 1) = Agaisimplny, tofhe replgeneracianligzataiosncalfraormsutchhe unias variate :Xto2thwietmulh thtievarcoriarteesanalpondiysnisgconsmatirsitxs The MANOVA table is the fol owing: MANOVA TABLE FOR COMPARING FACTORS AND THEIR INTERACTIONDegrees of Sourvariacteioofn fre(d.edomf. Matand rcrixoofs prsuomductof ssquarSSPes SSPtac 1 = 1 Factor 1 SSPfac2 = Factor 2 1 IResnteirdaualction SSPint = - 1) SSPres = Error 1) Totcorarlected 1 SSPcor = -
(xe . -
(i. e. - x) (x e . - x)' .
g
(
)
)
bn(i.e . - x) (x e. - x)'
g-
g k2:= 1 n(x. k - x) (x. k - x)'
b-
€=2:1 b
g b 2: 2:
€= 1 k = 1
(
(
n(x ek - xe . - x. k + x) (x ek - xe. - x.k + x)' ( g b
n
gb(n -
g b
n
gbn -
g
2: ( xek r - X ek ) ( xekr - Xek)' €=2:1 k2:= 1 r=1
)
)
( - x) ( xekr - x)' €=2:1 k2:= 1 r=2:1 xek r
)
(b
-
1)
Section 6.6
Two-Way M u ltivariate Ana lysis of Va ria nce
31 1
5
test (Hthoe: likel==ihood==ratio=test) of= (no interaction effects) (6-52) versus H1 : least one is conducted by rejecting H0 for small values of the ratio A* = -----(6-53) ForUsinlgarBarge tslaemplt 's mules, Witipllkies'r l(asmbda, A*, can be r e f e r ed t o a chi s q uar e per c ent i l e . e e [ 6 ] ) t o i m pr o ve t h e chi s q uar e appr o xi m at i o n, wer e ject H0:[ == = p= 1 -= g at- th1)e ab l-evel1) i]f a) (6-54) - gb(n - 1) - 2 ln A* wher e A* i s gi v en by ( 6 5 3) and i s t h e upper ( 1 00a) t h per c ent i l e of a chi-sqOruardeindiarsitlryi,btutheiotneswit ftohr (ignte-ra1)ctiobn-is1)carprd.iefd. out before the tests for main fac tpretor efatfieoctn.s. From If intearaprctiaoctniefcalfectstas ndpoi exist, ntht,eitfaicts onotr efadvifectssadoblenotto haveproceeda clewiartihnttehre addifor eachtionalvarmuliabltiev)arariaetofe tteesntsconduct . Instead,edptounisevearwhetiatehterwo-thweayintanaleractyisoens ofappearvariasncein s(oomene reersmponss ofesaddibuttinotve fotachterors.1 Thosand 2e efrefsepctonss, presowivitdhedoutthiatntethraectlaiotnermayeffebectsinexiterspt.reItnedanyin tevent , i n t e r a ct i o n pl o t s s i m i l a r t o Fi g ur e 6. 3 , but wi t h t r e at m ent s a mpl e means r e plactaicoinngefexpect e d val u es , bes t cl a r i f y t h e r e l a t i v e magni t u des of t h e mai n and i n t e r fheectmuls. tivariate model, we test for factor 1 and factor 2 main effects as fol I n t = and H lows. FirThesst, conse hypotider thhese ehypot h es e s H0: = = : at l e as t one 1 s s p eci f y no f a ct o r 1 ef f e ct s and f a ct o r 1 ef f e ct s , re spectively. Let (6-55) A* = ------sliokelthihatoodsmalraltivalo tueesst iofs asA*folareowscons: istent with H1 • Using Bartlet 's correction, the Reject H0: = =p 1= - ((ngo-fa1)cto]r 1 effects) at level a if (6-56) gb(n - 1) - 2 ln A* wher e A* i s gi v en by ( 6 5 5) and a) i s t h e upper ( 1 00a) t h per c ent i l e of a chi square distribution with (g - 1)p d.f. A
1' 1 2
1' 11
1' gb
···
0
Y ek # 0
At
I SSPres I
1 SSPint + SSPres I
y1 1
y1 2
··
·
1'gb
0
(
+
(
> X2( g-1 ) ( b-1 )p (
XJg-1 ) ( b-1 ) p (a) (
T1
T e # 0.
T2
··· =
Tg
0 some
I SSPres I 1 SSPfac 1 + SSPres I
T1
- 1l
T2
= · ·
·
Tg
0
+
> x{g- l ) p (a)
XJg-1) p (
5 The likelihood test procedures require that p (with probability 1).
::;
gb(n
1 ) , so that SSPres will be positive definite
312
Chap ter 6
Com parisons of Severa l M u ltivariate Means /3 1
In a s i m i l a r manner , f a ct o r 2 ef f e ct s ar e t e s t e d by cons i d eri n g H0: and H1 : at least one SSPres Small values of P2 A* SSPfac2 + SSPres artioen:consRejiescttentH0:wi/3th1 H1/3. 2Once again, for l(anrogefasctamplor 2eefs fandectsus) atinlgevelBaratlieft 's correc ( - [gb( n - 1) - p + 1 -2 (b - 1)] ln A* wher e A* i s gi v en by ( 6 5 7) and ( 1 00a) t h per c ent i l e of a a) i s t h e upper chi squareSidimsultrtiabneous ution wiconfthidbence- )inptedegrrvalesesfoofr contfreedom. r a s t s i n t h e model par a met e r s pr6.5oarevideavaiinsilgahtblse ifnotroththeetwnato-uwraye ofmodelthe f.actWhenor effiencttesr.actResionuleftsfcompar aneglble itgoible, e ct s are mayroni apprconcentoachratapple onieconts to rtahsetscomponent in the factosrof1 andthe difafcteorrences2 maiTen ef-feTctsof. Thethe Bonf e r f a c t o r effectThes and100(the1component - a)% simsuloftaneous- confofitdhencee factino(tre2rvefalfsectfosr, Tree)isp-ectTmiivelarye. belongsto pg( g - 1) \j � bn ( wher e v SSP gb , and"i e · i s t h ei t h di a gonalel e mentofE 1) , E r e s i i is the Siithmcomponent ilarly, the 100(of1 -e. a)% simultaneous confid(ence intervals for a ) belongs to pb(b \j � g;; where and Eii are as just defined and - x. qi is the ith component of We have cons i d er e d t h e mul t i v ar i a t e t w ow ay model wi t h catnatiioonsn of. Thatfactoirs,ltehvele model al l o ws f o r r e pl i c at i o ns of t h e r e s p ons e s at each combi s . Thi s enabl e s us t o exami n e t h e " i n t e r a ct i o n" of t h e f a ct o r s . one obsdoeservatnotionalvectloworfoisr avaithe poslablseibatilieachty ofcombi natalioinntoferafactctioonr lteevelrms, tehe two waycoronlyremodel a gener s p ondi n g MAN OVA t a bl e i n cl u des onl y f a ct o r 1, f a c t o r 2, and r e s i d ual s o ur c es of variation as components of the total variation. (See Exercise 6.13.) Thetechnioptqueimumcallecondid Evoltiounstiofnaror exty Operrudinagtiplon.ast(iSceefilm[8]have.) In beenthe courexamise ofnedtheus tX3hat wasopacidone,ty-werethreemeasresupredonsates-twXo 1leveltesarof rtheseisftaactnce,ors, X2 glos , = · · · =
f3 b
=
f3 k # I
0
=
=
= · · · =
f3b
0.
I
I
(6- 5 7)
I
=
0
> xJb- l ) p (a)
(
1
==
6- 58)
xfb- l ) p (
can
Result we
f3 q
f3 k
T e ; - Tm;
=
(n
m
( ie ;. - im; .) ± tv
{£;;2
a
6-59)
=
X - Xm . .
/3k ; - /3 q ;
1
Xm 1 •
f3ki - /3q1 are
( i. k ; - i. q ; ) ± tv
v
_
1)
{£;;2
x. ki
(6-60)
x. k
x. q ·
re p li
Comment.
n
If
y
Exa m p l e 6 . 1 1
=
k.
The
{A two-way mu ltivariate analysis of varia nce of plastic fi l m data) =
=
r
ing
a
study and
a te of extrusion
Section 6.6
TABLE 6.4 =
Two-Way M u ltivariate Ana lysis of Variance
313
PLASTIC FI LM DATA =
=
tear resistance, glos , and opacity Factor 2: Amount of additive Low (1.0%) High (1 .5%) [66..25 9.9.59 4.6.44]] [[76..29 109..01 2.5.70]] [ Low (- 10)% [[56..58 9.9.66 4.3.01]] [[66..91 9.9.59 3.1.99]] 9. 2 [ 6 . 3 0. 8 ] 4 [ 6 . 5 Factin raoter of1: Change 5. 7 ] 9. extrusion [66..76 9.9.31 2.4.81]] [[77..01 9.8.28 5.8.24]] [ High (10%) [[77..21 8.8.34 1.3.68]] [[77..25 10.9.71 2.6.79]] [6.8 8.5 3.4] [7.6 9.2 1.9] andeach combination of the factTheor lemeasvels. uTherementdatas werare die rseppeatlayededin Tabl5et6.im4es. at The mat r i c es of t h e appr o pr i a t e s u m of s q uar e s and cr o s pr o duct s wer e (see the SAStablstea:tistical software output in Panel6.1), leading to the fcalolcoulwiantegdMANOVA SSP Source of variation d.f. 5 045 1. 7 405 . 8 555 -1. [ ] change i n r a t e 1.3005 -..74205395 1 Factor 1: of extrusi.on . 6 825 1. 9 305 ] . 7 605 [ amount of .6125 4.1.79005325 1 Factor 2: add1.t1.ve [ . 0 005 . 0 165 . 0 445 ] .5445 3.1.49605685 1 Interaction . 0 200 3. 0 700 1. 7 640 ] [ 2.6280 64.-.59240520 16 Residual . 2 395 7 855 4. 2 655 -. [ ] 5.0855 74.1.29095055 19 Total (corrected)
x1
x2
x3
-
xl
-
x2
-
xi
-
x2
-
-
amount of an additive.
x3
x3
-
xl
-
x2
-
xl
-
x2
-
-
n
=
x3
x3
314 PANEL 6 . 1
Chapter 6
Comparisons of Severa l M u ltivariate M eans
SAS ANALYSIS FOR EXAM PLE 6. 1 1 U S I N G PROC GLM.
title 'MAN OVA'; data film; i n fi l e 'T6-4.dat'; i n put x1 x2 x3 facto r 1 facto r2; proc g l m d ata = fi l m ; cl ass fact o r 1 fact o r2; m odel x1 x2 x3 = facto r 1 factor2 facto r 1 * factor2 jss3; man ova h = fact o r 1 factor2 factor1 * factor2 j pri nte; mea ns fact or 1 fact or2;
PROGRAM COM MAN DS
General Linear Models Proced u re Cl ass Leve l I nformation Leve ls Cl ass 2 FACTOR 1 2 FACTOR2 N u m ber of observations i n
So u r ce Mod el E rro r Cor rected Tot a I
So u rce
So u rce Mod e l E rro r Cor rected Tot a I
So u rce
OUTPUT
Va l u es 0 1 0 1 data set = 2 0
Pr > F 0.0023
DF 3 16 19
S u m o f Sq u a res 2 . 50 1 50000 1 . 76400000 4.265 50000
Mean Sq u a re 0.83383333 0. 1 1 025000
R-Sq u a re 0. 586449
C.V. 4.893724
Root M S E 0.33 2039
DF
Type Ill SS
Mean S q u a re
F Va l u e
1 . 74050000 0. 760 50000 0.00050000
1 5 .79 6.90 0.00
0.00 1 1 0. 0 1 83 0.947 1
F Va l u e 4.99
Pr > F 0. 0 1 2 5
F Va l u e 7 . 56
X 1 Mean 6.78500000
DF 3 16 19
S u m of Sq ua res 2.457 50000 2. 62800000 5.08550000
Mean S q u a re 0 . 8 1 9 1 6667 0 . 1 642 5000
R-Sq u a re 0.483237
C.V. 4.350807
Root M S E 0.405278
DF
Type Ill SS
Mean Sq u a re
F Va l u e
1 . 30050000 0. 6 1 2 50000 0. 54450000
7.92 3.73 3.32
Pr
>
F
X 2 Mean 9 . 3 1 500000 Pr
>
F
0.01 25 0.07 1 4 0.0874
Section 6.6
pANEL 6 . 1
Two-Way M u ltivariate Ana lysis of Va riance
315
(continued)
Pr > F 0.53 1 5
DF 3 16 19
S u m of S q u a res 9.281 50000 64. 92400000 74. 20550000
Mean S q u a re 3 . 09383333 4.0577 5000
R-Sq u a re 0. 1 2 5078
C.V. 51.19151
Root M S E 2 . 0 1 4386
DF
Type I l l 55
Mean Sq u a re
F Va l u e
Pr > F
0.42050000 4.90050000 3.96050000
0. 1 0 1 .21 0.98
0.7 5 1 7 0.288 1 0.3379
14 14 14
0 .0030 0 . 0030 0 .0030
14 14 14
0.0247 0.0247 0 .0247
Sou rce M ode l E rror Co rrected Tota l
Sou rce
X1 X2 X3
X2 0.02 2 . 628 -0. 5 52
X1 1 .764 0 . 02 - 3.07
F Va l u e 0.76
X3 Mean 3.93 500000
X3 -3 .07 -0 . 5 5 2 64.924
M a n ova Test Criteria and Exact F Statistics for the H = Type I l l SS&CP Matrix for FACTOR 1 S=1 M = 0.5
Pilla i's Trace Hotel l i n g - Lawl ey Trace Roy's G reatest Root
0. 6 1 8 1 4 1 62 1 . 6 1 877 1 88 1 . 6 1 87 7 1 88
E = E rror SS&CP Matrix N=6
7 . 5 543 7 . 5 543 7 . 5 543
3 3 3
M a n ova Test Criteria a n d Exact F Statistics for the H = Type I l l SS&CP Matrix for FACTO R2 S=1 M = 0.5
P i l l a i 's Trace Hotel l i n g - Lawl ey Trace Roy's G reatest Root
0 .476965 1 0 0 . 9 1 1 9 1 83 2 0 . 9 1 1 9 1 832
4.2556 4.2556 4.2556
E = E rror SS&CP Matrix N=6
3 3 3
(continues on next page)
316
PANEL 6 . 1
Chapter 6
Comparisons of Severa l M u ltivariate Means
(continued) Manova Test Criteria a n d Exact F Statistics for the H = Type
Ill
SS&CP Matrix fo r FACTOR 1 * FACTOR2 S= 1 M = 0.5 N=6
0.22289424 0.286826 1 4 0 . 286826 1 4
P i l l a i's Trace Hotel l i n g - Lawley Trace Roy's G reatest Root Level of FACTO R 1 0
3 3 3
1 .3385 1 . 3385 1 . 3385
- - - - - - - - - X1 - - - - - - - - Mean SD 6.49000000 0.420 1 85 1 4 7. 08000000 0.3 224903 1
N 10 10 Level of FACTOR 1 0
Leve l of FACTOR2 0
E = E rror SS&CP Matrix
- - - - - - - - - X1 - - - - - - - - Mean SD 6 . 59000000 0.40674863 6. 98000000 0.47328638
Leve l of FACTOR2 0
0.3 0 1 8 0.30 1 8 0.30 1 8
- - - - - - - - - X2 - - - - - - - - Mean SD 0. 29832868 9. 57000000 0 . 57 580861 9. 06000000
- - - - - - - - - X3 - - - - - - - - SD Mean 1 . 8537949 1 3 . 79000000 2 . 1 82 1 49 8 1 4. 08000000
N 10 10
N 10 10
14 14 14
- - - - - - - - - X2 - - - - - - - - Mean SD 9. 1 4000000 0.560 1 587 1 0.42804465 9.49000000
- - - - - - - - - X3 - - - - - - - - SD Mean 1 . 5 5077042 3.44000000 2 . 30 1 23 1 55 4.43000000
N 10 10
To test for interaction,Iwe computI e 275.7098 A* + I 354.7906 = _7771 For ( g - 1) (b - 1)( 1 1,- A* ) (gb( n - 1) - p + 1)/2 F = A* (l (g - 1)(b - 1) - pl + 1)/2 hasgb( nan- exact- pF-+dis1trd.ibf.utioen wi[1]t.h) For ourI (gexampl - 1)(eb. - 1) - p I + 1 and F - ( 1 -.7771.7771 ) ((12(12()1()4)- -313 ++1)1)//22 - 1.34 ((21 1(2)( 1)(4)- 31- 3++1)1 )===3 14 =
SSPres = I SSPint SSPres
=
(S
1)
V1 = v2 ==
v1 =
e
v2 ==
Section 6 . 6
Two-Way M u ltivariate Ana lysis of Va ria nce
317
andhypotF3h, 1es4(i.s05)H0:= 3.3=4. Since= F ==1.34 = F3,(1n4(o.0i5)nte=ract3.3io4,nweeffedoctsnot). reject the Not e t h at t h e appr o xi m at e chi s q uar e s t a t i s t i c f o r t h i s t e s t i s 2 ( 2 ) ( 4) -[ 54)th. e Siexactnce F-x�t(e.0st5). = 7.81, we woul(3 + dTo1re-tache1(st 1ftoh))re/fs2]aacmetolnr(1concl.7and771)ufasct=ioon3.ras266,efprffeorctomvisd(eds(e6eby-pages 311 and 312), we calculate = I I 1 + I I = 275.722.70098212 = 3819 and = I I + I I = 275.527.71098347 = 5230 For both - 1 = 1 and - 1 = 1, F1 = ( 1 - ) ( l (g --1)1 )--plp ++ 1)1)//22 and ( 1 - ) ( l ( b --1)1)--plp ++ 1)1)//22 have F-- d1)istr-ibputi+ons1 andwith degr= eI es of- 1)fre-edomp I + 1,= I = - 1) -- p1)I +- 1,p + 1,= respectively. (See [1] .) (In1 our.3cas819e,) ( 16 - 3 + 1)/2 F1 = -.3819 (1 1 - 31 + 1)/2 =7.55 = ( 1 -.5230.5230 ) ((1116 -- 313 ++ 1)1)//22 = 4·26 and 1 - 31 + 1 = 3 = ( 16 - 3 + 1) = 14 = 1 andSimitlhaFrrerloyem,fobefre=, weo4.re2r,6eFj3e, 1c4tF3( H0:.0, 15)4( .0=15)3.=34.3.=3We4, and(haveno fweaFct1ore=rj1e7.ctef55fH0:ects)F3at, 1=4t(h.0e5)5%== le3.vel(3n4,o. factor 2andeffectthse) at the 5% level. We concl u de t h at bot h t h e additiTheve manner. of the effects of factafofersct1thande re2spononsthees,randesponstheyes dois explso ionraned itnraExersts incitshee6.component 15. In thats ofexercandise, simularteaneous conf i d ence i n t e r v al s f o r con considered. y1 1
y 12
y 22
y2 1
<
Af
SS Pres SS Pfac SSPres
A!
SSPres SS Pfac 2 SSPres
g
0
b
Ai Ai
(gb ( n
Ai
(gb( n
_
F2 -
Ai
v1
gb ( n
v1
(b
v2
(g
v2
gb( n
F2
v1
v2
T
F2
trusion
>
=
T2
0
>
/3 1
/3 2
0
change in rate of ex
amount of additive
nature
Te
f3k
•
318
6.7
Chapter 6
Com parisons of Severa l M u ltivariate Means
PROFILE ANALYSIS
Prtioonsfi,lande analsoyfsoisrtperh) artaeinadmis to snitiustaterieodnstointwwhio orchmora bate grteoryupsofoftsruebjatemctents. Als (tlersetssp, quesonses musdif etrebentexprgroupses eardeininsdependent imilar unitofs. oneFurtanother,hiteris. asOrsudmedinariltyh, atwetmihegrhtespposonseetshefoquesr the tequalion, aritye ofthemeanpopulvectatioonrsmeanis divvectidedoirnsttohseesverame?al spIecin prficofposile sanalibiliytiseiss,. the question of Cons i d er t h e popul a t i o n means r e pr e s e nt i n g t h e ernectageedrebyspsonstraeigshttolifnoesur, itsreshatownmentins Fifogrurthee 6.fi4r.stThigrosup.broAken-plloitneofgrthapheseimeans , s t h e for populProfaitleiosncan1. be constructed for each population (group). We shall concentrate onmeantwroesgrponsoupses. toLetp treatments for populationsand1 and 2, respectively. The hypotbe thhee e treatamtioentn prs haveofilesth, wee sacmean (faoverrmaulge)ateefthfeectqueson tthieontwofo sis ations. In itmerplmisesofthtathethpopul popul equaliArty einthaestprepwiofilsees fparashailoen.l? EquiAssuvmialnegnttlhyat: Itshe profiles paral el, are the profile2,s coi3, ncidentaccept a bl e ? 6 ? EquiAssuvmialnegnttlhyat: Itshe profiles coincid1,ent2, , are thaccept a bl e ? e profiles level? That is, are all tEquihe means equal t o t h e s a me cons t a nt ? v al e nt l y : I s accept a bl e ? The null hypothesis in stage 1 can be writ en where is the contrast matrix p
IL 1 = [ JL 1 1 , JL 1 2 , JL 1 3 , JL 1 4 ]
av
con
profile
IL2 = [ JL2 1 , JL2 2 , . . . , JL2 p ]
IL1 = [ JL 1 1 , JL 1 2 , . . . , JL 1p ]
Ha : IL l = IL 2
1.
Ha 1 : JL l i - JL l i - l = JL2 i - JL2 i - l , i = are Ha 2 : JL l i = JL2 i , i = . , p, are
2.
.
. . , p,
. .
3.
Ha 3 : JL 1 1 = JL 1 2 = · · · = JL 1 P = JL2 1 = JL22 = · · · = JL2 p Ha 1 : CIL 1 = C IL 2
C
Mean response
Variable
2
3
4
Figure 6.4
The popu l ation profi l e
p = 4.
6 The question, "Assuming that the profiles are parallel, are the profiles linear?" is considered in Ex ercise 6.12. The null hypothesis of parallel linear profiles can be written H0 : (JL1 � + JL21 ) - (JL1 1 - 1 + JL2 1 - 1 ) = (JL I L - 1 + JL21 - 1 ) - (JL1 1 - 2 + JL2 1 - 2 ) , i = 3, . , p. Although this hypothesis may be of interest in a par ticular situation, in practice the question of whether two parallel profiles are the same (coincident , what ever their nature, is usually of greater interest. . .
)
Section 6.7
Profile Ana lysis
319
-10 -11 01 00 00 00 ( 6 6 1) 1 1 0 0 0 0 Forpothiesndependent is can be tessatempld byesconsof sitzruesctingandthje tr1af,nsr2,ofmormthede twobso popul ervatioatnsions, the null hy and j 1, 2, Thesrnatriexhave sample mean vectors and respectively, and pooled covariance Since the twodissetrtisbofutitornsans, rfeosrpmectedivobser v at i o ns have and el y , an appl i c at i o n of Res u l t 6. 2 pr o vi d es a test for paral el profiles. =
c (( p - l ) X p )
n2
n1
Cx
csp ooled c ' .
=
. . . , n1
=
. . . , n2
C x2 ,
1
Np - l ( C IL 1 , CIC' )
Np - l ( C IL 2 , CIC' )
( J..L 1 i > J..L2 i ,
When t h e pr o f i l e s ar e par a l e l , t h e f i r s t i s ei t h er above t h e s e cond ftoortaall lheigorhtvis ce versa. Under this condition, tandhe profiles wil be coincident only if tarhee equal. Therefore, the null hypothesis at stage 2 can be writ en in the equivalent form Weobsecravnattihoenns test j wi1t,h2,th.e usualandtwo-samplj e t-1s,ta2,tistic based on the univariate i),
J..L 1 1
+
J..L 1 2 +
H0 2 l' x1 j,
=
· ··
+
J..L l p
=
1 ' IL l
Ho 2 : 1 ' IL1
.
. , n1 ,
=
1 ' x2 j,
J..L2 1
+
J..L2 2
+ ··· +
1 ' IL2 =
. . . , n2 •
J..L2 p 1 ' �L 2 =
320
Chapter 6
Comparisons of Severa l M u ltivariate Means
a e l e ar al For coi n ci d ent pr o f i l e s , and obs r v thaveions tfhroemsatmehe smean, ame norsomthalatpopulthe common ation. Thepronextfile isstleepvelis .to see whether all variables 0 1 and When H0 all observ1ations(, arbye tenable, the )common mean vect- or is estimated, using Isftatgehe 3common pr o f i l e i s l e vel , t h en and t h e nul l hypot h es i = s can be writ en as where C is given by (6-61). Consequently, we have the fol owing test. x2 1 , x22 , . . . , x 2 n2
x1 1 , x1 2 , . . . , x1 n 1
n1
+
H2
n2
X- =
J.L
n1
n1
+
""" .£,;
n2 1 = 1
X1 1·
n2
n1 X2 1· = ( n n ) X 1 l+ 2 j =l f..L l = f..L2 = · · · f..L p ,
""" + .£,;
Ho 3 : Cp., =
Example 6. 1 2
+
(n l
n2 +
-2 X n2 )
at
0
(A profi le analysis of l ove and ma rriage data)
Asveyedparadult of atslawirgterh rsetsupdyectoftolotveheiandr marmarriargeiage,"contE. rHatibutfiioens"ld, aandsoci"outologicsomest, sur" andmaletshandeir lefevelmals eofs wer"pasesiasonatkedet"oandresp"compani oenatfoleo"wilonve.g quesRecenttionsl,yusmaringried ond t o t h 8-point scale in the figure below. the
2
1. 2.
3. 4.
3
4
5
6
8
7
Almarl trhiainge?gs considered, how would you describe to Almarl trhiainge?gs considered, how would you describe from tWhathe 5-Subjpisoitnhetectslcesalvelwere sofheown.also asked tloorveespthondat youto thfeeelfofloorwiyourng quesparttnioerns?, using What is the level of love that you feel for your partner?
None at all
your contributions
the
your outcomes
the
passionate companionate
Very little
Some
A great deal
Tremendous amount
2
3
4
5
Section 6.7
Let
Profi le Ana lysis
321
anan 88--ppoioinntt ssccalalee rreesspponsonsee ttoo Ques t i o n 1 Ques t i o n 2 aa 5-5-ppoioinntt ssccalalee rreesspponsonsee ttoo Ques t i o n Ques t i o n 4 and the two populationsPopulbe defatiionned1 as married men Popul a t i o n 2 mar r i e d women The popul a t i o n means ar e t h e aver a ge r e s p ons e s t o t h e 4 ques t i o ns ftroirxthe ipopul a t i o ns of mal e s and f e mal e s . As s u mi n g a common covar i a nce ma t i s of i n t e r e s t t o s e e whet h er t h e pr o f i l e s of mal e s and f e mal e s ar e t h e same.A sample of 30 males and 30 females gave the sample mean vectors 6.7.803333 7.6.060033 3.4.790067 ' 4.4.050033 and pooled covariance matrix .606 .262 .066 .161 ..206266 ..167337 ..817310 ..012943 . 3 06 . 0 29 . 1 43 . 1 61 The saSimplncee tmeanhe samplvecteorssizarese arpleotreeasd asonablsamply learprge,ofweilessihnalFilgusureet6.h5e onnorpagemal t322.he ormaly met. Tohtodolest foogy,r paralevenletlihsoughm the data, which arwee incomput tegers, are e clearly nonnor [ -�1 -11 01 ] -101 -101 -100 0 -1 � 0 0 1 [ -..276819 -1..210168 -.-.712551 ] 1. 0 58 . 1 25 7 51 -. and [ -1� -11 01 -...200003333 [ -.-.016766 ] 0 -1 � ] .167 .200 x 1 == x2 == x3 == x4 ==
3
== ==
p ==
I,
n 1 ==
n 2 ==
( females)
( males)
S pooled ==
( H0 1 : CJL 1 == CP- 2 ) ,
C S pooiect C ' =
Spooled
==
C ( X1 - X2 ) =
==
322
Chapter 6
Com parisons of Seve ra l M u ltivariate Means
Sample mean response xCi
6
4
-
Key:
X
X
� 0 ��-o X
X
Males
o- - o Females
2
���� v�i�k 3 2 4
Sample profi les for marriage- l ove respon ses.
Figure 6.5
Thus, l [ . 7 19 2 68 . 1 25 -. ] 1 2T -.167, -.066, .200] (310 310)- -.- .212568 -.1.710151 -.1.705158 [ -.-..201006766 ] 15( .067) 1.005 Mor3.1e1over2.8, with8.7. Sin.0ce5, T2 1.(30005 30-28.7, we(4concl- 1)u/de(30that30the-4)]hypotF;,h,5es6(i.s05)of parfindialnegliprs notofilessufroprrimensing.and women is tenable. Given the plot in Figure 6.5, this As s u mi n g t h at t h e pr o f i l e s ar e par a l e l , we can t e s t f o r pr o f i l e s . To test H02 : JL1 JL2 (profiles coincident), we need SumSum ofof eleleement mentss iinn (x1 - x2) ( x1 - 4.x22)07 .367 Using 6-63 , we obtain T2 ( V(310 .36731o)4.027 )2 .501 Wijectththe hypot.05, hFes1,58is( .0th5at th4.e 0pr, andofileTs2are .coi501ncidFent1,5.8(That.05 is,4t.h0e, werescannot r e ponses of men andWe women t o t h e f o ur ques t i o ns pos e d appear t o be t h e s a me. coul d now t e s t f o r l e vel pr o f i l e s however , i t does not make s e ns e t o ; cara scraly eoutoft1h-is8,twhiest fleorQuesour exampl e , s i n ce Ques t i o ns 1 and 2 wer e meas u r e d on tioenss makes3 andt4hewerteset fmeasor levelureprd onofilaesscmeani ale ofn1-5.gles The iilncompat i b i l i t y of t h es e s c al uprosftilreatanales thyesineeds. for similar measurements in order to carry out a complete = [
+
=
=
=
(
) =
1'
a=
c2 = [ =
+
<
)
+
coincident
= 1'
= 1'
=
S poolect = 1 ' S poolect 1 =
(
)
=
a=
) =
+
=
=
<
) =
and
B
Section 6.8
Repeated Measu res Designs and Growth Cu rves
When t h e s a mpl e s i z es ar e s m al l , a pr o f i l e anal y s i s wi l depend on t h e nor m al itteyr as4,swiumptth thioen.orThiiginsalasobssumpterviatonionscan beorchecked, us i n g met h ods di s c us s e d i n Chap t h e cont r a s t obs e r v at i o ns much thiseonsaarmee analfash iogous on asThetthoattanalhfosoretywjsuiosstofpopuldiprscousfaitslieeosd.nsfo. (rSIsneeeverfa[ct1a3],ltpopul h.)e generatioansl measproceeds ures ofincompar Ascharweactseariidstearic ilsieobsr, therevted,ermat"rdiefpeatereentd meas uorres"locatrefieornss t,oonsittuhateisoansmewhersubjeetcth.e same t i m es Theamplobse 6.e2rvwherationse tonhe atismubje betect wmayeencorhearrestpbondeats twaso difmeaserenturterdeatundermentsthase i2n Ex2 tparreatedmwhen ent combithe rneatspionsonseappls onitehdetosaeachme sudog.bjectTheare tcorreatremlaentted.s need to be com Asersvinedgleovertreatamperentiomayd ofbetimapple. Foried tionseachtance,subjweectcoulandd ameassingluerechartheactweiergishtticofoba puppy at bi r t h and t h en once a mont h . I t i s t h e cur v e t r a ced by a t y pi c al dog thatWhenmust besomemodelsubjeed.ctsInretceihisvconte oneexttr, eweatmreentfer andto thote hcurervs eanotas aher treatment, tToheigrl uoswttrahtecuthrveesgrfoowtr thhecurtrveate model ments needintroduced to be compared. by Pot t h of f and Roy [ 1 5] , we consan iniidterialcalreadiciunmg,measTabluer6.ement5 givess ofretadihe domi n ant ul n a bone i n ol d er women. Bes i d es ngs after one year, two years, and three years Subj1ect I87.niti3al 186.year9 286.year7 375.year5 59.76.07 76.60.25 75.60.70 53.69.56 23 70.54.69 76.55.11 72.57.21 49.65.03 45 78.73.72 70.75.38 71.69.81 74.67.66 67 61.85.38 68.84.74 79.68.22 57.67.04 89 82.68.63 86.65.49 79.72.43 77.60.48 1011 1213 67.66.82 69.67.20 66.67.30 56.57.29 81.72.30 74.82.36 75.86.38 73.66.19 1415 Mean 72.38 73.29 72.47 64.79 xe j
6.8
323
C xe j ·
REPEATE D M EAS U R E S D E S I G N S AN D G ROWTH CURVES
(a)
X
(b)
growth curve.
TABLE 6.5
CALC I U M M EAS U R E M E NTS ON TH E DO M I NANT U LNA; CONTROL G RO U P
Source: Data courtesy of Everett Smith.
324
Chapter 6
Comparisons of Severa l M u ltivariate Means
fsuorbjtehcte contare corrolrgreloaup.ted butReadithosngse fobtromaidinedf ebyrentphotsubjoenctabss shooulrptidobemetinrydependent from t h e . model as s u mes t h at t h e s a me covar i a nce mat r i x hol d s f o r each s u bj e ct . uniequalvarvariateiaapprnces.oaches , tihleis, consmodeltrudoescted fnotromrethquie froeurthseamplfouremeasmeansurements to A prof steurmmar i z es t h e gr o wt h whi c h her e i s a l o s of cal c i u m over t i m e. Can t h e gr o wt h n beWhenadequatthe elymeasrepruerseement nted bys ona polall ysunomibjectasl iaren timtae?ken at times , Potthoff-Roy model for quadratic growth becomes
same
I
The
Unlik e hav e
( x1 , x2 , x3 , x4) ,
p
t1 , t2 ,
f3 o f3 o
+
+
f3 I t1 f3 I t2
+
+
• • •
tP ,
pat the
f3 2 ti f3 2 t�
whereUsthuealitlhy mean i s t h e quadr a t i c expr e s i o n eval u at e d at gr o ups need t o be compar e d. Tabl e 6. 6 gi v es t h e cal c i u m meas u r e ment foraaresegulcondar exerset ofciswomen, tm.he treatment group, that received special help diet andsWhen e pr o gr a a s t u dy i n vol v es s e ver a l t r e at m ent gr o ups , an ext r a s u bs c r i p t i s needed be t h e isnurtehmente one-s wonaythMAN OVA model . Let , , vect o r s of meae subjects in group e, for e 1, JLi
ti .
with
ne
X e 1 X e 2 . . . , Xe nc = . . . ' g.
ne
TABLE 6.6
CALC I U M M EAS U R E M E NTS ON TH E DO M I NANT U LNA; TREATM E NT GROUP
Subj1ect 23 45 67 89 1011 1213 1415 16Mean
I83.niti8al 185.year5 286.year2 381.year2 65.81.23 79.66.95 67.84.05 75.60.26 75.55.43 76.58.37 74.59.31 54.66.27 70.76.53 72.79.39 70.80.46 71.68.66 66.76.07 70.79.09 70.76.39 70.64.31 77.67.23 70.74.70 77.68.98 67.65.99 50.57.73 57.51.04 57.53.65 48.51.05 74.74.03 77.74.77 74.72.56 68.65.70 57.69.239 56.70.066 71.64.718 53.64.503
Source: Data courtesy of Everett Smith.
as
Section 6.8
Repeated Measu res Designs a n d Growth Cu rves
325
Al l of t h e ar e i n dependent and have t h e s a me covar i a nce matrix Under the quadratic growth model, th1e mean vectors are 1 [] 1 where 11 [ ] ( 6 6 5) and 1 If a qth-order polynomial is fit to the growth data, then and (6-66) 1 Under t h e as s u mpt i o n of mul t i v ar i a t e nor m al i t y , t h e maxi m um l i k el i h ood es t i mators of the are ( 6 6 7) where 1(N 1N covar i a nce mat r i x The wiesttihmNated covariancesis theofpooltheemaxid estmimumatolrikofelithhoode common estimators are for 1,2, (6-68) ov - 1)/(N , for e soq)th(eiNr covariance iqs 1) . whereAlso, (Nand (Nare independent We can f o r m al l y t e s t t h at a qt h o r d er pol y nomi a l i s adequat e . The model i s f i t wiwitthhioutn grroeupsstrictiotnshat, thhase erNror sumdegrofesesquarof efrseandedom.croUnder s producta qtshmat-ordrerix polis jyunomist the al, the error sum of squares and cros products (6-69) hashypothesis that th-e qq-ord1erdegrpoleyesnomiof farleiedom. The l i k el i h ood r a t i o t e s t of t h e nul l s adequate can be based on Wilks' lambda Xej
A s s u mp t i o n s.
I.
E[X ej ]
f3 eo f3 eo
=
+
+
f3 e 1 t 1 f3 e 1 t2
+
+
f3 e 2 ti f3 e 2 t�
t 1 ti t2 t �
f3 eo f3 e 1 f3 e 2
tp t�
B=
f3 e
=
B f3 e
f3 eo Pe1 Pe2
=
tp t�
f3 eo Pe 1
B=
f3 e
tP
=
t�
/3 e q
f3 e
S pooled =
_
g = � ne , €=1
----
C
k
=
f3 e
g)
A
( f3 e )
=
-g
- g)
f3 h
Wq
+
p
+
0 0 0
k - ( B' S�;olect B ) -l ne
+
( n g - l ) Sg)
f
-g-p+ =I= h,
=
-
=
g � €=1
gW
.
-g-p
ne � (X e j - B f3 e ) (X e j - B f3 e ) ' j =l A
_
.. , g
=
-g
W
ng - g
( ( nl - l ) Sl
A
+
+
0.
I.
326
Chapter 6
Com pa risons of Severa l M u ltivariate Means
II WWqIl Under l.grThusowththmodel , eth(pere are - 1)1 tfeerwerms iparnsteaadmetofertsh. eForp large fsoarmpleache tshiofezespolt,htehyenomigrnuloupslahypot er e ar hesis that the polynomial is adequate is rejected if - (N - � (p - ) ) ln Refer tocomput the datearicaln Tablculaetsio6.n5giandves6.6. Fit the model for quadratic growth. [ 73. 0 701 70. 1 387 ] P2J -2.3.06444274 -1.4.o85349oo so the estimated grContowthrocurl grvoesup:are 73.07 3.64t - 2.03t2 ( 2 . 5 8) ( . 8 3) ( . 2 8) Treatment group: 70.(2.510)4 (4..800)9t - 1.( .287)5t2 where [ -5.93.81744368 -5.9.58699368 -3.0.20240184 ] 1. 1 051 0. 2 184 -3. 0 240 and,obtaibyned(6by-68)di,vtihdeinsgtatndarhe didagonal errorselgievmenten bels byow thande partaakimetngetrheesstqimuarateesrowerot.e Exami n at i o n of t h e es t i m at e s and t h e s t a ndar d er r o r s r e veal s t h at t h e tFurermthserar,ethneeded. Los s of cal c i u m i s pr e di c t e d af t e r 3 year s f o r bot h gr o ups . does not seem to be any substantial dif erence between the groupsWi. lkser' leambda f o r t e s t i n g t h e nul l hypot h es i s t h at t h e quadr a t i c gr o h model is adequate becomes 2369. 3 08 2335. 9 12 2726. 2 82 2660. 7 49 2327. 9 61 2343. 5 14 2756. 0 09 2660. 7 49 2301. 7 14 2098. 5 44 2343. 5 14 2369. 3 08 2277. 4 52 2327. 9 61 2098. 5 44 2335. 9 12 I W2 1 2698. 2 28 2362. 2 53 2698.458930 2363. 2781.508917 2832. 2381. 1 60 2331. 2 35 2089. 9 96 2303. 6 87 2331. 2 35 2363. 2 28 2362.253 2381.160 2089.996 2314.485 A* =
( 6 -7 0 )
-q
q
Example 6. 1 3
+ g
q
+
A*
g
> xfp -q-l J g ( a)
me a ns
( 6- 7 1 )
{Fitti ng a q uadratic growth cu rve to ca lci u m loss)
A
[ {3 1 ,
=
+
+
(B ' S �;olect B) - 1 =
ne
t2
two
wt
A* = � =
= . 7627
Section 6.9
Perspectives and a Strategy for Analyz i n g M u ltivariate Models
Since, with .01,
327
a =
g) ) ln A* - ( 31 - � (4 - 2 2) ) 1n .7627 9. 2 1 7. 8 6 weis lefsailthtoanre.j0e5ctththereeadequac y of,tshoemequadrevidaencetic fittathat the .quadr 01. Siantceic tdoeshe p-notvalufiet i s , however well. We could, without restricting to quadratic growth, test for paral el and coincident calcium los using profile analysis. The Pot t h of f and Roy gr o wt h c u � v e model hol d s f o r mor e gener a l des i g ns t h an one-pres wioayn fMANOVA. However , t h e ar e no l o nger gi v en by ( 6 6 7) and t h e ex e compl (6-68). We refer the reTheraderoetroarit[se1covar 2]manyforiotmoranceherematmodiexamplrixficbecomes eatsioandns tfoumortrhtheermodel tests.triecatateeddherthane. They i n cl u de t h e fol owiDrng:opping the restriction to polynomial growth. Use nonlinear parametric mod elRess ortrievencting nonpar a met r i c s p l i n es . t h e covar i a nce mat r i x t o a s p eci a l f o r m s u ch as equal l y cor r e l a t e d rObsespeonsrviensgonmortheetsamehan oneindirveisdpualons.e variable, over time, on the same individual. This results in a multivariate version of the growth curve model. -
( � (p N-
q+
+
=
<
=
XJ4-2- 1 )2( . 01 ) =
a =
•
f3e
(a)
(b)
(c)
6.9
PERSPECTIVES AN D A STRATEGY FOR ANALYZI N G M U LTIVARIATE M O D E LS
Weprobabiemphaslity iofze makithat, nwigthanyseverincoral charrect adecictersiisotin.cs,Thiit issimis porparttaicntultaorlconty imrporol tthaentoverwhenal tdiesctatineg. fAorstihneglequal i t y of t w o or mor e t r e at m ent s as t h e exampl e s i n t h i s chapt e r i n tivariateofteunist,vwiartihatietsteasstsso. ciTheatedoutsincomegle p-tveall suuse, iwhets prehfeerraorblenotto perit isfworormtihnwhig a lleaertgmuloelonumber ok cl o s e r on a var i a bl e by var i a bl e and gr o up by gr o up anal y s i s . A s i n gl e mul t i v ar i a t e t e s t i s r e commended over , s a y, p uni v ar i a t e t e s t s becaus e , canas thgievnexte miexampl sleadinge demons results. trates, univariate tests ignore important information and Suppos e we col l e ct meas u r e ment s on t w o var i a bl e s and f o r t e n r a ndom lary esenotlecteeddherexpere andimentdispallauniyedtsasfrsocmatteacher plofotstwando grmaroupsgin. alThedothypotdiagrhaetmsicalindatFiga ure 6.6 on page 328. Example 6 . 1 4
(Comparing multivariate a n d u nivariate tests for the differences i n means)
X1
X2
328
Chapter 6
Comparisons of Severa l M u ltivariate Means
5.4.50 6.6.00 6.6.29 6.5.83 7.4.6.366 4.4.09 3.6.28 5.5.03 7.5.81 6.8
Gro1 up 11 11 11 11 12 22 22 22 22 2
3.3.20 3.4.56 5.5.26 6.5.50 7.6.53 4.5.99 4.5.41 6.7.01 4.6.76 7.8.08
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ... - - - - - -
x
x
2
2
+ +
0
+ + o + + o +
+
+
7
8o
6
0 0 0
4
+ + O
+
0
3 +
4
+
0
0
+
+
++
[I]
+
0
0
2
+
0
0
0
0 5
0 +
0
+
5
+
0
+
6
0
+
0
+
7
0 001
+
0
�------�---------r--------�------� ·
----
Figure 6.6
+
8
I
?
?
XI
Scatter p l ots and marg i n a l dot diag rams fo r the data from two g ro u ps.
Perspectives and a Strategy for Analyzing M u ltivariate Models
Section 6.9
329
I t i s cl e ar f r o m t h e hor i z ont a l mar g i n al dot di a gr a m t h at t h er e i s cons i d erginaalbledotoverdilaagrpainmthsheowsx1 valthuerese fisorconstheidtweroagrbleoupsover. lSiapmiinlatrhlye, txh2evalverutesicalfomarr the tlawtoiogrn obetupsw.eenThethsecattwteor varplotias blsueggess fort teachhat thgreroeup,is fandairlythsattro, ngalthposoughitivethcorerereis sgroomeupover2 measlap,utrheementgroups. 1 measurements are general y to the southeast of the 1 Let ation vectmeanorvectfor othrefosercondthe firgrstogrup.oup,Usanding IL ltehteILx21 obs[eJLr2v1at, JLio[2JLns1],bea, JLunit1h2e]vbearpopuliathteeaanalpopul tionymean 2. 4 6 wi t h v 1 s i s of var i a nce gi v es F 1 andat anyv2 reas18onabldegreeessigofniffirceanceedom.levelCons(Fe1 quent l y , we cannot r e j e ct JL JL 1 1 2 tdegrionse, esa uniofvfrareedom. iate analAgaiysisnof, wevarcannot iance gir1ev(jese.1ct0)F 2.3.JL601821 )wi. tUsJLh2vin1atg anyth1eandxre2 asobsvo2nablerv18ae sthigenicomponent ficance levelmeans . Thefunior tvhare itawteo tgresotsupssu,ggesandthence t2here weis noconcldif uederenceILl betILw2een• ' s T t o t e s t f o r t h e equal i t y of t h e On t h e ot h er hand, i f we us e Hot e l i n g 2 17.29 c2 ( 18)(17 2) F2, n( .01) 2.118 6.11 mean vect o r s , we f i n d T 12.account 94 andtwehe rposejectitive corILlrelaILti2oatn tbethe w1%eenlevelth.eThetwomulmeastivarurieament te teststafokesr eachinto grT2o-tup-iest isnequiformvalateiontn ttohattheisMANOVA unfortunatetleystig(nor6-38)ed. by the univariate tests. This Avarzooliabloegis, shet colmeaslectuerdedlizmasardss i(ninthgreasmsout) handwesttheernsnUnioutt-evdentStaletengts. hAmong ot h er (vinentmillengtimeh tiseras)mor. Becaus e t h e t a i l s s o met i m es br e ak of f i n t h e wi l d , t h e s n out e r e pr e s e nt a t i v e meas u r e of l e ngt h . The dat a f o r t h e l i z ar d s f r o m t w o gener a , Cnemi d ophor u s ( C ) and Scel o por u s ( S ) , col l e ct e d i n 1997 and 1999 ar e gin2ven 40in Tablmease u6.r7e.mentNotiscfeotrhSatltizharerdesar. e n1 20 measurements for C lizards and After taking natural logar[2.ith2ms40], the summar[y0.s3t5305atistics0.ar0e9417 ] C: n1 20 x1 4.394 1 0.09417 0.02595 S: n2 40 x2 [2.4.336808] 82 [0.0.5145390684 0.o.0145394255 ] Aariplthomst of, ismasshowns (Miasn Fis)gveruresu6.s7s.nTheout-vlaentrgelesangtmplh e(SiVL)ndiv,idafual95% ter takinconfg natiduenceral loing tbotervhalcover s for th0.e difference in ln(Mass) means and the dif erence in ln(SVL) means llnn((MSVL)ass):: JLJL11 2 -- JLJL22 1 :: (( -0.-0.4076,11, 0.0.2120)83) ==
==
==
==
,
==
8
==
H0 :
==
H0 :
==
==
==
==
==
=
H0 :
==
>
=
=
X
=
•
Example 6 . 1 5
(Data on l i za rds that req u i re a bivariate test to establish a difference i n means)
==
==
==
==
==
=
S
==
==
330
Chapter 6
Comparisons of Severa l M u ltivariate Means
TABLE 6.7
LIZARD DATA FOR TWO G E N E RA
c
Mas7.513s 5.5.086732 11.2.048819 13.18.264710 16.15.981032 17.16.052635 4.7.253030 5.13.425000 14.14.6.006659280 5.16.290264
s
Mas13.911s 5.37.233631 41.31.799581 3.4.936267 3.4.084838 6.22.562510 13.4.314209 12.7.312069 21.42.097789 27.38.290101 19.747
SVL74.0 69.72.50 56.80.94.000 95.99.55 97.90.50 91.67.00 75.69.05 91.91.05 90.73.00 69.94.05
Mas14.666s 4.5.702090 5.5.262090 6.9.797763 9.7.8.488933111 6.11.698580 16.13.562030 13.10.730050 7.9.910003 13.9.278716
SVL77.0 62.108.00 115.106.00 56.60.05 60.52.64.000 96.79.05 75.55.64.055 109.87.96.005 111.84.05
SVL snout-vent length. Source: Data courtesy of Kevin E. Bonine. =
4-
r.J:J r.J:J �
,.-._
� � -
'-'
2-
0
1I
3.9
<0 I
•
4.0
ot0 o [p o o90 0 @0 . • co I
4. 1
I
4.2
I
4.3
0 0
oo 0 • '· ·
�
3-
I
4.4
s
SVL80.0 62.61.50 64.63.62.000 71.69.05 67.66.05 64.79.50 84.81.00 82.74.05 68.70.50 77.70.05 ~
8o 0 0
s
,.
I
4.5
I
4.6
I
4.7
I
4.8
ln (SVL) Figure 6 . 7
Scatter plot of l n (Mass) ve rsus l n (SVL) for the l i zard data i n Ta ble 6 . 7 .
Perspect ives and a Strategy for Analyz i n g M u ltivariate Models
Section 6.9
331
Theferencecorirnestphondie indinvgiduniualvmeans ariate Sthaveudentp-v'als ut-estesoft s.t4a6tiandstics.0f8,orretsepstecintgivfeloyr. noClediarf lory, farodimf earuniencevariniastneoutper-vspentectlievngte, weh means cannotfodetr thecte tawdio fgenererencea ofinlimaszardss.means However , cons i s t e nt wi t h t h e s c at t e r di a gr a m i n Fi g ur e a bi v ar i a t e analUsinygsiRess struolnglt 6.y4s(uapporlso setes aExampl dif ere2ncee 6.5in), stihzee Tbet2-swtaeentistitchehastwano grapproupsoxiofmliatzare dxs�. dimulstrtiibvarutiiaotn.e metForhtodhisisexampl e , T 225. 4 wi t h a pv al u e l e s t h an . 0 001. A essential in this case. Exampl e s 6. 1 4 and 6. 1 5 demons t r a t e t h e ef f i c acy of a mul t i v ar i a t e t e s t r e l a t i v e tento itdats unia ivnarExampl iate counte 6.e1r.parts. We encountered exactly this situation with the efflu In t h e cont e xt of r a ndom s a mpl e s f r o m s e ver a l popul a t i o ns ( r e cal l t h e onew ay MANOVA in Section 6.4), multivariate tests are based on the matrices and Throughout this chapter, we have used Wilks' lambda statistic whitics carheisreequigulavrallyentincltoudedthe liinktelhiehoodoutprutatiofo tsetastt.isThrticaleepackages other mul. tivariate test statis Lawl e y-Hot e l i n g t r a ce PiRoyl a'istlraarcegest root maximum eigenvalue of Al l f o ur of t h es e t e s t s appear t o be near l y equi v al e nt f o r ext r e mel y l a r g e s a m moderaoftecassamples setusdiizesed, albyl compar iisoon.ns Fraroembastheed sonimulwhatatioinss necesreporsateridlytoa ldatplimesie.t,etFordhenumber s i m ul a t ' s t e s t , behaves di f f i r s t t h r e e t e s t s have s i m i l a r power , whi l e t h e l a s t , Roy fsearmeenttliym. e,Itsthpower i s bes t onl y when t h er e i s a s i n gl e nonzer o ei g enval u e and, at t h e enceers. Therexistes iisnealjupower ssot soneomeischarlsaurggesgae.cteThitriiosnstimaycthandat apprPiitl iasoibet'xis tmrwaateencee sisionetsulatigihtgronsloyupmorwherandeeralaobusllaofrgtethagaidie fotnehrst nonnor m al i t y . However , we s u gges t t r y i n g t r a ns f o r m at i o ns on t h e or i g i n al dat a when the reAlsidlualfours arsteanonnor m al . teisditicsscusapplsioyn iins githveentwino-tweayrmsseoft itnhgeandmulitnivevenariatemorregreecompl icated MAN OVA Mor s i o n model . in ChaptWhen,er and only when, the multivariate tests signals a dif erence, or departure fBonfrometrhoenninulinltehypot h es i s , do we probe deeper. We recommend cal c ul a t i n g t h e r v al s f o r al l pai r s of gr o ups and al l char a ct e r i s t i c s . The s i m ul t a neous confidence statements determined from the shadows of the confidence ellipse are, 6.7,
=
•
W
=
g ne
B
L L (x ej - xe) (x ej - xe) ' € =1 j =1
A* =
tr [BW - 1 J = tr [B(B + W)- 1 J
=
g
n e ( i e - x) (xe - x) ' €L =1
IWI IB + WI
=
=
7.
W (B + W) - 1
332
Chapter 6
Comparisons of Several M u ltivariate Means
tmerypicitalfluyr,tthoero lsatrugdye. butThe, wione-th taht-eacur-timreentindattervaal, cannot s may bebestuaggesken tasiveconclof diufsievreencesevidencethat fteorr fthoer compar existenceingoftrdieatf emreentncess. .TheWefsiursmmar iizsettohcheck e procedurthe date devela forooutpedliienrsthusisinchgapvi t s t e p sual displays and other calculations.
We mus t i s u e one caut i o n concer n i n g t h e pr o pos e d s t r a t e gy. I t may be t h e casthere, tthhate didiffeerreencesncesholwould fodrappear i n onl y one of t h e many char a ct e r i s t i c s and, f u r onl y a f e w t r e at m ent combi n at i o ns . Then, t h es e f e w ac maytvariveiadinotblfeewoulrsehncesowdsidetmaygniefctibecome cancethe diwherfloesrteence.among as a uniThealvlarbesthiaetteiprntacteesventtivreesaonesttriivcet.eidThats atogoodtihse, thsexper peecioverfiicmaactlenttievasetl desproiducegn. Todifdesereigncesn an, doeffnotectivineclexperude timooentmanywhenothoneer varspeciiablfiecsvarthatiablareeisnotexpectexpected ed to show dif erences among the treatments. to
EXERCI SES
6.1.
Cons t r u ct and s k et c h a j o i n t 95% conf i d ence r e gi o n f o r t h e mean di f e r e nce vectB or fBusal sinoutg tshideeeftfhlueent95%datconta andourre. sIuslttshiins Exampl e 6. 1 . Not e t h at t h e poi n t r e s u l t cons i s t e nt wi t h t h e t e s t HUs0:inBg the iconsnformidaterieodniinnExampl e 6. 1 ? Expl a i n . Exampl e 6. 1 . cons t r u ct t h e 95% Bonf e r o ni s i m ul t a neous i n t e r v al s f o r t h e component s of t h e mean di f e r e nce vect o r Compar e tshtreuctleengtd ihnsthofe exampl these inet.ervals with those of the simultaneous intervals con Thesampldate a8.corConsresptrondiuct anjgotinotsa95%mplconfe 8 inidTablenceer6.egi1 osenemforunusthe mean ually ldiarfgee.reRemove nce vee== 0
6.2.
6.3.
of
== 0
8.
Chapter 6
Exercises
333
tmeanor anddif tehreence95%vectBonfore. r Aronie stihmeulretasneous i n t e r v al s f o r t h e component s of t h e u l t s cons i s t e nt wi t h a t e s t of H0: ? DiRefsceusr tso. Exampl Does thee 6."out1. lier" make a dif erence in the analysis of these data? transforming the pairs of observations tRedo oConsln (tBrthuOD)ectanaltheandy95%sislinnBonf(Exampl SS).er oeni6.s1imafultetraneous i n t e r v al s f o r t h e c o mponent s of tDihescmeanuss anyvectposorsiblofe vitroalnsatfioonrmofedthvare asiasblumptes. ion of a bivariate normal dis tA rersiebarutciheron fconsor thiederdifedertehnceree vectindicoesrs measof traunsrifnogrmthede seobsvereirtvyatofiohearns. t attacks. Thepitalvalemeruesgencof tyhesroeomindiprcoesducedfor the 40summar heart-yasttaacktistpatics ients arriving at a hos [ 46.57.31 ] and [ 101.63.03 63.80.20 71.55.06 ] 71. 0 55. 6 97. 4 50. 4 Almeanl thirnedie cinesdiuscesingare(6-eval16) uwiatthed for .each05. patient. Test for the equality of Jfiuddgeencethientdierfvealrse.nces[Seein(6pai-18)rs .of] mean indices using 95% simultaneous con UseCalthecudatlatae for treatments 2 and 3 in Exercise 6.8. TesConst H0:truct 99% simultemplaneousoyinconfg a tiwdo-encesamplineteapprrvalsoachfor withteh dif e.rences 01. 1, 2. Us6.4i,ncomput g the suemmarandy stattiesstticsthfeorhypot the elhectesriisciH0:ty-demand data givasensuinmiExampl e n g t h at Sett respons.05.ibAlle fsoor, dettheerremjeictneiotnheofliH0•near combination of mean com ponent s mos Ob�ervations o[nxt:oarreesponses are collected for three treatments. The observation vectors J Treatment 1: [�J [: J [: J [:J [�J Treatment 2: [� J DJ [�J Treatment 3: [�J [� J [� J [�J Bras ienak(6up-35)th. eConsobsetrrvuatctitohnse corintoremean, tnrgeatarmraentys f,oandr eachresivardualiablcoemponent s, s p ondi . ( S ee Ex ampl e 6. 8 . ) Using the information in Part a, construct the one-way MANOVA table. B
B
6.4.
(a)
(b)
B
(c)
6.5.
n=
x=
(a)
S=
a=
(b)
6.6.
( a)
Spoolect . IL 2 - #L3 = 0
(b)
(c)
6.7.
IL2 i - JL3i ' i =
I 1 = I2 .
6.8.
T2 a=
x2
(a)
(b)
a=
IL l - IL 2 = 0,
=0
334
Chapter 6
Com parisons of Severa l M u ltivariate Means
EvalSet uate.Wi01.lkRepeat s' lambda,the tesandt ususinegTablthe echi6.3-stqouarteset fappror troeatximmentatioefn fwiectths. ' Bar t l e t s cor r e ct i o n. [ S ee ( 6 3 9) . ] Compar e t h e concl u s i o ns . !JsinCg th, eandcontSdrastCSC'matriinx(6C-1i4)n . (6-13), verify the relationships Cx1, Cons i d er t h e uni v ar i a t e onew ay decompos i t i o n of t h e obs e r v at i o n gi v en byeffe(6-ct3vect0). oShowr x1th-atx)the meanxvect2 - ox)r x is always perpendiculawherr to tehe treatment 1} 0 0 10 01 00 , } 01 00 01 } 1 0 0 Apendent likelihsoodamplaregcovar umentiaprnceovimatdesriaddices ttoioesnaltimsuatppore a common t for poolcovaring thiaencetwomatinderix in the case offortwtwoo norindependent mal populsaampltionses. ofGisivzees theandlikelifhroodom function, andby the choices populations, respecandtively. Show that this likelihood is maximized 1 l)SJ l S ( 2 ) spooled Use (4-16) and the maximization Result 4.10. Letes to p be t h e mean res p ons and tbyreatthme enttwos mean for populvectaotirosnsar1e andpara2,l reel.spectively. Assume that the profiles given Show that the hypothesis that the profiles are linear can be writ en as 3, or as C1 -2 1 0 wher0e the0 0- 2) matrix c 0 1 -2 1 0 0 0 1 0 0 0 0 -2 1 FolloCwi(ILng1 an argumentat lesvelimilarif to the one leading to (6-62), we reject 1 c (c)
6.9. 6.10.
d
A*,
a =
=
dj
=
x
=
xe j
u1
(
+
u2
(
( xg - x) n g
+ ··· +
n1
ul
=
=
' "2
n 2 . . . , ng
=
ng
6.11.
L( IL l , �L 2 , I ) , Np( IL 2 , I ) i
=
n1
il l
+
=
n2
x l ' [L 2
[ (nJ -
=
Np( IL l , I )
n1
n2
nl
n2 n 1 + n2
x2
+
( n2 - ) 2 ]
=
+
Hint: 6.12. (Test for linear profiles, given that the profiles are parallel.) [ JLl l ' JL1 2' JL1 pJ IL2 = [ JL21 , JL22 , JL2 p ] 0 0 0
(a)
,
Ho : ( JLl i + JL2i ) - ( JLli - 1 + JL2 i - l ) = ( JLl i - 1 i= p H0 : ( �L 1 + �L 2 ) = 0, 0 0 .
IL 1
0 0 • ,
,
+
JL2 i - l ) - ( JLl i - 2 X p (p
+
=
(b)
Ho :
T2
=
+
�L 2 )
=
0
a
(X1 X2)' C { ( �1 �J spooiect C ' l C (XI +
+
+
X 2 ) > c2
==
JL2i - 2 ) ,
where - - 2) - 2) l Let 30, 30, 6.4.6, 16.8,.27.63, 7..007] , .16 4.3, 4.9, 5.3, 5.1 and ..2076 ..6174 ..8171 ..0143 . 1 6 . 1 4 0 3 . 3 1 . Test for linear profiles, assuming that the Cons profiliedserartehpare obsaleerlv. atUsioens on t.w05.o rthesatponstherees, is aand obsdispelravyedatioinn vectthe foorrmat ofeachthecombi fol owinatngiotwn o-ofwfayacttoarblleevel(nsot):e Fact o r 2 Level1 Level2 Level3 Level4 [ [ [ [ ] ] ] : � � Levell [�] [ -�] [1��]J [ -�] Factor 1 Level2 [ -�] [ [ Level3 =�J -� J [ =:J With no replications, the two-way MANOVA model is wherDecompos e the are tehienobsdependent ) r a ndom vect o r s . I ex . r-vatx)ions for each- x)of the two variabl-ex.s ask x) swiiml irlaersutolttihnesearverraysal3in Exampl e 6. 8 . t h i s decompos i t i o n i s t h e 4 mat r i c es . Her e i s t h e over a l aver a ge, x averof faactgeorf2.or the fth level of factor 1, and x. k is the average for the kth level Regar d t h e r o ws of t h e mat r i c es i n Par t a as s t r u ng out i n a s i n gl e " l o ng" vec tor, and compute the sums of squares and sums of crCPos productCP s Chapter 6
(p ( n 1 + n2 n l + n2 - p + n2 = x1 = [
C
n1 =
2
Fp - 2, n 1 +n,- - p +l ( a)
_
s pooled
335
Exercises
x2 = [
],
=
a=
6.13. (Two-way MANOVA without replications.)
x1
x2 , single
g
2:
€=1
(a)
ee k
Np( 0,
Xek == X + ( e
+ ( x.k
Te
=
b
k2: =l
f3k == 0
+ ( x ek - Xe. For each response,
X
(b)
sstot == ssmean + ss facl + ss fac2 + ssres
S
tot == S mean + SCPfac l + SCPfac2 + SCPres
+
Xe.
336
Chapter 6
Com parisons of Severa l M u ltivariate Means
, , SSP and SSP s wi t h SSP Cons e quent l y , obt a i n t h e mat r i c es SSP n o r e f a c c f a c 2 l degrSummareesoffizertehedomgb l, g l, b l,and( g l)( b l) , r e s p ect i v el y . e cal c ul a t i o ns i n Par t bi n a MANOVA t a bl e . Thi s MANOVA t a bl e i s cons i s t e nt wi t h t h e t w ow ay MANOVA t a bl e fSSPor croesmparin theingener g factaolrtswando-wtayheiMAN r interOVAactionstablwhere is ea zero 1.matNotrixewiththatz, wierothdegrees1, of freedom.the The matrsiuxmofofinstqeruaracteios nandsumcroofs sprquaroductes ands matcrroixs . products becomes Given.05thleevelsummar y i n Par t c, t e s t f o r f a ct o r 1 and f a ct o r 2 mai n e f e ct s at t h e . (g - 1)TheUs(be t-ethste1).s rreequisulrtes thinat(6-56)(gand- 1)(6(-b58)- 1)wistohthgb(at SSP-re1)s wirepll beaposceditibyve defA inite (wiofth tprheobabiexperlitiym1).ent in Exercise 6.13 yields the fol owing data: Fact o r 2 Level1 Level2 Level3 Level4 Levell [1:] [�] [�] [ ��] Factor 1 Level2 DJ [ 1�] [ 1�] DJ�] Level3 [ -�J [ -�] [ -1� J [ Usvateiotnhesvecte dator aasto decompose each of the two measurements in the obser wher1, ande xisisththeeoveraveraalgeaverforatge,he kthilsetvelhe ofaverfaactgeorf2.or Forthemfththleevelcorrofesfpaondctor iCombi ng arranyse tfhoer preachecediofnthgedattwao wiresthponstheesdat. a in Exercise 6.13 and carry tGihevneces s a r y c a l c u l a t i o ns t o compl e t e t h e gener a l t w ow ay MAN OVA t a bl e . en t h e r e s u l t s i n Par t b, t e s t f o r i n t e r a ct i o ns , and i f t h e i n t e r a ct i o ns do fects. Use the likelihood ratio test wiIexif maitsht, tneseft ff.0oe5.ctr fsa, ctbutorno1 andintefraactctoirons2 mai, exinstef, exami n e t h e nat u r e of t h e mai n ef fdiefctesrbyencesconsoftruthcteincomponent g Bonfer osniofsitmheulftaactneousor eff95%ect parconfametidenceers. intervals for
(c) Hint:
n
==
n
now
residual
(d)
a
==
Hint:
n
Note: 6.14.
==
p
<
replicate
(a)
Xek == X + ( xe. - x) + ( x.k - x) + ( x ek - Xe. - x.k + x) xe. x.k
(b)
out
not
(c)
a
(d)
==
Chapter 6
6.15.
Exercises
RefCarer toryExampl e 6. 1 1. out appr o xi m at e chi s q uar e ( l i k el i h ood r a t i o ) t e s t s f o r t h e f a ct o r 1 andthe exactfactorF-2tefesftesctgis.venSetin the .exampl 05. Compare. Exple thaiesneanyresudiltfsewiretncesh the. results for Usencesingin(6t-h5e9)f,aconsctort1ruefctfesctimparultaametneousers95%for confiofdencethe tihnrteeervralesspfonsor edis.f Ienr tparerparetmettehrses. e intervals. Repeat these calculations for factor 2 effect Four4.3 (smeas ures ofe t4.he14)re.spTheonsemeasures,ononeacha givofen30boarboard,dars are reelpeatistedeidniTabln thee e e Exampl ssetinsf neesthsatartihseye frweromefmade one af t e r anot h er . As s u mi n g t h at t h e meas u r e s of our tcontreatemxtent. Sets, test fo.r05.theConsequaltruitcty ofa 95%treat(msimentulstainnea ouspari)soconfn ofidtencehe dynami intervcalmeasfor aurcontementrasst wiintthhethmean l e vel s r e pr e s e nt i n g a com e s t a t i c meas u r e ment s . Themericdatal acogniin Tabltion.e 6.Does8 wertehcole proces lectedsitnogteofst number two psyscholdependogicalonmodelthe ways of nuthe number s ar e pr e s e nt e d ( w or d s , Ar a bi c di g i t s ) ? Thi r t y t w o s u bj e ct s wer e r e quisentreedd taso make a s e r i e s of qui c k numer i c al j u dgment s about t w o number s pr e ei t h er t w o number wor d s ( " t w o, " "f o ur") or t w o s i n gl e Ar a bi c di g i t s (h"2,e "sa"4")me .numer The siucbjaleparcts iwerty (ebotashkedevento rorespbotondh odd)"same"andif "ditheftewroentnumber s thadwo tnumber " i f t h e s had a di f e r e nt par i t y ( o ne even, one odd) . Hal f of t h e s u bj e ct s wer e asalss,igandnedhala blf oofckthofe Arsubjabiectc sdirgeiceit trviaedls,tfhoel oblwedocksbyofatblriaolcks inofthnumber wor d t r i Widomithiznedeachfor eachblock,sutbjheector. dForer ofeach"saofme"theandfour"dicombif erentnat"iparonsitofye tparrreiverailtsyswasande orrdfaoernr. matsubj,etcthe. Hermediean reaction times for correct responses were recorded for each XX21 medi a n r e act i o n t i m e f o r wor d f o r m at di f e r e nt par i t y combi n at i o n medi a n r e act i o n t i m e f o r wor d f o r m at s a me par i t y combi n at i o n X3X4 medi a n r e act i o n t i m e f o r Ar a bi c f o r m at di f e r e nt par i t y combi n at i o n medi a n r e act i o n t i m e f o r Ar a bi c f o r m at s a me par i t y combi n at i o n TesConst ftorructtre95%atment(simefulfetactneouss usin)gconfa idence intervals for the contSetrasts r.e0pr5. e sefefntectin.gItnhteernumber f o r m at ef f e ct , t h e par i t y t y pe ef f e ct and t h e i n t e r a ct i o n p r e t t h e r e s u l t i n g i n t e r v al s . Thewhileabstheencepresofenceinteofraictntioernactsuipporon sutspporthetsMthmodel of numer i c al cogni t i o n, eexperC andimCentmodel of numerical cogni t i o n. Whi c h model i s s u ppor t e d i n t h i s ? Fornumbereachfsourbjmeatct,contconsratrsut,ctthtehrparee diityf etyrepencecontscorraesst,corandrestphondie inntegratoctitohne (a)
a =
(b)
pairs
The following exercises may require the use of a computer. 6.16.
stiffness
repeated measures design
6.17.
337
a =
= = = =
(a)
(b)
(c)
(d)
repeated measures design.
a =
338
Chapter 6
Com parisons of Severa l M u ltivariate Means
TABLE 6.8
N U M B ER PARITY DATA (M E DIAN TI M ES IN M I LLIS ECON DS)
WordDif WordSame ArabicDif ArabicSame 691.678.00 601.659.00 869.995.00 860.875.05 826. 0 833. 0 1056. 0 930. 5 888.865.00 728.839.00 1126. 954.909.00 1044.925.000 797. 0 1059. 5 856. 5 926. 0 766. 0 896. 5 1172. 5 854.00 735.986.00 1408. 50 1311.887.00 915. 1028. 761.663.00 657.583.00 1011.726.00 863.674.00 1225.731.982.000 1179.894.662.000 1037.831.662.050 640.905.624.050 975. 814. 0 735. 0 5 872. 5 1130.945.50 843.867.05 657.00 811.00 754. 909. 747.656.05 777.572.00 539.687.50 752.659.55 752.683.00 553.611.00 751.774.919.000 744.735.833.000 671. 0 612. 0 941.751.00 901.789.50 700.735.00 931.785.00 767.813.50 724.711.00 639.625.00 737.750.55 1289. 5 904. 5 784. 5 1140. 0 1096.05 1009.958.00 1076.918.00 746.983.50 1083. 1114.708.00 1046.669.00 1081.657.00 796.572.05 1201.0 925.0 1004.5 673.5 contmodelrasfto.r Iths esa emuldattaiv?arExpliate anorin. mal distribution a reasonable population Jolpaiinctoeured tuandrtles.MosTablimeann6.9 cont[11]asitnusditheeid rthmease reluarteiomentnshisponoftshizeecarandapacesshapeoffor femalTesetandfor 24equalmaliteytofurtthlees.two population mean vectors using .05. Icomponent f the hypotshmosesistinreParsponst aibislerefjoerctreejd,ectfiinndgtH0•he linear combination of mean ( xi )
( x2 )
( x3 )
( x4 )
Source: Data courtesy of J. Carr.
6.18.
24
(a)
(b)
a ==
Chapter 6
Exercises
339
TABLE 6.9
CARAPACE M EAS U R E M E NTS (I N M I LLI M ETE RS) FOR PAI NTED TU RTLES
Femal e Mal e Length Width Height Length Width Height 10310398 818486 423838 939496 747880 373535 105109 8688 4244 102101 8485 3839 123123 9592 5046 103104 8183 3739 133133 10299 5151 107106 8382 3938 133134 100102 5148 112113 8988 4040 136138 10298 4951 114116 8690 4340 138141 10599 5153 117117 9091 4141 147149 108107 5557 120119 9389 4140 153155 107115 6356 120121 9395 4442 155158 117115 6062 127125 9693 4545 159162 118124 6361 128131 9595 4645 177 132 67 135 106 47 FiCompar nd simeulwitaneous conf i d ence i n t e r v al s f o r t h e component mean di f e r e nces . t h t h e Bonf e r o ni i n t e r v al s . ions of the obsIn therevfatYouirsitophasnsmay. e ofwiasshtutdyo ofconsithe dcosert oflogaritransthpormiticngtrmiansformat l k f r o m f a r m s t o dai r y plonants, a sfuurelvey, was rtaekenpair,ofandfirms engaged i n mi l k t r a ns p or t a t i o n. Cos t dat a artrueckspre. sented in Table 6.10 on page 340capifortal, all36measgasuorleindeonanda per-m23iledibasesiesl, r dif ehresencesis ofiequal n the mean cosot rvects is orresj.ectSeted in Par.01.t a, find the linear IfTescombitht efonhypot cos t vect at i o n of mean component s mos t r e s p ons i b l e f o r t h e r e j e ct i o n. Cons t r u ct 99% s i m ul t a neous conf i d ence i n t e r v al s f o r t h e pai r s of mean com ponents. Which costs, if any, appear to be quite dif erent? ( x1 )
( x2 )
( x3 )
( xi )
( x2 )
( x3 )
(c)
Hin t:
6.19.
X1 =
(a)
(b)
(c)
X2 =
X3 =
n1
=
n2 =
a =
340
Chapter 6
Com parisons of Severa l M u ltivariate Means
(d)
Comment on t h e val i d i t y of t h e as s u mpt i o ns us e d i n your anal y s i s . Not e i n parfiedtiasculmular tthivatarobsiateeoutrvatliieornss. 9(Sandee Exer21 fcoirsegas5.2o2linande tr[u2cks] .) Repeat have beenPartiadenti these observations deleted. Comment on the results. Gasoline trucks Diesel trucks 16.7.4149 12.2.4730 11.3.2923 7.8.5420 12.5.2163 17.9.1151 9.4.2942 5.1.7358 7.9.7758 10.10.2186 14.3.7322 11.5.2939 11.14.2205 5.5.7058 10.9.8678 12.9.6709 12.4.7172 29.11.2008 13.13.3520 10.14.2978 10.9.4650 11.6.4357 8.9.8959 19.14.0530 29.12.6181 15.7.0691 10.3.2238 9.9.7150 2.5.0946 20.13.6848 7.9.9501 5.3.6803 8.9.1133 11.9.7671 17.11.7856 35.17.0180 10.11.2151 5.6.0175 10.7.6171 9.8.0539 10.13.2145 20.17.4656 12.10.2174 14.2.2569 14.6.0399 15.8.2990 12.6.2920 16.19.0389 10.8.8188 2.6.7005 12.12.2143 11.9.5944 16.5.7679 22.14.7676 12.8.3541 14.7.7023 11.12.0681 10.10.4873 21.17.5625 28.10.4676 26.12.9156 17.8.4244 16.7.8189 11.7.8183 12.13.2128 21.19.4240 16.14.7903 13.10.7378 17.14.5589 12.03 9.22 23.09 10.8.3928 4.5.4196 17.4.2006 9.12.7720 11.8.6539 6.5.8539 8.13.9.274029 11.2.7.291256 4.6.6.279231 8.15.2816 11.9.4825 13.8.0176 9.12.4189 4.9.6178 11.9.4949 17.32 6.86 4.44 with
TABLE 6. 1 0
xl
M I LK TRANSPORTATION-COST DATA
x2
x3
Source: Data courtesy of M. Keaton.
xl
x2
x3
Chapter 6
6.20.
( x1 )
( x2 )
The thookail lengtbilhesdinkimitesl arimeetgievrens in Tablande wi6.1n1.g lSiengtmilhasr imeasn milurimementeterss for fefomalr 45e hook-Plobtiltheedmalkiteeshookwere bgiilvened kiintTable datea5.as12.a scatter diagram, and (visually) check fTesor toutfolrieequalrs. (Nitoty ofe, imean n partivectculaorr, sobsfoerrtvhateipopul on 31 awitiothns of mal284.e )and female hookb i l e d ki t e s . Set . 0 5. I f i s r e j e ct e d, f i n d t h e l i n tearbefo elocombi irme inconduct atneatanyioninoutmosg ltihteirrssestfpeoonssundt. iAlbinletParefronrtattahievfoelrreyjt,ehyoucte imalonmayofe hook-want(bYilouteodmayikintteerwantdatpreat 284 f o r obs e r v at i o n 31 as a mi s p r i n t and conduct t h e t e s t wi t h 184 vatfDetoriotehrnims31iobsnfeoetrhrtvehate95%imalon.confeDoeshook-idenceitbimakel eredgikioanytne fdatodir fa eisretncereateinandd?)thi95%s casseimhowultaobsneouser confAre imaldencee orinfteemalrvalesbiforrdtshgeneral e component s of ly larger? (lengtTaihl ) l(eWngtinhg) le(ngtTaihl ) l(eWngtinhg) le(ngtTaihl ) l(eWngtinhg) 186206180 278277308 195183185 285276282 284176185 287277281 184177 290273 202177 254308 191177 267295 177176 267284 170177 268260 199197 299310 200191 281287 186177 272274 190180 273278 193212 271302 192178 281266 189194 280290 181195 254297 204191 276290 191186 287286 187190 281284 178177 265275 187186 288275 ' Usbondsing Moody sBaabond(torp-atminedigs, suammplquales iofty20) corAapor(maitdedlbonds e-highwerquale isteyl)ecorctepd.orForate and 20 each ofXthe corcurrersepntondiratniog (compani e s , t h e r a t i o s a meas u r e of s h or t t e r m l i q ui d i t y ) XX3 ldebtong--tteor-mequiinttyerreasttiora(tae meas(a measureuofre fofinanciinteraelsrtiscoverk or laege)verage) X4 rate of return on equity (a measure of profitability) male
(a)
x1
(b)
a =
x1
H0 : p.. 1 - IL 2
=
0
=
H0 •
=
x1
(c)
IL l - IL 2 IL l - IL 2 •
(d)
TABLE 6.1 1
xl
MALE HOOK-B I LLED KITE DATA
x2
xl
Source: Data courtesy of S. Temple.
6.21.
341
Exercises
1 2
= =
=
=
x2
xl
x2
=
342
Chapter 6
Com parisons of Severa l M u ltivariate Means
were recorded. The summar y s t a t i s t i c s ar e as f o l o ws : n1 .==45920, x1 .2==54[2.2-.87,02612.600,-.2.44347, 14.830] , and .-.202654 27.-.456589 -..053089 -..216702 -.n22==4420, -x2.2==67 [2.404,.1027.155,6.8.54524, 12.840], .-.094489 16.-.403289 -..400200 19.-.074419 .-.070219 19.-.400044 -..002494 61.-.085494 and 4 81 . 0 12 . 7 01 . 0 83 -. == -..008312 21.-.9.439498894 -...400270494 34.9..033540488 4 81 -. Does pool i n g appear r e as o nabl e her e ? Comment on t h e pool i n g pr o cedur e iArn tehtishecasfine.ancial characteristics of firms with Aa bonds dif erent from those wioftmean h Baavectbondsors?. SetUsing==th.e05.pooled covariance matrix, test for the equality Calfor creuljeactteinthgeH0:linearILl -combiIL2 ==natiionnsParoft mean component s mos t r e s p ons i b l e b.in a company's ability to satisfy its Bond r a t i n g compani e s ar e i n t e r e s t e d outmorsetaofndithneg fdebtoregoioblngigfatiniancions aasl rtahtieyos matmiguhtrebe. Doesusefulitinappear as i f one or helping to clas ify aResearbondcherass in"hitergeh"steord i"medi u m" qual i t y ? Expl a i n . n as s e s i n g pul m onar y f u nc t i o n i n non pat h ol o gi c al pop ulwerateiocolns lasectkeded satubjdefectinsitteo irnutneronvalsa andtreadmithe gasl untcontil exhaus t i o n. Sampl e s of a i r e nt s anal y zed. The r e s u l t s consvarumptiablioens fweror 25e males and 25 females are given in Tablon 4emeas6.12uonrespageof oxygen 343. The XX21 ==== rreessttiinngg volvoluumeme 0022 ((LmL/kgjmi /min) n) XX43 ====maximaximmumum volvoluumeme 0022 ((mL/miLjkg/mi n ) n) Look== .0f5.or Igender di f e r e nces by t e s t i n g f o r equal i t y of gr o up means . Us e s t mo f you r e j e ct == f i n d t h e l i n ear combi n at i o n H0: IL IL 2 l rConsesponstruictblet.he 95% simultaneous confidence intervals for each JLl i == 1, 2, 3, 4. Compare with the corresponding Bonfer oni intervals.
Aa bond companies:
Baa bond companies:
S pooled
(a)
(b)
a
(c)
0
(d)
6.22.
(a)
a
(b)
i
0,
- JL 2 I '
TABLE 6. 1 2
xl
w � w
Res(Lt/miingn0) 2 0.0.3349 0.0.4381 0.0.3336 0.0.4438 0.0.2321 0.0.3524 0.0.4301 0.0.3424 0.0.5306 0.0.0.444208 0.0.5550 0.0.4340
OXYG E N-CO N S U M PTI O N DATA
Males Maxi m um 0 Res t i n g 0 2 2 (mL/kg/mi n ) ( L /mi n ) 3.5.7081 2.3.3878 5.3.9153 4.3.6103 4.4.5.705771 4.3.3.931519 3.6.4.376591 3.2.3.855209 7.5.8379 3.3.4077 4.4.9957 4.3.4536 6.4.8608 3.3.3816 6.5.4939 3.3.2109 6.6.0300 4.3.0806 6.6.0454 5.3.0805 5.4.5275 4.5.0203 4.58 2.82 Xz
Source: Data courtesy of S. Rokicki.
x3
x4
Maxi(mL/kg/mi mum 0n)2 30.43.8875 44.46.0501 48.48.47.750025 48.48.9826 48.50.5368 51.55.3154 56.58.4679 42.51.49.279509 63.46.2303 55.58.0808 57.50.4356 32.48 -
- -
xl
Res(Lt/miingn0) 2 0.0.2298 0.0.3301 0.0.0.221518 0.0.2396 0.0.3371 0.0.0.233539 0.0.2188 0.0.2424 0.0.3340 0.0.2371 0.0.0.336576
-
-
Females mum0n)2 Res(mL/kg/mi ting 0n2 ) Maxi(Lm/mumin)02 Maxi(mL/kg/mi 5.3.0945 2.1.9531 33.35.8825 4.5.8978 2.1.9301 36.37.4870 4.1.5774 2.2.4329 38.39.3109 4.5.2686 2.1.9128 39.39.2941 7.6.2322 2.1.7251 28.42.4971 4.5.2100 2.2.7106 37.31.8100 4.5.4606 2.3.0506 38.51.3800 2.4.0801 2.2.4508 36.37.7608 6.4.5659 3.1.0855 38.46.9156 5.4.5.771723 2.2.1.459783 40.43.30.466009 5.11.0156 2.2.3023 39.39.3446 5.5.2373 2.2.4258 34.35.0876
- -- - -
x3
Xz
-
-
-
- --
-
- - - ----
x4
--
-
-- -- - ---
344
Chapter 6
Com pa risons of Several M u ltivariate Means
Thethusdattheya idon Tablnotere6.pr12ewersenteacolrandom lected sfaromplm gre.aComment duate-studenton tvolhe posuntesierblse, andim pl i c at i o ns of t h i s i n f o r m at i o n. Cons t r u ct a onew ay MANOVA us i n g t h e wi d t h meas u r e ment s f r o m t h e i r i s Tablin mean e 11.5component . Constructs 95%for thsiemtulwtoaneous confes fiodrenceeachinpaiterrvalofspopul for diaf-ftdatieornseances.inComment r e s p ons on t h e val i d i t y of t h e as s u mpt i o n t h at Resthe ienartecrherbresedihaveng ofsuaggesresidteentd thpopulat a cahtiangeon wiitnh simkulmil gsizraentoverpopultimatieonsis .eviFourdencemea-of sruiordement1 is 4000s were madeperioofd 2malis 3300e Egyptiaandn skperullsifoodr3thisre1850e differeThent timdate aperarieosdshown: pein Table 6.13 see the skull data on the CD-ROM . The measured variables are MaxBreath BasHeight BasLength NasHeight PerTimioed 138131 4948 11 131125 8992 132132 131119 5044 11 9996 143137 10089 136138 5456 11 130136 10893 4848 11 139125 10299 5151 11 134134 134131 138134 124133 10197 4848 22 138148 134129 4551 22 98104 126135 124 4552 22 9598 136 132133 145 5448 22 100 130 102 131133 134125 9694 5046 22 132133 130131 91100 5250 33 138130 137127 9499 51 33 45 136134 133 4952 33 91 123 95 136133 137131 54 33 10196 49 133133 138138 10091 5546 33 (c)
6.23.
I1
6.24.
B.C. ,
B.C. ,
(
TABLE 6. 1 3
(x i )
)
B.C.
EGYPTIAN SKULL DATA
(x2 )
Source: Data courtesy of J. Jackson.
(x3 )
==
(x4)
I 2 == I 3 .
X1 == maxibasibmreumgmatbriecadtheihghtof ofskulsklul(ml (m)mm) = bas i a l v eol a r l e ngt h of s k ul l ( m m) X4 = nasal height of skull (mm) Cons t r u ct a onew ay MAN OVA of t h e Egypt i a n s k ul l dat a . Us e a = . 0 5. Con struct 95%s sdiimf ulertamong aneousthconfe populidenceatioinsntreerprvalessetnto edetd byertmheintehrwhiee tcimh emeanperi component odsCons. Artruectthaeone-usuwalayMANOVA as s u mpt i o ns r e al i s t i c f o r t h es e dat a ? Expl a i n . MANOVA of t h e c r u deo i l dat a l i s t e d i n Tabl e 11. 7 . Con scomponent truct 95%s sdiimf ulertaamong neous thconfe populidenceatioinsnt.er(vYalous mayto detwantermtioneconswhiidcerh tmean fMANOVA ormations asofsutmpthe dationsa.)to make them more closely conform to the usraunsal Awoulprodjerectactwastodesanielgnedectrtiocalintvesime-tigoatf-eushowe priconscingusmercheme.s in GrTheeencosBay,t ofWielescctonsriciitny, dseakfohourr somes. Hourcustolymerconss wasumptseiot nat(ieingkihtlotwatimest-hthoure coss) wast of meaelec tdursruirceiitndygondurpeakainhotgperofsfi-uopmmer day i n Jul y and compared, f o r bot h t h e t e s t gr o up and tthhee contexperroilmgrentoup,al rwiattehs basbegan.elineTheconsruesmptponsioens,measured on a similar day before l o g( cur r e nt cons u mpt i o n) l o g( b as e l i n e cons u mpt i o n) fprorotduced he hourthseendifol onwig 9ng summar 11 y s(taatpeakisticshour) , 1 and 3 ( a peak hour ) : = 28, = [ . 153, -.231, -.322, -.339] = 58, = [ . 151, . 1 80, .256, .257] and ..380455 ..732255 ..222833 ..213299 = .228 .233 .592 .239 .232 .199 .239 .479 Perin elfoectrmricaalprconsofileuanalmptyiosin?s. DoesWhattiismte-heofnat-usuerpre oficinthgissediemf etroence,makeif aany?dif eCom rence ment . ( U s e a s i g ni f i c ance l e vel of a = . 0 5 f o r any s t a t i s t i c al t e s t s . ) Asbandsparandt ofwithveesstuwerdyeofaslkoedve tando resmarpondriatgeo tihnesExampl e 6. 1 2, a s a mpl e of hus e ques t i o ns : What i s t h e l e vel of pas s i o nat e l o ve you f e el f o r your par t n er ? What is the level of passionate love that your partner feels for you? Chapter 6
Exercises
x2 x3
6.25.
6.26.
A.M.
Test group: Control group:
nl n2
A.M.
P.M.,
P.M.
x1
x2
S pooled
Source: Data courtesy of Statistical Laboratory, University of Wisconsin.
6.27.
1. 2.
345
346
Chapter 6
Com parisons of Severa l M u ltivariate Means
3. 4. None at all
What i s t h e l e vel of c o mpani o nat e l o ve t h at you f e el f o r your par t n er ? WhatTheis trheespleonsveleofs wercompani o nat e l o ve t h at your par t n er f e el s f o r you? e recorded on the fol owing 5-point scale. Very little
Some
A great deal
Tremendous amount
2
3
4
5
Thi5-poirtynhust-scbalandse resandpons30ewitovesQuesgavetiothne 1,resXponseas i5-n Tablpoinet-6.sc1al4,ewherrespeonsX1 e Husband rating wife Wife rating husband 25 53 45 45 44 45 55 55 44 53 54 45 44 45 55 55 33 33 54 55 43 43 54 54 34 44 54 54 43 43 55 45 44 54 53 53 43 44 54 44 45 54 54 54 45 55 55 55 44 43 45 45 44 44 54 54 43 34 54 55 43 44 54 54 45 55 54 45 45 55 54 54 54 45 44 44 35 43 44 44 44 44 44 44 54 53 44 44 35 43 55 55 23 45 55 55 53 35 43 43 44 43 54 54 43 43 45 45 43 44 45 54 44 44 53 53 44 44 55 54
==
2
TABLE 6. 1 4
xl
x2
=
SPOUSE DATA
x3
x4
Source: Data courtesy of E. Hatfield.
xl
x2
x3
x4
a
to
Chapter 6
Exercises
Ques t i o n 2, x == a 5p oi n t s c al e r e s p ons e t o Ques t i o n 3, and == a 5p oi n t x 4 3 scalPle orest tphonse meane to vectQuesortisofnor4.husbands and wives as sample profiles. ITess thtefohusr parbandal erlaprtinogfiwilesfewiprthofile==par.0a5.l eIlfttohtehpre wioffileersaappear ting hustboandbe parproaflileel?, tpresotffiloers coiarencicoidentncidprentof,ilteesstatfotrhelesvelameprloefvelilesofwistihgnifi==cance..05. WhatFinallyconcl, if thue sTwoisopn(ecis)ecans ofbebitidrngawnfliefsro(gmenusthis analysis? are so similar morphological lsyu,chthatasfsoerxmanyratiosyearof emers theyginwerg fleietshandoughtbittiongbehabithetssawerme.e fBioundologitocalexidistf. eDorencesthe d ininparthet tiwnoTablspecie 6.es15 on pageand348 and on theTesCD-t fRorOMthe itequalnaxonomi dicatitye anyofc datthdiefatwelirsoetnceepopul a t i o n mean vect o r s us i n g == . 0 5. I f t h e hypot h es e s ofcombiequalnatmean vect o r s i s r e j e ct e d, det e r m i n e t h e mean component s ( o r l i n ear yourUsingusteheofiodatnsnoraofmonmean albone-theorcomponent ymimetnerahlodsconts) fmosoerntthtiesrneeTablspdatonseai.b1.l8e, fionrvesrejteigctatinegequalH0 • iJusty tbeify tweenTestthuse idoming n==ant.0and5. nondominant bones. Cons t r u ct 95% s i m ul t a neous conf i d ence i n t e r v al s f o r t h e mean di f e r e nces . Cons t r u ct t h e Bonf e r o ni 95% s i m ul t a neous i n t e r v al s , and compar e t h es e wi t h t h e i n t e r v al s i n Par t b. Tablin Table 6.e11.6 8on, 1pageyear349aftecrontthaeiinr sparthetibonecipatimionniern aanl contexperentims, fentor tahleprfiorgrst 24am.subjComects pareTesthteusdatinagfro==m bot.05.h tables to determine whether there has been bone los . Cons t r u ct 95% s i m ul t a neous conf i d ence i n t e r v al s f o r t h e mean di f e r e nces . Cons t r u ct t h e Bonf e r o ni 95% s i m ul t a neous i n t e r v al s , and compar e t h es e wi t h t h e i n t e r v al s i n Par t b. Peanut ted Steavartesi.eItniesanwiefth froerstptecto develstaro seeoanverp iiammlprporvarovedtiaablnteplcrsa. ontThepsi,ncrdatparopatsscfiofoernttonehisetsstorwoututo-hfiernactelnoyUnircompar exper i m ent ar e gi v en igeogr n Tablaephi6.c1al7 onlocatpageions349.(1, 2)Thrand,ee varin tiehtiisescas(5e,,6,thande thr8)eewervareiagrbloewns reatprwiestehnttiwnog yitherlede andvariathbleestwaroeimportant grade-grain characteristics were measured. The XX21 ==== YiSoundeld (pmatlotuweire kerghtn) els (weight in grams-maximum of250 grams) X == Seed s i z e ( w ei g ht , i n gr a ms , of 1 0 0 s e eds ) 3 TherPere werforme tawtowro-eplfaicctatoironsMANOVA of the experusiinmgentthe. data in Table 6.17. Test for a locat== .i0o5.n effect, a variety effect, and a location-variety interaction. Use (a)
(b)
a
a
6.28.
Leptoconops)
L. carteri
L. torrens?
a
6.29.
(a)
a
(b)
(c)
6.30.
( a)
a
(b)
(c)
6.31.
347
(a)
a
Chapter 6
348
TABLE 6. 1 5
Com parisons of Severa l M u ltivariate Means
B ITING-FLY DATA
xl
x2
( lWiengtngh) ( wiWidnthg )
L. torrens
L. carteri
8587 9492 9691 929190 87 106105 103100 109104 95104 90104 9410386 82103 101103 10099 100 99110 99103 10110395 1059999
4138 4443 4344 4243 4138 4746 4441 4445 4044 4046 404819 4143 454343 4441 454442 4346 4747 4350 47
( Thlpalrpd ) (Thipalrpd) (Fourpalpth) ( Lengtantennalh of ) ( Lengtantennalh of) x3
x4
lengt31 h 3236 3235 3636 3636 35 3834 3435 3636 3534 3737 3738 354239 4044 4042 43 3841 3538 384036 403937
wi13dth 1154 1714 1216 1147 11 1514 1514 1315 1415 1214 1114 1412 114155 1815 16 1417 1614 1514 1154 1614
Source: Data courtesy of William Atchley.
Xs
x6
lengt25 h segment9 12 2227 138 2826 109 2426 99 2623 99 24 9 2631 1010 2324 1010 2730 1110 2329 99 2230 109 2531 96 253332 1099 2529 119 3131 1110 34 10 3336 99 3132 1010 3137 118 3223 1111 3334 127
x7
segment8 139 90 99 99 10 1011 1010 1010 1010 1010 7109 89 911 1010 10 910 1010 811 1110 117 1
13
Chapter 6
TA BLE 6. 1 6
bject Sunumber 12 435 67 89 1011 1213 1415 1617 1819 212220 2324
Exercises
349
M I N ERAL CO NTENT I N B O N ES (AFTER 1 YEAR)
Domiradinuants 1..082757 ..887573 ..864011 ..894786 ..997791 ..882551 ..797012 ..790556 ..793265 ..887943 ..697349 ..477663
Radi1.051us ..888017 ..689813 ..786534 ..892306 ..982526 ..773065 ..882675 ..772764 ..798214 ..593706 ..960037 .743 FactVarioerty2 55 55 66 66 88 88
Source: Data courtesy of Everett Smith.
TABLE 6. 1 7
FactLocatoiro1n 11 22 11 22 11 22
Domi n ant humer2.268us 1.1.795318 1.1.664368 1.1.389651 1.1.1.799423133 1.2.365209 1.1.484670 1.1.784247 1.2.912390 1.2.214264 1.2.513073 1.1.404241
PEAN UT DATA
xYielld 195.194.33 189.180.74 203.195.09 202.197.76 193.187.50 201.200.50
Source: Data courtesy of Yolanda Lopez.
Domiulnnaant ..680269 ...775655161 ..770853 ..684487 ..866954 ..667092 ..782346 ..665693 ..587783 ..850240 ..587004 .585
Humer u s Ul n a 2.1.724610 ..698964 1.1.744356 ..769838 1.1.637861 ..561915 1.1.688615 ..778715 1.2.710676 ..765689 1.1.695180 ..752626 1.1.480920 ..757380 1.1.587960 ..750629 1.1.994197 ..774085 1.1.292899 ..762769 1.2.315930 ..747998 1.1.246511 ..664034 x x3 2 SdMat153.K1 er SeedSi51.4 ze 167.139.75 55.53.75 121.156.81 44.49.48 166.166.01 45.60.48 161.164.58 54.57.81 165.166.18 58.65.06 173.8 67.2
3 50
Chapter 6
Com pa risons of Severa l M u ltivariate Means a.
Analpear ytozebethesarteissfidieuald?s DifrosmcusPars. t Do the usual MANOVA assumptions ap Useffeinctgsthareereaddisulttsivine?ParIft nota, can, doeswe concl utdeerathctationthefe lfoecatct siohnowandjorup fovarr soimeety t h e i n vartwo-iafblacteso,rbutANOVAs not for. others? Check by running three separate univariate UsLartwoingfgerolronumber cateachionchars2,corcaactnreewesrpisondtconclic?toDiubetdesctustehrsatyiyoureoneld andvaransiwgreterayde-gr i, susbetinagtien95%r charthanBonfatctheereotirsthoicernis. sIn oneimulexpertaneousimentintienrvolvalvsinfogrrpaiemotrs eofsvarensiientg,ietsh. e spectral reflectance of three sgrpoeciwiensgofse1-asyoearn. -Theold sseeedledliinngsgs wasweremeasgrownurewid atthvartwoiodiusf wavel e ngt h s dur i n g t h e e r e nt l e vel s of nut r i e nt : tsheeedloptingsimusaleldevelwer, coded and a s u bopt i m al l e vel , coded -. The s p eci e s of (LP). Twoperofcentthesvarpeectsiiatrblkaalerssepmeasfrluectceau(ncerSeS),datwerJapanes ewaveleengtlarhch560(JL)nm, and(grelen)odgepole pine per c ent s p ect r a l r e f l e ct a nce at wavel e ngt h 720 nm ( n ear i n f r a r e d) Thetrientcellelvelmeansare (asCM)fol foowsr Jul. Thesian daye aver235agesfor eachare bascombied onnatfoiourn ofreplspiecicateiosnsand. nu Nut r i e nt Speci e s 720CM 560CM 10.13.4351 25.38.6933 JL 7.10.7408 24.25.2155 LP 17.10.7408 29.41.4205 LPJL Treating thetoceltesltmeans aseciiensdiefvfiedctualandobsaenutrvatriieontnsef, perffect.orUsmea two-.0w5ay. MANOVA f o r a s p Cons t r u ct a t w ow ay AN OVA f o r t h e 560CM obs e r v at i o ns and anot h er t w o waythe MAN ANOVAOVAforretshuelt720CM obs e r v at i o ns . Ar e t h es e r e s u l t s cons i s t e nt wi t h s i n Par t a? I f not , can you expl a i n any di f e r e nces ? Refvariearblteos Exercise 6.32. The data in Table 6.18 are measurements on the perperccentent ssppectectrraall rreeflfleectctaancence atat wavel waveleengtngthh 720560 nmnm ((gnreareen)infrared) fofor1-thyrearee -soplecid seesedl(siitnkgsa stparkenuce at[SS]th,rJapanes erelantrcthim[JesL],(Jandulialnodgepol e pi[1n],eJul[LiP]an) e e di f e day 150 daywere235all gr[2]o,wnandwiJulthiathnedayopti320mal[l3e]vel) durofinnutg trhieentgr.owing season. The seedlings (b)
(c)
(d)
6.32.
+,
X1 X2
==
==
ss
ss
+ + +
(a)
a
(b)
6.33.
X1 X2
==
==
==
Chapter 6
(a)
Exercises
Perspecifoersmefafetctw,o-aftaimcteorefMAN OVA us i n g t h e dat a i n Tabl e 6. 1 8. Tes t f o r a f e c t and s p eci e s t i m e i n t e r a ct i o n. Us e . 0 5. Dodatayou? Dithsicnuks thwietush urealfeMAN OVAto a rasessiudmptual analionsyaresis, sandatisftiheed posfor stihbeilitthyesofe r e nce correlated observations over time. Speci e s Ti m e Repl i c at i o n 720nm 560nm 11 21 9.8.7343 19.19.5145 11 43 9.8.2371 19.16.2347 22 21 10.10.2123 25.25.3020 22 43 10.10.4622 27.26.2128 33 21 15.16.2225 38.36.8679 17.12.2774 40.67.7540 33 43 12.11.0037 33.32.3073 JLJL 11 21 12.12.4128 31.33.3331 JLJL 11 43 15.14.2381 40.40.0408 JLJL 22 21 9.14.6359 40.33.1905 JLJL 22 43 38.44.7741 77.78.5174 JLJL 33 21 36.37.2671 71.45.4003 JLJL 33 43 8.7.7943 23.20.2877 LPLP 11 21 8.7.3876 22.21.7186 LPLP 11 43 8.6.4759 26.22.3723 LPLP 22 21 8.7.5344 26.24.6877 LPLP 22 43 14.13.5041 44.37.4943 LPLP 33 21 13.12.7373 37.60.9873 LPLP 33 34 a =
(b)
351
TABLE 6. 1 8
SPECTRAL RE FLECTANCE DATA
ss ss ss ss ss ss ss ss ss ss ss ss
Source: Data courtesy of Mairtin Mac Siurtain.
352
Chapter 6
Compar isons of Severa l M u ltivariate Means
ForDoesestienrtseraracte iparon tsihcowularuply ifnotreonerestevard iinabltheebutintenotractfioornthofe otspheciere?s andChecktime. rCanunniyoung athuniinkvarofianotate thwero-fmetactohrodANofOVAanalyfzionrgeachthesofe datthea tworoaredisfponserentes. pertralimreentflectaladesnceignumber n thatswoul? d allow for a potential time trend in the spec Ref(a) PlerottotExampl e 6 1 3. . ver s u s t i m e and t h os e of h e pr o f i l e s , t h e component s of ver s u s timte,thonattlhineearsamegrogrwtaph.h is adequat Commente. Takeon the compar.01. ison. Tes RefmumerlitkoelExampl e 6. 1 3 but t r e at al l 31 s u bj e ct s as a s i n gl e gr o up. The ihood estimate of the 1) 1 f3 is p whereTheisesthtiemsaatmpled covar e covariancesianceofmattherixmaxi. mum likelihood estimators are Cov - 1 - - 1) --2) Fit a quadratic growth curve to this single group and comment on the fit. (c)
(d)
6.34.
(
)
e x
x2
x1
(b)
6.35.
by
a ==
m a xi
(q
==
+
X
( B' S - 1 B ) - 1 B ' S - 1 x
S
........-._
(n
"'
( Pn )
==
(n
p
+
(n q) (n
p
+
q )n
( B' S- 1 B ) - 1
REFERENCES 1 . Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed. ) . New York: John Wiley, 1984. 2. Bacon-Shone, J. , and W. K. Fung. "A New Graphical Method for Detecting Single and Multiple Outliers in Univariate and Multivariate Data." Applied Statistics, 36, no. 2 (1987), 153-162. 3. Bartlett, M. S. "Properties of Sufficiency and Statistical Tests." Proceedings of the Royal Society of London (A) , 160 (1937), 268-282.
4. Bartlett, M. S. "Further Aspects of the Theory of Multiple Regression." Proceedings of the Cambridge Philosophical Society, 34 (1938), 33-40. 5. Bartlett, M. S. "Multivariate Analysis." Journal of the Royal Statistical Society Supple ment (B) , 9 (1947), 176-1 97. 6. Bartlett, M. S. "A N ote on the Multiplying Factors for Various x2 Approximations." Jour
nal of the Royal Statistical Society (B) , 16 (1954) , 296-298.
7. Bhattacharyya, G. K., and R. A. Johnson. Statistical Concepts and Methods. New York: John Wiley, 1 977. 8. Box, G. E. P, and N. R. Draper. Evolutionary Operation: A Statistical Method for Process Improvement. New York: John Wiley, 1 969. 9. Box, G. E. P. , W. G. Hunter, and J. S. Hunter. Statistics for Experimenters. New York: John Wiley, 1978. 10. John, P W. M. Statistical Design and Analysis of Experiments. New York: Macmillan, 1 97 1 .
Chapter 6
References
353
1 1 . Jolicoeur, P. , and J. E. Mosimann. "Size and Shape Variation in the Painted Turtle: A Principal Component Analysis." Growth, 24 (1960), 339-354. 12. Kshirsagar, A. M., and W. B. Smith, Growth Curves. New York: Marcel Dekker, 1995.
13. Morrison, D. F. Multivariate Statistical Methods (2d ed.). New York: McGraw-Hill, 1976. 14. Pearson, E. S. , and H. 0. Hartley, eds. Biometrika Tables for Statisticians. vol. II. Cam bridge, England: Cambridge University Press, 1972. 15. Potthoff, R. F. and S. N. Roy. "A generalized multivariate analysis of variance model use ful especially for growth curve problems." Biometrika, 51 (1 964) , 313-326. 16. Scheffe , H. The Analysis of Variance. New York: John Wiley, 1959.
17. Tiku, M. L. , and N. Balakrishnan. "Testing the Equality of Variance-Covariance Matri ces the Robust Way." Communications in Statistics-Theory and Methods, 14, no. 12 (1985), 3033-3051 . 1 8 . Tiku, M . L . , and M . Singh. "Robust Statistics for Testing Mean Vectors o f Multivariate Distributions." Communications in Statistics-Theory and Methods, 11, no. 9 ( 1 982) , 985-1 001. 19. Timm, N. H. Multivariate Analysis with Applications in Education and Psychology. Mon terey, California: Brooks/Cole, 1975. 20. Wilks, S. S. "Certain Generalizations in the Analysis of Variance." Biometrika, 24 (1932) , 471-494.
CHAPTER
7
Multivariate Linear Regression Models
7.1
I NTRODUCTI ON
Regr e s i o n anal y s i s i s t h e s t a t i s t i c al met h odol o gy f o r predi c t i n g val u es of one morvariaeble values. (dItependent ) var i a bl e s f r o m a col l e ct i o n of ( i n dependent ) ses ing the effectculs ofledthferpromedithcetotritvarle ofiablthees onfirstthpaper e respononsthees. scanuUnfbjealoctrsotbyubenatF.useGalleyd, ttfhooenr asname [ 1 4] , i n no way r e fl e ct s ei t h er t h e i m por t a nce or breIadtn thhisofchaptappliecrat, weionfofirstthdiissmetcus htodolhe mulogy.tiple regres ion model for the predic tion of adependentrespvaronsiea.blThies. sOurmodeltreiats tmhenentgenermustabelizedsomewhat to handletetrhsee,prasedia cvastiotnlit erseaettuhreefexiol sotwis onng tbooks he subj, ienctas. c(endiIf young arorederintoferedistfeidculintpury: Bower suingmreangreands ioOn 'analConnelysisl, [and5], NetWeiesrb,erWasg [9s]e,rSeber man, Kut[20]n, ander, andGolNacht s h ei m [ 1 7] , Dr a per and Smi t h [ 1 2] , Cook d ber g er [ 1 5] . ) Our abbr e vi a t e d t r e at m ent hi g h lofightthsetrheegrreegrs ieosniomodel n assu, mptandiothnse gener and thaeil rapplconsicabiequences , al t e r n at i v e f o r m ul a t i o ns lity of regres ion techniques seemingly dif erent situations. or
response
predictor
regression,
several
single
of
to
7.2
TH E CLASSICAL LIN EAR REGRESSION M O D E L
LetFor exampl, e,bewithpredic4,tower varmiiagblhtehaves thought to be related to a response variable current market value of home Y.
z 1 , z2 ,
• • •
Zr
r
r
==
Y
354
==
Section 7.2
and
The Classica l Linear Reg ress ion M odel
355
s q uar e f e et of l i v i n g ar e a zz32 lapprocataioisned(ivalndiucate loarstfoyearr zone of city) z qual i t y of cons t r u ct i o n p r i c e per s q uar e f o ot ) ( 4 Thein a ccontlas iicnauousl linearmanner regres oniontmodel s t a t e s t h at i s compos e d of a mean, whi c h depends h e z/ s , and a r a ndom er r o r whi c h account s f o r mea sfnotromexpltheicexper itly consimentideroredsienttbyhe model . Thevessuretimentvalgatuoreserarofreorttrhandeeatpretdhedieaseffctoerctvars ofTheiaotblheerers rrvaroercor(iaadndbledehence t h e i n t h e r e s p ons e ) i s vi e wed as a r a n dom varSpeciablifiecawhosl y, tehebehavi o r i s char a ct e r i z ed by a s e t of di s t r i b ut i o nal as s u mpt i o ns . linear regres ion model with a single response takes the form [ R es p ons e ] [ m ean ( d ependi n g on z , z , , [ e r o r] 1 2 Theparatmetermer"lsinear" refers to tThehe faprctetdihatctotrhevarmeaniableiss amaylinearor mayfunctnotionentof tehretunknown h e model as firsWit-otrhderintedependent rms. observations on and the associated values of zi , the com plete model becomes (7 - 1 ) where the error terms are assumed to have the fol owing properties: (7-2) Var a-2 (constant); and In matrix notation, (7-11) becomes y2 1 Z2 1 or and the specificatandions in (7-2) become Cov a-21. z1 = = = =
Y
s,
fixed.
= f3o + f31Z1 +
Y
· ··
+ {3, z , + s
=
. . •
Zr ) ] +
{30 , {3 1 , . . . , f3r ·
Y
n
Y1 = f3o + /3 1 Z1 1 + f3 2 Z1 2 +
···
+ f3r Z1 r + 8 1
Yn = f3 o + /3 1Zn 1 + f3 2 Zn 2 +
···
+ f3rZn r + Bn
1. E ( sj ) = 0; 2. ( sj ) = 3. Cov ( sj , sk) = O , j # k.
Yi
Yn
Z1 1 Z1 2 Z2 1
Z1 r Z2r
f3 o {3 1
Zn 1 Zn 2
Znr
f3r
f3 Y = Z + e ( n X 1 ) ( n X ( r +1 )) ( ( r +1 ) X1 ) ( n X1 )
1. E ( e ) = 0; 2. ( e ) = E ( ee ' ) =
+
s1 s2
Bn
356
Chapter 7
M u ltiva r i ate Linear Reg ression Models
Note that a one in the first column of the design matrix Z is the multiplier of the constant term {3 0 • It is customary to introduce the artificial variable Zj o = 1, so that
+ f3rZjr = f3oZjo + f31Zj1 + · · · + f3rZjr Each column of Z consists of the n values of the corresponding predictor variable,
f3 o + f31Zj1 +
· ··
while the jth row of Z contains the values for all predictor variables on the jth trial.
Although the error-term assumptions in (7-2) are very modest, we shall later need to add the assumption of joint normality for making confidence statements and testing hypotheses. We now provide some examples of the linear regression model. Example 7 . 1
(Fitti ng a straight- l i ne reg ression model)
Determine the linear regression model for fitting a straight line
E ( Y) = {3 0 + f31 z 1
Mean response = to the data 0
1
2
3
4
1
4
3
8
9
B efore the responses Y' = [Yi , }2 , . , YS J are observed, the errors e ' = [ 8 1 , 82 , . . . , 85 ] are random, and we can write .
.
y
= Z /3
1 1
Z1 1 Z2 1
1
Zs 1
+
E
where
Y=
y1 Y2 Ys
'
Z =
p=
[ ��] ,
81 82 e = 8s
The data for this model are contained in the observed response vector y and the design matrix Z, where
Section 7.2
1 y ==
357
The Classica l Li near Reg ress ion Model
4 3 8
'
Z ==
9
0 1
1 1 1 1 1
2
3 4
Note that we can handle a quadratic expression for the mean response by introducing the term {3 2 z2 , with z2 == z i . The linear regression model for the jth trial in this latter case is or •
Example 7.2
{The desig n matrix fo r o ne-way ANOVA as a reg ression model)
Determine the design matrix if the linear regression model is applied to the one-way AN OVA situation in Example We create so-called dummy variables to handle the three population means: JL 1 == JL + T 1 , JL2 == JL + T 2 , and JL3 == JL + T 3 • We set
6.6.
if the observation is from population 1 otherwise
z2 ==
{
1 if the observation is from population 0 otherwise
2
if the observation is from population 3 otherwise and {3 0 == JL , {3 1 == T 1 , {3 2 == T2 , {33 == T 3 • Then 1j == f3 o + f3IZj l + f3 2 Zj 2 + f3 3 Zj3 +
Bj ,
j == 1 ,
2,
...' 8
where we arrange the observations from the three populations in sequence. Thus, we obtain the observed response vector and design matrix
9
y
( 8X 1 )
==
6 2 2 9 0 3
1
1 1 1 1 z == 1 ( 8X 4 ) 1
1 1 1 0 0 0 1 0 1 0
0 0 0 1 1 0 0 0
0 0 0 0 0 1 1 1
7.2,
•
The construction of dummy variables, as in Example allows the whole of analysis of variance to be treated within the multiple linear regression framework.
358
7.3
Chapter 7
M u ltiva r i ate Linear Reg ress ion Models
LEAST SQUARES ESTI MATI O N One of the obj ectives of regression analysis is t o develop an equation that will allow the investigator to predict the response for given values of the predictor variab le s. Thus, it is necessary to "fit" the model in (7-3) to the observed corresponding to th e known values 1 , That is, we must determine the values for the regression coefficients f3 and the error variance consistent with the available data. j - · · · - b, z 1 Let be trial values for Consider the difference + · + between the observed response and the value + that wo uld be expected if were the "true" parameter vector. Typically, the differe nce s - ..· will not be zero, because the response fluctuates (in a manner characterized by the error term assumptions) about its expected value. The method of least squares selects so as to minimize the sum of the squares of the differences:
yj
Zj 1 , . . . , Zj r·
b
yj - b0 - b1 zj 1
/3.
b
a-2
yj - b0 - b1 z 1 b0 b1 zj 1 . . brZjr
yj
r
- b,Zj r
b
n
S ( b ) == j�=1 ( yj - bo - b1 Zj 1 == (y - Zb)'(y - Zb)
- "·
- brZjr) 2
(7- 4)
The coefficients chosen by the least squares criterion are called least squqres esti mates of the regression parameters They will henceforth be denoted by f3 to em phasize their role as es !imates of f3 . The coefficients f3 are consistentA with !he data in th � sense that they produce the sum of who se estimated (fitted) mean responses, + · + + squares of the differences from the observed is as small as possible. The deviations A A A j 1, . . , n - "· (7-5)
b
/3.
{30 {3 1 zj 1 . . f3 rZjr, yj Bj == Yj - f3o - /3 1 Zj 1 - f3 rZjr , == . are called residuals. The vector of residuals e == y - zp contains the information about the remaining unknown parameter a-2 • (See Result 7. .) Result 7.1. Let Z have full rank r + 1 n. 1 The least squares estimate of f3 in (7-3) is given by p == (Z' z ) -1 Z'y 1 Let y == zp == Hy denote the fitted values of y , where H == Z (Z'Z) - Z' is called "hat" matrix. Then the residuals A A 1 e == y - y == [I - Z (Z'Z) - Z'] y == (I - H) y satisfy Z' e == 0 and y' e == 0. Also, the residual sum ofsquares == � (yj - �o - � 1 Zj 1 - . . . - � rZjr) 2 == e ' e j =l == y' [I - Z (Z'Z) -1 Z' J y == y'y - y'Z/3A
2,
<
2
n
1 If Z is not full rank, ( Z ' Z ) -1 is replaced by ( Z ' Z ) - , a Exercise 7 6 ) .
.
generalized inverse
of Z ' Z. (See
Section 7 . 3
Least Squares Estimation
359
" 1 " " Let Z' y as as s e rt e d. Then = y y f3 = ( Z ' Z ) 1 1 [I - Z( Z ' Z )- Z' ] y. 'The-1matZ' 'ri=x [[II - Z(Z(ZZ''ZZ))--1ZZ'']] satisf(iseysmmetric); = y- Z/3 = [[II -- ZZ((ZZ'ZZ))-1 Z' JJ [I - Z-(Z' Z)-1Z ' ] == I[I--2Z(Z (ZZ''ZZ))--11ZZ''] (Z(idempot Z' Z)-e1ntZ ')Z; ( Z ' Z )-1Z' Z' [I - Z (Z' Z)-1Z' J = Z' - Z' = 1 Cons e quent l y , Z ' e = Z' ( y y) = Z' [ I Z( Z ' Z ) Z ' ] y = O, s o y' e = p ' Z ' e1 = 1 1 Addi t i o nal l y , Z ' Z r ( Z ' Z r ' £ = y' [ I -Z( z ' ] [ I z ' ] y = y' [ I -z( z ' z r z ' ] y � = y' y - y' Zf3 . To verify the expr" esZion for f3 , we write Zb = y Z y Zb = y Z Z ( f3 -b) f3 f3 f3 so S(b) = (y - Zb)' (y - Zb) = (y 2(-yZ/3)'- Zf3)'("y -ZZ/3)( f3"" - b)(/3" - b)' Z' Z ( /3" - b) = (y - Z/3)'" (y - Z/3)" (/3 - b)' Z' Z ( /3" - b) nbandt h e ssiencce(�ndyis-thZP)'e squarZ e=d e'leZngt=h of ZTh( f3� f-irsb)t te. rBecaus m in S(ebZ)doesnotdepend � has f u l r a nk, Z ( f3 -b) i(fZf3' Z )-1b,Z'yso. Nottheemithnat(imZum'Z)s-u1mexiofstsssqinuarceZ'es Zis hasruniqauenkrand occurs (fIofrZ'bZ=isf3not= oftrafdiulctrsaZnk,haviZ'nZag fu=l rafonkr srome a but then a' Z' Za = or Za = which con Res u l t s h ows how t h e l e as t s q uar e s es t i m at e s p and the residuals e can be obtained from the design matrix Z and responses y by simple matrix operations. Proof.
£
"'
1.
2.
+
(7-6)
0.
3.
o.
�
"'
+
"'
"
"'
+
+
+
+
"'
0'.
#
0
+ 1 0
# 0,
<
n.
f0
0,
+ 1.)
•
7.1
Example 7.3
(Calcu lating the least squares esti mates, the residua ls, and the residual sum of sq uares)
Calsquarculeastfeotrhae slterasaitgshtquar-linee model estimates p , the residuals e, and the residual sum of fit to the data 2 y
0
1
1
4
3
3
4
8
9
360
Chapter 7
M u ltiva r iate Linear Reg ress ion Models
We have
Z'
Z'Z
y 1
[�
1 1 1 1 2 3
Consequently,
p=
[1 � J [
4
!]
10 30
3
89
[ �: ]
=
(z' z r 1 z'y = [
and the fitted equation is
y
(Z'Z) - 1
Z'y
-2
] [��]
6 - .2 .
.
.1
- :� -:� ] [�� ] [�] =
== 1 + 2z
The vector of fitted (predicted) values is
Y
== zp ==
1 1 1 1 1
0 1 2 3 4
1
e == y - y == "
so
"
4
[� ] 1 3
1 3
=
5 == 8 3
7
9
9
5
7 9
0 1 -2 1 0
The residual sum of squares is
e' e
== [ o
1 -2 1 oJ
0 1 -2 1 0
== o2 + 12 + ( -2) 2 + 12 + o2 == 6
•
Sum-of-Sq uares Deco m position According to Result 7.1, y ' e satisfies
== 0, so the total response sum of squares y'y ==
y'y == (y + y - y )'(y + y - y) == ( y + i )'( y + e)
==
y' y + e' e
II
� Yi
1= 1
(7 -7 )
Section 7 . 3
Least Squares Esti mation ==
Since the first column of Z is 1, the condition Z' f;
361
0 includes the requirement
n n n 0 == l'e == � ej == � yj - � yj , or y == y . Subtracting ny2 == n ( y ) 2 from both j= l j =l j =l sides o f the decomposition in (7-7), w e obtain the basic decomposition of the sum of squares about the mean: or
�n (yj
_
' yy -
y-)2
==
==
ny2
y ' y - n( y ) 2
n � ( .Yj - y-)2 j =l
( ��:� :r: ) ( ) j =l
s u about mean
==
:����
re
n
squares
+
e' e
�n s7
+
j =l
(
+
�
residu l ( error) sum 0 squares
(7-8)
)
The preceding sum of squares decomposition suggests that the quality of the models fit can be measured by the coefficient of determination n
" e"'2j j =l £..J
R2 == 1 - n 2 � (yj - y ) j =l
-----
(7-9)
n � ( yj - y) 2 j =l
The quantity R 2 gives the proportion o f the total variation in the Y/S "explained" by, or attributable to, the predictor variables z 1 , z2 , , Zr . Here R 2 (or the multiple correlation coefficient R == + Vfii) equals 1 if the fitted equation passes through all tp e da!a points, s � that ej == 0 for all j. At the other extreme, R 2 is 0 if == y and {3 1 == {3 2 == · · · == f3r == 0. In this case, the predictor variables z 1 , z2 , , Zr have no in fluence on the response. • • •
ffi o
• . .
Geometry of Least Sq uares A geometrical interpretation of the least squares technique highlights the nature of the concept. According to the classical linear regression model,
Mean response vector =
E(Y) =
zp = f3 o
1 1 1
+
Z1 1 z /3 1 2 1 Zn l
+
. . .
+
Z1 r /3 , Z2r Znr
Thus, E ( Y ) is a linear combination of the columns of Z. As f3 varies, Z/3 spans the model plane of all linear combinations. Usually, the observation vector y will not lie in the model plane, because of the random error e; that is, y is not (exactly) a linear combination of the columns of Z. Recall that
(
y
response vector
)
( ) Z/3
vector in model plane
+
( ) error vector
362
Chapter 7
M u ltiva r i ate Linear Reg ress ion Models
3
e =y -y
Least sq u a res as a proj ection for n = 3, r = 1 .
Figure 7.1
Once the observations become available, the least squares solution is derived from the deviation vector y -
Zb = ( observation vector ) - (vector in model plane )
The squared length ( y Zb) ' ( y - Zb) is the sum of squares S ( b ) . As illustrated in Figure S(b) is as small as possible when b is selected such that Zb is the point in the model plane closest to y. This point occurs at th � tip of th � perpendicular pro jection of y on the plane. That is, for the choice b = f3 , y = is the projection of y on the plane consisting of all linear combinations of the columns of The resid ual vector e = y - y is perpendicular to that plane. This geometry holds even when is not of full rank. When has full rank, the proj ection operation is expressed analytically as multiplication by the matrix To see this, we use the spectral decompo sition to write -
7.1,
Z/3
Z
Z
Z ( Z' Z) -1 Z'.
(2-16)
A 1 A2
Ar + 1
where ··· > 0 are the eigenvalues of the corresponding eigenvectors. If is of full rank, >
Z.
>
>
Z
Z' Z and e 1 , e 2 , . . . , e r + l are
(Z'Z) - 1 = -A11 e 1e 1 + -A1 e2 e2 + · · · + Ar1+ 1 e r + 1 e�+ 1 2 1 Consider q i Ai 12 Ze i , which is a linear combination of the columns of Z. Then q; A-:-1/2 A-k1/2 e�Z' Ze k = A-:- 1/2 A-k112 e�Ak ek = 0 if i # k or 1 if i = k That is the vectors qi are mutually perpendicular and have unit length. Their linear combinations span the space of all linear combinations of the columns of Z. Moreover, r+ 1 1 r+1 1 Z (Z'Z) Z' = i�= 1 Aj Zei ejZ' � i =1 qi q; --
==
l
==
l
l
l
•
'
r +
1
Qk
Section 7 . 3
Least Sq uares Esti mation
( ) =�
363
According to Result 2A.2 and Definition 2A.12, the proj ection of y on a linear com-
q2, -�1 ( 2Z. [I - Z (ZZ(' ZZ)' Z)-1 Z' Z' r+l
. . . , q ,+l } is
bination of { q l ,
r+l
q;y) q;
q;q; y
= z (Z' Zf1 Z'y = z{3.
Thus, multiplication by projects a vector onto the space spanned by the columns of Similarly, J is the matrix for the proj ection of y on the plane perpendicular to the plane spanned by the columns of
Z.
Sam p l i ng Properties of Classical Least Squares Esti mators The least squares estimator tailed in the next result.
e
/3 and the residuals
have the sampling properties de
Result 7.2. Under the general linear regression model in (7-3), the least squares
p... = ( Z ' Z )-1Z'Y E( P ) = P = a-2(Z' Z)-1 e E(e) = 2 (e) = a-2[I - Z (Z' Z)-1Z' ] = a-2[I - H] Also, E(e'e) = ..... , ..... 1)a- , 2s = e e+ Y'[I - Z(- Z '-Z )-1Z' ] Y Y'[I - H]Y E(s2) = a-2 p e Y = zp + e p = ( Z ' Z ) - 1 Z ' Y = ( Z ' Z ) - 1 Z ' ( Z p + e) = p + ( Z ' Z ) - 1 Z ' e e == [[II -- Z(Z(ZZ''ZZ))--11ZZ'' ]] Y[Zp + e] = [I - Z( Z ' Z )-1Z' ] e
estimator
has
P
The residuals
and
Cov ( )
have the properties
0 and Cov
(n -
n
-
so defining
r -
(r
n
1)
r
n-r-1
1
we have
Moreover,
Now,
Proof.
and
are uncorrelated.
Before the response
.
2 If Z is not of full rank, we can use the
is observed, it is a random vector.
generalized inverse
(7-10)
r1 + 1 (Z'Z)- =
_2:
i=l
A;1 e1e; , where r1 + 1 . = A, + , as described in Exercise 7.6. Then Z ( Z' Z)-Z' _2: q 1 q; 1
A 1 ;;::: A2 ;;::: . . . ;;::: A,. 1 + 1 > 0 A,. 1 + 2 i=l has rank r 1 + 1 and generates the unique projection o f y on the space spanned by the linearly independent columns of Z. This is true for any choice of the generalized inverse. (See [20] .) =
=
=
364
Chapter 7
M u ltivariate Linear Reg ression Models
since [I - ZE((PZ)' Z=)-1f3Z' ] Z( Z=' ZZ)--1ZZ'E(=e) =Frf3om (2-24) and Cov (p) == a-( Z'2(z'Z)z)-1Z-1' Cov (e) Z ( Z' Z)-1 = a-2(Z' Z)-1Z' Z ( Z' Z)-1 ECov((ee)) == [[II -- Z(Z( ZZ''ZZ))--11ZZ'' ]] E(Cov(e) e=)[I - Z( Z'Z)-1Z' J ' 2[I - Z( Z'Z)-1Z ' ] = awhere tCov(he laspt ,e)equal=itE[y f(opl o-wsf3)e'from] = ( Z 'AlZ )s-o1,Z' E( ee' ) [I - Z(Z'Z)-1Z ' ] 2(-Z1 'Z)-1Z' [I - Z( Z ' Z )-1Z' J = = abecause Z' [I - Z(e'Ze'Z=) e'Z[I' ]-=Z( ZFr' Zo)m-1Z' ] [I - Z(Zand'Z)-Res1Z'u]let == te'r [[e'I -(I Z(- ZZ '(ZZ')-Z)1 Z-'1JZ'e ) eJ 1 Z' J ee' ) = t r ([ I Z( Z ' Z ) Now, for anE(artbr i(tWrar))y n= E(nWra1 ndomW2matrix W,Wnn) ) = t r [ )] = E ( ) ) E( E( E( W W W W nn 1 2 Thus, using Result E(e'wee) =obttrai([nI - Z(Z ' Z )-1 Z' ]E(ee' ) ) == a-a-22 ttrr [(II)--Z(a-2Ztr' Z[Z)-(1Z'Z'Z)] -1Z' J == a-na-2n2 -- a-a-22 ttrr [[ ( Z' Z)I -1 Z' Z] ] 1 1 2 n = ( ) aand the result for s2 = e' ej( n - fol ows. The l e as t s q uar e s es t i m at o r pos s e s e s a mi n i m um var i a nce pr o per t y t h at was f3 fparirstametestarblicisfhuednctbyionsGausof tsh.eTheformfol of3wi=ng result concerns "best" esftoimr anyators of linear Let = Z , wher e /3 e 2 E( e) = Cov (e) = a- I, andf3 =Z has ful rank For any the estimator 0.
(2-45),
+
0
(7-6).
0
0.
(7-10), (7-6),
4.9,
X
+ ··· + + ··· +
+ +
2A . 12,
(r + ) X (r + )
1
r
•
r - 1) "
c0{30 + c 1 {3 1 + · · · + cr f3 r
c'
Result 7.3 (Gauss' 3 least squares theorem). 0, r + 1. c' Co f3 o + C1 {3 1 + . . . + Cr f3 r "
"
"
Y c,
c.
+
"
3 Much later, Markov proved a less general result, which misled many writers into attaching his name to this theorem.
Section 7.4
I nfe rences About the Reg ression M odel
365
of c' f3 has the smallest posa'sYibl=e varianceaamong al l l i n ear es t i m at o r s of t h e f o r m Y 2 2 that are unbiased for c' f3. . Then For any f i x ed c , l e t a' Y be any unbi a s e d es t i m at o r of c ' f3 Also, edbyvalasuseumptexprioen,s ioEns(ayi'Ye)ld=s the valinguetheoftwf3o. expect EE((aa''YZ/3) = c'a'/3e), =whata' Z/3.ever Equat a'ThiZs/3im=plc'ief3s thor"at(c'c' =- a'a'ZZ )fo/31 r =any unbifor alasle/3,d esintclimuatdionrg. the choi-1ce f3 = ( c' - a' Z) '. Now, c' = Z ( Z ' Z ) c . Mor e over , a* f3 = f r om c' ( Z ' Z wi t h ) Z ' Y a*' Y = Resa satuilstfy7.in2gE(thPe )unbi= f3as,sedo rc'equiP =rement a*' Y ics' anunbia' Z,asedestimatorofc' f3 . Thus, forany Var(a'Y) == a-Var(2(a a-' Za*/3 a*)a'e')( a=-Var(a* a'ea*)) = a' Io2a 2 ' (a - a*)-'1( a - a*) a*' a*] = asa'inZce- (a*'a -Za*)= 'ca' *-=c'(a=- a*)Becaus Z ( Z ' Ze )a* cis fixedfrandom (thae-condia*) ' (taio-n a*)(1a -is a*)pos'iZtiv=e unles a= a*, Var(a'Y) is minimized by the choice a*' Y = c' (Z' Zf Z'Y = c' {3 . Thi s power f u l l e ads t o t h e bes t f o r r e s u l t f3 t i o n of s t a t e s t h at s u bs t i t u of c' f3 for any c of interest. In statistical terminolog(y,BtLUE) he estiofmatc' o/3.r c' f3 is calestliemdatthoer We(7-3des) wictrhibteheinaddiferetntioinalal pr(toecedur e s bas e d on t h e cl a s i c al l i n ear r e gr e s i o n model i n nt a t i v e) as s u mpt i o n t h at t h e er r o r s e have a nor m al di s itnribSectutioion.n 7.Met6. hods for checking the general adequacy of the model are considered Before we can asses the imE(porY)ta=nce of par{3 ticular variables in the (7 1 we musTotdodetseor,mweinsehtalhel assasmplumeintghatdistthrieberutrioonsrs eofhave{3 anda northemrealsididualstribsuutmioofn. squares, and r a nk e i s di s t r i b ut e d f u l Let Y has = Z/3 , wher e e 2 assquares eso-ti1m) .atTheor /3.il thMore maxieovermum, likelihood estimator of f3 is the same as the least P = (Z' Zf1Z'Y is distributed as cr2(Z' Zf1) + · · · + an Yn
a 1 Y1 +
Proof
+
0
=
+ +
+
+
[
=
0' .
0
•
fJ "
"
best (minimum-variance) linear unbiased estima tor
7.4
I N FERENCES ABOUT TH E REGRESSION M O D E L
I nferences Concerning the Reg ressi o n Parameters
f3 o +
regression function -1 1 )
Z 1 + · · · + f3r Zr
e' e .
Result 7.4. Nn (O,
+
Z
r+1
N, + l ( /3,
366
Chapter 7
M u ltiva riate Linear Reg ress ion Models
and is distributed independent= ly ofisthdiesrterisbidutualedsas =o-2X-� rzp1 . Further, 2 2 where (J- is the maximum likelihood estimator of o- • Gi v en t h e dat a and t h e nor m al as s u mpt i o n f o r t h e er r o r s , t h e l i k el i 2 hood function for o-2 is n 2/2a2 L ( lT ) = rr j=1 lTe-e-El = (2/1T2a)2nf2lTn e n 2 n f 1T ( 2 ) lT ForButathfiisxedminvalimuiezato-2io, tnhyie leikldelsithhoode leasis tmaxisquaresmizedestbyimmiatenimi=zin(gZ'(yZ)--1Z/3)Z'y,'whi(y -chZ/3does). 2 nothooddepend upon • Ther e f o r e , under t h e nor m al as s u mpt i o n, !he maxi m um l i k el i oand l e as t s q uar e s appr o aches pr o vi d e t h e s a me es t i m at o r Next , maxi m i z i n g 2 2 L(p, o- ) over o- [see (4- 1 8) ] gives L ({3 , 02 ) = ( 21T ) nf2 ( (J-2 ) n/2 e-n/2 where cl-2 = (y - z{3 ) ' (y - z{3 ) (7- ) FroSpecim (7-f1ic0)al,lwey, can expres {3 and as linear combinations of the normal variables = [-[f-=�z�i�;��;�J = [�J + [�-=�i(j};:��ii,-] = + ] [-�Becaus Z is fiixaed,nceResmaturltic4.es3 werimpleiobtes thaeinjedoinitnnorResmualltit7.y2of. Agai{3 andn, usiTheing (7r-mean tors ande covar 6), wevecget Cov ( [-�]) = Cov ( = u{-(���r�-�-i=-z-(-i-, z)-�i-z, J Sidependent nce Cov (. p(S, ee Res= uflotr4.t5h.e) normal random vectors p and these vect1ors are in Next , l e t ( A , e) Z ' . Then, be any ei g enval u e-ei g envect o r pai r f o r I Z ( Z ' Z) 1 1 2 by (7-6), [I - Z (Z'Z-1) Z' ] = [I - Z ( Z ' Z-)1- Z2' ] so Ae = [I - Z( Z ' Z ) Z' ] e = [I - Z(Z ' Z ) -1Z' J e = A[I - Z( Z ' Z)-1Z' J e = ThatResultis7., 2A),=andf0 orromResNow,ult t4.r 9[I,t-r [IZ-(ZZ(' ZZ) ' ZZ)'-] 1Z= ' ] =-A1 -+ A2(s+·ee· th+e Aproof e , wher n 1 A1 - A-2 val· · uesAofn arAieequaltheeione,genvalanduesof[ Z ' . J Cons e quent l y , e xact l y I Z( Z ' Z ) t h e r e s t ar e zer o . I t t h en f o l o ws f r o m t h e s p ec tral decomposition that (7-13) e
na2
Proof.
Y
e' e
f3 ,
1 V2iT 1
{3,
-c' c
1
c-
c-
j 2 a2
( y - z{3 ) ( y - z{3 ) I
{J
f3 .
1
12
n
e
e.
e
a
Ae
e.
A
e)
0
1.
>
n
>
r
1
>
e ) A'
e,
n
r
1
A2e
of
Section 7.4
I nferences About the Reg ress ion Model
367
, wher . e , . e ar e t h e nor m al i z ed ei g envect o r s as s o ci a t e d wi t h t h e ei g en e , e r n 1 1 2 values A1 = A2 = = An-r-1 = 1. Let e2 V= e r 1 � Then V is normal with mean vect{ e'or 21eandk - 2e'ek - 2 = Cov( V; , Vk) = o:u u u ' otherwise 2)-1and by (7-10), That is, thena2 ar=ee'inedependent N( O , oZ V ' Z J ( Z = ' Z ) = i V� V�-r-1 . d1.str1"bUted 2Xn2-r-1· A conf i d ence el l i p s o i d f o r i s eas i l y cons t r u ct e d. I t i s expr e s e d i n t e r m s of t h e fJ 1 2 2 estimatedcovariancematrix s (Z'Z) ,where s = e' ej( n - r - 1) . 2 1 and i s Let = Z wher e Z has f u l r a nk r 1) . ( O , oN fJ n Then a 100(1 - a)(%fJ -confp )i'dZence' Z ( fJre-P)gion
-----
�------
e
0
;
Vi
IS
e' [ l
lT
Result 7.5.
Y
i
k
-
e
+
+ ··· +
•
+ e,
e
+
+
+
1
±
-----
=
Proof.
+
i
1,
/\
/\
(2-22).]
0,
/\
"
+
+
+
-----
/\
l /3 ; -
u'
+
•
368
Chapter 7
M u ltivariate Li near Reg ression Models
{3 ,
The conf i d ence el l i p s o i d i s cent e r e d at t h e maxi m um l i k el i h ood es t i m at e andZ' Z.itsIforanieeintgaenvaltion andue issnearize arlyezerdetoe,rtmheinconfed byidencethe eielglenval u es and ei g envect o r s of i p s o i d wi l be ver y l o ng i n h t e directPrionactofittiohenercors ofrestepnondiignorngeeitghenvect oulr.taneous" confidence property of the in e "s i m tone-ervalat-esa-ttiimmate etsvalinuRese tn-url-t17.(a/2)5. Inandstead,useththeye irnetpleravceals( r 1)Fr+1,n-r- 1 (a) with the � tn - r - l ( � ) v\iaf ( �; ) when searching for important predictor variables. TheWiscasonsseisnm, neientghbordatahinood.TablFie 7.t t1hwere reegrgates hioernemodel d from 20 homes in a Milwaukee, wher(in theousandstotofal dweldollalirnsg),sandize (in hundrsel eindsg ofpriscqeuar(ine ftehetous), ands asofsedols eladrs)val, utoe these data using the method of least squares. A computer calculation yields Total(dwel100 fltin)g size Asse($s1000)ed value Sel(l$in1000)g price 15.15.2301 57.63.38 74.74.08 16.14.2353 72.70.09 65.57.40 14.17.5373 63.28 74.76.09 63. 14.14.4981 60.57.72 72.73.50 15.13.2859 56.55.46 74.55 73. 15.14.4148 62.63.46 71.71.50 14.18.6873 60.67.22 78.86.59 15.25.2706 57.89.16 68.102.00 19.15.0375 68.61 84.69.00 60. 18.16.3056 66.65.38 88.76.00 +
±
Example 7.4
( 7 -1 4)
(Fitti ng a regression model to real-estate data)
lj = f3 o + {3 1 Zj 1 + {3 2 Zj 2 + Bj
z1 =
z2 =
Y=
TABLE 7 . 1
REAL-ESTATE DATA
Z2
Z1
2
y
[ 5..21523544 .0512 ] . 1 463 and 30. 9 67] [ 2..604534 p Thus, the fit ed equation is 30.967 2.634z1 .045z2 wiatitohns of t3.he473.leasThet squarnumber s i n par e nt h es e s ar e t h e es t i m at e d s t a ndar d devi . 8 34, i n di c at i n g t h at t h e dat a e s coef f i c i e nt s . Al s o , exhigresbiiotna analstrongysisreofgrtehsesioendatrelaatusioinsnghtiph.e SASSee Panel 7. 1 , whi c h cont a i n s t h er e the residuals pass the diagnostic checks dessctraitbisedticialn sSectoftwioarne7package. .6, the fit eIdf Section 7.4
I nferences About the Reg ress ion Model
369
1 ( Z ' Z ) - ==
-.0172 .0067
1
== ( Z ' Z ) - Z ' y ==
y
==
( 7.88 )
+
( . 78 5 )
s ==
(
+
( . 2 85 )
R 2 ==
)
e
PANEL 7.1
SAS ANALYSIS FOR EXAM PLE 7.4 U S I N G PROC R E G .
title 'Reg ression Ana lysis'; data estate; i nfi le 'T7- 1 . dat'; in put z1 z2 y; proc reg data estate; model y = z1 z2;
PROGRAM COM MAN DS
=
OUTPUT
Mode l : M O D E L 1 Dependent Va riable: Ana lys is of Va ria nce DF 2 17 19
Sou rce Model E rror C Tota l
Deep Mean
c.v.
S u m of Sq u a res 1 032 .87506 204.99494 1 2 37.87000
76.5 5000 4.53630
Mean S q u a re 5 1 6.43753 1 2 .05853
Adj R-sq
F va l u e 42.828
Prob > F 0.0001
0.81 49
Parameter Estimates Va riable I NTERCEP z1 z2
DF 1
T fo r HO: Parameter = 0 3 .929 3.353 0 . 1 58
Prob > ITI 0.00 1 1 0.0038 0.8760
370
Chapter 7
M u ltivariate Linear Reg ress ion Mode ls
equat i o n coul d be us e d t o pr e di c t t h e s e l i n g pr i c e of anot h er hous e i n t h e nei g h bortervhaloodforf{3ro2m[seites (s7iz-1e4)and] is asgivseens ebyd value. We note that a 95% confidence t ( . 0 25) � . 045 2. 1 10( . 285) � 1 2 7 or ( - . 5 56, . 647) Sifronmce tthhee rconfegreids enceion model intervalandinclthuedesanal{32ysis 0,retpeathe varediawiblteh zt2hmie sginhtglbee prdreodippedctor vardictiiaoblneofz1s.elGiinvgenpridwelce. ling size, asses ed value seems to add lit le to the pre Pardicttoofr varregriablesesioonn analthe ryessispionss concer n ed wi t h as s e s i n g t h e ef f e ct s of par t i c ul a r pr e e var i a bl e . One nul l hypot h es i s of i n t e r e s t s t a t e s t h at cerZq+t1a,iZnq+of2,the, z/sz, . doThenotstaitneflmentuencethtatheZrqe+s1p, onsZq+2e, . ,Thesz, doenotprediincfltuoencers wil betralnsabellateeds into the statistical hypot/3q+1hesi/3s q+2 0 or f3 2 ( 7 1 5) ( ) whereSet/3(t2i)ng [ /3q+1 , /3q+2' z [ n XZ(q 1+ 1 ) n XZ( r2 q ) ] , we can expres the general linear model as Z/3 [Z1 Z2] [-�f3_(Q2))_J Z1 P (1) Z2 P (2) Underis bastheednulonl hypot h es i s Z The l i k el i h ood r a t i o t e s t of f3 f3 1 1 2 ( ( ) ) t h e Extra sum of squares SS(y -(ZZ11)P-(1 )SS' (y -(ZZ) 1 P (1 ) - (y - zp)'(y - zp) (7-16) where P (1) (Z1Z1 )-1 Z1y. 2 Let Z have f u l r a nk 1 and be di s t r i b ut e d as 1 . The N ( O , o) n lextikelraihsoodum ofrastqiouarteesst iofn (7-16)f3and(2) s2 is (yequivZalpe)nt'(yto- aZP)test (of bas- 1)ed. Ionn parthe ticular, the likelihood ratio test rejects if ------s2------ Fr-q,n-r- 1 (a) in
±
±
=
=
•
Li keli hood Ratio Tests fo r the Regression Pa ra meters
Y.
Y
.
. . •
H0 :
=
= ··· =
··· ,
=
{3, =
= 0
H0 :
{3, J.
=
Y=
1
+e=
H0 :
H0
=
= 0, Y =
res
+e
+
+e=
+ e.
res
=
=
Result 7.6.
H0 :
= 0
r+
= H0
e
/
-
(SS res ( Z 1 ) - SS res ( Z ) ) / ( r - q )
>
H0
n
-r
Section 7.4
I nferences About the Regress ion Model
371
wherand e Fr-q,n-r-d.1 f. is the upper )th percentile of an F-distribution with Gi v en t h e dat a and t h e nor m al as s u mpt i o n, t h e l i k el i h ood as s o ci a t e d 2 with the parameters and a- is 1n 2 ;;.n -n/2 n n 2 ( ( wiUnderth thethmaxie resmtrumictiooccur r i n g at (Z'Z) -1 Z 'y and B-2 n of the null hypothesis, Z1 1 and n-r
1
-
(a)
Proof.
r-q
(100a
f3
L ( p,
u2 )
=
1
2'7T) /
2 2 e - ( y-Z(3 ) ' ( y - Z(3 ) / a
(I
{3
2 7T) /
<
-
=
(y - Zp ) ' (y - Z P )In .
=
Y
e
f3 ( ) + e
=
where the maximum occurB- s at 1 (Z 1Z1)-1Z1y. Moreover, - Z1 - Z 1 1 )/ i Rejecting 2 2 for small values of the likelihood ratio maxmax L(/3,(/3 1 a-2a-) ) - (_a-(;-"' 221 )- 2 - ( "' 2 (;-0'21 "'2 )-n/2 - ( 0'"'21 ;;.2 "'2 )-n/2 is equivalent (tB-oire-jeB-ct2i)ng( for large valu(zes1 of) (B-i - (B-z))2)/B-(2 or its scaled version, -- F -nB-2 s2 Theor Respruelcedit ngwiF-trhatio has an F-distribution with - and d.f. (See Comment. The likelihood ratio test is implemented as fol ows. To test whether alsplondicoefnfgictioentthsesine acoefsubsficeitentarse. zerTheo,ifmitprthoevement model iwin tthheandresiwidualthoutsumthofe tseqruarmsecors (trhee extviartahesuF-mrofatisoq. uarTheess)aimes compar e d t o t h e r e s i d ual s u m of s q uar e s f o r t h e f u l model pr o cedur e appl i e s even i n anal y s i s of var i a nce s i t u at i o ns 4 whereMorZ isenotgeneralof fully,riatnk.is possible to formulate null hypotheses concerning ltirnixearC havecombifunlatrioansnk,oflet of the foandrm consC/3ider Let the ma C/3 ( Thi s null hypothesis reduces to the previous choice when C [o ; (r - q )X ( r - q ) ] · ) Under the ful model, C is distributed as Nr-q( C/3, a-2C (Z'Z)-1 C') . We reject A
f3 ( ) =
=
H0 : f3 ( )
_ f3 -'( l)_ - ,a_2
L
( )
=
{3 (
)
n
0
'
n
_
P (1) ) ' (y
(y
/
_
_
___
(I
+ "' 2
(I
_
_
1 +
_
(I
{3 , a2
H0
n
I r - q) I(n - r - 1 )
-------
7.11
r
m = 1.)
f3 A0
I r - q)
- ssres
( ss res
q
n-r-1
[19]
•
=
0,
H0 :
H0 :
=
=
r-q
( r - q) X (r + 1 )
A0 •
0
=
p
4In situations where Z is not of full rank, rank(Z) replaces Result 7.6.
r
l
I
+ 1 and rank ( Z 1 ) replaces q + 1 in
372
Chapter 7
M u ltivariate Linear Reg ress ion Models
C/3 = 0
0
at l e vel a i f does not lie i n t h e 100( 1 - a) % confidence ellip soid for C/3. Equivalently, we rej ect C/3 if ( C p ) ' ( C (Z' Z) -1 C ') -1 ( C p) ( r - q) Fr n r ( a ) q - - l (7 - 1 7 ) where ( y - Zp) ' (y - ZP)/(n - r - 1) and Fr - q , n - r - 1 ( a ) is t h e upp er 1d. f . Thet e s t i n ( 7 qandn (is1t00a)he litkhelperihoodcentrilaetiofanFd i s t r i b ut i o nwi t h r r -1of7) o t e s t , and t h e numer a t o r i n t h e r a t i o i s t h e ext r a r e s i d ual s u Fm ( S ee squareThes incurnextredexampl by fit iengilthuestmodel , s u bj e c t t o t h e r e s t r i c t i o n t h at C handled by the general theoryrajutesst howdescrunbal ibed. anced experimental designs are easily Malof aelaandrge freemalstaure apatnt rchaionsnr.atTheed thseersveircveircaetiinngsthrweree eesconver tablishtmented insto(loancatiniodex.ns) Tablis catee7.gor2 contizedaiaccor ns thedidatng atofolroncatio18n (cus1, t2,omeror 3)s. Each dat a poi n t i n t h e t a bl e 0 and fequalemalenumber1). sThiofsobscateergorvatiiozatnsiopern hascellt.hFore forinmsattance,ofandathtgender weo-combiway(nmtaatblalioeenwiofthlounca thasion21 randespmalonseeshas. In5trroeducisponsngest,hwhireeledummy the combivarniatablioens oftoloaccount cation 2foandr lofcatemalione andmodeltwolindummy var i a bl e s t o account f o r gender , we can devel o p a r e gr e s i o n ki n g t h e s e r v i c e i n dex t o l o cat i o n, gender , and t h ei r "i n t e r a ct i o n" using the design matrix Locat1 ion Gender0 Serv15.ice2 11 21.27.23 00 21.22 11 00 21. 11 36.92.44 11 22 27.15.23 00 22 9.18.21 00 50.00 01 222 44. 1 63. 6 15.30.23 33 00 36.40.49 33 11 H0 :
H0 :
= 0
s2
---------
s2
>
'
=
f3 = 0 .
Example 7 . 5
[22]).
(Testi ng the i m porta nce of additional pred ictors using the extra sum-of-sq uares approach)
=
=
=
Y
TABLE 7.2
RESTAU RANT-S ERVICE DATA
(Y)
Section 7.4
I nferences About the Reg ress ion Model
373
cons1tant 1locat0 ion0 gender i n t e r a ct i o n 1 0 1 0 0 0 0 0 11 11 00 00 11 00 11 00 00 00 00 00 responses 11 11 00 00 11 00 11 00 00 00 00 00 11 11 00 00 00 11 00 11 00 00 00 00 } 2 responses 11 00 11 00 11 00 00 00 11 00 00 00 Z = 11 00 11 00 11 00 00 00 11 00 00 00 responses 11 00 11 00 01 01 00 00 01 01 00 00 11 00 01 01 01 01 00 00 00 01 01 00 } 2 responses 11 00 00 11 01 01 00 00 00 00 01 01 } 2 responses 1 0 0 1 0 1 0 0 0 0 0 1 } 2 responses The coefficif3'ent =vector ,8can1 , ,8be2 , s,(3e3t ,outT1 , asT2 , 'Y1 'Y12 'Y2 1 , 'Y2 , 'Y3 1 'Y32] ' ' ' / s 0) r e pr e sent t h e ef f e ct s of t h e l o cat i o ns on t h e det e r m i n a wher e t h e ,8 ttihoen'Yofik'sserrevprice,estehnte T/sthe rleoprcateisoen-ntgtender he effeincttserofactgender onctst.he service index, and i o n ef f e The des i g n mat r i x Z i s not of f u l r a nk. ( F or i n s t a nce, col u mn 1 equal s the suForm ofthcole compl umnse2-4te model or col,urmnsesults fromIna fcomput act, rank(er prZ)ogr= am give SS e ( Z ) = 2977. 4 r s and The- ramodel nk(Z)wi=th18out-the=in12.teraction terms has the design matrix zl con sisting of the first six columnsSSofreZ.s (ZWe1 ) =fin3419.d that1 wiy3t2h= 0-(nroank(locatZ1i)on-gender = 18 - 4 inte14.ractToion)tes,twe comput Y1 1 = eY1 2 = Y21 = Y2 = 'Y3 1 = (SSres(Z I ) - SSres( Z ) )/ (6 - 4) (SSres(ZSSl r)es-(ZSS)/12res(Z) )/2 = (3419.2977.1 - 42977./12 4)/2 = '89 �
�
�
5
5
[ f3 o , (i >
5-6.)
6.
6
n
=
n
F-
s2
H0 : -
-------
374
Chapter 7
M u ltivariate Linear Reg ress ion Models
TheF-disF-trirbauttioionmaywithbe compar e d wi t h an appr o pr i a t e per c ent a ge poi n t of a n r a t i o i s not s i g ni f i c ant f o r any r e as o and d. f . Thi s n Fablnotedepend significanceuponlevelany loConscatioen-gender quently, weinconclteractuideon,thandat thtehseserevitceerminsdexcandoesbe droppedUsinfrgomthethexte model . r a s u mo f s q uar e s appr o ach, we may ver i f y t h at t h er e i s no s significant; tdihfateirseI,nncemalanalbetesyandwsiseen-offe-lmalvoarcatieasinceodons snot(intouatlgioivcatoenstihowheren sefafmeeectthr)ae, butticelngslthcount tato sgender ersviarce.e iunequal , t h e varinteiraatictoinoinsn tcannot he respusonsuale laty tberibsuteparablaetteoddiinftoerienntdependent predictoramount variablse.s Toandevaltheiur atnecese thsearryeltaotifvitetihneflmodel uenceswiofththande prwiedithcouttorsthone tethrme sreinspquesonsetiionntandhis cascompute, it ies the appropriate F-test statistics. 2
12
a.
•
7.5
I N FERENCES FROM TH E ESTIMATED REGRESSION F U N CTI ON
Once an i n ves t i g at o r i s s a t i s f i e d wi t h t h e f i t e d r e gr e s i o n model , i t can be us e d t o sprolevdiecttworo varpreidiablcteiso.nThenproblze0msand. �etf3 canz0 be usez0d1 , . . . t,oZoesr]tibematseelethcteerdegrvaleus esionfofrutnche tatioz0n .{30 {31z0 1 f3rZor at z0 and to estimate the value of the response Letz0 Y0 denotz01 , .e. . t,hZeorJ.valAccor ue of dthinegrteospthonse model e wheninthe pretdihectexpect or vareiadblvalesuhavee of valisues I ( Y z E o ) f3o {3 1 Zo 1 o f3 rZo r Its least squares estimate is z0f3 . 0 For t h e l i n ear r e gr e s i o n model i n i s t h e unbi a s e d l i n ear z f3 1 esertriomrsatorarofe Enor(Y0mI alz0l)ywiditshtrmiibutnimed,umthvaren iaance, Var(z0P) confz0(iZdence' Z) izn0te2r.valIf tfhoer E(YQ I zo) z0f3 is provided by zO tn-r-l (� ) v!(zQ(Z'Zf1z0)s2 where tn-rd.-1f. the upper )th percentile of a !-distribution with For a f i x ed 'o 0, i s j u s t a l i n ear combi n at i o n of t h e {3 / s , /3 z z "' "' " 1 2 Resa-2(Z'ultZ)-1applby Resies. uAlltso, VarUnder (z0f3) thezf0uCovrther( f3as)szu0mptz0io(nZ'tZ)hat z0ia-s norsinmcealCovly di(sf3tr)ib uted, Result asserts that p is Nr+1 ( f3, a-2(Z'Z)-1) independently of s2/a-2 , which = [1,
+
+
···
(1)
(2)
+
Y
Esti mati ng the Reg ression Fu nction at z0
(7-3),
= [1,
=
+
+
···
= Zo /3
+
"
Result 7.7.
Yo
"
(7-3),
o
=
100( 1 - a)%
e
(7 - 1 8)
=
n -
r-1
( a/2)
IJ ±
IS
100( a/2
Proof.
7.3
=
7.2.
7.4
so
=
==
e
is distributed as � - and Section 7 . 5
I nferences from the Est i mated Reg ress ion Fu nction
375
z'o P
Consequently, the linear combination is
x r 1/ ( n - r - 1 ) . 2 N ( z'o /3 , o- z0 (Z' Z ) - 1 z 0) ( z'o P - z'o{J )j \1o-2 z'o ( Z ' Z ) - 1 z0 v?J;?
"'
( z'o{J - z'o{J )
is distributed as tn-r-1 • The confidence interval fol ows. Prthanediescttiiomnatofinagnewthe observation, suofch as YQ,Accorat ding to the regres ioisnmormodele uncerof tain or (new response (expected value of at (new error) 2 wher e i s di s t r i b ut e d as and i s i n dependent of and, hence, of and s . 2 Thedoes ernotro.rs influence the estimators and s through the responses but has the Given the linear regres ion model of a new observation The variance of the Var is Whenis givthene erbyrors have a normal distribution, a for n ( �) where tn -rdegr-1 ees ofis ftrheedom. e upper )th percentile of a t-distribution with We f o recas t by whi c h es t i m at e s By Res u l t has and The f o r e cas t er r o r i s t h en Thus , so the predictor is unbiased. Since and are independent, Ifly idit isstrfiubrtutheerd, asandsumedso isthtathe lhasineara normal di s t r i b ut i o n, t h en i s normal combi n at i o n Cons e quent l y , i s di s t r i b ut e d as Di v i d i n g t h i s r a t i o by which is distributed as � we obtain •
Forecasti ng a New Observation at z0
expected value
z'o = [ 1, z0 1 , . . . , Zo r J
Y0 • Y0 = z'ofJ + eo
Yo) = N(O, o-2 )
eo
Y0 z0) +
/3
Result 7.8.
e
Yo
"'
"'
"'
l
Vs 2 ( 1 + z o (z'zr 1 z o )
eo
Yo
(7-3),
"'
z'ofJ = f3 o + {3 1 Zo 1 + · · · + f3 r Zo r forecast error Yo - z0{J "' ( Yo - z'o /3 ) = o-2 (1 + z0(Z ' Z ) - 1 z 0) 100 ( 1 a)% prediction interval za P ± t - r-
"'
100 ( a/2
( a/2)
n-r-1
p Y,
e
e
unbiased predictor
(7 -3),
Yo z'o/3 , E ( Yo I z 0) . 7.7, z'o/3 1 2 E ( z'ofJ ) = z'o{J Var (z'ofJ ) = z'o ( Z ' Z) - z0o- • E ( Yo - z o /3 ) = E ( eo ) + Yo - z'o/3 = Zo /J + eo - z o /3 = eo + Zo ( fJ - p ) . E ( z0 ( fJ - fJ"' ) ) = 0 eo fJ "' 1 2 2 Var ( Y0 - z0{J ) = Var ( eo ) + Var ( z0f3 ) = a- + z0 ( Z ' Z ) - z0o- = a-2 ( 1 + z'o (Z' Z ) -1 z 0) . e fJ Y0 - z'o{J . 1 N (O , 1 ) . ( Y0 - z'o P )j \1o-2 (1 + z0 ( Z ' Z ) - z0) VX 1 / ( n - r - 1 ) , v?J;?, -Yo ( - z o/3 ) Vs2 ( 1 + z0(Z' Z ) - 1 z0) Proof. A
A
"'
"'
"'
r
"'
"'
which is distributed as tn-r-1 . The prediction interval fol ows immediately.
•
376
Chapter 7
M u ltivariate Linear Reg ress ion Models
The prediction interval for Yo is wider than the confidence interval for estimating the value of the regression function E(Yo I z0) = z0 /J . The additional uncertainty in forecasting Y0 , 1which is represented by the extra term s 2 in the expressi o n s 2 ( 1 + z0(Z ' Z) - z 0) , comes from the presence of the unknown error term s0 • Example 7 . 6
(I nterva l esti mates for a mean response and a futu re response)
Companies considering the purchase of a computer must first assess their future needs in order to determine the proper equipment. A computer scientist col lected data from seven similar company sites so that a forecast equation of computer-hardware requirements for inventory management could be devel oped. The data are given in Table 7.3 for z 1 = customer orders (in thousands) z2 = add-delete item count (in thousands) Y = CPU (central processing unit) time (in hours) TABLE 7.3
COM PUTE R DATA
Z1 (Orders)
Z2 (Add-delete items)
y (CPU time)
123.5 146.1 133.9 128.5 151.5 136.2 92.0
2.108 9.213 1.905 .815 1.061 8.603 1.125
141.5 168.9 154.8 146.5 172.8 160.1 108.5
Source: Data taken from H. P. Artis, Fo recasting Computer Require ments: A Forecaster's Dilemma (Piscataway, NJ: Bell Laboratories, 1 979).
Construct a 95% confidence interval for the mean CPU time, E ( Yo I z0) {30 + {3 1 Zo1 + {3 2 z0 2 at z0 = [ 1, 130, 7.5]. Also, find a 95% prediction interval for a new facility ' s CPU requirement corresponding to the same z0 • A computer program provides the estimated regression function =
y
=
( Z ' Z )-1
=
[
8.42 + 1.08z 1 + .42z2 8.17969 - .06411 .00052 .08831 -.00107 .01440
and s = 1.204. Consequently, "
z0 /J
=
8.42 + 1.08 ( 130 ) + .42 ( 7.5 )
=
]
151.97
Section 7.6
Model Checking and Other Aspects of Reg ression
377
and s VzO(Z ' Zr 1 z0 = 1.204 ( .58928) = .71. We have t4( .025 ) = 2.776, so the 95% confidence interval for the mean CPU time at z0 is
z o P ± t4( .025 )s V'z O( Z ' Zr 1 z o = 151.97 ± 2.776 ( .71 ) or ( 150.00, 153.94 ). .--___-___-1Since s V1 + z 0 ( Z ' Z ) _ z 0 = ( 1.204 ) ( 1.16071 ) = 1.40, a 95% prediction interval for the CPU time at a new facility with conditions z0 is z 'o P ± t4 ( .025)s V'1 + z0(Z' Z ) - 1 z0 = 151.97 ± 2.776 ( 1.40) or ( 148.08, 155.86 ) . • 7.6
M O D E L CH ECKI NG AND OTH ER ASPECTS OF REGRESSION Does the Model Fit?
Assuming that the model is "correct," we have used the estimated regression function to make inferences. Of course, it is imperative to examine the adequacy of the model before the estimated function becomes a permanent part of the decision-making apparatus. All the sample information on lack of fit is contained in the residuals B 1 = Y1 - f3 o - f3 1Z1 1 - . . · - f3rZ1r B2 = Y2 - f3 o - /31 Z2 1 - · · · - f3 r Z2r A
A
A
A
A
A
en = Yn - f3 o - {3 1Zn 1 - . . . - f3 rZnr e = [I - Z ( Z ' Z ) - 1 Z ' ] y = [ I - H] y A
or
A
A
(7-19)
If the model is valid, each residual ej is an estimate of the error sj , which is assumed to
be a normal random variable with mean zero and variance a-2 • Although the residuals 1 2 e have expected value O, their covariance matrix a- [ I - Z (Z' Z ) - Z ' ] = a-2 [I - H] is not diagonal. Residuals have unequal variances and nonzero correlations. Fortu nately, the correlations are often small and the variances are nearly equal. Because the residuals e have covariance matrix a-2 [I - H], the variances of the sj can vary greatly if the diagonal elements of H, the leverages h j j , are substantially different. Consequently, many statisticians prefer graphical diagnostics based on stu dentized residuals. Using the residual mean square s 2 as an estimate of a-2 , we have ---(7-20) Var ( ej ) = s 2 ( 1 - hjj ) , j = 1, 2, . . , n
.
and the studentized residuals are * s. 1 A
e· 1 ---:;:=::=:========= A
=
v
"
I
s 2 (1 - hjj ) '
j = 1, 2, . . . , n
(7-21)
We expect the studentized residuals to look, approximately, like independent draw ings from an N ( 0, 1 ) distribution. Some software packages go one step further and studentize ej using the delete-one estimated variance s 2 (j), which is the residual mean square when the jth observation is dropped from the analysis.
378
Chapter 7
M u ltiva riate Linear Reg ress ion Models
Residuals should be plotted in various ways to detect possible anomalies. general diagnostic purposes, the following are useful graphs: 1.
Plot the residuals ej against the predicted values Yj
=
For
ffi o + ffi l Zj l + . . . + ffirz;
r .
Departures from the assumptions of the model are typically indicated by two types of phenomena: (a) A dependence of the residuals on the predicted value. This is illustrated in Figure 7.2(a). The numerical calculations are incorrect, or a {3 0 term has been omitted from the model. (b) The variance is not constant. The pattern of residuals may be funnel shaped, as in Figure 7.2(b ), so that there is large variability for large y and small variability for small y . If this is the case, the variance of the error is not constant, and transformations or a weighted least squares approach (or both) are required. (See Exercise 7.3.) In Figure 7.2( d), the residuals form a horizontal band. This is ideal and indicates equal variances and no de pendence on y . 2. Plot the residuals ej against a predictor variable, such as z 1 , or p roducts of pre dictor variables, such as zi or z 1 z2 • A systematic pattern in these plots suggests the need for more terms in the model. This situation is illustrated in Figure 7.2(c). 3. Q-Q plots and histograms. Do the errors appear to be normally distributed? To answer this question, the residuals ej or ej can be examined using the techniques discussed in Section 4.6. The Q-Q plots, histograms, and dot diagrams help to detect the presence of unusual observations or severe departures from nor mality that may require special attention in the analysis. If n is large, minor de partures from normality will not greatly affect inferences about /3.
�����----� y
A
(a)
������
A
y
(b)
A
������ y
(c)
(d)
Figure 7.2
Resid u a l p l ots.
Section 7 . 6 4.
Model Checking and Other Aspects of Reg ression
379
Plot the residuals versus time. The assumption of independence is crucial, but
hard to check. If the data are naturally chronological, a plot of the residuals ver sus time may reveal a systematic pattern. (A plot of the positions of the resid uals in space may also reveal associations among the errors.) For instance, residuals that increase over time indicate a strong positive dependence. A sta tistical test of independence can be constructed from the first autocorrelation,
n
(7-22)
"' 2 e· 1
2: j =l of residuals from adj acent periods. A popular test based on the statistic
� (Sj
-
ej - d
I� e;
0
2( 1
-
rl ) is called the Durbin- Watson test. (See
[13] for a description of this test and tables of critical values.) Example 7.7
{Residual plots)
Three residual plots for the computer data discussed in Example 7.6 are shown in Figure 7.3. The sample size n = 7 is really too small to allow definitive judg ments; however, it appears as if the regression assumptions are tenable. • £
£ 1 .0
1 .0 0
0
zl
- 1 .0
- 1 .0
• • •
•
•
10
5 •
•
(b)
(a)
1 .0 0
�--�--�--���
A
y
- 1 .0
(c) Figure 7.3
Resi d u a l pl ots fo r the co m puter data of Exa m p l e 7 . 6 .
z2
380
Chapter 7
M u ltivariate Linear Reg ress ion Models
If several observations of the response are available for the same values of the predictor variables, then a formal test for lack of fit can be carried out. (See [12] for a discussion of the pure-error lack-of-fit test.) Leverage and I nfl uence
Although a residual analysis is useful in assessing the fit of a model, departures from the regression model are often hidden by the fitting process. For example, there may be "outliers" in either the response or explanatory variables that can have a consid erable effect on the analysis yet are not easily detected from an examination of resid ual plots. In fact, these outliers may determine the fit. The leverage h1 j is associated with the jth data point and measures, in the space of the explanatory variables, how far the jth observation is from the other n - 1 observations. For simple linear regression with one explanatory variable z, 2 ( zj - z) 1 h1· 1· = - + n n 2 � (zj - z) ----
j= l
The average leverage is ( r + 1 )/n. (See Exercise 7.8.) For a data point with high leverage, hjj approaches 1 and the prediction at Zj is almost solely determined by yj , the rest of the data having little to say about the mat ter. This follows because (change in yj ) = hjj (change in yj) , provided that other y values remain fixed. Observations that significantly affect inferences drawn from the data are said to be influential. Methods for assessi �g influence are typically based on the change in the vector of parameter estimates, f3 , when observations are deleted. Plots based upon leverage and influence statistics and their use in diagnostic checking of regres sion models are described in [2] , [4] , and [9] . These references are recommended for anyone involved in an analysis of regression models. If, after the diagnostic checks, no serious violations of the assumptions are de tected, we can make inferences about f3 and the future Y values with some assur ance that we will not be misled. Additional Problems in Li near Reg ression
We shall briefly discuss several important aspects of regression that deserve and re ceive extensive treatments in texts devoted to regression analysis. (See [9] , [10], [12], and [20].) Selecting predictor variables from a large set. In practice, it is often difficult to formulate an appropriate regression function immediately. Which predictor vari ables should be included? What form should the regression function take? When the list of possible predictor variables is very large, not all of the vari ables can be included in the regression function. Techniques and computer programs designed to select the "best" subset of predictors are now readily available. The good ones try all subsets: z 1 alone, z2 alone, , z 1 and z2 , The best choice is decided by . . .
. • .
•
Section 7 . 6
Model Checking a n d Oth e r Aspects of Reg ression
381
2 R2 2 •2 R R , R 1 - ( 1 - R2) ( n - 1)/(n - [11 1) , ) n -( -
examining some criterion quantity like [See (7-9).] However, always increases with the inclusion of additional predictor variables. Although this problem can be cir r- a cumvented by using the adjusted = better statistic for selecting variables seems to be Mallow ' s CP statistic (see ]),
(
(residual sum of squares for subset model with p parameters, including an intercept) CP = 2p) (residual variance for full model) A plot of the pairs (p, Cp ) , one for each subset of predictors, will indicate models that forecast the observed responses well. Good models typically have (p , Cp ) co ordinates near the 45 ° line. In Figure 7.4, we have circled the point corresponding to the "best" subset of predictor variables. If the list of predictor variables is very long, cost considerations limit the num ber of models that can be examined. Another approach, called stepwise regression (see [ 2]) , attempts to select important predictors without considering all the possibili ties. The procedure can be described by listing the basic steps (algorithm) involved in the computations: Step 1. All possible simple linear regressions are considered. The predictor variable that explains the largest significant proportion of the variation in Y (the variable that has the largest correlation with the response) is the first vari able to enter the regression function .
1
•
(0)
•
(3 )
(2) .
•
•
(2, 3)
•
( 1 , 3)
(1)
0
•
( 1 , 2,
3)
( 1 , 2) Numbers in parentheses correspond to predicator variables
CP p l ot fo r com p uter data from Exa m p l e 7.6 with th ree pred icto r va riables (z 1 = orders, z2 = add- delete cou nt, z3 = n u m ber of items; see the exa m p l e a n d orig i n a l sou rce) .
Figure 7.4
382
Chapter 7
M u ltivariate Linear Reg ress ion Models
Step 2. The next variable to enter is the one (out of those not yet included) th at makes the largest significant contribution to the regression sum of squares. The significance of the contribution is determined by an F-test. (See Result 7.6. ) The value of the F-statistic that must be exceeded before the contribution of a variable is deemed significant is often called the F to enter. Step 3. Once an additional variable has been included in the equation, the in dividual contributions to the regression sum of squares of the other variables already in the equation are checked for significance using F-tests. If the F-statistic is less than the one (called the F to remove) corresponding to a pre scribed significance level, the variable is deleted from the regression function . Step 4. Steps 2 and 3 are repeated until all possible additions are nonsignifi cant and all possible deletions are significant. At this point the selection stop s . Because of the step-by-step procedure, there is no guarantee that this approach will select, for example, the best three variables for prediction. A second drawback is that the (automatic) selection methods are not capable of indicating when trans formations of variables are useful.
Z
Colinearity. If is not of full rank, some linear combination, such as Za, must equal 0. In this situation, the columns are said to be colinear. This implies that does not have an inverse. For most regression analyses, it is unlikely that a = 0 ex actly. Yet, if linear combinations of the columns of exist that are nearly 0, the cal culation of is numerically unstable. Typically, the di�gonal entries of will be large. This yields large estimated variance � for the f3 /s and it is then difficult to detect the "significant" regression coefficients f3i . The problems caused by colin earity can be overcome somewhat by (1) deleting one of a pair of predictor variables that are strongly correlated or (2) relating the response Y to the principal compo nents of the predictor variables-that is, the rows zj of are treated as a sample, and the first few principal components are calculated as is subsequently described in Sec tion 8.3. The response Y is then regressed on these new predictor variables.
Z' Z Z 1 (Z' Z)-
Z
( Z ' Z ) -1
Z
Bias caused by a misspecified model. Suppose some important predictor vari ables are omitted from the proposed regression model. That is, suppose the true model has = with rank r + 1 and !
Z [ Z 1 Z2 ] f3 1 ( ) + q X ( Y(nX 1) [ (nx(Zq1+1) (nXZ(r2-q) J ( rf3q()2X) 11)) Ze)1 13 (1)a21.Z2 P (2) e E( e) ( Y Z ' ( Y Z ) /3 f3 1 1 1 1 · ) ( ( 1 (Z1 ZE1()p Z1 1)Y. (Z 1Z1 )-1Z1E(Y) (Z1Z1)-1 Z1 ( Z1 f3 1 () ( l
=
------ - --- - --- +
l
=
+
+
(neX 1)
(7-23)
where However, the investigator unknowingly fits = 0 and Var( = a model using only the first q predictors by minimizing the error sum of squares The least squares estimator of is Then, unlike the situation when the model is correct, =
=
+
f3(1) P (1) Z2 P (2) E(e))
:=
+
(7-24 )
Section 7 . 7
383
Thatto thosise, of is (athbiatasise,d estimator ofIf imporunltaents vartheiacolbleusmnsare miofs inarg feroperm pthendie modelcular, the least squares estimates may be misleading. InsponsthisesseY]_cti,on, we. . ,consYm andiderathsienprgleoblseetmofofprmodel i n g t h e r e l a t i o ns h i p bet w een r e e di c t o r var i a bl e s z , , Eac h r e 1 sponse is assumed to fol ow its own regres ion. model, so that Thetermers asrosrotciearmtede'with dif erent responshaseEs may(e) be corandreVar(lated.e) Thus, the error To establishdenotnotateiotnheconfvalouresminofg ttohethpredi e clascitcoalr varlineariablreesgrfeosriothnemodel ,ilaelt, jt h t r lreotrs. In matrix notation, thbee destheigrnesmatponsriexs, and let ej be the er "
13 ( 1 ) Z2
7.7
M u ltivariate M u ltiple Reg ression
Z1 Z 2
�
0). f3 ( l )
Z1
f3 ( I )
M U LTIVARIATE M U LTI PLE REGRESSION
z 2 . . . , z, .
Y2 ,
Yi = 13 o l + I31 1 Z1 + . · + l3r 1 Zr + e 1 Y2 = 13 o 2 + I31 2 Z1 + · · · + 13r 2 Zr + e2
m
(7-25 )
Ym = 13 o m + 13 I m Z1 + · · · + l3rm Zr + e m = [e 1 , e2 , . . . , e m ] = 0 = I.
[ Zjo , Zj r, . . . , Zj r ] Yj = [ lj 1 , lj 2 , . . . , �· m ]
=
[ ej 1 , ej 2 , . . . , ej m ]
z
( nX ( r + l )) Zno Zn l
Znr
imats threixsaquant me asititehsathavefor tmulhe stiinvglarei-arteescount ponseerrepgraretss. ioSetn model. [See The other (7-3).]
Y1 1 Yi 2 y y = 121 22 (n x m) Yn l Yn 2
Yi m 12m = [Y (l) Ynm
13 o i 13 o 2 13 1 1 131 2
13 o m 13 1 m = [ /3 ( 1 )
/3
(( r + l ) Xm )
e
( n Xm )
l3 rl l3r 2 e1 1 e1 2 e2 1 e22 = en l e n 2 1 =
ee2 e'
-----
n
l3 rm el m e2 m = [ e( l ) en m
384
Chapter 7
M u ltiva r i ate Linear Reg ression Models
Simply stated, the ith response Y( i ) follows the linear regression model Y( i ) = ZfJ ( i ) + B ( i ) , i = 1 , 2, . . . , m (7- 27 ) with Cov ( B ( i ) ) = cr ii I. However, the errors for different responses on the same trial can be correlated. Given the outcomes Y and the values of the preftictor variables Z with full col umn rank, we determine the least squares estimates fJ ( i ) exclusively from the obser vations Y( i ) on the ith response. In conformity with the single-response solution, we take P u) = ( Z ' Z)- 1 Z' Y( i ) (7- 28 ) Collecting these univariate least squares estimates, we obtain Y(mJ ] iJ = [J} (ll i P ( 2 J i · · · i P (m J ] = ( Z ' Z r1 Z' [Y(ll ! Y( 2 J or (7-29)
For any choice of parameters B = [b (l) ! b ( z ) ! · · · l b (m) J , the matrix of errors Y is - ZB. The error sum of squares and cross products matrix is ( Y - ZB) ' ( Y - ZB ) =
[
\
( Y(ll - Zb ( lJ ( Y(ll - Zb (ll )
(Y( ll - Zb (ll ) ' ;( Y(ml - Zb (m J )
(Y(m) - Zb (m) ) (Y(l) - Zb (l) )
(Y(m) - Zb (m) ) ( Y(m) - Zb (m) )
]
(7 -3 0) "
The selection b ( i ) = fJ ( i ) m1n1m1zes the ith diagonal sum of squares (Y( i ) - Zb ( i ) ) ' (Yu),...- Zb ( i ) ) · Consequently, tr [ ( Y - ZB ) ' ( Y - ZB ) ] is minimized by the choice B = f3 . Also, the generalized variance ( Y - ZB ) ' ( Y - ZB ) is min imized by the least squares estimates /J . (See Exercise 7 . 1 1 for an additional generalized sum of squares property.) ,... Using the least squares estimates f3 , we can form the matrices of 1 Predicted values: Y = z{J = Z (Z' Z)- Z' Y e = Y - Y = [I - Z ( Z ' Z) - 1 Z ' ] Y Residuals: (7- 3 1)
I
I
M u ltiva riate M u ltiple Reg ress ion
Section 7 . 7
385
The orthogonality conditions among the residuals, predicted values, and columns of Z, which hold in classical linear regression, hold in multivariate multiple regression. 1 They follow from Z' [I - Z ( Z ' Z ) Z' ] = Z' - Z' = 0. Specifically, 1 (7-32) Z' e = Z' [ I - Z ( Z' Z) - Z' ] Y = 0 so the residuals e(i) are perpendicular to the columns of Z. Also, (7-33) confirming that }he predicted values Y(i) are perpendicular to all residual vectors e(k) . Because Y = Y + e ,
(
or Y'Y
Y'Y
e'e
�
)
resi ual ( error) �urn total sum of squares = predicted sum of squares 0 square s an + and cross products and cross products cross pro d ucts (7-34) The residual sum of squares and cross products can also be written as e � e = Y' Y - Y'Y = Y'Y - iJ' z ' z iJ (7-35)
(
Example 7.8
=
) (
)
+
(Fitti ng a mu ltivariate straig ht-l i n e reg ression model)
To illustrate the calculations of /3 ' Y , and e ' we fit a straight-line regression model (see Panel 7.2 on page 386 for SAS output), "
"
lf1 = f3o l + f3 1 1 Zj1 + Bj l lj 2 = f3o 2 + f3 I 2 Zj l + Bj 2 '
j = 1, 2, . . . ' 5
to two responses Yi and Y2 using the data in Example 7.3. These data, aug mented by observations on an additional response, are as follows: 0 1 -1
1 4 -1
3 8 3
2 3 2
4 9 2
The design matrix Z remains unchanged from the single-response problem. We find that
z'
=
[
1 1 1 1 1 0 1 2 3 4
]
(Z'Z)
-1
=
[
.6 - .2 - .2 .1
]
386
Chapter 7
PANEL 7 . 2
M u ltivariate Li near Reg ress ion Models
SAS ANALYSIS FOR EXAMPLE 7.8 U S I N G PROC. G L M .
title ' M u ltiva riate Reg ression Ana lysis'; data m ra; infile 'Exa m pl e 7-8 data; i n put y1 y2 z 1 ; proc g l m data = m ra; model y 1 y2 = z 1 /ss3; m a n ova h z 1 /pri nte;
PROGRAM COM MANDS
=
General Linear Models Proced u re
OUTPUT
Sou rce Model Er ror Corrected Tota l
Sou rce Z1
DF 1 3 4
S u m o f Sq u a res 40.00000000 6. 00000000 46.00000000
Mean Sq u a re 40. 00000000 2 . 00000000
R-Sq u a re 0.869 565
C.V. 28. 28427
Root M S E 1 .41 42 1 4
DF 1
Type I l l S S 40. 00000000
Mean Sq u a re 40.00000000
T fo r HO: Parameter = 0 0.91 4.47
Sou rce Model E r ror Corrected Tota I
Sou rce Z1
F Va l u e 20.00
Y 1 Mean 5 .00000000 F Va l u e 20.00
S u m of Sq u a res 1 0.00000000 4.00000000 1 4.00000000
R-Sq u a re 0 . 7 1 4286
1 1 5.4701
Root M S E 1 . 1 5470 1
DF 1
Type I l l S S 1 0. 00000000
Mean Sq u a re 1 0.00000000
Mean Sq u a re 1 0.00000000 1 . 33333333
c.v.
T for HO: Parameter = 0 -1 . 1 2 2 . 74
Pr > F 0. 0208 Std Error of Esti mate 1 . 095445 1 2 0 .4472 1 3 60
P r > ITI 0.4286 0.0208
DF 1 3 4
Pr > F 0. 0208
F Va l u e 7.50
Pr > F 0.07 1 4
Y2 Mean 1 .00000000 F Va l u e 7.50 Pr > ITI 0.3450 0.07 1 4
Pr > F 0.07 1 4 Std Error of Est i m ate 0.894427 1 9 0.3651 4837
(continues on next page)
Section 7 . 7
PANEL 7.2
M u ltivariate M u ltiple Reg ress ion
387
(continued)
Y2
Y1 Y1 Y2
M a n ova Test Criteria a n d Exact F Statistics for the Hypothesis of no Overa l l Z1 Effect E = E rror SS&CP Matrix H = Type I l l SS&CP Matrix for Z1 N=O S=1 M=O Va l u e 0.062 50000 0.937 50000 1 5 .00000000 1 5 . 00000000
Statistic Wil ks' La m bda P i l l a i's Tra ce Hotel l i n g-Lawley Trace Roy's G reatest Root
Pr > F 0.0625 0.0625 0.0625 0.0625
Den D F 2 2 2 2
Num DF 2 2 2 2
F 1 5 . 0000 1 5 . 0000 1 5 .0000 1 5 . 0000
and
-1-1 2 2 ] /3 (2 ) (Z' Z) -1 Z'y(2 ) == [ -.. 26 -..21 ] [ 250 ] == [ -11 ] /3 (1 ) == (Z'Z) - 1 Z'y(1 ) [ 21 � � [ jJ [11(1) f}( 2 ) ] J ( Z' Z f1 Z' [Y(l ) Y(2) ] == 1 2z 1 y2 == -1 11 01 [� � ] 1 -10 1 5 == 1 2 2 11 3
so
A
==
From Example 7.3,
A
==
Hence,
i
=
=
i
=
The fitted values are generated from )\ Collectively, A
Y
3
A
==
and
+
z{J
=
3 4
7 9
3
+
z2 .
388
Chapter 7
M u ltivariate Li near Reg ress ion Models
and A
E=Y-Y= Note that
E'Y = [00 A
[00
1 -2 1 1 1 -1
OJ
1 -2 1 1 1 -1 -1
0
1 -1 3 5 1 7 2 9 3
-n
[� �]
=
Since
Y' Y =
[
4 3 8 1 -1 -1 2 3
[
Y'Y = 165 45 45 15
J
and
1 -1 4 -1 3 2 8 3 9 2
�] A
A
e'e
=
=
[
171 43 43 19
[ -!]
J
6 -2
the sum of squares and cross-products decomposition
Y ' Y = Y'Y + e' e
is easily verified.
•
Result 7.9. For the least squares estimator {J = [ Jl ( l) i · · · i Jl (mJ l determined under the multivariate multiple regression model (7-26) with full rank(Z) = r + 1 < n,
J1 (2)
and
i, k The residuals E = [i ( l ) i i ( 2 l i · · · i i (mJ l E ( e ( i ) e ( k ) ) = (n - r - 1 ) o-i k ' so E( e) "
=0
and E
Also, e and {J are uncorrelated.
=
= Y - z {J
(n - r 1
-1
e' e
1, 2, . . . , m satisfy E(i ( i ) )
)=I
=0
and
M u ltivariate M u ltiple Reg ress ion
Section 7.7
Proof.
389
The i t h r e s p ons e f o l o ws t h e mul t i p l e r e gr e s i o n model Y Z , and ) e ( e f3 e e ) E i E( i ai u i i i i ( ( ( ( ( ) ) ) ) Also, as in P (i) - f3 u) (Z'Z)-1Z ' Yu) - f3 u) (Z' Z)-1Z' eu and e(i) Y(i) - Y(i) [I - Z(Z'Z)-1Z' J Y(i) [I - Z(Z'Z)-)1Z' ] e(i) so E(Nextp (i) ) , f3 (i) and E(iu) Cov ( P u) , P (k ) E( Z( 'PZ(i))-1-Z'/3E(e) u( P) e(kk) )-Z ( Z(k' Z) )-1 o-ik(Z' Z)-1 Us i n g Res u l t and t h e pr o of of Res u l t wi t h U any r a ndom vect o r and AConsa efquent ixed lmaty, rix, we have that E[U' AU] E[tr (AUU' ) ] tr [AE(UU' ) J . E(i(i)e(k ) a-E(iket(ri)[((II-- ZZ((ZZ'' Z)Z)--11ZZ'' ))e](k ) a-ikt(r [ (-I - -Z (Z' Z)-1Z' ) o-iki] astaiinntthheeunbiproofasofedResestiumltator ofDividFiinngaleachly, entry e(i) e(k) of by - - we ob Cov( p0) , e(k ) E(Z[ (' ZZ)' Z-1)Z-'1E(Z ' ee((ii))ee((kk) ()I(I--Z(ZZ(Z' Z'Z)-)1-Z1Z' )']) (o-Zik'((ZZ)-'1ZZ)'-o-1Ziki' (-I -(ZZ' Z(Z)-'1ZZ)'-)1Z' ) so each element of is uncorrelated with each element of The mean vect o r s and covar i a nce mat r i c es det e r m i n ed i n Res u l t enabl e us to obtWeain ftihrestsconsamplidinerg prtheoperprotblieesmofoftheeslteimasatt siqnuarg thees mean predicvecttorso. r when the predic tvaror ivarablieablis ezs0 f3have(i) , andthe tvalhisuiess esz0timat[1e,dZobyr , .z.0,f3,.Z.. (oi)r,] .thThee ithmeancomponent of the iofth trheespfionst ede regres ion relationshipZ'. Col-lect[zi'vely, z' : : z ' Of-' (1) : f-' ( ) icomponent s an unbi�asFredomestthime atcovaror zi'oa/3ncesimatnceriE(x fzo'or/3f3i_i(i)) andz'of3lf((/3k) ,(ith) e eszt'oim/3uatJ iofonrereachrors Zo f3[(zi) (-f3Zof3-(i)Phave) (f3COVar-iaPnces)'z ] z ( E ( f3 - P ) (f3 - P ) )z E b u) u (k) (k o a-bikzo(Z('i)Z)-1zou (k) (k ' o +
= (7-10),
=
0,
=
=
=
=
(7-36)
=
=
=
=
0.
= =
o)
/3
(
4.9
'
=
7.2, =
= =
=
=
=
7.2.
n
1)
r
e'e n r
I.
= = = =
=
"
0
e.
/3
1,
•
7.9
=
"
OfJ
a
-
a
:
Oa
2
a J : . . . : OP(m)
=
= =
(7-37)
=
(7-38)
390
Chapter 7
M u ltivariate Linear Reg ress ion Models
The r e l a t e d pr o bl e m i s t h at of f o r e cas t i n g a new obs e r v at i o n vect o r Y0 d i n g t o t h e r e gr e s i o n model , 0 s0 whe , Y02, er,rYoormE] oat z0.[s0Accor [thYe01"new" z r e i i fJ i Yo ( ) , s0 , . . , s ] i s i n dependent of t h e er r o r s and s a t i s f i e s a m 1 2 E(s0J OandE( s0iso-k) o-ik · Theforecast errorfortheith componentofY0is Yoi zafJ (i) eoYoii -- Zzob(fJufJ ()i) - zfJa fJ(i ()i) - za fJ (i) sunbio E(asYeodi pr- ez0difJuctor) of YE(oi ·sThe0J -fozr0eEcas( IJt(eri) r-orfJus have) covarinidiancescating that z0 fJu) is E(Yoi - zo fJ (i ) (YaE(k -soizo-fJz(ko()IJ (i) - IJ (i ) )( sok - zb( fJ (k) - fJ (k ) ) E- (zsboEiso((k)P (i) z-bEIJ((Pi )us) o-k) IJ-(iE() (Pso(ik()P-(k) fJ-(kfJ)'(zko)')zo 0(Z' Z)s-i1nz0) (Z' Z 1Z' e is independent ( 1 oz i k Notof E0e•thAatsiE(mil(aPr(ri)es-ul!lt hol(i ) eodskf)or E(s0i(e�J3 (pk)(i-) f3(k ) ' ) f. (i) flu) the erMaxirors mumhavelikaelnorihoodmal esditsitmribatutorios n.and their distributions can be obtained when Let t h e mul t i v ar i a t e mul t i p l e r e gr e s i o n model i n hol d wi t h fbutul iroan.nkThen(Z) r 1, n (r 1) and let the errors have a normal distri iEs (t/J�e) maxi/JandCov( mal distributihoemaxn with mum likelifJh(oodi) , fJes(k t)imato-oikr(ofZ'ZfJ) and. AlfJsoha,/Ji� asnorindependentoft imum likelihood estimator1of the pos1itive definite given by - Z/3) - Z/3) n n and n:i is distributed as Accor d i n g t o t h e regres s i o n model , t h e l i k el i h ood i s det e r m i n ed fdiromstribtutheed datasa Nm(fJ['Yzj1,, Y2, . .We, YnJ'firwhosst note erowsthatare independent , wi t h [ Y 1 � /J' ZfJ z 1 Y2 - /J' z2, . . , Yn - fJ' znJ' so - Z/J)' - Z/J) j= 1 (Yj - fJ' zj) (Yj - fJ' zj)'
==
. .
+
=
•
=
e
=
=
A
A
= =
A
A
=
A
A
+
A
= 0,
an
A
A
A
=
+
=
=
(7-39)
+
= 0
+
=
e
Result 7.10. =
+
+
>
(7-26)
+ m,
e
A
.......
=
.......
-1
=
I
A
I=
A
-
A
e' e
=
-
A
(Y
f
A
(Y
wp,n - r - 1 (I)
Proof.
Y=
Y-
I).
(Y
(Y
n
=
2:
=
Y1
M u ltiva riate M u ltiple Reg ression
Section 7.7
and
n
:L (Y - /J'z j ) ' I - 1 (Yj - fJ' z j ) j= 1 j
391
= j:L= 1 tr [ (Yj - /J'zj ) ' I-1 (Yj - /J 'zj ) ] = j=1:L tr [I-1 (Yj - fJ'zj ) (Yj - fJ'zj ) ' ] = [I-1 (Y - Z/J) ' (Y - Z/J) ] Z' e = n
n
tr (7-40) Another preliminary calcl!lation will enable us to express the likelihood in a simple 0 [see (7-32)], form. Since e Y - Z/J satisfies
= (Y - Z/J) ' (Y - Z/J) = [Y - zp z (/J - fJ) [Y - zp z ( /J - fJ) J = (Y - Z/J) ' (Y - Z/J) ( /J - /J ) ' Z'Z (/J - /J ) = e' e (/J /J) ' z' z (/J - /3) 1( ) I ) = IT L( I I j= 1 + = I = I Z (/3 - /J ) I -1 ( /J - /J) ' Z' A' A, 2 1 A = I - 1 (/J - /J ) ' Z', tr [Z( /3 - /J)I -1 (/J - /J) ' Z' ] fJ = fJ . Z Z ( f3 (i) - f3 (i) ) =I= P u ) - f3u )1 tr [?; (/3 - /J)I - (/3 - /J ) ' Z' ] ' I- 1 Z;_(/J - /J). = e' e , = n 1 fJ I, fJ i = n- e' e " + "
"
"
]'
"
"
+
"
"
+
+ Using (7-40) and (7-41), we obtain the likelihood n ) ' -1 1 1 - 2 Y1 -P z1 l ( y1 - p z1 a e p, (21T )mf2 1 1/2 1 1 _ l tr [ l -1 ( £ ' i ( P - P ) ' z' z ( p - 13 ) J e 2 (21T )mn/2 1 l n/2 1 1 _ l tr [:I-1£ ' i ] - l t r [ Z ( p - P)l-\ P - P ) ' Z ' ] 2 e 2 (21T )mn/2 1 l n/2 -
a'
"
The ,...matrix
(7-41)
a'
"
is the form with and, from Exercise 2.16, it is nonnegative definite. Therefore, its eigenvalues are nonnegative also. Since, by Result 4.9, " is the sum of its eigenvalues, this trace will equal its minimum value, zero, if This choice is unique because is of full rank " implies that 0, in which case and *,... 0, c > 0, where c ' is any nonzero row of c Applying Result 4.10 with B b /2, and p = m, we find that and are the maximum likelihood estimators of and respective ly, and " ( n )mn/2 nmj2 1 e -nm/2 (7-42) L( /J , i ) (27T)mnj2 1 i ' i l n/2 e (21T )mn/2 1 I l n/2 "
"
"
>
=
=
It remains to establish the distributional results. From (7-36), we know that f3u ) and e u ) are linear combinations of the elements of e . Specifically,
p(i) e (i)
= (Z' Z) -1 Z' e(i) 1 f3 u) = [I - Z (Z' Z) - Z' J eu) ' +
i = 1, 2, . . . , m
"
392
Chapter 7
M u ltivariate Linear Reg ress ion Models
Therefore, by Result 4.3, {3 ( 1 ) ' {3 (2) ' P (m ) ' e ( 1 ) ' e ( 2) ' e ( m ) are jointly norm al. Their mean vectors and covariance matrices are given in Result 7.9. Since e and {3 have a zero covariance matrix, by Result 4.5 they are independent. Further, as in 0 0 0
'
0 0 .
'
n -r-1 1 (7-13), [I - Z(Z' Z) Z' ] = � e e e € , where e{;ek = 0, e # k, and e € e e = 1. S et €= 1 Ve = e' e e = [ e ( 1 ) e e , e ( 2) e e , . . , e (m ) e e J ' = ee 1 e 1 + ee 2 e2 + · · · + ee n en . Because Ve , e = 1, 2 , . . . , n - r - 1 , are linear combinations of the elements of e, they have a joint normal distribution with E( Ve ) = E( e' ) ee = 0. Also, by Result 4.8, Ve and Vk have covariance matrix ( e{;ek) I = (0) I = 0 if e # k. Consequently, the Ve are in dependently distributed as Nm (O, I ) . Finally, n-r- 1 n -r- 1 e' e = e' [I - Z (Z' Z) - 1 Z' ] e = � e' e e e ee = � Ve V € €= 1 €= 1 which has the wp , n - r - 1 (I) distribution, by (4-22). II .
Result 7.10 provides additional �upport for using least squares estimates. When the errors are normally distributed, /3 and n- 1 e' e are the maximum likelihood esti mators of {3 and I, respectively. Therefore, for large samples, they have nearly the smallest possible variances. Comment. The multivariate multiple regression model poses no new computational problems. Least squares (maximum likelihood) estimates, f3 ( i ) = (Z'Z) - 1 Z' y( i ) ' are computed individually for each response variable. Note, however, that the model requires that the same predictor variables be used for all responses. Once a multivariate multiple regression model has been fit to the data, it should be subjected to the diagnostic checks described in Section 7.6 for the single-response model. The residual vectors [ ej1 , ej 2 , , ej m ] can be examined for normality or out liers using the techniques in Section 4.6. The remainder of this section is devoted to brief discussions of inference for the normal theory multivariate multiple regression model. Extended accounts of these procedures appear in [1] and [22] . A
• • •
Li ke l i hood Ratio Tests for Regression Parameters
The multiresponse analog of (7-15), the hypothesis that the responses do not depend on Zq+ 1 , Zq+2 , , Zr , becomes • • •
H0 : /3 ( 2) =
0
where {3 =
/3 ( ) (( q+ 1 )1X m) /3 (2) ((r - q) X m )
(7-43)
Setting
Z
] ( n XZ(r2- q))
[ ( n XZ( q+l)) 1
=
/3 ( 2 )
= 0, Y =
393
'we can write the general model as
Under the quantities involved in the H0 :
M u ltivari ate M u ltiple Reg ression
Section 7.7
Z 1 /3( l ) + e
and the likelihood ratio test of is based on H0
extra sum of squares and cross products = (Y - Z 1 fJ ( l ) ) ' (Y - Z 1 /3 ( l ) ) - (Y - Z {J)' (Y - Z /3 ) = n( 1 {J ( l J = (Z!Z 1 r1 Z lY i 1 = n- 1 (Y - z l {J ( l l ) ' (Y - z l {J ( l J ) ·
I - I)
"
"
"
"
whereFrom the likelihandood ratio, can be expres ed in terms of generalized variances: ( Equivalently, I�I I can be used. Let t h e mul t i v ar i a t e mul t i p l e r e gr e s i o n model of hol d witrit�hutedof:._ fUnder ul rank and is distributed Letas the errors inbedependent normallylydiofs The l i k el i h ood r a t i o t e s t of whi c h, i n t u r n , i s di s t r i b ut e d as is equivalent to rejecting for l(arI gIeIval) ues of l ni l -5 2ln ln I ln I I For large, the modi[ fied statistic ] ln ( I 1 � 1 I ) has, to a close appr(SeeoSuppl ximateiment on, a chi-square distribution with d.f. If is tihsenot of ful rank, butdihasscusrseadnkin (Stheeenalso Exercise Thewherdise tplriabcedutiobynal concland usionsbystartaenkd in ResuHowever lt re,mainotnaltlhhypot e same,hespresoconcer vided tnhiatng iscanre (7-42),
A,
7 -44)
Wilks' lambda statistic
A 2;n =
Result 7.11.
Z
n (I 1 - I)
r+1 /3 ( 2 )
H0 :
I1 l
( r -t 1 ) + m n. e ni Wp,n - r - 1 (I) Wp,r - q ( I).
= 0,
H0
A = -n
il l
n
=
-n
Proof.
Z
7A.)
generalized inverse
r1
q+1
5 Technically, both n
(Z 1 ) .
- r and n
m
7.11
H0
ni + n( i l - i )
- n - r - 1 _ l:_2 (m - r + q + 1 )
(Z'Z) -
(7-26)
<
r1 + 1, [19].
I1 m ( r - q)
{3 "
=
•
(Z' Z) - Z'Y, 7.6.)
r {3
should also be large to obtain a good chi-square approximation.
394
Chapter 7
M u ltiva r i ate Linear Reg ress ion Models
be tested due to the lack of uniqueness in the identification of {3 caused by the lin ear dependencies among the columns of Z. Nevertheless, the generalized inverse al lows all of the important MANOVA models to be analyzed as special cases of the multivariate multiple regression model. (Testi ng the i m portance of additional pred ictors with a m u ltivariate response)
Example 7.9
The service in three locations of a large restaurant chain was rated according to two measures of quality by male and female patrons. The first service-quality index was introduced in Example 7.5. Suppose we consider a regression mod el that allows for the effects of location, gender, and the location-gender interac tion on both service-quality indices. The design matrix (see Example 7.5) re mains the same for the two-response situation. We shall illustrate the test of no location-gender interaction in either response using Result 7.11. A computer program provides
(
residual sum of squares and cross products
(
extra sum of squares and cross products
) )
=
ni
=
n(I 1
=
[
2977.39 1021.72 1021 .72 2050.95
I)
_
=
[
]
441 .76 246.16 246.16 366.12
]
Let /3( 2) be the matrix of interaction parameters for the two responses. Al though the sample size n = 18 is not large, we shall illustrate the calculations involved in the test of H0 : /3 ( 2) = 0 given in Result 7.11. Setting a = .05, we test H0 by referring -
[n
-
rl
-
1
-
! (m
2
=
-[
-
] ( ni �
r1 + q1 + 1 ) 1n 18 - 5 - 1 -
I
A
n l �l + n( I 1 - I ) I A
J
)
(2 - 5 + 3 + 1 ) ln ( .7605)
=
3.28
to a chi-square percentage point with m ( r1 q1 ) = 2(2) = 4 d.f. Since 3.28 < x�( .05) = 9.49, we do not reject H0 at the 5% level. The interaction terms are not needed. • -
More generally, we could consider a null hypothesis of the form H0 : C{J = f0 , where C is ( r - q ) X ( r + 1 ) and is of full rank ( r - q ) . For the choices C = [o ! r q XI r q ] and f0 = 0, this null hypothesis becomes H0 : C{J = {3 (2) = 0, (-) (-) the case considered earlier. It can be shown that the extra sum of squares and cross products generated by the hypothesis H0 is
( C {3 - f0 ) ' ( C (Z'Z)- 1 C ' )- 1 ( C{J - f0 ) Under the nul! hypothesis, the statistic n(I 1 - I) is distributed as Wr - q (I) inde pendently of I . This distribution theory can be employed to develop a test of H0 : C{J f0 similar to the test discussed in Result 7.11. (See, for example, [22] .) n(I 1 - I)
=
=
Section 7.7
M u ltiva riate M u ltiple Reg ress ion
395
Other M u ltivariate Test Statistics
Tests other than the likelihood ratio test have been proposed for testing H0: /J ( 2 ) 0 in the multivariate multiple regression model. Popular computer-package programs routinely calculate four multivariate test statistics. To connect with their output, we introduce some alternative notation. Let E be the p X p error, or residual, sum of squares and cross products matrix =
E
=
ni
that results from fitting the full model. The p X p hypothesis, or extra, sum of squares and cross-products matrix The statistics can be defined in terms of E and H directly, or in terms of the nonzero eigenvalues TJ 1 TJ2 TJs of HE - 1 , where s = min (p, r - q ) . Equivalently, they are the roots of I (I 1 - I) - TJI I 0. The definitions are 1 lEI Wilks ' lambda IT i =l 1 + TJi I E + H I >
>
. . .
>
=
s
=
Pillai ' s trace
=
Hotelling-Lawley trace
=
Roy ' s greatest root
=
-
TJ; ± i =l 1 + TJi
s :L TJ i =l i Til 1 + Til
=
=
tr [H(H
+ Et 1 J
tr [HE - 1 ]
Roy ' s test selects the coefficient vector a so that the univariate F-statistic based on a a ' Yj has its maximum possible value. When several of the eigenvalues TJi are mod erately large, Roy ' s test will perform poorly relative to the other three. Simulation studies suggest that its power will be best when there is only one large eigenvalue. Charts and tables of critical values are available for Roy ' s test. ( See [18] and [16] .) Wilks ' lambda, Roy ' s greatest root, and the Hotelling-Lawley trace test are nearly equivalent for large sample sizes. If there is a large discrepancy in the reported P-values for the four tests, the eigenvalues and vectors may lead to an interpretation. In this text, we report Wilks ' lambda, which is the likelihood ratio test. Pred ictions from M u ltivariate M u lti ple Reg ressions
Suppose the model Y = zfJ + e, with normal errors e , has been fit and checked for any inadequacies. If the model is adequate, it can be employed for predictive purposes. One problem is to predict the mean responses corresponding to fixed values z0 of the predictor variables. Inferences about the mean responses can be made using the distribution theory in Result 7.10. From this result, we determine that
fJ' "
Zo
is distributed as
Nm(fJ'
Z o , z'o(Z' z) -1 z o I )
396
Chapter 7
M u ltiva riate Linear Reg ress ion Models
and
:i
wn - r - l (I) The unknown value of the regression function at z0 is /3 ' z0 . So, from the discus sio n 2 n
is independently distributed as
)(
of the T -statistic in Section 5.2, we can write
(
)(
)
-1 ,..,a ' Z o - ,..,a ' Z o n Zo ' (7- 45) I v'z Q (Z' Zf l z o v'zQ(Z' Zf l z o n - r - 1 and the 100 1 - a)% confidence ellipsoid for {3' z0 is provided by the inequality T2 =
a ' Zo -
P
a. ' P
A
(
(7-46) where Fm, n - r - m ( a) is the upper 1 00a th percentile of an F-distribution with m and n - r - m d.f. The 100 1 - a)% simultaneous confidence intervals for E(Yi) = z0 fJ(i) are
�(
(
(
mn-r- 1 n-r-m
( )
))
(
�
)
n Fm, n - r - m (a) z Q (Z'Z f l z o n - r - 1 Ui i , i = 1 , 2, . . , m where p(i ) is the ith column of jJ and uii is the ith diagonal element of I.
z QfJ(iJ ± A
.
(7-47)
The second prediction problem is concerned with forecasting new responses = {3' Z o + Eo at Z o . Here Eo is independent of e . Now,
Yo Y0 - {3' z0 = ( {J - {J ) ' z0 + Eo
Nm (O, 1 + z0(Z' Z) - 1 z 0 ) I ) independently of n , so the 100 1 - a)% prediction ellipsoid for Y0 becomes 1 (Yo - {J ' z o ) ' n : 1 i (Yo - iJ ' z o ) mn-r-1 1 + Z o (Z' Z) -1 z o ) (7-48) (a) n - r - m Fm' n - r - m A
A
(
:i
_
<
(
The 100 1 are
"'
Z o /J (i) ±
_
(
r
(
(
is distributed as
[(
(
))
]
- a)% simultaneous prediction intervals for the individual responses Yo z
�(
(
mn-r-1 n-r-m
))
Fm' n - r - m (a)
�(
1 +
(
)
z Q (Z' Zf 1 z o ) n - nr - 1 uii , i = 1 , 2, . . . , m
(7-49)
where P (i) ' a-ii ' and Fm, n - r - m ( a ) are the same quantities appearing in (7-47). Com paring (7-47) and (7-49), we see that the prediction intervals for the actual values of the response variables are wider than the corresponding intervals for the expected values. The extra width reflects the presence of the random error soi ·
Section 7.7
Example 7. 1 0
M u ltiva riate M u ltiple Reg ress ion
397
{Co nstructi ng a confidence e l l i pse and a pred iction e l l i pse fo r bivariate responses)
A second response variable was measured for the computer-requirement prob lem discussed in Example 7.6. Measurements on the response }2 , disk input/output capacity, corresponding to the z 1 and z2 values in that example were y2 = [301.8, 396.1, 328.2, 307.4, 362.4, 369.5, 229.1]
Obtain the 95% confidence ellipse for {3' z0 and the 95% prediction ellipse for Y0 = [Y0 1 , YQ2 ] for a site with the configuration z0 = [ 1 , 130, 7.5 ] . Computer calculations provide the fitted equation Y2 = 14.14 + 2.25z 1 + 5.67z2 with s = 1.812. Thus, P( 2 ) = [ 14.14, 2.25, 5.67 ] . From Example 7.6,
P( l ) =
[8.42, 1.08, 42],
z oP ( 1 ) =
Z o (Z' z) - 1 z o =
151 .97, and
.34725
We find that
zof3 (2) = A
14.14 + 2.25 ( 130) + 5.67(7.5 ) = 349.17
and
[-�;��-] [:-;-�-� �-]
Since
iJ ' Z o =
Zo =
=
[ ���:�� J
[
]
n = 7 r = 2 and m = 2 a 95% confidence ellipse for ,..,U ' z0 = z�-�-�-(�)- is from o /3( 2 ) '
'
(7-46), the set [ z o' f3 ( 1 )
-
151 .97, z0' f3 (2)
'
-
349.17 J ( 4)
[
5.80 5.3o 5.30 13.13 <
][
- 1 z o/3( 1 ) z o/3( 2)
( .34725 )
-
-
151 .97 349.17
[( � ) 2 4)
'
]
F2, 3 ( .05)
]
with F2,3( .05 ) = 9.55. This ellipse is centered at ( 151 .97, 349.17) . Its orienta tion and the lengths of the major �nd minor axes can be determined from the eigenvalues and eigenvectors of ni. Comparing (7-46) and (7-48), we see that the only change required for the calculation of the 95% prediction ellipse is to replace z0(Z' z) - 1 z 0 = .34725 with 1 + z0(Z' Z) - 1 z0 = 1 .34725. Thus, the 95% prediction ellipse for Y0 = [ YQ1 , YQ2 ] is also centered at ( 151 .97, 349.17), but is larger than the confidence ellipse. Both ellipses are sketched in Figure 7.5.
398
Chapter 7
M u ltiva riate Linear Regression Models
Response 2
Prediction ellipse
Confidence ellipse
95% co nfidence a n d pred iction e l l i pses for t h e com puter data with two responses.
Figure 7.5
0
It is the prediction ellipse that is relevant to the determination of com puter requirements for a particular site with the given z 0 • II 7.8
TH E CO NCEPT OF LI N EAR REGRESSION
The classical linear regression model is concerned with the association between a single dependent variable Y and a collection of predictor variables z 1 , z2 , • • • , Zr . The regression model that we have considered treats Y as a random variable whose mean depends upon fixed values of the z/s. This mean is assumed to be a linear function of the regression coefficients {3 0 , {3r , , f3r · The linear regression model also arises in a different setting. Suppose all the variables Y, Z1 , Z2 , , Zr are random and have a joint distribution, not necessarily normal, with mean vector IL and covariance matrix I . Partitioning It ( r + l ) X ( r +l ) (r+ l) X l and I in an obvious fashion, we write . . .
. .
.
I
lTy y
JL y
( 1 Xl ) ILz ( rxl ) with
and I =
I I
Uzy
-i;l�T�;�( lXl ) : ( lXr )
(7-50) I zz can be taken to have full rank. 6 Consider the problem of predicting Y using the linear predictor = b0 + b 1 Z1 + · · · + br Zr = b0 + b ' Z (7-51) 6 If I z z is not o f full rank, one variable-for example, Zk-can be written as a linear combination of the other Z/s and thus is redundant in forming the linear regression function Z' {3. That is, Z may be replaced by any subset of components whose nonsingular covariance matrix has the same rank as I z z .
Section 7 . 8
The Concept of Li near Reg ress ion
399
For a given predictor of the form of (7-51), the error in the prediction of Y is prediction error == Y - b0 - b1 Z1 - · · · - b,Z, == Y - b0 - b ' Z (7-52) Because this error is random, it is customary to select b0 and b to minimize the mean square error == E(Y - b0 - b ' Z ) 2 (7-53) Now the mean square error depends on the joint distribution of Y and Z only through the parameters IL and I. It is possible to express the "optimal" linear predictor in terms of these latter quantities. Result 7.12. The linear predictor {30 + f3 ' Z with coefficients f3 o == J.Ly - f3 ' ILz has minimum mean square among all linear predictors of the response Y. Its mean square error 1s 1 1 2 E(Y - f3o - f3 ' Z) 2 == E(Y - J.Ly - u zyi z z (Z - JLz ) ) == a- yy - u zyiz zuz y Also, /30 + f3 ' Z == J.Ly + Uzyiz1z ( Z - JLz ) is the linear predictor having maximum correlation with Y; that is, Corr (Y, f30 + f3 ' Z ) == max Corr (Y, b0 + b ' Z ) bo, b
Proof. Writing b0
==
=)
f3 ' "I zz f3 = lTyy
�-zz1 U z y , ... Uzy lTy y
+ b' Z b0 + b' Z - (J.Ly - b' JLz ) + ( J.Ly - b' JLz ) , we get 2 E(Y - b0 - b ' Z ) 2 == E[Y - J.Ly - ( b ' Z - b' JLz ) + (J.Ly - b0 - b' 1Lz ) ] == E( Y - J.Ly ) 2 + E(b' ( Z - ILz ) ) 2 + (J.Ly - b0 - b' 1Lz ) 2 - 2E[b' (Z - JLz ) ( Y - J.Ly) ] == lTy y + b'I zz b + (J.Ly - b0 - b' JL z ) 2 - 2b' Uzy Adding and subtracting Uzyiz� Uzy , we obtain E(Y - b0 - b' Z) 2 == a-y y - u zyiz� U z y + (J.Ly - bo - b' 1Lz ) 2 + (b - I i1z Uzy )'I zz (b - Iz1z U zy ) The mean square error is minimized by taking b == Iz�Uzy == /3 , making the last term zero, and then choosing b0 == J.Ly - (Iz�U zy ) ' ILz == {30 to make the third term zero. The minimum mean square error is thus lTy y - Uzyiz1z U zy . Next, we note that Cov (b0 + b ' Z, Y) == Cov (b' Z, Y) == b' U zy so [b' U zy ] 2 2 for all b0 , b , [Carr ( b0 + b ' Z, Y) ] - CT y y (b' I zz b) Employing the extended Cauchy-Schwartz inequality of (2-49) with B obtain
==
I zz , we
400
Chapter 7
M u ltivariate Linear Reg ress ion Mode ls
or [Corr (bo
+ b ' Z, Y ) ] 2
<
u'z y I -z1z uz y lTy y
with equality for b = Ii1zUz y = /3 . The alternative expression for the maximum correlation follows from the equation Uz y iz� Uzy = Uz y f3 = uz y iz�I z z /3 B f3 ' I z z f3. The correlation between Y and its best linear predictor is called the population multiple correlation coefficient 1 ' y �y Uz zzUz ... (7-5 4) PY ( Z ) = + lTy y
===
The square of the population multiple correlation coefficient, p}( z ) , is called the population coefficient of determination. Note that, unlike other correlation coeffi cients, the multiple correlation coefficient is a positive square root, so 0 PY ( Z) 1 . The population coefficient of determination has an important interpretation. From Result 7.12, the mean square error in using {30 + /3 ' Z to forecast Y is 1 U z' y �... z z U z y 1 , i lTy y ( 1 - PY2 ( z ) ) (7-5 5) lTy y - Uz y z z U z y - lTy y - lTy y lTy y <
(
_
)
<
_
If p}( z ) = 0, there is no predictive power in Z. At the other extreme, p}( z ) = 1 im plies that Y can be predicted with no error. Example 7 . 1 1
(Deter m i n i ng the best li near pred ictor, its mean sq uare error, and the m u ltiple co rrelation coefficient)
[_I!:_Y__J [-�]
[ J _ _ _}___ __ =�[_________!___________ ! J !
Given the mean vector and covariance matrix of Y, Z1 , Z2 , IL
=
#Lz
= 2 0
and I =
lTy y 1 U z y
Uzy I z z
�g
=
l
1! 7 -1 3
--
3 2
determine (a) the best linear predictor {30 + {3 1 Z1 + {3 2 Z2 , (b) its mean square error, and (c) the multiple correlation coefficient. Also, verify that the mean square error equals lTy y ( 1 - p}( z ) ) · First, 1 = = fJ = Iz�uz y = :
[� �l [ - � J [ : [�J = = _
f3o = /Ly - fJ' #Lz
5
[1, -2]
3
�:!] [ -� J
J [ - .6 - .6] [
[ -� J
so the best linear predictor is {30 + f3 ' Z = 3 + Z1 - 2Z2 . The mean square error is .4 1 1 = 10 3 = 7 lTy y - Uz' y �... zzUz y = 10 - [ 1, - 1 _ 1.4 1
J
-
Section 7 . 8
The Concept of Linear Reg ress ion
u' lTIy-1yu �10 lTy y(1 - p}(z ) 10( 1 1 - 2 - p1 I. z1Z2 Nr+1 ( �, I ) zr z , z , , Z r 1 2 N(JLy uzyiz1z(Z - ILz) , lTy y - uzyiz�Uzy)
40 1
and the multiple correlation coefficient is
ZY ZZ ZY
P Y ( Z ) ==
Note that
{0 )
==
==
==
_
==
.548
7 is the mean square error.
•
It is possible to show (see Exercise 7.5) that PY ( Z )
---yy
(7 -56)
where p Y Y is the upper left-hand corner of the inverse of the correlation matrix de termined from The restriction to linear predictors is closely connected to the assumption of normality. Specifically, if we take
y
to be distributed as
then the conditional distribution of Y with
• . •
fixed (see Result 4.6) is
+
The mean of this conditional distribution is the linear predictor in Result IS,
E(Y I
7.12.
That
) ( z � z1 , z2, . . , zr) JLy f3'uzyiz� z z z , z , , Z ) r 1 2 N,+1 ( IL, I ) . , z , , z, ) z 1 2 . [ 1 9] ) z1 , z2, , Zr) , f3' z Z Nr+1 ( �, I ). [�] S [��-;+��;] Si1zSzy , Po - SzySz1zZ - P' Z ==
==
+ f3 o +
(7-57)
and we conclude that E(Y I is the best linear predictor of Y when the population is The conditional expectation of Y in (7 -57) is called the linear regression function. When the population is not normal, the regression function E(Y I need not be of the form {3 0 + Nevertheless, it can be shown (see that E(Y I whatever its form, predicts Y with the smallest mean square error. Fortunately, this wider optimality among all estimators is possessed by the linear pre dictor when the population is normal. • • •
• • •
• • •
Result 7.13. Suppose the joint distribution of Y and # =
is
Let
=
and
be the sample mean vector and sample covariance matrix, respectively, for a random sample of size n from this population. Then the maximum likelihood estimators of the coefficients in the linear predictor are
{3
==
==
Y
==
Y
402
Chapter 7
M u ltivariate Linear Reg ress ion Models
Consequently, the maximum likP'elizhood estimszatySozr1zof(zth-e Z)linear regres ion function ZJ2 and the maximum likelihood estimator of the mean square error lTy y .z - ( Syy - SzySzzSzy ) We us e Res u l t and t h e i n var i a nce pr o per t y of maxi m um l i k el i hood estimators. [SJLeey (- (Iz�UzySince,)' fJLrozm, Result and mean square error lTy y.z lTy y - Uzy'Iz1zUzy the conclusions fol ow upon substitution of the maximum likelihood estimators [I] and [-!:�-1- i��J ( : ) S for I t i s cus t o mar y t o change t h e di v i s o r f r o m t o ( i n t h e es t i m at o r of 2 teshtemeans q uar e er r o r , , i n or d ert o obt a i n t h e f3' Z ) o. y y z imator 2 Z ) i ( ) - Sz y S-zzS1 z y ) - tf is
=Y+
�o +
E[Y - {30 - /3'
IS
,...
Proof.
_
n-1 n
-l
,
4.11 4-20).]
7.12,
f3 o =
=
=
p =
i=
=
n
1
•
= E(Y - {30 -
n n - r + 1) n
n-1 n - r - 1 ( syy
Example 7. 1 2
,
-
/\
unbiased
"' ,
( lj - f3 o - fJ
(7-58)
n-r-1
(Maxi m u m l i ke l i hood esti mate of the regression functionsingle response)
ForZ1 (tohrederscomput), anderZdat2 (aadd-del of Examplete eitemst)hgie ve the obssamplerveatmean ions onvect(oCrPUandtismame), ple covariance matrix: 7 .6,
# =
s=
n=7
Y
[�] [i��:�:�] [-� -�---�:-] [�!::�:!-L��!:���-- -�!:�!�] =
=
Section 7.8
The Concept of Linear Reg ress ion
403
AsfunctsumiionngandthatthY,e esZt1imandateZd2meanare josiqnuartly nore ermroalr., obtain the estimated regres ion Resu_1lt 7.13 gi[ves.0th03128e maxim-.um006422likeli]hood[418.es7ti63m]ates [ 1.079 ] P 8zzSzy -.006422 .086404 35.983 .420 � �: [ 13 150.44 - [1 .079, .420] : 7J 150.44 - 142.019 8. 4 21 and the estimated regres ionf3'fzuncti8.o4n2 - 1.08z1 .42z2 t(he prThee1di)cmaxition ofmumY wiliktheltihhioods regresetsimioatnefuofnctthieonmeanis square error arising from (sy y - SzySz�Szy) ( ] [ [ ] ) 03128 418. 7 63 0 -. 0 06422 • . 9 7 83] 13 9 63 418. 467. 35 [ (�7 ) ' . 0 86404 35. 9 83 -. 0 06422 .894 TheY2, ext, Yemnsisioalnmofostthime medipreviaoteus. Weresuprltessetonttthheisprediextensctiioonn fofor snorevermaall popul responsatieosnsY. 1 , Suppose (mX 1 ) is distributed as Nm+r( I ) (rX 1) with m( ILILXYz1) and I (IImxvvzvm) (IImvzzzXr) rX 1 ) r r m ( X ( ) I ( X r) Byz1 , zRes2 ul,t 4.of6, tthheecondi t i o nal expect a t i o n of [ Y , . . , Y ] ' , gi v en t h e f i x ed val u es m 1 predictor variables, is ( 7 5 9) Thithes conditional expected valof tuhe,e vectconsoidrereond as aIftuincts compos ion of ed ofz2, uniz,varisiacalte lreed grJLye1s ioInsy.1zForiz1z(izns-tance,ILz) theEf(iYirst component of t h e condi t i o nal mean vect o r i s es z i h e c m mean i n e uar h q whi , t s mi z 2 1 error for the prediction of Y1 . Them matrix Ivziz z is called the matrix of ,
=
=
==
�o = y - i r z = =
=
"
{30 +
=
"
+
nn
= =
•
Pred iction of Several Variables • • •
y
/-L ,
z
I I
=
,
• • •
--------�
Y
multivariate regression
regression coefficients.
!
-+�-----�-� -
l
}2,
Zr
+
!
=
I
Zr ,
X
• • •
r
Zr ,
Z. )
, Zr ,
{3 =
• • •
m
,
404
Chapter 7
M u ltivariate Linear Reg ression Mode ls
The error of prediction vect- ILovr - Ivzii�(Z - JLz) hasIvv·tzhe expecte-d ILsqvuar- eIs vzii� and cro(Zs --proILductz)] s mat-riILxv - Ivziz1z(Z - ILz) 1 1 ( IIvvvv -- IIvziz )' ( )' I I I I I izzii� z vzii� vz vziz zv z vz 1 z izv vziz Becaus e and ar e t y pi c a l y unknown, t h ey mus t be es t i m at e d f r o m a r a ndo m I IL sedamplpreediinctiorondererrtoorconss. truct the multivariate linear predictor and determine expect Suppos e and Z ar e di s t r i b ut e d as , N ( JL . Then t h er e I) r m + gres iof3noof th{JezvectILovr - onIvziz� Z is ILz Ivzii�z ILv Ivziz�(z - ILz) The expected sf3quaro -es{Jandz) cro-s -f3proo-duct{JsZ)'matrixIvv·for zthe erIvvrors-isIvziz1zizv Basgresediononfuanctraindom s a mpl e of s i z e t h e maxi m um l i k el i h ood es t i m at o r of t h e r e on is Po z SvzSi1z(z - Z) and the maximum likelihood estimator of Ivv· z is The r e gr e s i o n f u nct i o n and t h e covar i a nce mat r i x f o r t h e pr e di c t i o n errors fol ow from Resf3uolt ILvUs-inIgvzizthe r1zeJLlazti,onships Ivziz� f3o Ivv·zz IILvvv -IIvzizvzii�1z(zIzv- ILz)Ivv - fJ Izz/3' we deduceof maxithemmaxium mlikumelihloodikelieshoodtimatstoartsementuponssfurbsomtituthteioinnvarof iance property see Y
= E[Y = =
+
Result 7.14.
+
E(Y -
]'
[Y
(7-60 )
Y
Y
+
=
=
(Y
+
=
=
n,
+ {J = Y +
Proof.
4.6.
= + /3 = =
f3 =
+
=
(4-20)]
[
It can be shown that an unbiased estimator of Ivv·z is (7-61)
Section 7 . 8
Example 7 . 1 3
405
The Concept of Linear Reg ress ion
(Maxi mum l i keli hood esti mates of the reg ression fu nctions-two responses)
Wetime,return tdio stkhe computcapaciertydat, a givorendiern Exampl and ete ForitemsYi, we have s, and es add-del 1/0
y2 =
7.6 z2 = 150.44 327.79 130.24 3.547
zl =
and
467.913 1148.556 l 418.763
·
l ! QQ � 2�� 418.763 1008.976 1 377.200 35.983 140.558 i 28.034
! ! �§ �� � � Q7 �� �2 !
_ _
_
_-
-
-----
-
_
-
_ __ _ _ _
_
= CPU
7.10.
_
:
_
35.983
------
!�Q� �?. � -
-
-
28.034 13.657
-
Assuming normality, weSvzfiSnzd�t(hzat the estimated regres ion function is P o + {3 z =
[ ] [
y+
- z)
150.44 418.763 35.983 = 327.79 + 1008.976 140.558 x
[ ] [
[
]
.003128 - .006422 - .006422 .086404
][
z1 - 130.24 Z2 - 3.547
150.44 1.079 (z - 130.24) + .420 (z - 3.547 ) = 327.79 + 2.254 (z 1 - 130.24) + 5.665 (z2 - 3.547) 1 2
]
]
Thus, the minimum mean square error predictor of Yi is Similarly, the best predictor of is prThe( oduct:maxis)matmumrixliIkelvv·ihzoodis giesventimbyate of the expected squared errors and cros (Svv - SvzSZ�Szv) 150.44 + 1.079( z 1 - 130.24) + .420( z2 - 3.547) = 8.42 + 1.08z 1 + .42z2 Y2 14.14 + 2.25z 1 + 5.67z2
n
1
( ) ([ [ (�) [ =
=
6
7
7
467.913 1148.536 1148.536 3072.491
][ ] [
418.763 35.983 1008.976 140.558
]
.003128 - .006422 -.006422 .086404
1 .043 1.042 .894 .893 1.042 2.572 .893 2.205
]
][
418.763 1008.976 35.983 140.558
])
406
Chapter 7
M u ltivariate Linear Regress ion Models
Theatedfmean irst estsiqmuarateederrergrore,s ion farunce tthioen,same as those in Examplande the fasorsothcie single-response case. Siims itlhaerlsya, metheassetcondhat givesentiminatExampl ed regrees ion function, We see t h at t h e dat a enabl e us t o pr e di c t t h e f i r s t r e s p ons e , Yi, wi t h s m al l erthateroverror tphraneditchtieonse(condunderrpersepdionscteio, n)r;.ofTheCPUpostimitievteendscovartoiabenceaccompaniindiecdatbeys overprediction (underprediction) of disk capacity. Res u l t s t a t e s t h at t h e as s u mpt i o n of a j o i n t nor m al di s t r i b u tequation fioornsthe whole collection Y1 , leads to the prediction 8.42 + 1.08z 1 + .42z2 ,
.894,
14.14 + 2.25z 1 + 5.67z2 ,
7.12
7.10.
.893
II
Comment.
7.14
Y2_ , . . . , Ym ,
5\
= f3 o l Y2 = f3o 2 A
+
A
Ym =
+
A
f3 1 1 Z 1 f3 1 2Z 1 A
A
A
f3om + f3 I m Z 1
Zb Z2 , . . . , Zr A
+ ··· + + ··· +
f3r 1 Zr f3r 2Zr
+ ··· +
f3rm Zr
A
A
We notThee thsaemefol valowiunesg:, are used to predict each }i. The Ivzizare es1z tfiomriat, eks of the k)th entry of the regres ion coefficient matrix We concl u de t h i s di s c us s i o n of t h e r e gr e s i o n pr o bl e m by i n t r o duci n g one f u r ther correlation coefficient. Consider the pair of errorsYi - JLy1 - Iy1ziz�(Z - JLz) 1z(Z - JLz) I JL z iz Y Y2 2 obtion,aindetedefrrmominedusifnrgomthethbese ert rlionrearcovarpreidiancectormats torprix eIdivv·ct zYi andIvvr;.-TheiIvzii�r corIrezlya, tmeas ures the association between Yi and Y after eliminating the effects of We defbyine the between Yi and r;, eliminating PY1Y2·Z lTylyl . z lTy2y2. z 1 ent i mat r s y wher e 1 t h i n r e i x k . k) t h The e t h I I I lT y vv. y z vv izv vzii z z · corresponding is 1.
2.
A
/3 =
z 1 , z2 ,
l3 i k
. • .
, Zr
(i,
>
1.
Pa rtial Correlation Coefficient
Y2
Z2 , . . . , Zr . Z2 , , Z,
=
2
Z1 ,
Z1 ,
partial correlation coefficient
• • •
=
" I v
" I v
(i, sample partial correlation coefficient
(7-62)
=
(7-63)
Com paring the Two Form u l ations of t h e Reg ress ion Mode l
Section 7 . 9
407
wia jtohint multtihvare iatek)north elmealmentdistrofibution, we find that thAse ssaumplmineg parthattial andcorrelahavetion coefefficfiiecntientinin(7 is the maximum likelihood estimator of the partial correlation co From the computer data in Example (i,
sy i yk . z
Y
Syy - Sy z Si�S z y.
Z
(7-63) -62).
Example 7 . 1 4
(Calcu lati ng a partia l correlatio n)
7.13,
Syy - Svz S -zz1 S zv
-
[
1 .043 1.042 1.042 2.572
Therefore,
1.042
]
vT.043 V2372
= 64
(7-63)
.
Calparicnulgatthinegtwthoecorordreinlaartiyoncorcoefrelfaitciioenntcoefs, wefisceieentth, atwethobte asaisnociation betweenComYi onandbotl2hhasresbeenponssehs.arply reduced after eliminating the effects of the variables ry 1 y 2
= .96.
Z
•
7.9
CO M PAR I N G TH E TWO FORM U LATI O N S OF TH E REGRESSION M O D E L
IneraSectl resipoonsns 7.e2varandiabl7.e7s,,weresprpecesteivnteleyd. tIhnethmulesetitprleeatrmegrentess,itohnemodel s f o r one and s e v prinediSectctoironvar7.i8a-wibles thadh a at t h e jt h t r i a l . Al t e r n at i v el y , we can s t a r t as val u es soneet ofsuvarbseitaofblevars thiaatblhavees in ora jdoerinttonorprmedialctdivalstruibesutofiotn.heTheotherprsoecest lesadsof condi t i o ni n g on t o a condi t i o nal expect phlowe retgrhiessrieolantimodel . Theexplticwitolyappr, weoiaches to multwotipmilenrore grvareisantioanstiofaroneththreeatrlaeitgrsead.esmulToiontismodel o ns h i p n t r o duce formulation. For any response variable the multiple regres ion model asserts that The predictor variables can beand"centweecanred"wrbyitesubtracting their means. For instance,
fixed
zj
Mea n Corrected Form of the Reg ression Model
Y,
{3 1 Z 1j = {3 1 (z 1j - z1 ) + {3 1 z1 lj = ( f3o + f3 I Z1 + . . . + f3rZr ) + f3 I ( Z lj - zl ) + . . . + f3r ( Zr j - Zr ) = {3 * + f3 I ( Z lj - Z1 ) + + f3r ( Zr j - Zr ) + Bj · · ·
+
Cj
(7-64)
408
Chapter 7
M u ltiva r iate Linear Reg ress ion Mode ls
wiingthto the reparameterization in Theis {3* = {3 0
+
{3 1 z1
design matrix correspond
mean corrected f3rZr· (7 - 64) 1 Z1 1 Z 1 Z1 r - Z r 1 Z2 1 Z 1 Z2r Zr zc = 1 Zn l - Z 1 Zn r Zr
+ ··· +
-
-
-
-
where the last columns are each perpendicular to the first column, since 0, Further, set ing with we obtain so r
n
1 j� ( Zj i Zi ) = =l
Z c = [1 1 Z c 2 ]
i=
1, 2, . . . , r
Z� 21 = 0,
{3 * "
"
f3r (7 - 65 )
That is, the regresandsion icoefs estfiimciatenteds by Because thearedefunbiinitiaosnsedly estimated rbye maifromn unchanged by t h e r e par a met e r i z at i o n i n t h ei r bes t es t i m at e s comput e d b es t e§._ t i m at e s com t h e des i g n mat r i x ar e exact l y t h e s a lll e as t )l e ,. .. the linear prputediedctforromof thcane desbeigwrn imatt enrasix Thus, set ing with Finally, (Z� 2 Z c 2 ) - 1 Z� 2 y
[ /3 r , {3 2 , . . . , f3r J '
Zc
Y
y.
{3*
(7 - 64),
Z.
/3 r , {3 2 , . . . , f3r
/3 � = [ {3 1 , {3 2 , . . . , f3r ],
(7-66)
(z
-
z) = [z l - Zl , Z2 - Z2 , . . . ' Zr - Zr ] '.
(7-67)
Section 7.9
409
Com paring the Two Form u l ations of the Reg ression Model
The mul t i p l e r e gr e s i o n model yi e l d s t h e s a me mean corficierntectvected desorsigfnormattheriixthforerseachponserearspeonsgiveen. Theby least squares estimates of the coef [ (��:�-:-��E�-��:-�;:J , i 1, 2,(7-68) Somet/imes, for even further numerical stability, "staarndare usdiezd.ed"Inintputhis varcasiea,blthees f± sThelopelecoefast sfqicuarientess estiimn attheesrofegrthees ion model are replbecome aced by 1)1) iple r1,egr2,es ion sThesituateiornelasatiwelonslh. ips hold for each response in the multivariate multi When t h e var i a bl e s , ar e j o i n t l y nor m al , t h e es t i m at e d pr e di c t o r of (see Result 7.13) is ( 7 6 9) whereRecalthe esl ftirmomatitohne mean procedurcorerelcteadsed fnatormurofal tyhetorethgreeisntiroonductmodelion tofhatcenttheebesredtz/slin. ear predictor of [see (7-66)] i.Ys * wi*th * andand since7 Comparing (7-66) and (7-69), we see that ( 7 70) Thermodelefapprore, botoachesh thyie enorld exact mal tlhyeorthey condilitnioearnalprmeanedictoandrs. the clas ical regres ion the two smulimitlaivrarariagtument e multiipnldie crategresetshiatonthseetbesupst arlineearalsoprexacedicttloyrsthofe sthame.e responses in TheExamplcompute 7.6eusr idatng athwie clthasthiecalsinliglneare rersepgronses eion modelCPU. Thetimseawerme edatanala weryzede anin wer e aljoyinzedtly noragaimnalinsoExampl e 7. 1 2, as s u mi n g t h at t h e var i a bl e s and Z 2 gi v en t h at t h e bes t pr e di c t o r of i s t h e condi t i o nal mean of and Both approaches yie8.ld4ed2 th1.e 0sa8zme1 pr.e4di2zc2tor, Comment.
multivariate
/J ( i J =
=
= ) (n ( Zj i - zi) \} j = 1 ( Zj ; - ZY ( Zj ; - Z; /'V
. .
l ) sz, z,
f3i
=
.
{ii = fji = "
beta coefficients {ii
. , r.
,m
f3i V ( n - s2 z . � i V ( n - sz{zl , l l
.
Relati ng the Form ulatio n s
Y, Z1, Z2
. . . , Zr
Y
Y
= � + P�(z - z) P� = y 'Z c 2 (Z� 2 Z c 2 ) - 1 • Pc = P Sz' s -z1z = Y 'Z c 2 (Z'c 2 Z c 2 ) -1
� =y � = Y = �o
y
-
same
A
Example 7 . 1 5
(Two approaches yield the same l i near pred ictor)
Yi =
Yi
z2 •
z1
Yi , Z1 ,
y=
+
+
7The identify in (7-70) is established by writing y = (y - yl)
Consequently,
y' z c 2
=
( y - yl) ' z c 2 + yl' z c 2
=
+
•
yl so that
( y - yl) ' z c 2 + 0 ' = ( y - yl) ' z c 2
1 1 yZc 2 ( Z�2Zc 2 ) - = (y - yl) ' Zc 2 (Z� 2 Zc 2 ) -
=
Yi
(n - l ) sz y [ ( n - l ) S z z] - 1
= Sz y S z1z
41 0
Chapter 7
M u ltivariate Linear Reg ress ion Models
Al t h ough t h e t w o f o r m ul a t i o ns of t h e l i n ear pr e di c t i o n pr o bl e m yi e l d t h e s a me predictothreequatvaluesionsof, concept u al l y t h ey ar e qui t e di f e r e nt . For t h e model i n input varof iablesorare assumedthe valtoubees ofsetthbye prtheediexperctor varimentiableers. Iarnethreandom conditvarionaliablmeanesthtehmodel at ar e obs e r v ed al o ng wi t h t h e val u es of t h e r e s p ons e var i abltheye yis e. ldThean assumptipronsediunder l y i n g t h e s e cond appr o ach ar e mor e s t r i n gent , but choices, rather than merely among linear predicWetorscl. ose by noting thcattotrhamong e mul t i v ar i a t e r e gr e s i o n cal c ul a t i o ns i n ei t h er cas e csqanuarbeescouched i n t e r m s of t h e s a mpl e mean vect o r s and and t h e s a mpl e s u ms and cros -products: (7-3) or
(7-26),
(7-57)
( )
optimal
_
n L i �I n L
(7-59),
all
z
y
n ( yj - y) (z j - z) ' L l J: i_�! n (z - Z ) ( yi - Y)' ! L (z i - Z ) (z i - Z ) ' i j=l j=l i
( yj -
y) ( yj -
y) '
__________________________________
of
:
I
__
_________________________________
[�: :-r�:�-] This is the only information necessary to compute the estimated regres ion coeffi =n
their eschecki timatendg.covarThisiarencesquir. esOfthcoure ressied,ualansimerporrortasnt, whiparcthofmusregrt bees calion analculcientatysesdiandsusisinmodel g all the original data. Fored, ordatauta colocorlercteleadteoverd. Constime,equentobserlvy,atinioansreigrn edisfioenrecontnt timexte,pertheiobsodsearrveatofiotnsenonreltahte dependent var i a bl e or , equi v al e nt l y , t h e er r o r s , cannot be i n dependent . As i n di c at edtionsin ourcan diinsvalcusisdiatoneofindependence i n Sect i o n t i m e dependence i n t h e obs e r v a f e rences made us i n g t h e us u al i n dependence as s u mpt i o n. Sitomtiimlaerlory, dinerfeerdencesdata iandn retgrheesstiaondarn cand rbeegrmiesslieoadin asnsguwhen rnsegrarees usioned.modelThissisarueefiist mpt i o iencemporoftatntimseo,dependence, in the examplbute talhatso fhowol owsto,iwencornotporonlateytshhisowdependence how to detiencttotthheeprmules tiple regres ion model. Power compani eess musses, tparhaveticulenough natingutrhael gascoldteso heatt daysalofl ofthtehyeareir cus. Atomamerors' homes and bus i n a r l y dur component of t h e pl a nni n g pr o ces s i s a f o r e cas t i n g exer c i s e bas e d on a model rhaveelatinsogmetheresleandout s of nat u r a l gas t o f a ct o r s , l i k e t e mper a t u r e , t h at cl e ar l y umed.atuMorre, iet igass cusistormarequiyretdo on cold days. Rattiohnserhitphanto usthee tamount he dailyofavergasageconstemper (
7. 1 0
)
M U LTI PLE REGRESSION M O D E LS WITH TI M E D E PE N D E NT ERRORS
5.8,
Example 7. 1 6
(I ncorpo rati ng time dependent errors in a reg ression model)
j
Section 7 . 1 0
M u lt i p l e Reg ress ion Models with Ti me Dependent Errors ==
41 1
uslaregedegree heat i n g days ( D HD) deg dai l y aver a ge t e mper a t u r e . A number f o r DHD i n di c at e s a col d day. Wi n d s p eed, agai n a hour av ercloasge,e focarnthalesweekend, o be a factthoer demand in the sefndout amount . Becaus e many bus i n es s e s endshown,day.in parDatta, inonTablthese e var(iSaeebletshefoCD-roronenatROMwiurnaltfeogasrr itnhiesatcompl majypicoalreltynoreldatetshaeronsnetaci. week tThery aree are observations.) n ==
24
7.4.
63
TABLE 7.4
65
NATU RAL GAS DATA
y
zl
z2
z3
z4
227 236 228 252 238
32 31 30 34 28
30 32 31 30 34
12 8 8 8 12
1 1 0 0 0
333 266 280 386 415
46 33 38 52 57
41 46 33 38 52
8 8 18 22 18
0 0 0 0 0
Sendout DHD DHDLag Windspeed Weekend
I n i t i a l y , we devel o ped a r e gr e s i o n model r e l a t i n g gas s e ndout t o degr e e heatly toinhaveg daysso, mewindafsfpeeedct onandnatauweekend dummy var i a bl e . Ot h er var i a bl e s l i k e r a l gas cons u mpt i o n, l i k e per c ent cl o ud cover , soeverthatalofatttheemptpreevid ofiusts, day.we deci(Thededdegrto iene clheataruedesiunnotbsg udaymedonllyaggeditnhethcureoneerreronttirmtDHDeerperm. iAfbutodteialsrsdenot e d by DHDLag i n Tabl e The fit ed model is Sendout WindsDHD DHDLag peed Weekend winiftihcant and it looksAll tlhikeecoefwe haveficientasver, wiythgoodthe except i o n of t h e i n t e r c ept , ar e s i g f i t . ( T he i n t e r c ept t e r m coul d be erldrag,oipped. f weautcalocorWhenculrealtaetthitohin,seiscorwedone,rgetelattihoen rofestuhltesrdoesidnotualchange s that arseubsadjtaantcential yin.)tHowev ime, the lag autocorrelation 7.4.)
==
R2
1 .858 + 5.874 + 1.315
+
-
1.405 15.857
== .952.
1
n
1
==
rl (e)
==
� ee j =2 j j - 1
-n --
� ej
j =l
==
.52
41 2
Chapter 7
M u ltivariate Linear Reg ress ion Models
Thethe rvalesiude,ual autofocortherlealgatioautns ofocorr trheelaftiirosnt is tloaogslasrhgowse to bethatignorthereed.miAghtploalt so beamountsomeofdependence amonginvalitdhate eresrtohres 7t-tteimstes andperioP-dsv,aloruesoneasweek, apard witt. h the dependence s o ci a t e coeffiThecientfsirisnt tshteepmodel . t o war d cor r e ct i n g t h e model i s t o r e pl a ce t h e pr e s u med i n sibly depend ent swherdependent eriese ofwenoireerlsaretoerteseach� rimn tsh�.e rteogrThatitessprioiesnvi, wemodel fous ovalrmfuouler sNjaetnde_ 1aout, irtesgrwivaletsuheiaoonenposmodel f o r t h e week ago, and an independent error sj . Thus, we consider wher e t h e s ar e i n dependent nor m al r a ndom var i a bl e s wi t h mean 0 and j 2 ance(See [a-7]•.)TheThe fSASorm commands of the equatandionparfor�t ofitshknown asfranom fit ing this combimodelned. e out p ut rshegrownes iinonPanelmodel7.3foonr spageendout413.with an autoregres ive model for the noise The fit ed model is Sendout 30 DHD 1.DHDLag 207 Windspeed Weekend and the time dependence in the noise terms is estimated by � .470�-1 .240Nj-7 Cj The varFrioamncePanel7. of sis3es, wetimseeatedthtato tbeheB-aut2 ocorrelations of the residuals from the enrrorsicofhed0. model ar e al l negl i g i b l e . Each i s wi t h i n t w o es t i m at e d s t a ndar d er grThatoupisof, thconserAleseiosc,unoatiweivreeaslgahtgsoneidstonotsuremjleacofrtgteshqaseuarhypot juedgeds ofhesresbyisittdhhualeatP-aautvgraloouupcore forofreltconsahtiisosnsteacuttfiostirvice. aut7.3 oarcore trheoslaetiofonsr laragse simultaneously equaland to 0. The groups examined in Panel The noi s e i s now adequat e l y model e d. The t e s t s concer n i n g t h e coef f i ciareentnowof eachvalidpr.8 Theedicitnotrevarrceptiablteer,mthienstihgeniffiincalancemodelof thcaenrebegrdresoipped. on, andWhenso fotrhthis, iched ofmodelnat hasiurs adone,lbetgastetfrhoerfrogierevicassenvertvalinygulpotiest leofentchange ithael prandeidincancthtoenowrrvaresubeilatblinusegsemodel d. Weto fwio.relThecasnottenrspurendout s u e pr e di c tion here, since it involves ideas beyond the scope of this book. (See [7] .) .52,
1
of
15
This
� Nj _7 ,
autoregressive
vari are
= 2.1
+ 1.426
+ 5.810
+
=
-
10.109
+
+
= 228.89.
a
1-6, 1-12, 1-18,
1-24.
II
8These tests are obtained by the extra sum of squares procedure but applied to the regression plus autoregressive noise model. The tests are those described in the computer output.
Section 7 . 1 0
M u lt i p l e Reg ression Models with Ti me Dependent Errors
41 3
When model i n g r e l a t i o ns h i p s us i n g t i m e or d er e d dat a , r e gr e s i o n model s wi t h noiwarsee packages structure,slitkheat allowallfoowr tthheetanalimeydependence ar e of t e n us e f u l . Moder n s o f t st to easily fit these expanded models. SAS,
PA NEL 7 . 3
SAS ANALYSIS FOR EXAMPLE 7. 1 6 U S I N G PROC ARIMA
data a; infi l e 'T7-4.dat'; time = _n_; i n put obsend d h d d h d l a g w i n d xweekend;
PROGRAM COMMANDS
proc arima data = a; ide ntify va r = obse n d crosscor = ( d h d d h d l ag w i n d xweekend ); estimate p = ( 1 7) method = m l i n put = ( d h d d h dlag w i n d xweekend ) p l ot; esti mate p = (1 7) n oconsta nt method = m l i n p ut = ( d h d d h d lag w i n d xweekend ) p l ot;
ARI MA Proced u re
OUTPUT
Maxi m u m Likel i h ood Esti mation App rox. Std E rror 1 3 . 1 2340 0 . 1 1 779 0 . 1 1 528 0. 24047 0. 24932 0.44681 6. 03445
Pa ra meter MU AR 1 , 1 AR 1 , 2 NUM1 NUM2 NUM3 NUM4 Constant Estim ate
= 0 . 6 1 770069
Std E rror Esti mate AIC SBC N u m ber of Resi d u a l s
= 1 5 . 1 29244 1 = 528.49032 1 = 543 .492264 63 =
T Ratio 0. 1 6 3.99 2 . 08 24. 1 6 5.72 2.70 - 1 .68
Lag 0 7 0 0 0 0
Va riable OBSEND OBSEND OBSEND DHD D H D LAG WI N D XWE E K E N D
Sh ift 0 0 0 0 0 0 0
Autocorrelation Check of Resi d u a l s To Lag 6 12 18 24
Ch i Sq u a re 6 . 04 1 0 .27 1 5 .92 23 .44
DF 4 10 16 22
Autocorrelations 0.079 0. 1 44 0.01 3 0.01 8
0 .0 1 2 -0.067 0 . 1 06 0.004
0.022 -0. 1 1 1 -0. 1 37 0.250
0. 1 92 -0.056 -0 . 1 70 -0.080
-0. 1 27 -0.056 -0. 079 -0 . 069
0. 1 6 1 -0. 1 08 0.0 1 8 -0. 0 5 1
(continues on next page)
Chapter 7
41 4
PANEL 7.3
M u ltivariate Linear Reg ress ion Models
(continued)
Autocorrelation Plot of Resi d u a l s Lag 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cova ria nce 228.894 1 8. 1 94945 2.763255 5.038727 44. 05983 5 -29 . 1 1 8892 36. 9042 9 1 33.008858 - 1 5 .4240 1 5 -2 5 .379057 - 1 2 .890888 - 1 2 . 777280 -24.82 5623 2.970 1 97 24. 1 50 1 68 -3 1 .4073 1 4
Corre lation 1 .00000 0. 07949 0 . 0 1 207 0.02201 0. 1 9249 - 0 . 1 2722 0. 1 6 1 23 0 . 1 442 1 -0.06738 -0 . 1 1 088 -0.05632 -0.05 582 -0 . 1 0846 0 . 0 1 298 0.1 0551 -0 . 1 3721
-1 9 8 7 6 5 4 3 2
0 1 2 3 4 5 6 7 8 9 1 I******************** I** I **** . *** I I*** I*** *I ** I *I *I ** I
I I** *** I II
II .
ma rks two sta ndard errors
S U PPLE M E NT 7A
The Distribution of the Likelih ood Ratio for the Multivariate Multiple Regression Model
The develWe oknow pment itnhtathis supplement es t a bl i s h es Res u l t 1 Z' ) Y and under Y' ( I Z( Z ' Z ) 1 1 1 YSin' ce[I - Z1[(IZ-1 ZZ(l ) ZZ' Z1])Y-1Z wi' ] Zth Y[I - ZZ(1 /3Z'(Z1) )-1Ze'. J [ZSet1 PZ2] [I [-PZZ(1 Z'PZZ) 2]Zth' Je. columns of Z are perpendi(Zcular to P. Thus( Z , we can write e)' ' /3 P e) /3 P e e 1 (Z-1l /3(1) e)' P1 (Z 1 /3(l) e) e' P1e wheretPo1 consI t-ructZ1t(hZe1orZ1t)honorZ1 . mWeal vectthenousrse[gth1 ,egGr2, .am-Schmi d t pr o ces s ( s e e Res u l t . , g ] f r o m t h e col u mns + l q ofcomplZ1 . eThen wesetcontto indiue,mensobtiaoinsninbyg thconse ortrthuonorctingmanal arsebt iftrroamry [orthZonor2], andmalfsinetalofly t e t h e - - vectors orthogonal to the previous vectors. Consequently, we have fromofcolzul mns frbutompercolpuendimnscofulaZr2 arorbtithronorary smetalof to columns of Z1 vectto colorsuormnsthogonal of Z 1 Let ( , e) be an ei g enval u ee i g envect o r pai r of Z Z 1. Then, s i n ce Z ) ( Z A. 1 1 1 1 1 1 [Z1 (ZA.1eZ1 )-ZZl1(Z] [1ZZ1l()Z-11ZZ11e)- Z1(]Zl (ZZ11Z(Zl )1-Z1Z1 )1-)2Ze 1 , itA.f(oZl lo(wsZ1Zthat)-1Z1)e A.2e 1 andtr ((thZe 1eiZg1 )enval-1 Z1 Zues1 ) of Ztr1((Z1Z1I)-1Z1) are or MorA1 eoverA2 , tr (Z1 A(Zq+11Z, 1wher )-1Z1e) (q+l)X (q+l) 7.11.
ni =
H0 , ni =
A
0
=
=
n
r
+
!
=
ni = ni =
2A.3)
A
+
+
+
=
!
=
+
=
=
1
=
=G G,
n
=
=
=
=
=
==
0 1. = q+1 =
=
+
+
· ··
+
41 5
416
Chapter 7
M u ltivariate Linear Reg ression Models
Z 1 (Z1Z 1 ) - 1 Z1 . t h at 1 (Z 1 (Z1Z 1 ) - Z1) Z 1 = Z1 ,
Thi s s h ows ar e t h e ei g enval u es of ei g enval u es equal t o Now, has anyeigenvallinearuecombiThenatoritohnonorme alofvectunitolresngte,h is an eigenvector corareretshperondiefonreg eitogen vectnatioonsrs ofof the columns ofsince thByey arthee fsoprectmedralbydecompos taking parititoicnular linearwe we readily e € Similarly, by writing twihatth teihegenval linearucombi n at i o n e e, f o r exampl e , i s an ei g envect o r of e so that e€ Continaruieneig,genvect we haveors of with eigenvalues Also, from thseowaye the g£ , artrael eidecompos genvectweroirtesioofconsn cortrurcteespd,ondineg toeth€seoandthat e uniet eiConsgenvalequentues. lByy, tthhese espg/ec-s VeV€ wher e , becaus e Cov( V f , j , t h e i Vequentl[yV,ebby . . , iVes dim]s'triarbute einddependent l y di s t r i b ut e d as ( O , Cons e N m as{ e In the same manner, o so e We can write the extra sum of squares and cros products as VeV€ Ve wher e t h e ar e i n dependent l y di s t r i b ut e d as ( O , By i s di s N m tofribinutdependent ed as Ve's. independently of since involves a dif erent set The l a r g e s a mpl e di s t r i b ut i o n f o r -[ ] � ln I I I 1 fol ows from Result witd.h f. The use of i n s t e ad of i n t h e s t a t i s t i c i s due t o Bar t l e t f o l o wi n g Box and it improves the chi-square approximation. A 1 > A2 > > Aq+ l > 0 Z 1 (Z1Z 1 ) - 1 Z1 q + 1 Z 1b 1. Z 1 (Z1Z 1 ) -1 Z1 , ···
1.
g e=
so
the
1, 2, . . . , q + 1,
comb i
Z1 . (2-16), h a ve 1 q+ (Z (Z' Z) -1 Z' ) Z = Z, Z 1 (Z1Z 1 ) - 1 Z1 = L g g . see €= 1 Zb = g Z (Z'Z)-1Z' r+ 1 Z (Z' Z) - 1 Z' = L g g . A = 1, €= 1 PZ = [ I - Z (Z' Z) - 1 Z' ] Z = Z - Z = 0 g = Zbh e r + 1, P A = 0. e > r + 1, Z' g = 0, Pg = g . P n-r-1 n (2-16), P = L g g t'=r+2 n n n i = e' Pe = L (e' g e ) (e'g e ) ' = L €= r +2 €= r +2 Vjk) = E(g€e u ) e(k ) gj ) = o-i kgegj = 0, f * e ' g£ I). = . . . , Ve i , (4-22), n i Wp, n - r - 1 ( I). Pl gc = g ee > qq ++ 11 n P1 = L g g € . €=q+2 r+ 1 r+ 1 n(I 1 - I) = e' (P1 - P)e = L (e' g e ) (e' g e )' = L €=q+2 €=q+2 I). (4-22), n(I 1 - i) Wp, r - q ( I) n(I 1 - i ) ni , <
==
<
( I / I1 ) m(m + 1 )/2 - m (q + � (m - r + q + 1) )
5.2,
1 ) = m ( r - q)
n
n - r - 1 - (m - r + q + 1) v - v0 = m ( m + 1 )/2 + m ( r + 1 ) (n - r - 1 [3]
[6],
Chapter 7
Exercises
41 7
E X E RCI SES
7.1.
7.2.
Given the data
10 15
7 19 3 25
5 9
11 7
8 13 = 1, 2, . . . , 6.
, j fcalit ltyh,elcalinculearrateetgrheesleiasonmodel s Speci f i j lj t s q uar e s es t i m at e s t h e f i t e d val u es y, t h e r e s i d ual s and t h e r e s i d ual s u m of s q uar e s , Given the data == {3 0 + {3 1 zj 1 + /3 ,
i,
e' e .
10 2 15
y
7 19 3 6 3 25
5 3 9
11 7 7
18 9 13
fit the regres ionljmodel , j Bj tthoisthfeit, deduce the corforrmes(psondiee pageng fit ed rofegrthees varioniaequat bles ionz1f,oandr thez2or. iFrgionmal (not standardized) variables. Let 2 but wher e wi t h known and pos i t i v e def i anite. For of ful rank, show that the estimator is If a-2 is unknown, it may be estimated, unbiasedly, by ( i s of t h e cl a s i c al l i n ear r e gr e s i o n f o r m 2 == with and a- 1. Thus, Ussioen fthoer tweihe esghttiemdatleeasoft2tshqeuarsloepes estiimnatthoermodel�· i2n Exercise {37.z3j to sderj , ive 2an expres when ( a ) Var( s , ( b ) Var( s , and ( c ) Var( s z J. Com z ) ) j ) aaj aj j ment tEshetaoptblonisimhtalhechoimanner c) e of ip which the unequal variances for the errors influence From and' Exer� -1 cise 2 == {3 1 Zjl + {3 2Zj 2 + standardized 409 )
== 1, 2, . . . , 6. y,
7.3. (Weighted least squares estimators . )
Y == n X Zr +l f3 + e ( ( )) (( r + l ) X l ) ( nXl ) V (n X n ) E(ee' ) == V , ( nX l )
E(e) == 0 V
weighted least squares
Pw
==
(Z' V -1 Z) -1 Z' V - 1 Y
n - r - 1 ) -1 X
(Y - ZPw ) 'V - 1 (Y - ZPw ) . Hint: v - 112 Y ( V - 112 Z) f3 + v - 1 12 e E(e * ) == 0 Y * Z * /3 + e * , 1 (Z * Z * ) - Z * 'Y * . ==
7.4.
f3
==
7.5.
E(e * e * ' )
==
Pw
==
==
+
j
==
_
P * ==
1, 2, . . . , n,
f3 w .
(7 - 5 6 P�(Z) = 1 - 1/ p YY . Hint: 4 . 11 (7 - 55 ) ( 1 - PY ( Z ) - Ciyy - Uzy�zzUzy l i zz i a-yy uzy iz� Uzy ) a- yy O'yy I I zz I :
==
==
III
418
Chapter 7
M u ltivariate Linear Reg ress ion Models
FrfirostmrowResandult 2A.firs8t(ccol), umn. Since (see Exerwherceise 2.2i3)s the ==entry of I-1 inandthe -1 == == the entry in the ( 1, 1) position of is A mat r i x ( Z ' Z ) i s cal l e d a gener a l i z ed i n ver s e ofA1 Z'AZ2 if Z' Z (A,Z '1Z ) Z'0Zar==e Z'theZnonzer . Let o ei+genval1 == ruaesnk(ofZZ') Zandwitshucorpposree spondiShowng teihgatenvectors e1 , e2, . . , e 1 1 1 + ' (Z'Z) Ai1eie; iThes a genercoefafliicziedentinsverf3"se ofthatZ' Z.minimize the sum of s"quared errors (y - iZ/3)ons arsaetissaftyistfhieed for any f3 such that(ZZ/3' Z,...) i/3s t==heZ'pryo.jeShowction (thyat-thZesf3e)'equat ofShowy onthtathe Z/3c�lu==mnsZ (ofZ'Z.Z)-Z'y is the projection of yon the columns of Z. (ShowSee Footdirenctotlye t2hiatn tf3�is==chapt( Z ' Zer.)-)Z' y is a solution to the normal equations ( Z ' Z ) [I(fZZ/3' �)is-Z'theypr] ojeZ'ctyio. n, then y - Z/3" is perpendicular to the columns ofZ.The eigenvalue-eigenvector requirement implies that (Z' Z) ( Ai1ei) 1== e1 for f o ri 1. Ther e f o r e ,( Z ' Z )( A i e i ) e; Z ' + landO == e; ( Z ' Z )e + i == eieiZ'. Summing over gives 1 (Z'Z)(Z' zrz' Z'z ('� Ai e;e: ) z' == ('£ e;e; ) Z' == ( � e;e:) Z' == IZ' == Z' sSupposincee teihZe 'cl==as icfaloriregres i+on1.model is, with rank (Z) == + 1, writ en as == + + wher e r a nk( Z 1 and r a nk( Z I f t h e par a met e r s ) ar e ) == + f3 1 2 2 ( ) iconfdentidifenceied befregioroenhandfor f3as(2bei) is ngigvofenprbyimary interest, show that a 100(1 (P (2) - 13 (2 ) ' [Z2Z2 (P (2) By Exercise 4.12, with l's and 2's interchanged, _1 c1 1 c1 2 where ( Z' Z) == [C2 1 C2 J o- Y Y
7.6.
( V - 112 I V -112 ) - 1
p
p
yy
==
o- YY
I I zz 1/1 I I , V 112 I - 1 V 112 ,
Y lT Y lTyy .
==
(Generalized inverse of Z ' Z)
>
>
···
>
v- 112 I V - 112
p
p-1
r1
+l >
,1
(a)
+
==
·
i� =l
(b)
normal e9 uations
(c)
(d)
==
Hint: (b) (d)
i
<
r1
i
> r1
=
L=l
7.7.
0
L=l
> r1
r
z ( n X ( q+l l )) ((q+13l( )X1 ) l )
( nyXl )
q
z ( n X (r2- q)) ((r13- q)(2)x l ) ==
r
(n eXl )
q.
a)%
Z 2 Z l (Z 1 Z l ) - 1 Z1Z 2 ]
- 13 (2) ) <
Hint:
s 2 (r
- q)Fr - q , n - r - 1 (a)
22 - 1�
Multip
(C )
7.8.
Exercises
41 9
and conclude
that
Chapter 7
by the square-root matrix ( fJ( 2 ) - fJ( 2) )/o-2 is N (O, I ) , so that A
( C 22 f 112 ,
( P ( 2) - P ( 2) ) ' ( C 22 ) - 1 ( P ( 2) - P ( 2) ) is a-2 x;- q . Recall that the hat matrix is defined by H = Z (Z'Z) - 1 Z' with diagonal ele ments h jj . (a) Show that H is an idempotent matrix. [See Result 7.1 and (7-6) . ] (b) Show that 0 <
hjj
<
n 1, j = 1, 2, . . . , n, and that � hjj = r + 1, where r is j=1
the number of independent variables in the regression model. (In fact, ( 1 /n ) hjj < 1. ) (c) Verify, for the simple linear regression model with one independent variable z, that the leverage, h j j , is given by <
( zj - z ) 2 1 h · · = -n + n � ( zj - z ) 2 j=1 ]]
-----
7.9. Consider the following data on one predictor variable z 1 and two responses Yi and }2:
2 1 3
1 2 2
-2 -1 0 5 3 4 -3 - 1 -1
Determine the least squares estimates of the parameters in the straight-line re gression model
lj 1 = f3o 1 + f3 1 1 Zj1 + Bj1 lj2 = f3o 2 + f3 1 2Zj1 + Bj 2 ,
j
= 1, 2, 3, 4, 5 "
Y
"
Also, calculate the matrices of fitted values and residuals e with = [ y1 ! y2]. Verify the sum of squares and cross-products decomposition
Y
7.10. Using the results from Exercise 7.9, calculate each of the following.
(a) A 95% confidence interval for the mean response E ( Ycn ) = f3o 1 + f3 1 1 Zo 1 corresponding to z0 1 = 0 . 5 (b) A 95% prediction interval for the response Y0 1 corresponding to z0 1 = 0.5 (c) A 95% prediction region for the responses Y0 1 and Y0 2 corresponding to z0 1 = 0.5 7.11. (Generalized least squares for multivariate multiple regression . ) Let A be a pos itive definite matrix, so that dj (B) = ( yj - B'z j ) ' A ( yj - B'z j ) is a squared sta tistical distance from the jth observation
yj to its regression B' z j .
Show that
420
Chapter 7
M u ltivariate Linear Reg ress ion Models
the choice B =
n
{J = ( Z ' -l
Z) Z'Y
minimizes the sum of squared statistical dis-
tances, � dy(B) , for any choice of positive definite A. Choices for A includ e
j=1 I -1 and I .
Hint: Repeat the steps in (7-40) and (7-41) with I - 1 replaced by A . 7.12. Given the mean vector and covariance matrix of Y, Z1 , and Z2 ,
determine each of the following. (a) The best linear predictor {3 0 + {3 1 21 + f3 2 Z2 of Y (b) The mean square error of the best linear predictor (c) The population multiple correlation coefficient (d) The partial correlation coefficient py z1 . z2 7.13. The test scores for college students described in Example 5.5 have
[] [ ]
z1 z = z2 = Z3
527 . 74 54 . 69 , 25 . 13
[
5691.34 600.51 126.05 s = 217.25 23.37 23.11
]
Assume joint normality. (a) Obtain the maximum likelihood estimates of the parameters for predicting z1 from z2 and z3 . (b) Evaluate the estimated multiple correlation coefficient Rz 1 ( z2, z3 ) . (c) Determine the estimated partial correlation coefficient Rz 1 , z2 . z3 • 7.14. Twenty-five portfolio managers were evaluated in terms of their performance. Suppose Y represents the rate of return achieved over a period of time, Z1 is the manager ' s attitude toward risk measured on a five-point scale from "very con servative" to "very risky," and Z2 is years of experience in the investment busi ness. The observed correlation coefficients between pairs of variables are
R=
[
z1 z2 Y .82 1.0 -.35 - .35 1.0 - .60 .82 - .60 1.0
]
(a) Interpret the sample correlation coefficients ryz 1 = - .35 and ry z2 = - . 8 2 . (b) Calculate the partial correlation coefficient ryz1 . z2 and interpret this quan tity with respect to the interpretation provided for ryz 1 in Part a.
The following exercises may require the use of a computer. 7.15. Use the real-estate data in Table 7.1 and the linear regression model in
Example 7.4.
Chapter 7
7.16. 7.17.
7.18. 7.19.
7.20. 7.21.
Exercises
421
(a) Verify the results in Example 7.4. (b) Analyze the residuals to check the adequacy of the model. (See Sec tion 7.6.) (c) Generate a 95% prediction interval for the selling price ( Y0) corresponding to total dwelling size z 1 = 17 and assessed value z2 = 46. (d) Carry out a likelihood ratio test of H0 : {3 2 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. Calculate a CP plot corresponding to the possible linear regressions involving the real-estate data in Table 7.1. Consider the Fortune 500 data in Exercise 1 .4. (a) Fit a linear regression model to these data using profits as the dependent variable and sales and assets as the independent variables. (b) Analyze the residuals to check the adequacy of the model. Compute the leverages associated with the data points. Does one (or more) of these com panies stand out as an outlier in the set of independent variable data points? (c) Generate a 95% prediction interval for profits corresponding to sales of 40,000 (millions of dollars) and assets of 70,000 (millions of dollars). (d) Carry out a likelihood ratio test of H0 : {3 2 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. Calculate a CP plot corresponding to the possible regressions involving the For tune 500 data in Exercise 1.4. Satellite applications motivated the development of a silver-zinc battery. Table 7.5 contains failure data collected to characterize the performance of the bat tery during its life cycle. Use these data. ( a) Find the estimated linear regression of ln (Y) on an appropriate ("best") subset of predictor variables. (b) Plot the residuals from the fitted model chosen in Part a to check the nor mal assumption. Using the battery-failure data in Table 7.5, regress ln (Y) on the first principal component of the predictor variables z 1 , z2 , • • • , z5 • (See Section 8.3.) Com pare the result with the fitted model obtained in Exercise 7.19(a). Consider the air-pollution data in Table 1.5. Let Yi = N0 2 and Y2 = 0 3 be the two responses (pollutants) corresponding to the predictor variables Z1 = wind and z2 = solar radiation. ( a) Perform a regression analysis using only the first response Yi . (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for N0 2 corresponding to z 1 = 10 and z2 = 80. (b) Perform a multivariate multiple regression analysis using both responses Y1 and }2 . (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both N0 2 and 0 3 for z 1 = 10 and z2 = 80. Compare this ellipse with the prediction interval in Part a (iii). Comment.
422
Chapter 7
TABLE 7 . 5
M u ltiva riate Linear Regression Models
BATTE RY-FAILURE DATA
zl
z2
Charge rate (amps) .375 1.000 1.000 1.000 1.625 1.625 1.625 .375 1.000 1.000 1.000 1.625 .375 1.000 1.000 1.000 1.625 1.625 .375 .375
Discharge rate (amps) 3.13 3.13 3.13 3.13 3.13 3.13 3.13 5.00 5.00 5.00 5.00 5.00 1.25 1 .25 1.25 1.25 1 .25 1.25 3.13 3.13
z3
Depth of discharge (% of rated ampere-hours) 60.0 76.8 60.0 60.0 43.2 60.0 60.0 76.8 43.2 43.2 100.0 76.8 76.8 43.2 76.8 60.0 43.2 60.0 76.8 60.0
z4 Temperature ( oC) 40 30 20 20 10 20 20 10 10 30 20 10 10 10 30 0 30 20 30 20
Zs
End of charge voltage (volts) 2.00 1 .99 2.00 1 .98 2.01 2.00 2.02 2.01 1.99 2.01 2.00 1 .99 2.01 1 .99 2.00 2.00 1 .99 2.00 1.99 2.00
y
Cycles to failure 101 141 96 125 43 16 188 10 3 386 45 2 76 78 160 3 216 73 314 170
Source: Selected from S. Sidik, H. Leibecki, and J. Bozek, Failure of Silver-Zinc Cells with Competing Failure Modes-Preliminary Data Analysis, NASA Technical Memorandum 81556 (Cleveland: Lewis Research Center, 1 980) .
7.22. Using the data on bone mineral content in Table 1.8:
(a) Perform a regression analysis by fitting the response for the dominant ra dius bone to the measurements on the last four bones. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (b) Perform a multivariate multiple regression analysis by fitting the respons es from both radius bones. 7.23. Using the data on the characteristics of bulls sold at auction in Table 1.10: (a) Perform a regression analysis using the response Y1 = SalePr and the pre dictor variables Breed, YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt. (i) Determine the "best" regression equation by retaining only those pre dictor variables that are individually significant. (ii) Using the best fitting model, construct a 95% prediction interval for selling price for the set of predictor variable values (in the order listed above) 5, 48.7, 990, 74.0, 7, .18, 54.2 and 1450.
Chapter 7
423
Exercises
(iii) Examine the residuals from the best fitting model. (b) Repeat the analysis in Part a, using the natural logarithm of the sales price as the response. That is, set Yi = Ln (SalePr). Which analysis do you prefer? Why? 7.24. Using the data on the characteristics of bulls sold at auction in Table 1.10: (a) Perform a regression analysis, using only the response Yi = SaleHt and the predictor variables Z1 = YrHgt and Z2 = FtFrBody. (i) Fit an appropriate model and analyze the residuals. (ii) Construct a 95% prediction interval for SaleHt corresponding to z 1 = 50.5 and z2 = 970. (b) Perform a multivariate regression analysis with the responses Y1 = SaleHt and Y2 = SaleWt and the predictors Z1 = YrHgt and Z2 = FtFrBody. (i) Fit an appropriate multivariate model and analyze the residuals. (ii) Construct a 95% prediction ellipse for both SaleHt and SaleWt for z 1 = 50.5 and z2 = 970. Compare this ellipse with the prediction in terval in Part a (ii). Comment. 7.25. Amitriptyline is prescribed by some physicians as an antidepressant. Howev er, there are also conjectured side effects that seem to be related to the use of the drug: irregular heartbeat, abnormal blood pressures, and irregular waves on the electrocardiogram, among other things. Data gathered on 17 patients who were admitted to the hospital after an amitriptyline overdose are given in Table 7.6. The two response variables are
TABLE 7.6
Y1
TOT 3389 1101 1131 596 896 1767 807 1111 645 628 1360 652 860 500 781 1070 1754
AM ITRIPTYLI N E DATA
Y2
AMI 3149 653 810 448 844 1450 493 941 547 392 1283 458 722 384 501 405 1520
Source: See [21].
Z1
GEN 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1
Z2
AMT 7500 1975 3600 675 750 2500 350 1500 375 1050 3000 450 1750 2000 4500 1500 3000
Z3
PR 220 200 205 160 185 180 154 200 137 167 180 160 135 160 180 170 180
Z4
DIAP 0 0 60 60 70 60 80 70 60 60 60 64 90 60 0 90 0
Zs
QRS 140 100 111 120 83 80 98 93 105 74 80 60 79 80 100 120 129
424
Chapter 7
M u ltivariate Linear Reg ress ion M odels
Yi == Total TCAD plasma level (TOT) Y2 == Amount of amitriptyline present in TCAD plasma level (AMI ) The five predictor variables are Z1 == Gender: 1 if female, 0 if male ( GEN) Z2 == Amount of antidepressants taken at time of overdose (AMT ) Z3 == PR wave measurement (PR) Z4 == Diastolic blood pressure (DIAP) Z5 == Q RS wave measurement ( Q RS ) (a) Perform a regression analysis using only the first response Yi . (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for Total TCAD for z 1 1 , z2 == 1200, z3 == 140, z4 == 70, and z5 == 85. (b) Repeat Part a using the second response }2 . (c) Perform a multivariate multiple regression analysis using both responses Yi and Y2 • (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both Total TCAD and Amount of amitriptyline for z 1 == 1, z2 == 1200, z3 140, z4 == 70, and z5 == 8 5 . Compare this ellipse with the prediction intervals in Parts a and b. Comment. ==
==
REFERENCE S 1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed.). New York: John Wiley, 1984. 2. Atkinson, A. C. Plots, Transformations and Regression. Oxford, England: Oxford Uni versity Press, 1 985. 3. Bartlett, M. S. "A Note on Multiplying Factors for Various Chi-Squared Approxima tions." Journal of the Royal Statistical Society (B) , 16 (1954 ), 296-298.
4. Belsley, D. A., E. Kuh, and R. E. Welsh. Regression Diagnostics. New York: John Wiley, 1980. 5. Bowerman, B. L. , and R. T. O' Connell. Linear Statistical Models: An Applied Approach (2d ed.). Boston: PWS-Kent, 1990. 6. Box, G. E. P. "A General Distribution Theory for a Class of Likelihood Criteria." Biometrika, 36 (1 949) , 317-346.
7. Box, G. E. P., G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control (3d ed.). Englewood Cliffs, NJ: Prentice Hall, 1994. 8. Chatterj ee, S. , and B. Price. Regression Analysis by Example. New York: John Wiley, 1977. 9. Cook, R. D., and S. Weisberg. Applied Regression Including Computing and Graphics. New York: John Wiley, 1999.
Chapter 7
References
425
10. Cook, R. D. , and S. Weisberg. Residuals and Influence in Regression. London: Chapman and Hall, 1982.
1 1 . Daniel, C., and F. S. Wood. Fitting Equations to Data (2d ed. ) . New York: John Wiley, 1980.
12. Draper, N. R., and H. Smith. Applied Regression Analysis ( 3rd ed. ) . New York: John Wiley, 1998. 13. Durbin, J. , and G. S. Watson. "Testing for Serial Correlation in Least Squares Regres sion, II." Biometrika, 38 ( 1 951 ) , 159-178.
14. Galton, F. "Regression Toward Mediocrity in Heredity Stature." Journal of the Anthro pological Institute, 15 ( 1 885 ) , 246-263 . 15. Goldberger, A. S. Econometric Theory. New York: John Wiley, 1964.
16. Heck, D. L. "Charts of Some Upper Percentage Points of the Distribution of the Largest Characteristic Root." Annals of Mathematical Statistics, 31 ( 1960 ) , 625-642. 17. Neter, J., W. Wasserman, M. Kutner, and C. Nachtsheim. Applied Linear Regression Mod els ( 3d ed. ) . Chicago: Richard D. Irwin, 1996. 18. Pillai, K. C. S. "Upper Percentage Points of the Largest Root of a Matrix in Multivari ate Analysis." Biometrika, 54 ( 1967 ) , 189-193.
19. Rao, C. R. Linear Statistical Inference and Its Applications. ( 2d ed. ) . New York: John Wiley, 1973. 20. Seber, G. A. F. Linear Regression Analysis. New York: John Wiley, 1977. 21. Rudorfer, M. V. "Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose." Journal of Toxicology- Clinical Toxicology, 19 ( 1 982 ) , 67-7 1 .
22. Timm, N. H. Multivariate Analysis with App lications in Education and Psychology. Monterey, CA: Brooks/Cole, 1975.
C HAPTER
8
Principal Components
8. 1
I NTRODUCTI O N
A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are (1) data reduction and (2) interpretation . Although p components are required to reproduce the total system variabili ty, often much of this variability can be accounted for by a small number k of the prin cipal components. If so, there is (almost) as much information in the k components as there is in the original p variables. The k principal components can then replace the initial p variables, and the original data set, consisting of n measurements on p variables, is reduced to a data set consisting of n measurements on k principal components. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result . A good example of this is provided by the stock market data discussed in Example 8.5. Analyses of principal components are more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. For example, principal components may be inputs to a multiple regression (see Chapter 7) or cluster analysis (see Chapter 12) . Moreover, (scaled) principal components are one "factoring" of the covariance matrix for the factor analysis model considered in Chapter 9. 8.2
POPU LATI O N PRI NCIPAL CO M PO N E NTS
Algebraically, principal components are particular linear combinations of the p ran dom variables X1 , X2 , , XP . Geometrically, these linear combinations represent the selection of a new coordinate system obtained by rotating the original system . • •
426
Section 8.2
Pop u l ation Principa l Com ponents
427
with X1 , X2 , , XP as the coordinate axes. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. As we shall see, principal components depend solely on the covariance matrix I (or the correlation matrix p) of X1 , X2 , , XP . Their development does not re quire a multivariate normal assumption. On the other hand, principal components derived for multivariate normal populations have useful interpretations in terms of the constant density ellipsoids. Further, inferences can be made from the sample components when the population is multivariate normal. (See Section 8 .5.) Let the random vector X' == [ X1 , X2 , , XP ] have the covariance matrix I with eigenvalues A1 > A2 > · · · > AP > 0. Consider the linear combinations • • •
. . •
• • •
Yi Y2
== ==
a! X a2_X
== ==
a 1 1 X1 + a 12 X2 + · · · + a 21 X1 + a 22 X2 + · · · +
a 1 P XP a 2 P XP
(8- 1 )
Then, using (2-45 ) , we obtain Var (Y:)
==
a;Iai
i
==
Cov ( Y: , Yk )
==
a;Ia k
i, k
==
1, 2, . . . ' p 1, 2, . . . ' p
(8-2) (8- 3 )
The principal components are those uncorrelated linear combinations Yi , r;, . . , YP whose variances in (8-2 ) are as large as possible . The first principal component is the linear combination with maximum vari ance. That is, it maximizes Var ( Yi ) == a1Ia1 . It is clear that Var (Y1) == a1Ia1 can be increased by multiplying any a1 by some constant . To eliminate this indeterminacy, it is convenient to restrict attention to coefficient vectors of unit length. We there fore define .
First principal component
linear combination a! X that maximizes Var ( a1 X) subject to a1 a1 == 1 Second principal component == linear combination a1 X that maximizes Var (a2X) subject to a2_a 2 == 1 and Cov (a1X, a2X) == 0 ==
At the ith step, ith principal component
==
linear combination a; X that maximizes Var ( ai X) subject to a;ai == 1 and Cov (ai X, akX) == 0 for k < i
428
Chapter 8
Principal Components
R esult 8.1. Let I be the covariance matrix associated with the random vec tor X' = [X1 , X2 , , XP ] . Let I have the eigenvalue-eigenvector pairs (A.1 , e 1 ) , AP > 0. Then the ith principal compo ( A.2 , e 2 ) , . . . , (A. p , e p ) where A 1 > A2 nent is given by i = 1, 2, . . . ' p r: = e;x = ei l xl + ei 2 x2 + . . . + ei p xp , ( 8- 4 ) With these choices, = A·l i = 1, 2, . . . ' p Var (Y)l = e��e l .... l Cov ( I: , Yk ) = e;Iek = 0 i#k (8- 5) If some Ai are equal, the choices of the corresponding coefficient vectors, e z , an d hence I: , are not unique. . • .
>
· · ·
>
Proof. We know from (2-51), with B = I, that
a' Ia (attained when a = e 1 ) max -,- = A 1 a*O a a But e1 e 1 = 1 since the eigenvectors are normalized. Thus, e1Ie a'Ia max -,- = A1 = , 1 = e1Ie 1 = Var ( Yi ) a*O a a elel Similarly, using (2-52), we get
For the choice a = e k + l ' with ek +l ei = 0, for i = 1, 2, . . . , k and k = 1, 2, . . . , p ek +l i e k +l fek +l ek + l = ek + l i e k +l = Var (Yk +l )
-
1,
But ek + 1 (Ie k + l ) = "-k + l ek +l e k +l = "- k +l so Var (Yk + l ) = "-k + l · It remains to show that e i perpendicular to e k (that is, e;e k = 0, i =I= k) gives Cov ( I: , Yk) = 0. Now, the eigenvectors of I are orthogonal if all the eigenvalues A 1 , A2 , . . , AP are distinct. If the eigenvalues are not all distinct, the eigenvectors corresponding to common eigen values may be chosen to be orthogonal. Therefore, for any two eigenvectors e i and e k , eie k = 0, i # k. Since Ie k = Ak e k , premultiplication by e; gives Cov ( I: , Yk ) = e;Ie k = eiA. k e k = A. k eie k = 0 for any i # k, and the proof is complete. • .
From Result 8.1, the principal components are uncorrelated and have variances equal to the eigf�nvalues of I.
R esult 8.2. Let X' = [ X1 , X2 , , Xp ] have covariance matrix I, with eigenvalue-eigenvector pairs ( A. 1 , e 1 ) , ( A.2 , e 2 ) , . . . , ( A.p , e p ) where > A = e2X, . . . , YP = e�X be the principal 0. Let Yi = e!X, A 1 A2 > Y2 P components. Then p p . . . . . . o- 1 1 + o-22 + + o-PP = L Var (XJ = "- 1 + "-2 + + AP = L Var ( I: ) i =l i=l . . •
>
· · ·
>
Section 8.2
Popu lation Principal Components
Proof. From Definition 2A.28, o-1 1 + o-22
429
+ · · · + o-PP == tr (I) .
From (2-20) with A == I, we can write I PAP' where A is the diagonal matrix of eigenvalues and P [e1 , e 2 , . . . , e p ] so that PP' P ' P I Using Result 2A.12(c), we have tr (I) tr (PAP' ) tr ( AP ' P ) tr ( A ) A1 + A2 + · · · + AP Thus, p p == Var( Xz) tr( I ) == tr( A ) • L Var( Y;) iL i=l =l ==
==
==
==
==
==
.
==
==
==
Result 8.2 says that Total population variance
o-1 1 + o-22 + · · · + a-P P A1 + A2 + · · · + AP (8-6) and consequently, the proportion of total variance due to (explained by) the kth prin cipal component is ==
==
Proportion of total population variance due to kth principal component
k
A 1 + A2 + · · · + A p
==
1, 2, . . . ' p
(8-7)
If most (for instance, 80 to 90%) of the total population variance, for large p, can be attributed to the first one, two, or three components, then these components can "replace" the original p variables without much loss of information. Each component of the coefficient vector e; [ ei 1 , . . . , ei k ' . . . , ei p ] also merits inspection. The magnitude of ei k measures the importance of the kth variable to the ith principal component, irrespective of the other variables. In particular, ei k is pro portional to the correlation coefficient between }j and Xk . ==
Result 8.3. If Y1 e� X, Y2 == e�X, . . . , YP == e�X are the principal components obtained from the covariance matrix I, then el· k \IIl (8-8) i, k == 1, 2, . . . ' p PY P Xk � V lTkk are the correlation coefficients between the components }j and the variables Xk . Here (A1 , e1), (A2 , e 2 ) , . . . , ( Ap , e p ) are the eigenvalue-eigenvector pairs for I. ==
==
�
Proof. Set a k [0, . . . , 0, 1, 0, . . . , 0] so that Xk a k X and Cov (X10 }j) == Cov ( akX, e;X) ak i ei , according to (2-45). Since Iei Aiei , Cov ( Xk , li) == ak Ai ei Aieik · Then Var(}j) Ai [see (8-5)] and Var(Xk ) o-k k yield Cov (}j , Xk ) Aiei k ei k \0\ . k = 1 ' 2' · · · , p • = = PY , . xk = Vvar (Y;) Vvar (Xk ) \0\ � � z , ==
==
==
==
==
==
==
Although the correlations of the variables with the principal components often help to interpret the components, they measure only the univariate contribution of an individual X to a component Y. That is, they do not indicate the importance of an X to a component Y in the presence of the other X' s. For this reason, some
430
Chapter 8
Principa l Com ponents
statisticians (see, for example, Rencher [17]) recommend that only the coefficients e1 1\ , and not the correlations, be used to interpret the components. Although the coeffi cients and the correlations can lead to different rankings as measures of the impor tance of the variables to a given component, it is our experience that these ranking s are often not appreciably different. In practice, variables with relatively large coef ficients (in absolute value) tend to have relatively large correlations, so the two mea sures of importance, the first multivariate and the second univariate, frequently give similar results. We recommend that both the coefficients and the correlations be ex amined to help interpret the principal components. The following hypothetical example illustrates the contents of Results 8 . 1 , 8.2, and 8.3. Example 8 . 1
(Calculati ng the po pulation principal components)
[ -� -� �]
Suppose the random variables X1 , X2 and X3 have the covariance matrix
I=
It may be verified that the eigenvalue-eigenvector pairs are A1 = 5.83, A2 = 2.00, A3 = 0.17,
e1 = [ .383, -.924, OJ e2 = [0, 0, 1 J e3 = [ .924, .383, OJ
Therefore, the principal components become Yi = e1X = .383X1 - .924X2 Y2 = e2X = X3 Y3 = e3X = .924X1 + .383X2 The variable X3 is one of the principal components, because it is uncorrelated with the other two variables. Equation (8-5) can be demonstrated from first principles. For example, Var ( Yi ) = Var ( .383X1 - .924X2 ) = ( .383 ) 2 Var (X1 ) + ( -.924) 2 Var (X2 ) + 2( .383 ) ( - .924) Cov ( X1 , X2 ) = .147 ( 1 ) + .854(5) - .708 ( -2) = 5.83 = A1 Cov ( Yi , ¥2) = Cov ( .383X1 - .924X2 , X3 ) = .383 Cov (X1 , X3 ) - .924 Cov (X2 , X3 ) = .383 (0) - .924(0) = 0 It is also readily apparent that o-1 1 + o-22 + o-3 3 = 1 + 5 + 2 = A1 + A2 + A3 = 5.83 + 2.00 + .17
Section 8.2
Pop u lation Pri ncipal Components
43 1
validating Equation (8-6) for this example. The proportion of total variance ac counted for by the first principal component is A1/ ( A 1 + A2 + A3) = 5.83/8 = .73. Further, the first two components account for a proportion (5.83 + 2)/8 = .98 of the population variance. In this case, the components Yi and Y2 could replace the original three variables with little loss of information. Next, using (8-8), we obtain
Notice here that the variable X2 , with coefficient .924, receives the great est weight in the component Y1 . It also has the largest correlation (in absolute value) with Y1 . The correlation of X1 , with Yi , .925, is almost as large as that for X2 , indicating that the variables are about equally important to the first prin cipal component. The relative sizes of the coefficients of X1 and X2 suggest, however, that x2 contributes more to the determination of Yi than does xl . Since, in this case, both coefficients are reasonably large and they have oppo site signs, we would argue that both variables aid in the interpretation of Yi . Finally, -
( as it should) The remaining correlations can be neglected, since the third component is • unimportant. It is informative to consider principal components derived from multivariate normal random variables. Suppose X is distributed as Np ( JL, I ) . We know from (4-7) that the density of X is constant on the IL centered ellipsoids which have axes ± c\/T; e i , i = 1, 2, . . . , p, where the (Ai , ei) are the eigenvalue eigenvector pairs of I. A point lying on the ith axis of the ellipsoid will have coordi nates proportional to ei = [ ei l , ei 2 , , ei p ] in the coordinate system that has origin IL and axes that are parallel to the original axes x1 , x2 , , xP . It will be convenient to set IL = 0 in the argument that follows. 1 From our discussion in Section 2.3 with A = I- 1 , we can write • . •
• • •
1 This can be done without loss of generality because the normal random vector X can always be translated to the normal random vector W = X J.t and E(W) = 0. However, Cov (X) Cov (W). -
=
432
Chapter 8
Principa l Components
where e1 x, e2 x, . . . , e� x are recognized as the principal components of x . Sett ing Y1 = e1 x, y2 = e2 x, . . . , yP = e� x, we have
and this equation defines an ellipsoid (since A1 , A2 , . . . , AP are positive) in a coordin ate system with axes y1 , y2 , . . . , yP lying in the directions of e1 , e 2 , . . . , e P , respectively. I f A1 is the largest eigenvalue, then the major axis lies in the direction e1 . The remain ing minor axes lie in the directions defined by e 2 , , e P . To summarize, the principal components y1 = e1 x, y2 = e2 x, . . , Yp = e � x lie in the directions of the axes of a constant density ellipsoid. Therefore, any point on the ith ellipsoid axis has x coordinates proportional to e; = [ ei 1 , ei 2 , , eip] and, neces sarily, principal component coordinates of the form [0, . . . , 0, Yi , 0, . . . , OJ . When IL * 0, it is the mean-centered principal component Yi = e; (x - IL ) that has mean 0 and lies in the direction ei . A constant density ellipse and the principal components for a bivariate normal random vector with IL = 0 and p == .75 are shown in Figure 8.1. We see that the prin cipal components are obtained by rotating the original coordinate axes through an angle (} until they coincide with the axes of the constant density ellipse. This result holds for p > 2 dimensions as well. • • •
.
• . •
Figure 8.1 The consta nt density e l l i pse 2 x ' I- 1 x = c and the principal
p, = O p = . 75
components y1 , y2 for a bivariate normal ra ndom vector X havi ng mean 0.
Pri ncipal Components O btai ned from Standardized Variables
Principal components may also be obtained for the standardized variables
Zl = z2 = p ==
z
(XI - JL I )
�
(X2 - JL2 ) �
(Xp - JLp ) va:;:;
(8-9)
Section 8.2
Popu lation Principa l Components
433
In matrix notation, (8-10) where the diagonal standard deviation matrix V 1 12 is defined in (2-35) . Clearly, E(Z) = 0 and Cov (Z ) = (V 1 12 ) - 1 I (V 1 12 ) - 1 = p by (2-37). The principal components of Z may be obtained from the eigenvectors of the correlation matrix p of X. All our previous results apply, with some simplifica tions, since the variance of each Zi is unity. We shall continue to use the notation Yi to refer to the ith principal component and ( Ai , e i ) for the eigenvalue-eigenvector pair from either p or I. How ever, the ( Ai , e i ) derived from I are, in general, not the same
as the ones derived from p.
-
R esult 8.4. The ith principal component of the standardized variables Z' = [Z1 , Z2 , . . . , ZP ] with Cov (Z) = p, is given by i = 1, 2, . . . , p Yi = e;z = e; ( V 1 12 ) - 1 (X IL ) , Moreover, p p (8-11) Var ( Yi) = L Var (Zi) = p L i =l i =l and i, k = 1 , 2, . . . ' p P y Z = el. k \IIl l'
k
In this case, ( A. 1 , e 1 ) , ( A2 , e 2 ) , . . . , ( A.p , e p ) are the eigenvalue-eigenvector pairs for AP 0. p, with A 1 A.2 >
>
· · ·
>
>
Proof. Result 8.4 follows from Results 8.1, 8.2, and 8.3, with Z1 , Z2 , . . . , ZP in • place of X1 , X2 , , XP and p in place of I. • • •
We see from (8-11) that the total (standardized variables) population variance is simply p, the sum of the diagonal elements of the matrix p. Using (8-7) with Z in place of X, we find that the proportion of total variance explained by the kth princi pal component of Z is
(
Proportion of (standardized) population variance due to kth principal component
where the Ak ' s are the eigenvalues of p . Example 8.2
)
=
A
_!5._ ,
P
k = 1, 2, . . . ' p
(Pri nci pal com ponents obtai ned fro m cova ria nce and correlation matrices are different)
Consider the covariance matrix
I=
[! �J 10
(8-12)
434
Chapter 8
Principal Components
and the derived correlation matrix p
=
[ .! -� ]
The eigenvalue-eigenvector pairs from I are e1 == [ .040, .999] A1 == 100.16, A2 == e2 [ .999, - .040] .84, Similarly, the eigenvalue-eigenvector pairs from p are e1 == [ .707, .707] A 1 == 1 + p == 1.4, A2 == 1 - p == .6, e2 == [ .707, - .707] The respective principal components become Yi == .040Xl + .999X2 I: Y2 == .999X1 - .040X2 and X2 X1 � ILl + .707 Y1 = .707Zl + .707Z2 = .707 ==
(
)
(
� IL2
)
1 == .707 ( X1 - JL 1 ) + .0707(X2 - JL2 ) p: X2 � IL2 X1 � ILl - .707 ¥2 = .707Zl - .707Z2 = .707 1 == .707 (X1 - JL 1) - .0707(X2 - JL2 ) Because of its large variance, X2 completely dominates the first principal com ponent determined from I. Moreover, this first principal component explains a proportion A1 == 100.16 == .992 101 A1 + A2 of the total population variance. When the variables X1 and X2 are standardized, however, the resulting variables contribute equally to the principal components determined from p. Using Result 8.4, we obtain PY 1 , z 1 == e1 1 VA;- == .707 vT.4 == .837 and PY 1· z2 == e2 1 VA;- == .707 vT.4 == .837
(
'
)
(
)
�
In this case, the first principal component explains a proportion A1 1.4 == == .7 p 2 of the total ( standardized ) population variance. Most strikingly, we see that the relative importance of the variables to, for instance, the first principal component is greatly affected by the standardization.
Section 8.2
Pop u l ation Principal Co mponents
435
When the first principal component obtained from p is expressed in terms of X1 and X2 , the relative magnitudes of the weights .707 and .0707 are in direct opposition to those of the weights .040 and . 999 attached to these variables in • the principal component obtained from I . The preceding example demonstrates that the principal components derived from I are different from those derived from p. Furthermore, one set of principal components is not a simple function of the other. This suggests that the standard ization is not inconsequential. Variables should probably be standardized if they are measured on scales with widely differing ranges or if the units of measurement are not commensurate. For ex ample, if X1 represents annual sales in the $10,000 to $350,000 range and X2 is the ratio (net annual income)/(total assets) that falls in the .01 to .60 range, then the total vari ation will be due almost exclusively to dollar sales. In this case, we would expect a single (important) principal component with a heavy weighting of X1 . Alternative ly, if both variables are standardized, their subsequent magnitudes will be of the same order, and X2 (or Z2 ) will play a larger role in the construction of the principal com ponents. This behavior was observed in Example 8.2. Pri ncipal Com ponents fo r Covariance Matrices with Special Structu res
There are certain patterned covariance and correlation matrices whose principal com ponents can be expressed in simple forms. Suppose I is the diagonal 1natrix 0 0
0
lT l l
I == 0
(8-13)
0
Setting ei == [ 0, . . . , 0, 1 , 0, . . . , 0 J , with 1 in the ith position, we observe that lTl l 0 0 lT22 0
0
0 0 lT pp
0
0
0 1 0
0
1o-ii
0
0
0
or I e·l == o-l l e l· . .
and we conclude that ( o-ii , ez) is the ith eigenvalue-eigenvector pair. Since the lin ear combination ei X = Xi , the set of principal components is just the original set of uncorrelated random variables. For a covariance matrix with the pattern of (8-13), nothing is gained by extracting the principal components. From another point of view, if X is distributed as Np ( /L, I) , the contours of constant density are ellipsoids whose axes already lie in the directions of maximum variation. Consequently, there is no need to rotate the coordinate system.
436
Chapter 8
Principa l Components
Standardization does not substantially alter the situation for the I in (8- 1 3 ). In that case, p == I , the p X p identity matrix. Clearly, pe i == l e i , so the eigenv alu e 1 has multiplicity p and e; == [0, . . . , 0, 1 , 0, . . . , OJ , i == 1 , 2, . . . , p, are convenient choices for the eigenvectors. Consequently, the principal components determined from p are also the original variables Zr , . . . , ZP . Moreover, in this case of equ al eigenvalues, the multivariate normal ellipsoids of constant density are spheroids. Another patterned covariance matrix, which often describes the correspon dence among certain biological variables such as the sizes of living things, has the general form (8-1 4)
I == The resulting correlation matrix
p ==
1 p p 1
p p
(8-15)
p p
1 is also the covariance matrix of the standardized variables. The matrix in (8- 1 5) im plies that the variables X1 , X2 , , XP are equally correlated. It is not difficult to show (see Exercise 8.5) that the p eigenvalues of the corre lation matrix (8- 1 5) can be divid-ed into two groups. When p is positive, the largest is A1 1 + (p - l )p (8- 1 6) with associated eigenvector • . •
[ �, �, ... , �] ==
el =
(8- 1 7 )
The remaining p - 1 eigenvalues are A2 == A.3 == · · · == AP == 1 - p and one choice for their eigenvectors is
e2 =
[ �, � , o, . . . , o
]
e) =
[� � �
e; =
[ V( i 1- 1 ) i , . . . , V( i 1- 1 )i ' V- (( ii -- 11 ))i , O, . . . , O J
e� =
·
[
·
, o, . . . , o
]
- (p - 1 ) 1 1 . ' ' ' V (p - 1 ) p · · V (p - 1 ) p V (p - 1 ) p
J
Section 8.3
Summarizing Sa m p l e Va riation by Pri ncipal Components
43 7
The first principal component
is proportional to the sum of the p standarized variables . It might be regarded as an "index" with equal weights. This principal component explains a proportion
1 + (p - 1 )p ----- == p p
1-p + -p
( 8 - 1 8)
of the total population variation. We see that A. 1/ p · p for p close to 1 or p large. For example, if p == .80 and p == 5, the first component explains 84% of the total variance. When p is near 1, the last p - 1 components collectively contribute very little to the total variance and can often be neglected. In this special case, the first principal component for the original variables, X, is the same. That is, Yi == ( 1/ v]J ) [1, 1, . . . , 1 ] X, a measure of total size and it explains the same proportion (8-18) of total variance. If the standardized variables Z1 , Z2 , . . . , ZP have a multivariate normal distrib ution with a covariance matrix given by (8-15), then the ellipsoids of constant densi ty are "cigar shaped," with the major axis proportional to the first principal component Yi == ( 1/v]J) [ 1, 1, . . . , 1 ] Z. This principal component is the projection of Z on the equiangular line 1 ' [ 1, 1, . . . , 1 ] . The minor axes (and remaining principal com ponents) occur in spherically symmetric directions perpendicular to the major axis (and first principal component). ==
8.3
S U M MARIZING SAM PLE VARIATI O N BY PRI NCIPAL CO M PO N E NTS
We now have the framework necessary to study the problem of summarizing the variation in n measurements on p variables with a few judiciously chosen linear combinations. Suppose the data x 1 , x 2 , . . . , x n represent n independent drawings from some p-dimensional population with mean vector IL and covariance matrix I. These data yield the sample mean vector x , the sample covariance matrix S , and the sample cor relation matrix R . Our objective in this section will be to construct uncorrelated linear combina tions of the measured characteristics that account for much of the variation in the sam ple . The uncorrelated combinations with the largest variances will be called the
sample principal components. Recall that the n values of any linear combination
j ==
1, 2, . . . , n
have sample mean a1 x and sample variance a1 Sa 1 . Also, the pairs of values (a1xj , a2xj ) , for two linear combinations, have sample covariance a1 Sa 2 [see (3-36)]. The sample principal components are defined as those linear combinations
438
Cha pter 8
Principal Components
The sample principal components are defined as those linear combinations which have maximum sample variance. As with the population quantities, we restrict the coefficient vectors ai to satisfy a; ai = 1. Specifically, First sample principal component Second sample principal component
=
=
linear combination a1 x1 that maximizes the sample variance of a1x1 subject to a1a 1 = 1 linear combination a2x1 that maximizes the sample variance of a2x1 subject to a2a2 = 1 and zero sample covariance for the pairs (a1x1 , a 2x1 )
At the ith step, we have ith sample principal component
=
linear combination a;x1 that maximizes the sample variance of aj x1 subject to aja i = 1 and zero sample covariance for all pairs (a ;x1 , ak x1) , k < i
The first principal component maximizes a1 Sa 1 or, equivalently, (8-19) "'
By (2-51), the maximum is the largest eigenvalue A 1 attained for the choice a 1 = � e igenvectpr e 1 of S. Successive choices of ai maximize (8-19) subj ect to 0 = ajSe k = aji\k e k , or ai perpendicular to e k . Thus, as in the proofs of Results 8.1-8 . 3, we obtain the following results concerning sample principal components:
Section 8.3
Summarizing Sample Va riation by Principal Com ponents
439
We shall denote the sample principal components by y1 , y2 , . . . , yP , irrespective of whether they are obtained from S or R . 2 The components constructed from S and R are not the same, in general, but it will be clear from the context which matrix is being used, and the single notation Yi is convenient . It is also convenient to label the component coefficient vectors e i and the component variances Ai for both situations . The observations xj are often "centered" by subtracting x . This has no effect on the sample covariance matrix S and gives the ith principal component i = 1 , 2, . ' p (8-21) Yi -- e i ( x - x ) ' for any observation vector x. If we consider the values of the ith component j = 1, 2, . , n (8-22) YJ l -- e l- ( X1· - X ) ' generated by substituting each observation x1 for the arbitrary x in (8-21), then 1 1 1 (8-23) Y i -- -n :L ei ( xj - x) - -n ei :L ( xj - x ) - -n e i 0 - 0 j= l j=l That is, the sample m�an of each principal component is zero. The sample variances are still given by the A./s, as in (8-20) . "'
"'
n
_
"'
Example 8.3
"' '
.
"' '
"' '
-
-
"' '
(n
.. .. -)
"' '
(Summarizing sample variabil ity with two sa mple principal co mponents)
A census provided information, by tract, on five socioeconomic variables for the Madison, Wisconsin, area . The data from 14 tracts are listed in Table 8 . 5 in the exercises at the end of this chapter. These data produced the following summary statistics: 14 . 01, [4 . 32, 1 . 95, 2 . 17, 2 . 45 ] x' = total total health services median median population school employment employment home value ($10,000s) (thousands) years (thousands) (hundreds) and 4 . 308 1 . 683 1 . 803 2 . 155 - . 253 1 . 683 1 . 768 . 176 . 177 . 588 S = 1 . 803 . 588 . 801 1 . 065 - . 158 2. 155 . 177 1 . 065 1 . 970 - . 357 - . 253 . 176 - . 158 - . 357 . 504 Can the sample variation be summarized by one or two principal components? 2 Sample principal components can also be obtained from i Sn , the maximum likelihood estimate of the covariance matrix I-, if the X1 are normally distributed. (See Result 4.11.) In this case, provided that the eigenvalues of I- are distinct, the sample principal components can be viewed as the maximuf!l likelihood estimates of the corresponding population counterparts. (See [1].) We shall not consider ! because the assumption of normality is not required in this section. Also, i has eigenvalues [ (n 1 )/n ]A1 and corresponding eigenvectors el ' where ( Al ' eJ are the eigenvalue-eigenvector pairs for S. Thus, both S and i g�ve tp.e satpe sample principal components e; x_j see (8-20)] and the same proportion of explained variance Ai / ( A 1 + A 2 + . . + Ap) · Finally, both S �nd I- give the same sample correlation matrix R, so if the variables are standardized, the choice of S or I- is irrelevant. =
-
·
440
Chapter 8
Principal Components
We find the following: CO EFFICIE NTS FOR TH E PRI NCIPAL COM PO N E NTS (Correlation Coefficients i n Parentheses)
Variable Total population Median school years Total employment Health services employment Median home value "
Variance (Ai): Cumulative percentage of total variance
e l (ryl,xk )
e 2 ( ry2 , xk)
e3
e4
es
"
"
"
.781 (.99)
- .071 ( - .04)
.004
.542
- .302
.306 (.61) .334 (.98)
- .764( - .76) .083 (.12)
- . 162 .015
- .545 .050
- .010 .937
.426 (.80)
.579 (.55)
.220
- .636
- .173
- .054( - .20)
- .262( - .49)
.962
- .051
.024
6.931
1.786
.390
.230
.01 4
74.1
93.2
97.4
99.9
100
The first principal component explains 74.1% of the total sample variance. The first two principal components, collectively, explain 93.2% of the total sam ple varia rirce. Consequently, sample variation is summarized very well by two principal components and a reduction in the data from 14 observations on 5 variables to 14 observations on 2 principal components is reasonable. Given the foregoing component coefficients, the first principal component appears to be essentially a weighted average of the first four variables. The sec ond principal component appears to contrast health services employment with a weighted average of median school years and median home value. M As we said in our discussion of the population components, the component co efficients ei k and the correlations ry l ' xk should both be examined to interpret the prin cipal components. The correlations allow for differences in the variances of the original variables, but only measure the importance of an individual X without regard to the other X' s making up the component. We notice in Example 8.3, however, that the correlation coefficients displayed in the table confirm the interpretation provid ed by the component coefficients. The N u m ber of Principal Com ponents
There is always the question of how many components to retain. There is no defini tive answer to this question . Things to consider include the amount of total sample vari ance explained, the relative sizes of the eigenvalues ( the variances of the sample components ) , and the subject-matter interpretations of the components. In addition, as we discuss later, a component associated with an eigenvalue near zero and, hence, deemed unimportant, may indicate an unsuspected linear dependency in the data .
Section 8 . 3
S u m marizing Sa m p l e Va riation by Principal Components
441
�I
Figure 8.2
A scree plot.
A useful visual aid to determining an appropriate number of principal components is a scree plot. 3 With the eigenvalues ordered from largest to smallest, a scree plot is a plot of Ai versus i-the magnitude of an eigenvalue versus its num ber. To determine the appropriate number of components, we look for an elbow ( bend ) in the scree plot. The number of components is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size. Figure 8.2 shows a scree plot for a situation with six principal components. An �lbow occurs in the plot in Figure 8.2 at about i = 3. That is, the eigenval ues after A2 are all relatively small and about the same size. In this case, it appears, without any other evidence, that two ( or perhaps three ) sample principal compo nents effectively summarize the total sample variance. Example 8.4
(Summarizing sam ple va riabi l ity with one sample principal co m ponent)
In a study of size and shape relationships for painted turtles, Jolicoeur and Mosi mann [11] measured carapace length, width, and height. Their data, reproduced in Exercise 6.18, Table 6 . 9, suggest an analysis in terms of logarithms. (Jolicoeur [10] generally suggests a logarithmic transformation in studies of size-and-shape relationships. ) Perform a principal component analysis. The natural logarithms of the dimensions of 24 male turtles have sample mean vector x ' = [ 4.725, 4.478, 3.703 J and covariance matrix 3
Scree is the rock debris at the bottom of a cliff.
442
Chapter 8
]
Principa l Components
[
11 .072 8.019 8.160 3 s == 1o- 8.019 6.417 6.005 8.160 6.005 6.773 A principal component analysis (see Panel 8.1 on page 443 for the output from the SAS statistical software package) yields the following summary: CO EFFICI E NTS FOR PRI NCIPAL CO M PO N E NTS (Correlation Coefficients i n Pa rentheses)
el ( ry l , xk )
Variable ln (length) ln (width) ln (height) "
Variance ( Ai): Cumulative percentage of total variance
e2
e3
.683 (.99) .510 (.97) .523 (.97)
- . 159 - .594 .788
- .713 .622 .324
23.30 X 10-3
.60 X 10- 3
.36 X 10-3
96.1
98.5
100
"
"
A scree plot is shown in Figure 8.3. The very distinct elbow in this plot oc curs at i == 2. There is clearly one dominant principal component. The first principal component, which explains 96% of the total variance, has an interesting subject-matter interpretation. Since y1 == .683 ln (length) + .510 ln (width) + .523 ln (height) == ln [ (length)·683 (width) .s 10 (height) · 523 ]
2
3
Figure
8.3
turtle data .
A scree p lot for the
Section 8.3
PANEL 8.1
S u m marizing Sample Va riation by Principal Components
443
SAS ANALYSIS FOR EXAMPLE 8.4 USING PROC PRINCOMP.
title 'Principal Component Ana lysis'; data t u rt l e; i nfile ' E 8-4.d at'; i n put l e n gth width heig ht; x1 = log(length); x2 =log(width); x3 =log(heig ht); p roc princomp cov data = tu rtle out = resu lt; var x1 x2 x3;
PROGRAM CO M MANDS
Pri ncipal Com pone nts Ana lysis
OUTPUT
24 Observations 3 Va riables
Mean StD
X1
X1 4. 72 5443647 0 . 1 05223590
S i m p l e Statistics X2 4.477573765 0.080 1 04466
X3 3.703 1 85794 0.08229677 1
X1
X2
X3
0.0080 1 9 1 4 1 9
0. 008 1 596480 0.0060052707
X2 X3
Total Va ria nce = 0.0242 6 1 488
Difference 0.022705 0 . 000238
PR I N 1 PRIN2 PRIN3
X1 X2 X3
Proportion 0.960508 0.024661 0 . 0 1 4832
444
Chapter 8
Principal Components
the first principal component may be viewed as the ln (volume) of a box with adjusted dimensions. For instance, the adjusted height is (height ) · 523 , which ac counts, in some sense, for the rounded shape of the carapace. • I nterpretati on of the Sample Pri nci pal Com ponents
The sample principal components have several interpretations. First, suppose th e underlying distribution of X is nearly Np ( JL , I ) . Then the sample principal comp o nents, .Yz == ei(x - x) are realizations of population principal compone n ts li == ej (X - p.. ) , which have an Np (O, A ) distribution. The diagonal matrix A h as entries A. 1 , A.2 , . . . , AP and ( "-z , e i) are the eigenvalue-eigenvector pairs of I . Also, from the sample values x1 , we can approximate IL by x and I by S. If S is positive definite, the contour consisting of all p X 1 vectors x satisfying (8-24 ) (x - x) ' s - 1 (x - x) == c 2
estimates the constant density contour (x - IL ) 'I - 1 (x - IL ) == c2 of the underlying normal density. The approximate contours can be drawn on the scatter plot to ind i cate the normal distribution that generated the data. The normality assumption is useful for the inference procedures discussed in Section 8.5, but it is not required for the development of the properties of the sample principal components summa rized in (8-20). Even when the normal assumption is suspect and the scatter plot may depart some what from an elliptical pattern, we can still extract the eigenvalues from S and obtain the sample principal components. Geometrically, the data may be plotted as n points in p-space. The data can then be expressed in the new coordinates, which coincide with the axes of the contour of (8-24). Now, (8-24) defines a hyperellipsoid that is centered at x and whose axes are given by the eigenvectors of s- 1 or, equivalently, of S. (See Section 2.3 and Result 4.1, with S in place of I.) The lengths of these hyperellipsoid AP 0 are the axes are proportional to \lA;, i = 1, 2, . . . , p, where A 1 A2 eigenvalues of S. Because e i has length 1, the absolute value of the ith principal component, 1 .Yi 1 1 ei(x - x) I, gives the length of the projection of the vector (x - x) on the unit vector e i . [See (2-8) and (2-9) .] Thus, the sample principal components .Yi == ei(x - x) , i 1, 2, . . . , p, lie along the axes of the hyperellipsoid, and their ab solute values are the lengths of the projections of x - x in the directions of the axes e i . Consequently, the sample principal components can be viewed as the result of translating the origin of the original coordinate system to x and then rotating the coordinate axes until they pass through the scatter in the directions of maximum variance. The geometrical interpretation of the sample principal components is illustrat ed in Figure 8.4 f2r p � 2. Figure 8.4(a) shows an ellipse of constant distance, cen tered at x, with A 1 > A2 . The sample principal components are well determined. They lie along the axes of the ellipse in the perpendicular directions of maximum �ampl� varial!_ce. Fjgure 8.4(b) shows a constant distance ellipse, centered at x, with A 1 A2 . If A 1 == A.2 , the axes of the ellipse (circle) of constant distance are not uniquely determined and can lie in any two perpendicular directions, including the di rections of the original coordinate axes. Similarly, the sample principal components >
==
==
·
>
·
· ·
>
>
Section 8.3
(x
-
S u m marizing Sa m p l e Va riation by Principa l Components
x ) ' S - l (x
-
x) =
445
c2
•
(x
(a) Figure
5:, 1 > �2
8.4
(b) � l
�
-
x ) ' s - l (x
-
x)
=
c2
�2
Sam p l e principal components a n d e l l i pses of consta nt di stance.
can lie in any two perpendicular directions, including those of the original coordi nate axes. When the contours of constant distance are nearly circular or, equiva lently, when the eigenvalues of S are nearly equal, the sample variation is homogeneous in all directions. It is then not possible to represent the data well in fewer than p dimensions. If the last few eigenvalues Ai are sufficiently small such that the variation in the corresponding ei directions is negligible, the last few sample principal components can often be ignored, and the data can be adequately approximated by their repre sentations in the space of the retained components. (See Section 8.4.) Finally, Supplement 8A gives a further result concerning the role of the sample principal components when directly approximating the mean-centered data xj - x . A
Standard izing the Sample Pri nci pal Components
Sample principal components are, in general, not invariant with respect to changes in scale. (See Exercises 8.6 and 8.7). As we mentioned in the treatment of population components, variables measured on different scales or on a common scale with wide ly differing ranges are often standardized. For the sample, standardization is ac complished by constructing
xjl - x l � xj2 - x2 Vs;
j
==
1, 2,
. . .
, n
(8-25)
446
Cha pter 8
Principa l Components
The n X p data matrix of standardized observations
Z ==
z1 z2
Z1 1 Z12 Z21 Z22
z n'
Zn l Zn2 X1 1 - X1 X12 - X1 � � X21 - X2 X22 - X 2 �
Z1 p Z2p Znp
vS;;
Xnl - Xp Xn2 - Xp � vs;;
yields the sample mean vector [see (3-24)]
x lp - x l � x2p - x 2 vs;; Xnp - Xp vs;;
=0
and sample covariance matrix [see (3-27)] 1 s z == n n n-1 1 n- 1 1 n- 1 ( n - 1 ) s1 1 (n - 1 ) s12
( z _ lll' z )' ( z _ lll'Z) (Z - lZ')' ( Z - lZ' ) Z'Z S1 1
1 n-1
( n - 1 ) s12
� Vs;;
� Vs;;
(n - 1 ) s22
S22
( n - 1 ) s1 P (n - 1 ) s2 P
� vs;;
vS;; vs;;
(8-26)
(8-27)
(n - 1 ) s1 P
� vs;; (n - 1 ) s2 P vS;; vs;; == R
(8-28)
(n - 1 ) sPP
Spp
The sample principal components of the standardized observations are given by (8-20), with the matrix R in place of S. Since the observations are already "centered" by construction, there is no need to write the components in the form of (8-21).
Section 8.3
(
S u m marizing Sample Va riation by Principal Components
)
447
Using (8-29), we see that the proportion of the total sample variance explained by the ith sample principal component is Proportion of (standardized) sample variance due to ith sample principal component
==
A.
__!_
P
i
==
1, 2,
.. .
'p
(8-30)
A rule of thumb suggests retaining only those components whose variances Ai are greater than unity or, equivalently, only those components which, individually, ex plain at least a proportion 1/p of the total variance. This rule does not have a great deal of theoretical support, however, and it should not be applied blindly. As we have mentioned, a scree plot is also useful for selecting the appropriate number of components. Example 8. 5
"'
(Sa mple principal com ponents from standardized data)
The weekly rates of return for five stocks (Allied Chemical, du Pont, Union Carbide, Exxon, and Texaco) listed on the New York Stock Exchange were de termined for the period January 1975 through December 1976. The weekly rates of return are defined as (current Friday closing price - previous Friday closing price )/(previous Friday closing price), adjusted for stock splits and div idends. The data are listed in Table 8.4 in the Exercises. The observations in 100 successive weeks appear to be independently distributed, but the rates of return across stocks are correlated, since, as one expects, stocks tend to move togeth er in response to general economic conditions. Let x1 , x 2 , , x5 denote observed weekly rates of return for Allied Chemical, du Pont, Union Carbide, Exxon, and Texaco, respectively. Then . • .
x'
==
[ .0054, .0048, . 0057, .0063, . 0037 ]
448
Chapter 8
Principa l Components
and 1.000 .577 .509 .387 .462 .577 1 .000 .599 .389 .322 R = .509 .599 1 .000 .436 .426 .387 .389 .436 1 .000 .523 .462 .322 .426 .523 1.000 We note that R is the covariance matrix of the standardized observations x5 - x5 x1 - x1 x2 - x 2 Z1 = � , Z2 = \IS;; , , Zs = � -
-
· · ·
The eigenvalues and corresponding normalized eigenvectors of R , determi ne d by a computer, are "'
A 1 = 2.857, e1 = [ .464, .457, .47o, .421, .421 J e2 = [ .24o, .5o9, .26o, -.526, - .582 J "-2 = .809, A3 = .540, e3 = [ - .612, .178, .335, .541, - .435] e4 = [ .387, .2o6, - .662, .472, - .382 J "-4 = .452, A5 = .343, e5 = [ - .451, .676, - .4oo, -.176, .385 J Using the standardized variables, we obtain the first two sample principal components: 5\ = e1z = .464z l + .457z2 + .470z 3 + .421z4 + .421zs .Y2 = e2z = .240z l + .5o9z2 + .26oz3 - .526z4 - .582zs These components, which account for "'
"'
"'
"'
of the total ( standardized ) sample variance, have interesting interpretations . The first component is a roughly equally weighted sum, or "index," of the five stocks. This component might be called a general stock-market component, or simply a market component. The second component represents a contrast between the chemical stocks ( Allied Chemical, du Pont, and Union Carbide ) and the oil stocks ( Exxon and Texaco ) . It might be called an industry component. Thus, we see that most of the variation in these stock returns is due to market activity and uncorrelated industry activity. This interpretation of stock price behavior has also been sug gested by King [12] . The remaining components are not easy to interpret and, collectively, rep resent variation that is probably specific to each stock. In any event, they do not explain much of the total sample variance . This example provides a case where it seems sensible to retain a compoII nent ( j/2) associated with an eigenvalue less than unity.
Section 8.3
Example 8.6
Summarizing Sa m p l e Va riation by Principa l Components
449
(Co mponents from a correlation matrix with a special structure)
Geneticists are often concerned with the inheritance of characteristics that can be measured several times during an animal ' s lifetime . Body weight (in grams) for n = 150 female mice were obtained immediately after the birth of their first 4 litters. 4 The sample mean vector and sample correlation matrix were, respectively, 49.95 ] x ' = [39.88, 48.11, 45.08, and 1 .000 .7501 .6329 .6363 .6925 .7386 .7501 1 .000 R = .6625 .6329 .6925 1.000 .6363 .7386 .6625 1.000 The eigenvalues of this matrix are A1 = 3.085, A2 = .382, A3 = .342, and A4 = .217 We note that the first eigenvalue is nearly equal to 1 + (p - 1 ) r = 1 + ( 4 - 1 ) ( .6854) = 3.056, where r is the arithmetic average of the off diagonal e}ements of R. The remainin � eigen':alues are small and about equal, although A4 is somewhat smaller than A2 and A3 • Thus, there is some evidence that the corresponding population correlation matrix p may be of the "equal correlation" form of (8-15). This notion is explored further in Example 8.9. The first principal component .Y1 = e �z = .49z l + .52z2 + .49z3 + .5oz4 accounts for 100( A1/p)% = 100(3.058/4)% = 76% of the total variance. Although the average postbirth weights increase over time, the variation in weights is fairly well explained by the first principal component with (nearly) equal coefficients. • "
"
"
"
"
C omment. An unusually small value for the last eigenvalue from either the sample covariance or correlation matrix can indicate an unnoticed linear dependen cy in the data set. If this occurs, one (or more) of the variables is redundant and should be deleted. Consider a situation where x1 , x 2 , and x 3 are subtest scores and the total score x4 is the sum x1 + x2 + x3 • Then, although the linear combination e' x = [ 1, 1, 1, - 1 ] x = x1 + x2 + x3 - x4 is always zero, rounding error in the com putation of eigenvalues may lead to a small nonzero value. If the linear expression relating x4 to ( x1 , x2 , x3 ) was initially overlooked, the smallest eigenvalue-eigenvector pair should provide a clue to its existence. Thus, although "large" eigenvalues and the corresponding eigenvectors are im portant in a principal component analysis, eigenvalues very close to zero should not be routinely ignored. The eigenvectors associated with these latter eigenvalues may point out linear dependencies in the data set that can cause interpretive and compu tational problems in a subsequent analysis. • 4 Data courtesy of J. J. Rutledge.
450
8.4
Chapter 8
Principa l Com ponents
G RAPH I N G TH E PRI NCIPAL CO M PO N E NTS
Plots of the principal components can reveal suspect observations, as well as provide checks on the assumption of normality. Since the principal components are linear combinations of the original variables, it is not unreasonable to expect them to be nearly normal. It is often necessary to verify that the first few principal components are approximately normally distributed when they are to be used as the input data for additional analyses. The last principal components can help pinpoint suspect observations. Each observation can be expressed as a linear combination
xj = (xj e l ) e l + (xj e 2 ) e 2 + . . . + (xj ep ) e p = .Yj l e 1 + .Yj 2 e2 + · · · + .Yj p e p of the complete set of eigenvectors e 1 , e 2 , . . . , e P of S. Thus, the magnitudes of the last principal components determine how well the first few fit the observations. That is, Yj l e l + Yj 2 e 2 + . . . + Yj, q - l e q - 1 differs from Xj by Yj q e q + + Yj p e p , the square of whose length is YJq + · · · + YJp · Suspect observations will often be such that at least one of the coordinates Yj q ' . . . , yj P contributing to this squared length will be large . (See Supplement 8A for more general approximation results.) The following statements summarize these ideas. 1. To help check the normal assumption, construct scatter diagrams for pairs of the first few principal components. Also, make Q-Q plots from the sample values generated by each principal component. 2. Construct scatter diagrams and Q-Q plots for the last few principal compo nents. These help identify suspect observations. . . .
Example 8.7
(Plotti ng the principal components fo r the tu rtle data)
We illustrate the plotting of principal components for the data on male turtles discussed in Example 8.4. The three sample principal components are
y1 = .683 (x1 - 4.725 ) + .510(x2 - 4.478) + .523 (x3 - 3.703 ) y2 = - .159 ( x1 - 4.725 ) - .594(x2 - 4.478) + .788(x3 - 3.703 ) y3 = - .713 (x1 - 4.725 ) + .622(x2 - 4.478) + .324(x 3 - 3.703 ) where x 1 = ln (length) , x2 = ln (width), and x3 = ln ( height) , respectively. Figure 8.5 shows the Q-Q plot for y2 and Figure 8.6 shows the scatter plot of (j/1 , y2 ) . The observation for the first turtle is circled and lies in the lower right corner of the scatter plot and in the upper right corner of the Q-Q plot; it may be suspect. This point should have been checked for recording errors, or the tur tle should have been examined for structural anomalies. Apart from the first turtle, the scatter plot appears to be reasonably elliptical. The plots for the other sets of principal components do not indicate any substantial departures • from normality.
Section 8.4
Graph ing the Principa l Components
451
@1 .04 ,, fl' .� •
0. •
..
••
•
••
•
.3
.1
•
yl
•
•
• • • • •
-. 1 •
-.3
••
-.03
•
-.01
•
•
•
8.5
A 0-0 p l ot for the secon d principal component y2 from the d ata on m a l e t u rtles . Figure
2
• •
•
:· .01
.03
.05
8.6 Scatter plot of the principal components y1 and y2 of the data o n male turtles. Figure
.07
y2
The diagnostics involving principal components apply equally well to the check ing of assumptions for a multivariate multiple regression model. In fact, having fit any model by any method of estimation, it is prudent to consider the
(
. . vector) - vector . oft predicted Residual vector = ( observation ( es tIma e d) va 1ues
)
or
(8-31) e j = yj - /3 ' zj j = 1, 2, . . . , n p( Xl ) ( p Xl ) ( p Xl ) for the multivariate linear model. Principal components, derived from the covariance matrix of the residuals, n 1 (8-32) ( e"1- - e 1· ) ( e1- - e-;::1· ) � n - p . 1 =1 -;::
"
'
can be scrutinized in the same manner as those determined from a random sample. You should be aware that there are linear dependencies among the residuals from a linear regression analysis, so the last eigenvalues will be zero, within rounding error.
452
8.5
Chapter 8
Principa l Components
LARGE SAM PLE I N FE R E N CES
We have seen that the eigenvalues and eigenvectors of the covariance (correlation ) matrix are the essence of a principal component analysis. The eigenvectors deter mine the directions of maximum variability, and the eigenvalues specify the variances. When the first few eigenvalues are much larger than the rest, most of the total vari ance can be "explained" in fewer than p dimensions. In practice, decisions regarding the quality of the principal componen1 approx imation must be made on the basis of the eigenvalue-eigenvector pairs ( Ai , e z) ex tracted from S or R . Because of sampling variation, these eigenvalues and eigenvectors will differ from their underlying population counterparts. The sampling distributions of Ai and ei are difficult to derive and beyond the scope of this book. If you are interested, you can find some of these derivations for multivariate normal populations in [1 ], [2], and [5] . We shall simply summarize the pertinent large sam ple results. Large Sample Properties of Ai and ei
Currently available results concerning large sample confidence intervals for A i and e1 assume that the observations X 1 , X2 , . . . , Xn are a random sample from a normal population. It must also be assumed that the (unknown ) eigenvalues of I are distinct and positive, so that A 1 > A2 > · · · > A p > 0. The one exception is the case where the number of equal eigenvalues is known. Usually the conclusions for distinct eigen values are applied, unless there is a strong reason to believe that I has a special struc ture that yields equal eigenvalues. Even when the normal assumption is violated, the confidence intervals obtained in this manner still provide some indication of the uncertainty in ;\ i and e i . Anderson [2] and Girshick [5] have established the following large sample distri bution theory for the eigenvalues A' = [A l , . . . ' Ap ] and eigenvectors e l , . . . ' e p of S: 1. Let A be the diagonal matrix of eigenvalues Ar, . . . , AP of I, then Vn (A - A ) is approximately Np (O, 2A 2 ) . 2. Let "
then Vn ( e i - ez ) is approximately Np (O, E i) · 3. Each Ai is distributed independently of the elements of the associated e i . Result 1 implies that, for n large, the A i are independently distributed. Moreover, Ai has an approximate N ( Ai , 2A[j n ) distribution. Using this normal distribution, we obtain P[ l Ai - Ai I z(aj2)Ai V2711 J = 1 - a. A large sample 100( 1 - a) % confidence interval for Ai is thus provided by "
<
A· A· - A· ( 1 + z(aj2 ) V27li ) ( 1 - z(aj2)V27li ) "
l
_______________
"
<
l
<
l
(8-33)
Section 8 . 5
453
La rge Sam p l e Inferences
where z( a/2) is the upper 100( a/2 )th percentile of a standard normal distribution. Bonferroni-type simultaneous 100( 1 - a)% intervals for m A/s are obtained by re placing z(a/2) with z (a/2m) . (See Section 5.4 . ) Result 2 implies that the e/s are normally distributed about the correspond ing e/s for large samples. The elements of each e i are correlated, and the correla tion depends to a large extent on the separation of the eigenvalues A.1 , A2 , . , A.P (which is unknown) and the sample size n . Approximate standard errors for the co �fficients eik are given by the square ro,.,ots of the diagonal elements of (1/ n ) :E i where E i is derived from Ei by substituting A/s for the A./s and e /s for the e/s . . .
Example 8.8
(Co nstructi ng a confidence i nterva l fo r A1)
We shall obtain a 95% confidence interval for A. 1 , the variance of the first pop ulation principal component, using the stock price data listed in Table 8.4 in the Exercises. Assume that the stock rates of return represent independent drawings from an N5( /L, I) population, where I is positive definite with distinct eigenvalues A 1 > A.2 > > A5 > 0. Since n == 100 is large, we can use (8:__3 3 ) with i == 1 to construct a 95% confidence interval for A 1 . From Exercise 8.10, A. 1 == .0036 and in addition, z( .025 ) 1.96. Therefore, with 95% confidence, · · ·
==
.0036 �) ( 1 + 1.96 "v/2
<
A. 1
<
.0036 or .0028 ( 1 - 1.96 "v/2 �)
<
A. 1
<
.0050 •
Whenever an eigenvalue is large, such as 100 or even 1000, the intervals gener ated by (8-33) can be quite wide, for reasonable confidence levels, even though n is fairly large. In general, the confidence interval gets wider at the same rate that A i gets larger. Consequently, some care must be exerc�ed in dropping or retaining prin cipal components based on an examination of the A/s. Testi ng fo r the Equal Correlation Structu re
The special correlation structure Cov (Xi , Xk) == v'a-i i a-kk p , or Corr (Xi , Xk) == p , all i # k, is one important structure in which the eigenvalues of I are not distinct and the previous results do not apply. To test for this structure, let
Ho : p
==
Po
(pX p)
==
1 p p 1
p p
p
1
p
and
H1 : P
=I=
Po
A test of H0 versus H1 may be based on a likelihood ratio statistic, but Lawley [14] has demonstrated that an equivalent test procedure can be constructed from the off diagonal elements of R .
454
Chapter 8
Principa l Components
2:i=lk
Lawley ' s procedure requires the quantities p 1 rk = p - 1 i = r;k k = 1, 2, ' p; r = p ( p 2- 1 ) l 0 0 0
2:2:i
(p - 1 ) 2 [ 1 - ( 1 - r ) 2 ] ( 8- 3 4) 1 = p - (p - 2) ( 1 - r) 2 It is evident that rk is the average of the off-diagonal elements in the kth column (o r row) of R and r is the overall average of the off-diagonal elements. The large sample approximate a-level test is to reject H0 in favor of H1 if A
T=
( �r ))2 [ 2:2:i
(1 -
>
XJp + l ) ( p - 2 )/2 (a)
( 8- 3 5 )
where XJp + l ) ( p -2 ) ;2( a) is the upper (100a )th percentile of a chi-square distribution with (p + 1 ) (p - 2)/2 d.f. Example 8.9
(Testi ng for eq uicorrelation structu re)
From Example 8.6, the sample correlation matrix constructed from the post birth weights of female mice is .7501 .6329 .6363 1.0 .7501 1.0 .6925 .7386 R= .6625 .6329 .6925 1.0 .6363 .7386 .6625 1.0 We shall use this correlation matrix to illustrate the large sample test in ( 8 - 3 5 ) . Here p = 4, and we set 1 p p p 1 Ho : P = Po = p p p1 pp p p p p 1 H1 : P =I= Po Using (8-34) and (8-35), we obtain
r1 = 31 ( .7501 + .6329 + .6363 ) = .6731, r3 = .6626, r4 = .6791
r2 = .7271,
2 r = 4(3) ( 07501 + 06329 + 06363 + o6925 + 07386 + 06625 ) 2 = ( .7501 - .6855 ) 2 r) ( i r k i
2:2:
=
06855
+ ( .6329 - .6855 ) + . . . + ( .6625 - .6855 ) 2 = .01277
Section 8 . 6
Mon itoring Qual ity with Principal Co mponents
455
4
( rk - r)2 ( . 6731 f- .6855)2 f( .6791 - .6855)2 .00245 (44 -- 1(4 [-1 2)(- (11 -- ..66855)855 2 ] 2.1 329 (( 1 150 - 1) [ .01277 - (2.1329)( .00245)] 11.4 2 1) 2 5( 2 5, 5% ) / 2 ) / (8-35) .05) 11.5%07. 8.6, � k =l
y
=
+
· · ·
+
=
=
=
and
T=
_
_
6SSS f
=
Since (p + (p = = the critical value for the test in is x�( = The value of our test statistic is approximately equal to the large sample critical point, so the evidence against H0 (equal correlations) is strong, but not overwhelming. As we saw in Ex�mple the smallest eigenvalues A2 , A3 , and A4 are slightly different, with A4 being somewhat smaller than the other two. Conse quently, with the large sample size in this problem, small differences from the • equal correlation structure show up as statistically significant. �
8.6
�
�
M O N ITO RI NG QUALITY WITH PRI NCIPAL CO M PO N E NTS
2T 5.6,
In Section we introduced multivariate control charts, including the quality ellipse and the chart. Today, with electronic and other automated methods of data collec tion, it is not uncommon for data to be collected on or process variables. Major chemical and drug companies report measuring over process variables, including temperature, pressure, concentration, and weight, at various positions along the pro duction process. Even with variables to monitor, there are pairs for which to cre ate quality ellipses. Clearly, another approach is required to both visually display important quantities and still have the sensitivity to detect special causes of variation.
1010020
10
45
Checki ng a G iven Set of Measurements fo r Sta b i l ity
2 Yjl e1 (xj
Let X1 , X , . . . , X n be a random sample from a multivariate normal distribution with mean IL and covariance matrix I. We consider the first two sample principal com ponents, = - x) and = - x) . Additional principal components could be considered, but two are easier to inspect visually and, of any two components, the first two explain the largest cumulative proportion of the total sample variance. If a process is stable over time, so that the measured characteristics are influ enced only by variations in common causes, then the values of the first two principal components should be stable . Conversely, if the principal components remain stable over time, the common effects that influence the process are likely to remain constant. To monitor quality using principal components, we consider a two-part procedure. The first part of the procedure is to construct an ellipse format chart for the pairs of for j = values . . . , n. By theAsample variance of the first principal component is given by the largest eigenvalue and the sample variance of the second principal component is the second-largest eigenvalue The two sample components are uncorrelated,
Yj2 e2(xj
(yj(81 ,-2y0)j2), 1, 2, A.1 , A.2 • �
y1
y2
456
Chapter 8
Principal Components
so the quality ellipse for n large ( see Section 5.6) reduces to the collection of pairs of possible values (.}/1 , y2 ) such that
"'2 Y1 ---;;- + Al
Example 8. 1 0
Y"'22 X22 ( a ) A2 -
---;;-
( 8- 36)
<
(An e l l i pse format chart based o n the fi rst two principal co mponents)
Refer to the police department overtime data given in Table 5.8. Table 8.1 con tains the five normalized eigenvectors and eigenvalues of the sample covari ance matrix S. The first two sample components explain 82% of the total variance. The sample values for all five components are displayed in Table 8.2. TABLE 8. 1
E I G E NVECTORS AND E I G E NVALUES OF SAM PLE COVARIANCE MATRIX FOR POLICE DEPARTM ENT DATA
Variable
el "'
Appearances overtime (x 1 ) Extraordinary event ( x2 ) Holdover hours (x3) COA hours (x4) Meeting hours (x5 )
e2 "'
.046 - .048 .039 .985 - .658 .107 .734 .069 - .155 .107 ;\.l 2,770,226 1,429,206
e3 "'
e4 "'
.629 -.077 .582 .503 .081 628,129
"'
-.643 -.151 .250 .397 .586 221,138
.432 - .007 - .392 - .213 .784 99,824
TABLE 8.2
VALUES OF TH E PRI NCIPAL CO M PO N E NTS FOR TH E POLICE DEPARTM E NT DATA
Period
Yjl
Yj 2
Yj 3
Yj 4
Yj s
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2044.9 -2143.7 -177.8 -2186.2 -878.6 563.2 403.1 - 1988.9 132.8 -2787.3 283.4 761 .6 -498.3 2366.2 1917.8 2187.7
588.2 -686.2 -464.6 450.5 -545.7 -1045.4 66.8 - 801.8 563.7 -213.4 3936.9 256.0 244.7 -1 193.7 -782.0 -373.8
425.8 883.6 707.5 -184.0 115.7 281.2 340.6 -1437.3 125.3 7.8 -0.9 -2153.6 966.5 -165.5 -82.9 170.1
- 189.1 -565.9 736.3 443.7 296.4 620.5 -135.5 - 148.8 68.2 169.4 276.2 -418.8 - 1 142.3 270.6 - 196.8 -84.1
-209.8 -441 .5 38.2 -325.3 437.5 142.7 521 .2 61.6 61 1.5 -202.3 - 159.6 28.2 182.6 -344.9 -89.9 -250.2
"'
"'
"'
"'
es
"'
Section 8.6
Mon itori n g Q u a l ity with Principal Components
457
0 0 0 ("i")
0 0 0 < ;;::...
C"'l
.......
•
0
•
0 0 0
....... I
0 0 0 ("i") I
-5000
-.
- 2000
• •
•
+·
•
0
•
•
•
•
• •
2000
8.7 The 95% control e l l i pse based on the fi rst two principal com pone nts of overtime hou rs.
Figure
4000
Let us construct a 95% ellipse format chart using the first two sample prin cipal components and plot the 16 pairs of component values in Table 8 . 2. Although n == 16 is not large, we use x�( .05 ) = 5 .99, and the ellipse becomes
�2 �2 � 1 + �2 A l A2
<
5 . 99
This ellipse is shown in Figure 8.7, along with the data . One point is out of control, because the second principal component for this point has a large value. Scanning Table 8 . 2, we see that this is the value 3936.9 for period 11 . According to the entries of e2 in Table 8 .1, the second principal component is essentially extraordinary event overtime hours. The principal component approach has led us to the same conclusion we came to in • Example 5.9. In the event that special causes are likely to produce shocks to the system, the second part of our two-part procedure-that is, a second chart-is required. This chart is created from the information in the principal components not involved in the ellipse format chart . Consider the deviation vector X - /L, and assume that X is distributed as Np ( #L , I ) . Even without the normal assumption, Xj - IL can be expressed as the sum of its projections on the eigenvectors of I X - 1L = ( X - IL ) ' e 1 e 1 + (X - 1L ) ' e 2 e 2 + (X - �L ) ' e 3 e 3 + . . . + (X - �L ) ' e P e P
458
Chapter 8
Principal Compone nts
or
X - 1L = Y1 e 1 + Y2 e 2 + }3 e 3 + + YP e P (8-37 ) where Y: = (X - IL ) ' ei is the population ith principal component centered to have mean 0. The approximation to X - IL by the first two principal components ha s t h e form Yi e 1 + Y2e 2 . This leaves an unexplained component of X - IL Yi e1 - Y2 e 2 Let E = [ e 1 , e 2 , . . . , e p ] be the orthogonal matrix whose columns are the eigenvecto rs of I. The orthogonal transformation of the unexplained part, · · ·
-
yl Y2 E' (X - 1L - Y1 e 1 - Y2 e 2 ) = Y3
yl 0 0
yp
0
0 Y2 0 0
0 0 Y3
yp
=
[J J 2
so the last p 2 principal components are obtained as an orthogonal transformation of the approximation errors. Rather than base the T2 chart on the approximation errors, we can, equivalently, base it on these last principal components. Recall that -
Var (Y:) = Ai for i = 1, 2, . , p . .
and Cov ( Y: , Yk) = 0 for i # k. Consequently, the statistic Y (2 ) Iv�z ) ,Y( z ) Y( 2 ) , based on the last p - 2 population principal components, becomes (8-38) This is just the sum of the squares of p - 2 independent standard normal variables, Ak 11 2Yk , and so has a chi-square distribution with p 2 degrees of freedom. In terms of the sample data, the principal components and eigenvalues must be estimated. Because the coefficients of the linear combinations ei are also estimates, the principal components do not have a normal distribution even when the popula tion is normal. However, it is customary to create a T2 -chart based on the statistic -
which involves the estimated eigenvalues and vectors. Further, it is usual to appeal to the large sample approximation described by (8-38) and set the upper control limit of the T2 -chart as UCL = c 2 = x -2 (a). This T2-statistic is based on high-dimensional data. For example, when p = 20 variables are measured, it uses the information in the 18-dimensional space perpen dicular to the first two eigenvectors e1 and e 2 . Still, this T2 based on the unexplained variation in the original observations is reported as highly effective in picking up spe cial causes of variation.
�
Section 8.6
Example 8. 1 1
Monitor i n g Q u a l ity with Principal Com ponents
459
(A T2-cha rt for the u nexplai ned (o rthogonal) overti me hou rs)
Consider the quality control analysis of the police department overtime hours in Example 8 . 10. The first part of the quality monitoring procedure, the quali ty ellipse based on the first two principal components, was shown in Figure 8 . 7 . To illustrate the second step of the two-step monitoring procedure, we create the chart for the other principal components. Since p == 5, this chart is based on 5 - 2 == 3 dimensions, and the upper control limit is x�( . 05) 7 . 81 . Using the eigenvalues and the values of the prin cipal components, given in Example 8 . 10, we plot the time sequence of values A2 A2 A2 Y1 3 Yj 4 Y j s 2 TJ - -A- + -A- + -AA. ==
_
A3
4
As
where the first value is T2 == . 891 and so on. The T 2-chart is shown in Figure 8 . 8 .
6
� 4
2
0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
5
0
10
15 Period
Figure
8.8
A T 2-ch a rt based on the l ast th ree principa l co mponents of overtime h o u rs.
Since points 12 and 13 exceed or are near the upper control limit, some thing has happened during these periods. We note that they are just beyond the period in which the extraordinary event overtime hours peaked. From Table 8.2, y3j is large in period 12, and from Table 8 . 1, the large co efficients in e 3 belong to legal appearances, holdover, and COA hours. Was there some adjusting of these other categories following the period extraordi • nary hours peaked? Contro l l i ng Futu re Va l ues
Previously, we considered checking whether a given series of multivariate observa tions was stable by considering separately the first two principal components and then the last p - 2 . Because the chi-square distribution was used to approximate
460
Chapter 8
Principa l Components
the UCL of the T2 -chart and the critical distance for the ellipse format chart, no fur ther modifications are necessary for monitoring future values. Example 8. 1 2
(Contro l e l l i pse for futu re principal components)
In Example 8.10, we determined that case 11 was out of control. We drop this point and recalculate the eigenvalues and eigenvectors based on the covariance of the remaining 15 observations. The results are shown in Table 8.3. TABLE 8.3 E I G E NVECTO RS A N D E I G E NVALU ES FROM TH E 1 5 STA B L E O B S E RVATIONS
el "
.049 .007 - .662 .731 - . 159 ;\.l 2,964,749.9
Appearances overtime (x 1 ) Extraordinary event ( x2 ) Holdover hours (x3) COA hours (x4) Meeting hours (x5 )
e2 "
e3 "
e4
es
"
"
.304 53 0 . 479 . 629 - .260 .939 -.078 - .212 - .089 - .158 - .437 . 582 - .123 - .336 .503 - .291 - .752 .081 - .058 .632 672,995.1 396,596.5 194,401.0 92,760.3 .
The principal components have changed. The component consisting pri marily of extraordinary event overtime is now the third principal component and is not included in the chart of the first two. Because our initial sample size is only 16, dropping a single case can make a substantial difference. Usually, at least 50 or more observations are needed, from stable operation of the process, in order to set future limits. Figure 8.9 on page 461 gives the 99% prediction (8-36) ellipse for future pairs of values for the new first two principal components of overtime. The 15 stable pairs of principal components are also shown. M In some applications of multivariate control in the chemical and pharmaceuti cal industries, more than 100 variables are monitored simultaneously. These include numerous process variables as well as quality variables. Typically, the space orthog onal to the first few principal components has a dimension greater than 100 and some of the eigenvalues are very small. An alternative approach (see [13]) to constructing a control chart, that avoids the difficulty caused by dividing a small squared princi pal component by a very small eigenvalue, has been successfully applied. To imple ment this approach, we proceed as follows. For each stable observation, take the sum of squares of its unexplained component db j == (xj - x - .Yj 1 e 1 - .Yj 2 e 2 ) ' (xj - x - .Yj 1 e 1 - .Yj 2 e 2 ) Note that, by inserting EE ' == I, we also have "
"
which is just the sum of squares of the neglected principal components.
Section 8.6
Mon itoring Qual ity with Principal Components
461
0 0 0 ("i")
•
0 0 0 < ;;::...
C"'l
.......
•
0 0 0 0
•
....... I
0 0 0 ("i") I
Figure
8.9
overtime.
•
- 5000
• •
+.
•
- 2000
••
•
•
• •
2000
0
4000
A 99% e l l i pse format chart for the fi rst two principal com pone nts of futu re
Using either form, the db j are plotted versus j to create a control chart. The lower limit of the chart is 0 and the upper limit is set by approximating the distribution of db j as the distribution of a constant c times a chi-square random variable with v degrees of freedom. For the chi-square approximation, the constant c and degrees of freedom v are chosen to match the sample mean and variance of the db j , j = 1, 2, . . . , n. In partic ular, we set - 1 n " d 2uj - c v d 2u - - £.J n j= l
and determine
The upper control limit is then ex� ( a ) , where a
=
.05 or .01 .
S U PP LE M E NT SA
The Geometry of the Sam ple Principal Component Approximation
In this supplement, we shall present interpretations for approximations to the data based on the first r sample principal components. The interpretations of both the p-dimensional scatter plot and the n-dimensional representation rely on the algebraic result that follows. We consider approximations of the form A = [ a 1 , a2 , . . . , an ] ' ( nx p) to the mean corrected data matrix [ xl
- X, x2 - X, . . . ' X n -
x] '
The error of approximation is quantified as the sum of the np squared errors
n p
n
:L ( xj - x - aj ) ' ( xj - x - aj ) = :L :L (xj i - xi - aj i ) 2 j=l j=l i =l Result 8A.l Let A be any matrix with rank(A)
( n X p)
:E r = [ e 1, e 2 , . . . , e r J , where ei is the ith eigenvector of s .
sum of squares in (8A-1) is minimized by the choice
"
so the jth column of its transpose A' is
where
462
<
(8A-1)
r < min (p, n ) . Let
The error of approximation
Supplement SA
The Geometry of the Sa mple Principa l Com ponent Approximation
463
are the values of the first r sample principal components for the jth unit. Moreover, n
L (xj - X - aj ) ' (xj - X - aj ) = j =l
"
where A r +l
>
· · ·
>
( n - 1 ) ( Ar + l + . . . +
Ap )
"
AP are the smallest eigenvalues of S.
Proof.
Consider first any A whose transpose A' has columns aj that are a lin ear combination of a fixed set of r perpendicular vectors o 1 , o2 , . . . , on so that U = [ O r , u2 , . . . , ur ] satisfies U'U = I. For fixed U, xj - x is best approximated by its projection on the space spanned by u 1 , o2 , . . . , or (see Result 2A.3), or
(xj - x) ' o l o l + (xj - x) ' o2 o2 + . . . + (xj - x) ' ur or oJ. (xj - x) o2(xj - x) = [ 01 ' 02 ' . . . ' O r J = UU' (xj - x)
(8A-2)
This follows because, for an arbitrary vector bj , X1· - x - Ub1· = X1· - x - UU' (x1· - x) + UU' ( x1· - x) - Ub1· = (I - UU' ) (xj - x) + U (U' (xj - x) - bj ) so the error sum of squares is (xj - x - Ubj ) ' (xj - x - Ubj ) = (xj - x) ' (I - UU' ) (xj - x) + 0 + (U' (xj - x) - bj ) ' (U' (xj - x) - bj ) where the cross product vanishes because (I - UU' ) U = U - UU ' U = U - U = 0. The last term is positive unless bj is chosen so that bj = U' (xj - x) and Ubj = UU' (xj - x) is the projection of xj - x on the plane. Further, with the choice aj = Ubj = UU' (xj - x) , (8A-1) becomes n
L (xj - X - UU' (xj - x) ) ' (xj - X - UU' (xj - x) )
j =l
n
= L (xj - x) ' (I - UU' ) (xj - x) j =l (8A-3) = L (xj - x) ' (xj - x) - L (xj - x) ' UU' (xj - x) j =l j =l We are now in a position to minimize the error over choices of U by maximizing the last term in (8A-3). By the properties of trace (see Result 2A.12), n
n
n
n
L (xj - x) ' UU' (xj - x) = jL tr [ (xj - x) ' UU' (xj - x) ] j =l =l n
= L tr [ UU' (xj - x) (xj - x) ' ] j =l = ( n - 1) tr [UU' SJ = ( n - 1 ) tr [U' S U]
(8A-4)
464
Chapter 8
Principa l Compone nts
That is, the best choice for U maximizes the sum of the diagonal elements of U' SU. From (8-19), selecting u 1 to maximize u 1 Su1 , the first diagonal element of U' SU, give s u 1 = e 1 . For u2 perpendicular to e r, u2Su2 is maximized by e 2 . [See (2-52).] Continuing, we find that u = [ e l , e 2 , . . . ' e r J = E r and A' = E r E �[x l - X, x 2 - X, . . . ' X n - x], as asserted. With this choice the ith diagonal element of ii' sii is e jS ei = ei ( Aiei) = Ai so (xj - X ) (xj tr [ iJ ' S iJ ] = A 1 + A2 + · · · + A, . Also, (xj - X ) ' (xj - X) = tr
[�
�
X)']
= (n - 1 ) tr ( S) = (n - 1 ) ( A 1 + A2 + . . . + Ap ) · Let U = iJ in (8A-3), and the error bound follows. • The p-Di mensional Geometrical I nterpretatio n
The geometrical interpretations involve the determination of best approximating planes to the p-dimensional scatter plot. The plane through the origin, determined by ur , u2 , . . . , u, consists of all points x with for some b This plane, translated to pass through a, becomes a + Ub for some b. We want to select the r-dimensional plane a + Ub that minimizes the sum of n
squared distances � dJ between the observations xj and the plane. If xj is approxij= l n mated by a + Ubj with � bj = 0, 5 then j =l
n � (xj - a - Ubj ) ' (xj - a - Ubj ) j= l n = � (xj - x - Ubj + x - a ) ' (xj - x - Ubj + x - a ) j =l n = � (xj - x - Ubj ) ' (xj - x - Ubj ) + n ( x - a ) ' ( x - a ) j =l n > � (x j - X - E r E � (xj - x) ) ' (xj - X - E r E � (xj - x) ) j=l by Result 8A.1, since [Ubr , . . . , Ub n ] = A' has rank (A) r. The lower bound is reached by taking a = x, so the plane passes through the sample mean. This plane is determined by e r, e 2 , . . . , er . The coefficients of ek are ek(xj - x) = Yjk , the kth sample principal component evaluated at the jth observation. The approximating plane interpretation of sample principal components is il lustrated in Figure 8.10. <
n
5 If
2: bj = nb "#
;=1
0,
use a +
Vbj =
(a +
Ub)
+
U(b1 - b) =
a* +
Ub7 .
Supplement SA
The Geometry of the Sam ple Principa l Com ponent Approximation
465
3
8.10 The r 2-d i mensional p l a n e that approximates the scatter
Figure
=
n
plot by m i n i m i z i n g 2: d} . j= 1
An alternative interpretation can be given. The investigator places a plane through x and moves it about to obtain the largest spread among the shadows of the observations. From ( 8A-2 ) , the projection of the deviation xj - x on the plane Ub is vj = UU' (xj - x) . Now, v = 0 and the sum of the squared lengths of the projec
tion deviations
n n 2: vjvj = 2: (xj - x) ' UU' (xj - x) = (n - 1 ) tr [U' SUJ j =l j =l is maximized by U = E. Also, since v = 0, n n ( n - 1 ) S v = 2: ( vj - v) ( vj - v ) ' = 2: vj vj j =l j =l and this plane also maximizes the total variance "
tr (S v ) =
tr (n � 1)
[ � vivf ] = (n � 1 tr [ � vjvi ] )
The n-Di mensional Geometrical I nterpretatio n
Let us now consider, by columns, the approximation of the mean-centered data ma trix by A. For r = 1, the ith column [x l i - xi , x2 i - xi , . . . ' Xn i - xz]' is approxi mated by a multiple ci b ' of a fixed vector b' = [ br , b2 , , bn J . The square of the length of the error of approximation is n 2 Ll? = " � ( XJ· l· - Xl· - c l b1 ) j =l Considering A t o be of rank one, we conclude from Result 8A.1 that (n X p) e l e J. (x l - x) Y1 1 " e e ! (x 2 - x) Y2 1 e,... , A= 1 l • • .
-
-
"
"
e l e1 (x n - x)
"
Yl n
466
Chapter 8
Principa l Components 3
d3
-
1
1
(a) Principal component of S
(b) Principal component of R
8.1 1 The fi rst s a m p l e principal component, y1 , m i n i m izes the sum of the sq u a res of the d ista n ces, Lf , from the deviation vecto rs, d; [x1 i - Xj, x2 i - xi, , Xn i - xi], to a l i n e .
Figure
=
. . .
p
minimizes the sum of squared lengths :L Lr . That is, the best direction is determined i= l by the vector of values of the first principal component. This is illustrated in Figure 8.1l (a) . Note that the longer deviation vectors (the larger sii ' s) have the
p
most influence on the minimization of :L Lr . i= l If the variables are first standardized, the resulting vector [ ( x l i - xi)/ \fi;z, ( x2 i - xi)j\fi;z, . . . , ( xn i - xi)/ \fi;z ] has length n - l for all variables, and each vec tor exerts equal influence on the choice of direction. [See Figure 8.11(b ).] In either case, the vector b is moved around in n-space to minimize the sum of
p
the squares of the distances :L Lr . In the former case Lr is the squared distance i= l between [ X l i - Xi, X2 i - Xi, . . . , Xn i - xi] ' and its projection On the line determined by b. The second principal component minimizes the same quantity among all vec tors perpendicular to the first choice.
EXERCI SES
8.1. Determine the population principal components Yi and Y2 for the covari
ance matrix
I = [� �]
Also, calculate the proportion of the total population variance explained by the first principal component. 8.2. Convert the covariance matrix in Exercise 8.1 to a correlation matrix p . (a) Determine the principal components Yi and Y2 from p and compute the proportion of total population variance explained by Yi .
Chapter 8
Exercises
467
(b) Compare the components calculated in Part a with those obtained in Ex ercise 8.1. Are they the same? Should they be? (c) Compute the correlations p y l > z 1 , p y 1 , z2 , and PY2 , z 1 • 8.3. Let 2 0 I= o 4 o 0 0 4 Determine the principal components Yi , ¥;, and Y3 . What can you say about the eigenvectors (and principal components) associated with eigenvalues that are not distinct? 8.4. Find the principal components and the proportion of the total population vari ance explained by each when the covariance matrix is a-2 a-2 p 0 1 1 2 I = a- P a-2 a-2 P , - __ < P < __ v2 v2 0 a- 2 p (I2 8.5. (a) Find the eigenvalues of the correlation matrix
O J [
[
]
p =
[: � :] p
p
1
Are your results consistent with (8-16) and (8-17)? (b) Verify the eigenvalue-eigenvector pairs for the p X p matrix p given in (8-15). 8.6. Data on x 1 = sales and x2 = profits for the 10 largest U.S. industrial corpora tions were listed in Exercise 1.4 of Chapter 1. From Example 4.12
X= _
8.7.
[ ]
[
]
62,309 10,005.20 255.76 X 105 S = 2,927 ' 255.76 14.30
(a) Determine the sample principal components and their variances for these data. (You may need the quadratic formula to solve for the eigenvalues of S.) (b) Find the proportion of the total sample variance explained by )\ . (c) Sketch the constant density ellipse (x - x) 'S- 1 (x - x) = 1.4, and indicate the principal components j/1 and y2 on your graph. (d) Compute the correlation coefficients ry b xk ' k = 1, 2. What interpretation, if any, can you give to the first principal component? Convert the covariance matrix S in Exercise 8.6 to a sample correlation matrix R. (a) Find the sample principal components j/1 , j/2 and their variances. (b) Compute the proportion of the total sample variance explained by j/1 . (c) Compute the correlation coefficients rY I > Z k ' k = 1, 2. Interpret j/1 . (d) Compare the components obtained in Part a with those obtained in Exer cise 8.6(a). Given the original data displayed in Exercise 1.4, do you feel that it is better to determine principal components from the sample covariance matrix or sample correlation matrix? Explain.
468
Chapter 8
Principa l Components
8.8. Use the results in Example 8.5.
(a) Compute the correlations ryn zk for i = 1 , 2 and k = 1, 2, . . . , 5. Do these correlations reinforce the interpretations given to the first two compo nents? Explain. (b) Test the hypothesis 1 p p p p p 1 p p p Ho : P = Po = p p 1 p p p p p 1 p p p p p 1 versus H1 : P =I= Po at the 5% level of significance. List any assumptions required in carrying out this test. 8.9. (A test that all variables are independent.) (a) Consider that the normal theory likelihood ratio test of H0 : I is the diagonal matrix 0 (]" 1 1 0 0 0 ' (Tii > 0 0
0
Show that the test is as follows: Reject H0 if
l n/ 2 A = IpS = I R l n/2 < s�J-/2 II i=1 ll
C
For a large sample size, 2 ln A is approximately X� (p -1 ) 12 . Bartlett [3] sug gests that the test statistic - 2[1 - (2p + 1 1 )/6n J ln A be used in place of -2 ln A. This results in an improved chi-square approximation. The large sample a critical point is X� (p -1 ) ;2 ( a ) . Note that testing I = I o is the same as testing p = I. (b) Show that the likelihood ratio test of H0 : I = 0"2 1 rejects H0 if -
---- ( 2:p Ai )p = [ -i 1
I S l n/2 A= ( tr( S) /p ) np/2
n /2
geometric mean Ai n p/2 < c arithmetic mean Ai
J
1 p = For a large sample size, Bartlett [3] suggests that -2[1 - (2p2 + p + 2)j6pn] ln A is approximately x(p + 2) (p -1 ) ;2 . Thus, the large sample a critical point is XJp + 2) (p -1 ) ;2 ( a ) . This test is called a sphericity test, because the constant den sity contours are spheres when I = 0"2 1. A
A
Chapter 8
Exercises
469
Hint: (a) max L( IL, I) is given by (5-10), and max L ( IL , I0) is the product of the p, , l univariate likelihoods, max (2'1T)- n12 a-izn12 exp - ± (xi; - JLY/2u;; . JLl al l j=l n n 2 Hence, jli = ( 1/ n ) � xj i and (]-ii = ( 1/ n ) � ( xj i - xi ) • The divisor n j=l j=l cancels in A, so S may be used. 2 (b) Verify (J-2 = np under H0 . ( x i P - Xp ) (xi 1 - X1 / + · · · + j=l j=l Again, the divisors n cancel in the statistic, so S may be used. Use Result 5.2 to calculate the chi-square degrees of freedom.
J
[
±
[±
]/
The following exercises require the use of a computer. 8.10. The weekly rates of return for five stocks listed on the New York Stock Ex-
change are given in Table 8.4. (See the stock-price data on the CD-ROM.) TABLE 8.4
STOCK- P R I C E DATA (WE E KLY RATE OF R ETU RN)
Week 1 2 3 4 5 6 7 8 9 10
Allied Chemical .000000 .027027 .122807 .057031 .063670 .003521 - .045614 .058823 .000000 .006944
91 92 93 94 95 96 97 98 99 100
- .044068 .039007 - .039457 .039568 - .031142 .000000 .021429 .045454 .050167 .019108
Du Pont .000000 - .044855 .060773 .029948 - .003793 .050761 - .033007 .041719 - .019417 - .025990
Union Carbide .000000 - .003030 .088146 .066808 - .039788 .082873 .002551 .081425 .002353 .007042
Exxon .039473 -.014466 .086238 .013513 - .018644 .074265 - .009646 - .014610 .001647 - .041118
Texaco -.000000 .043478 .078124 .019512 -.024154 .049504 - .028301 .014563 - .028708 -.024630
.020704 .038540 - .029297 .024145 - .007941 - .020080 .049180 .046375 .036380 - .033303
- .006224 .024988 - .065844 - .006608 .011080 - .006579 .006622 .074561 .004082 .008362
- .018518 - .028301 - .015837 .028423 .007537 .029925 - .002421 .014563 -.011961 .033898
.004694 .032710 - .045758 -.009661 .014634 - .004807 .028985 .018779 .009216 .004566
(a) Construct the sample covariance matrix S, and find the sample principal components in (8-20). (Note that the sample mean vector x is displayed in Example 8.5.)
470
Chapter 8
Principal Components
(b) Determine the proportion of the total sample variance explained by th e first three principal components. Interpret these components. (c) Construct Bonferroni simultaneous 90% confidence intervals for the vari ances A 1 , A2 , and A3 of the first three population components Yi , Y2 , and Y3 . (d) Given the results in Parts a-c, do you feel that the stock rates-of-return data can be summarized in fewer than five dimensions? Explain. 8.11. Consider the census-tract data listed in Table 8.5. Suppose the observations on X5 = median value home were recorded in thousands, rather than ten thou sands, of dollars; that is, multiply all the numbers listed in the sixth column of the table by 10. (a) Construct the sample covariance matrix S for the census-tract data when X5 = median value home is recorded in thousands of dollars. (Note that this covariance matrix can be obtained from the covariance matrix given in Example 8.3 by multiplying the off-diagonal elements in the fifth column and row by 10 and the diagonal element s5 5 by 100. Why?) (b) Obtain the eigenvalue-eigenvector pairs and the first two sample princip al components for the covariance matrix in Part a. (c) Compute the proportion of total variance explained by the first two princi pal components obtained in Part b. Calculate the correlation coefficients, ry i , xk ' and interpret these components if possible. Compare your results with the results in Example 8.3. What can you say about the effects of this change in scale on the principal components? 8.12. Consider the air-pollution data listed in Table 1 .5. Your job is to summarize these data in fewer than p = 7 dimensions if possible. Conduct a principal TABLE 8. 5
Tract 1 2 3 4 5 6 7 8 9 10 11 12 13 14
C E N S US-TRACT DATA
Total population (thousands) 5.935 1.523 2.599 4.009 4.687 8.044 2.766 6.538 6.451 3.314 3.777 1 .530 2.768 6.585
Median school years 14.2 13.1 12.7 15.2 14.7 15.6 13.3 17.0 12.9 12.2 13.0 13.8 13.6 14.9
Total employment (thousands) 2.265 .597 1 .237 1.649 2.312 3.641 1.244 2.618 3.147 1.606 2.119 .798 1.336 2.763
Health services employment (hundreds) 2.27 .75 1.11 .81 2.50 4.51 1.03 2.39 5.52 2.18 2.83 .84 1 .75 1.91
Median value home ($10,000s) 2.91 2.62 1 .72 3.02 2.22 2.36 1 .97 1.85 2.01 1 .82 1.80 4.25 2.64 3.17
Note: Observations from adj acent census tracts are likely t o b e correlated. That is, these 14 observa tions may not constitute a random sample.
Chapter 8
Exercises
47 1
component analysis of the data using both the covariance matrix S and the cor relation matrix R. What have you learned? Does it make any difference which matrix is chosen for analysis? Can the data be summarized in three or fewer dimensions? Can you interpret the principal components? 8.13. In the radiotherapy data listed in Table 1 .7 (see also the radiotherapy data on the CD-ROM), the n = 98 observations on p = 6 variables represent patients ' reactions to radiotherapy. (a) Obtain the covariance and correlation matrices S and R for these data. (b) Pick one of the matrices S or R (justify your choice), and determine the eigenvalues and eigenvectors. Prepare a table showing, in decreasing order of size, the percent that each eigenvalue contributes to the total sample variance. (c) Given the results in Part b, decide on the number of important sample prin cipal components. Is it possible to summarize the radiotherapy data with a single reaction-index component? Explain. (d) Prepare a table of the correlation coefficients between each principal com ponent you decide to retain and the original variables. If possible, interpret the components. 8.14. Perform a principal component analysis using the sample covariance matrix of the sweat data given in Example 5.2. Construct a Q-Q plot for each of the important principal components. Are there any suspect observations? Explain. 8.15. The four sample standard deviations for the postbirth weights discussed in Ex ample 8.6 are � = 32.9909, \IS; = 33.5918, vs;; = 36.5534, and � = 37.3517 Use these and the correlations given in Example 8.6 to construct the sample co variance matrix S. Perform a principal component analysis using S. 8.16. Over a period of five years in the 1990s, yearly samples of fishermen on 28 lakes in Wisconsin were asked to report the time they spent fishing and how many of each type of game fish they caught. Their responses were then converted to a catch rate per hour for
x 1 = Bluegill x2 = Black crappie x3 = Smallmouth bass x6 = Northern pike x4 = Largemouth bass x5 = Walleye The estimated correlation matrix (courtesy of Jodi Barnet) .0652 .4653 -.2277 1 .4919 .2636 .3506 - . 1917 1 .3127 .2045 . 4919 .0647 1 .4108 .2493 .2635 .3127 R= 1 - .2249 .2293 .3506 .4108 .4653 1 - .2144 - .2277 - .1917 .0647 - .2249 1 .2293 - .2144 .2045 .2493 .0652 is based on a sample of about 120. (There were a few missing values.)
472
Chapter 8
Principa l Com ponents
8.17. 8.18.
8.19.
8.20.
8.21.
Fish caught by the same fisherman live alongside of each other, so the data should provide some evidence on how the fish group. The first four fish belong to the centrarchids, the most plentiful family. The walleye is the most popular fish to eat. (a) Comment on the pattern of correlation within the centrarchid family x1 through x 4 • Does the walleye appear to group with the other fish? (b) Perform a principal component analysis using only x 1 through x4 • Int er pret your results. (c) Perform a principal component analysis using all six variables. Interpret your results. Using the data on bone mineral content in Table 1.8, perform a principal com ponent analysis of S. The data on national track records for women are listed in Table 1 .9. (a) Obtain the sample correlation matrix R for these data, and determine its eigenvalues and eigenvectors. (b) Determine the first two principal components for the standardized vari ables. Prepare a table showing the correlations of the standardized vari ables with the components, and the cumulative percentage of the total ( standardized ) sample variance explained by the two components. (c) Interpret the two principal components obtained in Part b. ( Note that the first component is essentially a normalized unit vector and might measure the athletic excellence of a given nation. The second component might measure the relative strength of a nation at the various running distances. ) (d) Rank the nations based on their score on the first principal component. Does this ranking correspond with your inituitive notion of athletic excel lence for the various countries? Refer to Exercise 8.18. Convert the national track records for women in Table 1.9 to speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 3000 m, and the marathon are given in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Perform a principal components analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8.18. Do your interpretations of the components differ? If the na tions are ranked on the basis of their score on the first principal component, does the subsequent ranking differ from that in Exercise 8.18? Which analysis do you prefer? Why? The data on national track records for men are listed in Table 8.6. ( See also the data on national track records for men on the CD-ROM. ) Repeat the princi pal component analysis outlined in Exercise 8.18 for the men. Are the results consistent with those obtained from the women ' s data? Refer to Exercise 8.20. Convert the national track records for men in Table 8.6 to speeds measured in meters per second. Notice that the records for 800 m� 1500 m, 5000 m, 10,000 m and the marathon are given in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Perform a principal component analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8.20. Which analysis do you prefer? Why?
Chapter 8
TABLE 8.6
Exercises
473
NATI ONAL TRACK RECORDS FOR M E N
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands
lOO m (s) 10.39 10.31 10.44 10.34 10.28 10.22 10.64 10.17 10.34 10.51 10.43 12.18 10.94 10.35 10.56 10.14 10.43 10.11 10.12
200 m (s) 20.81 20.06 20.81 20.68 20.58 20.43 21 .52 20.22 20.80 21 .04 21.05 23.20 21.90 20.65 20.52 20.65 20.69 20.38 20.33
10.16
20.37 44.50 1.73
3.53
13.21
27.61
132.23
10.11
20.21 44.93 1.70
3.51
13.01
27.51
129.13
10.22 10.98 10.26 10.60 10.59 10.61 10.71 10.01 10.34 10.46 10.34 10.91
20.71 21 .82 20.62 21.42 21.49 20.96 21 .00 19.72 20.81 20.66 20.89 21.94
46.56 48.40 46.02 45.73 47.80 46.30 47.80 45.26 45.86 44.92 46.90 47.30
1 .78 1.89 1.77 1 .76 1.84 1.79 1 .77 1.73 1.79 1.73 1.79 1.85
3.64 3.80 3.62 3.73 3.92 3.56 3.72 3.60 3.64 3.55 3.77 3.77
14.59 14.16 13.49 13.77 14.73 13.32 13.66 13.23 13.41 13.10 13.96 14.13
28.45 30.11 28.44 28.81 30.79 27.81 28.93 27.52 27.72 27.38 29.23 29.67
134.60 139.33 132.58 131.98 148.83 132.35 137.55 131.08 128.63 129.75 136.25 130.87
10.35 10.40 11.19 10.42 10.52
20.77 20.92 22.45 21 .30 20.95
47.40 46.30 47.70 46.10 45.10
1.82 1.82 1 .88 1.80 1.74
3.67 3.80 3.83 3.65 3.62
13.64 14.64 15.06 13.46 13.36
29.08 31.01 31 .77 27.95 27.61
141 .27 154.10 152.23 129.20 129.02
400 m (s) 46.84 44.84 46.82 45.04 45.91 45.21 48.30 45.68 46.20 47.30 46.10 52.94 48.66 45.64 45.89 46.80 45.49 45.28 44.87
800 m 1500 m 5000 m 10,000 m Marathon ( min ) ( min ) ( min ) (min ) (min ) 3.70 14.04 1.81 29.36 137.72 3.57 1.74 13.28 27.66 128.30 3.60 13.26 135.90 1.79 27.72 3.60 13.22 1 .73 27.45 129.95 14.68 3.75 1.80 30.55 146.62 3.66 13.62 1.73 28.62 133.13 3.85 14.45 1.80 30.28 139.95 3.63 1 .76 13.55 28.09 130.15 3.71 1.79 13.61 29.30 134.03 3.73 1.81 13.90 29.13 133.53 3.74 1.82 13.49 27.88 131.35 4.24 16.70 2.02 35.38 164.70 3.84 1.87 14.03 28.81 136.58 3.58 13.42 1 .76 28.19 134.32 3.61 13.50 1 .78 28.11 130.78 14.91 3.82 1.82 154.12 31 .45 3.61 1.74 13.27 27.52 130.87 3.57 1.73 132.30 13.34 27.97 3.56 13.17 1.73 27.42 129.92
(continues on next page)
474
Chapter 8
TABLE 8.6
Principa l Components
(continued)
Country New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taipei Thailand Turkey USA USSR Western Samoa
lOO m (s) 10.51 10.55 10.96 10.78 10.16 10.53 10.41 10.38 10.42 10.25 10.37 10.59 10.39 10.71 9.93 10.07 10.82
200 m (s) 20.88 21 .16 21.78 21 .64 20.24 21.17 20.98 21 .28 20.77 20.61 20.46 21 .29 21 .09 21 .43 19.75 20.00 21.86
400 m (s) 46.10 46.71 47.90 46.24 45.36 46.70 45.87 47.40 45.98 45.63 45.78 46.80 47.91 47.60 43.86 44.60 49.00
800 m 1500 m 5000 m 10,000 m Marathon (min) (min) (min) (min) (min) 13.21 27.70 1.74 3.54 128 . 9 8 13.34 27.69 1 .76 3.62 131 .48 14.72 1.90 31 .36 4.01 148.22 14.74 30.64 3.83 1.81 145.27 13.29 27.89 1 .76 3.60 131.58 13.13 27.38 1.79 3.62 128.65 13.25 27.67 1 .76 3.64 132.50 15.11 31.32 1.88 3.89 157.77 27.73 13.31 1 .76 3.55 131.57 13.29 27.94 1.77 3.61 130 .63 13.22 27.91 1 .78 3.55 131 .20 14.07 30.07 1.79 3.77 139.27 32.65 15.23 3.84 1.83 149.90 13.56 28.58 1.79 3.67 131 .50 13.20 27.43 3.53 1 .73 128.22 13.20 27.53 1.75 3.59 130.55 16.28 34.71 4.24 2.02 161.83
Source: IAAF/ATFS Track and Field Statistics Handbook for the 1 984 Los Angeles Olympics.
8.22. Consider the data on bulls in Table 1.10. Utilizing the seven variables YrHgt,
FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt, perform a principal component analysis using the covariance matrix S and the correlation matrix R. Your analysis should include the following: (a) Determine the appropriate number of components to effectively summarize the sample variability. Construct a scree plot to aid your determination. (b) Interpret the sample principal components. (c) Do you think it is possible to develop a "body size" or "body configuration" index from the data on the seven variables above? Explain. (d) Using the values for the first two principal components, plot the data in a two-dimensional space with 5\ along the vertical axis and y2 along the hor izontal axis. Can you distinguish groups representing the three breeds of cat tle? Are there any outliers? (e) Construct a Q-Q plot using the first principal component. Interpret the plot. 8.23. A naturalist for the Alaska Fish and Game Department studies grizzly bears with the goal of maintaining a healthy population. Measurements on n = 61 bears provided the following summary statistics: Variable Sample mean x
Weight (kg)
Body length (em)
Neck (em)
Girth Head (em) length (em)
Head width (em)
95.52
164.38
55.69
93.39
31.13
17.98
Chapter 8
References
475
Covariance matrix
3266.46 1343.97 731.54 1175.50 162.68 238.37 1343.97 721 .91 324.25 537.35 80.17 117.73 731 .54 324.25 179.28 281 .17 39.15 56.80 S= 1175.50 537.35 281.17 474.98 63.73 94.85 162.68 63.73 80. 17 39.15 9 . 95 13.88 238.37 117.73 56.80 94.85 13.88 21.26 (a) Perform a principal component analysis using the covariance matrix. Can the data be effectively summarized in fewer than six dimensions? (b) Perform a principal component analysis using the correlation matrix. (c) Comment on the similarities and differences between the two analyses. 8.24. Refer to Example 8.10 and the data in Table 5.8, page 240. Add the variable x6 = regular overtime hours whose values are ( read across )
6187 7679
7336 8259
6988 6964 10954 9353 and redo Example 8. 10.
8425 6291
6778 4969
5922 4825
7307 6019
8.25. Refer to the police overtime hours data in Example 8.10. Construct an alter
nate control chart, based on the sum of squares d� j , to monitor the unexplained variation in the original observations summarized by the additional principal components.
REFERENCES 1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed. ) . New York: John Wiley, 1 984. 2. Anderson, T. W. "Asymptotic Theory for Principal Components Analysis." Annals of Mathematical Statistics, 34 (1963) , 122-148. 3. Bartlett, M. S. "A Note on Multiplying Factors for Various Chi-Squared Approxima tions." Journal of the Royal Statistical Society (B) , 16 (1954 ), 296-298.
4. Dawkins, B. "Multivariate Analysis of National Track Records." The American Statisti cian, 43 (1989), 110-115.
5. Girschick, M. A. "On the Sampling Theory of Roots of Determinantal Equations." Annals of Mathematical Statistics, 10 (1939), 203-224. 6. Hotelling, H. "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24 (1933) , 417-441 , 498-520.
7. Hotelling, H. "The Most Predictable Criterion." Journal of Educational Psychology, 26 (1935), 139-142. 8. Hotelling, H. " Simplified Calculation of Principal Components." Psychometrika , 1 (1936), 27-35. 9. Hotelling, H. "Relations between Two Sets of Variates." Biometrika, 28 (1936), 321-377.
10. Jolicoeur, P. "The Multivariate Generalization of the Allometry Equation." Biometrics, 19 (1963), 497-499.
476
Chapter 8
Principal Com ponents 11. Jolicoeur, P. , and J. E. Mosimann. "Size and Shape Variation in the Painted Turtle : A Principal Component Analysis." Growth, 24 (1960), 339-354. 12. King, B. "Market and Industry Factors in Stock Price Behavior." Journal of Business, 39 (1966), 139-1 90. 13. Kourti, T., and J. McGregor, "Multivariate SPC Methods for Process and Product Mon itoring," Journal of Quality Technology, 28 (1996), 409-428. 14. Lawley, D. N. "On Testing a Set of Correlation Coefficients for Equality." Annals of Mathematical Statistics, 34 (1963), 149-151. 15. Maxwell, A. E. Multivariate Analysis in Behavioural Research. London: Chapman and Hall, 1977. 16. Rao, C. R. Linear Statistical Inference and Its Applications (2d ed. ) . New York: John Wiley, 1973. 17. Rencher, A. C. "Interpretation of Canonical Discriminant Functions, Canonical Variates and Principal Components." The American Statistician, 46 (1992) , 217-225.
CHAPTE R
9
Factor Analysis and Inference for Structured Covariance Matrices
9. 1
I NTRODUCTI O N Factor analysis has provoked rather turbulent controversy throughout its history. Its modern beginnings lie in the early 20th-century attempts of Karl Pearson, Charles Spearman, and others to define and measure intelligence. Because of this early as sociation with constructs such as intelligence, factor analysis was nurtured and de veloped primarily by scientists interested in psychometrics. Arguments over the psychological interpretations of several early studies and the lack of powerful com puting facilities impeded its initial development as a statistical method. The advent of high-speed computers has generated a renewed interest in the theoretical and computational aspects of factor analysis. Most of the original techniques have been abandoned and early controversies resolved in the wake of recent developments. It is still true, however, that each application of the technique must be examined on its own merits to determine its success. The essential purpose of factor analysis is to describe, if possible, the covari ance relationships among many variables in terms of a few underlying, but unob servable, random quantities called factors. Basically, the factor model is motivated by the following argument: Suppose variables can be grouped by their correlations. That is, suppose all variables within a particular group are highly correlated among themselves, but have relatively small correlations with variables in a different group. Then it is conceivable that each group of variables represents a single underlying construct, or factor, that is responsible for the observed correlations. For example, correlations from the group of test scores in classics, French, English, mathematics, and music collected by Spearman suggested an underlying "intelligence " factor. A second group of variables, representing physical-fitness scores, if available, might correspond to another factor. It is this type of structure that factor analysis seeks to confirm. 477
478
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
Factor analysis can be considered an extension of principal component analy sis. Both can be viewed as attempts to approximate the covariance matrix I. How ever, the approximation based on the factor analysis model is more elaborate. The primary question in factor analysis is whether the data are consistent with a pre scribed structure. 9.2
TH E ORTH OGO NAL FACTO R M O D E L The observable random vector X, with p components, has mean IL and covariance ma trix I. The factor model postulates that X is linearly dependent upon a few unob servable random variables F1 , F2 , . . . , Fm , called common factors, and p addition al sources of variation s 1 , s 2 , . . . , sP , called errors or, sometimes, specificfactors. 1 In par ticular, the factor analysis model is
X1 - IL 1 = f 1 1 F1 + f 12 F2 + X2 - IL2 = f2 1 F1 + f22 F2 +
··· ···
+ f 1m Fm + e1 + f2 m Fm + e2
(9-1)
or, in matrix notation, L F + e (9-2) ( p Xm ) ( mXl ) ( p Xl ) ( p Xl ) The coefficient ei j is called the loading of the ith variable on the jth factor, so the ma trix L is the matrix of factor loadings. Note that the ith specific factor si is associated only with the ith response Xi . The p deviations X1 - IL l , X2 - �L2 , . . . , XP - ILp are expressed in terms of p + m random variables F1 , F2 , , Fm , s 1 , s2 , . . . , sP which are unobservable. This distinguishes the factor model of (9-2) from the multivariate regression model in (7 -26), in which the independent variables [whose position is oc cupied by F in (9-2)] can be observed. X - JL =
• • •
With so many unobservable quantities, a direct verification of the factor model from observations on X1 , X2 , . . . , XP is hopeless. However, with some additional as sumptions about the random vectors F and e, the model in (9-2) implies certain co variance relationships, which can be checked. We assume that
E(F) = 0 , ( mX l ) E(e) = 0 , ( pXl )
Cov (F) =
Cov ( e ) =
E[FF' J =
I (m x m)
E[ee' ] = 'It = ( px p)
l/1 1
0
0
l/12
0 0
0
0
l/Jp
(9-3)
1 As Maxwell [22] points out, in many investigations the E1 tend to be combinations of measure ment error and factors that are uniquely associated with the individual variables.
Section 9.2
The O rthogonal Factor Model
479
and that F and e are independent, so Cov ( e, F) = E ( eF' ) =
0
( pXm )
These assumptions and the relation in (9-2) constitute the
orthogonal factor mode/ . 2
The orthogonal factor model implies a covariance structure for X. From the model in (9-4) , ( X - �-L ) (X -
IL ) '
= ( LF + e ) ( LF + e ) ' = ( LF + e ) ( ( LF) ' + e' ) = LF( LF) ' + e ( LF) ' + LFe ' + ee '
so that
I = Cov (X) = E(X -
IL ) IL )' (X -
= LE ( FF' ) L' + E ( eF' ) L' + L E ( Fe ' ) + E ( ee ' ) = LL' + 'II according to (9-3). Also, by independence, Cov ( e, F) = E ( e, F' ) = 0. Also, by the model in (9-4) , (X - p, ) F' = ( LF + e ) F' = LF F' + eF' , so Cov (X, F) = E (X - �-L ) F' = L E( FF' ) + E ( eF' ) = L. not diagonal gives the oblique factor model. The oblique model presents some additional estimation difficulties and will not be discussed in this book. (See [20] .)
2 Allowing the factors F to be correlated so that Cov (F) is
480
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
The model X - IL = LF + E is linear in the common factors. If the p responses X are, in fact, related to underlying factors, but the relationship is nonlinear, such as in xl - ILl = e l l pl p3 + B l , Xz - J.Lz = e2 1 F2F3 + Bz , and so forth, then the covari ance structure LL' + 'It given by (9-5) may not be adequate. The very important as sumption of linearity is inherent in the formulation of the traditional factor model. That portion of the variance of the ith variable contributed by the m common factors is called the ith communality. That portion of Var ( Xi) = uii due to the spe cific factor is often called the uniqueness, or specific variance. Denoting the ith com munality by h r , we see from (9-5) that (]" 0 0 ll
+
communality or
specific variance (9-6)
and (T o o ll
= h? + l
1 /r o 'P l '
i = 1 , 2, . . . ' p
The ith communality is the sum of squares of the loadings of the ith variable on the
m common factors. Example 9 . 1
(Verifying the relation
Consider the covariance matrix
I=
I=
LL'
+ W for two factors)
1 9 30 2 12 30 57 5 23 2 5 38 47 12 23 47 68
Section 9 . 2
T h e Orthogonal Factor Model
The equality 1 9 30 30 57 2 5 12 23
2 12 5 23 38 47 47 68
4 1 7 2 -1 6 1 8
or
[�
-1 6
7 2
1 8
]+
481
2 0 0 0 0 4 0 0 0 0 1 0 0 0 0 3
I = LL' + 'It
may be verified by matrix algebra. Therefore, I has the structure produced by an m = 2 orthogonal factor model. Since
f 12 f22 e32 e42
el l L = f21 e3 1 e4 1
'It =
l/1 1
0
0 0 0
l/12 0 0
4 7 -1 1 0 0
l/13
0 0 0
0
l/14
1 2 6 8
'
2 0 0 0
0 4 0 0
0 0 0 0 1 0 0 3
the communality of X1 is, from (9-6) ,
h i = ei l + ei 2 = 4 2 + 12 = 17
and the variance of X1 can be decomposed as
lTl l =
or 19
�
variance
( e i l + ei 2 ) 42
+ 12
�
communality
+ l/1 1 = hi + l/1 1
+
+
2
�
17
+
2
specific variance
A similar breakdown occurs for the other variables.
•
The factor model assumes that the p + p(p - 1 )/2 = p(p + 1 )/2 variances and covariances for X can be reproduced from the pm factor loadings eij and the p specific variances l/Ji · When m = p, any covariance matrix I can be reproduced ex actly as LL' [see (9-1 1)], so 'It can be the zero matrix. However, it is when m is small relative to p that factor analysis is most useful. In this case, the factor model pro vides a "simple" explanation of the covariation in X with fewer parameters than the p(p + 1 ) /2 parameters in I . For example, if X contains p = 12 variables, and the fac tor model in (9-4) with m = 2 is appropriate, then the p ( p + 1 )/2 = 12 ( 1 3 )/2 = 78 elements of I are described in terms of the mp + p = 12(2) + 12 = 36 parameters eij and l/Ji of the factor model.
482
Chapter 9
Factor Ana lysis and Inference for Structu red Covaria nce Matrices
Unfortunately for the factor analyst, most covariance matrices cannot be fac tored as LL' + '\}I, where the number of factors m is much less than p. The follow ing example demonstrates one of the problems that can arise when attempting to determine the parameters eij and l/Ji from the variances and covariances of the ob servable variables. Example 9.2
(Nonexistence of a proper sol ution)
]
Let p = 3 and m = 1, and suppose the random variables X1 , X2 , and X3 have the positive definite covariance matrix
I
=
[
1 .9 .7 .9 1 .4 .7 .4 1
Using the factor model in (9-4), we obtain X1 - JL1 X2 - JL2 X3 - JL3
= = =
e1 1 F1 + e 1 e2 1 F2 + e2 e3 1 F1 + e3
The covariance structure in (9-5) implies that
I
or
1
=
ei 1 + l/J1
.9 o 1
=
= =
LL' +
w
e l le2 1 e� l + l/J2
The pair of equations
.70 .40 1
= = =
el le3 1 e2 1 e3 1 e� l + l/J3
.7 o = e l le3 1 .40 = e21 e3 1 implies that
( )
.40 .70 el l Substituting this result for e2 1 in the equation .9o = el le2 1 yields e r 1 = 1 .575, or e1 1 = ± 1.255. Since Var (F1 ) = 1 (by assumption) and Var ( X1 ) = 1 , e1 1 = Cov (X1 , F1 ) = Corr (X1 , F1 ) . Now, a correlation coeffi cient cannot be greater than unity (in absolute value), so, from this point of view, I el l ' = 1 .255 is too large. Also, the equation 1 = ei 1 + ljJ 1 ' or ljJ 1 = 1 - ei 1 e2 1
=
Section 9 . 2
The Orthogona l Factor Model
483
gives
1 - 1.575 = - .575 which is unsatisfactory, since it gives a negative value for Var ( s 1 ) = l/f 1 . Thus, for this example with m = 1, it is possible to get a unique numeri cal solution to the equations I = LL' + '1'. However, the solution is not con sistent with the statistical interpretation of the coefficients, so it is not a proper solution. • l/1 1 =
When m > 1, there is always some inherent ambiguity associated with the fac tor model. To see this, let T be any m X m orthogonal matrix, so that TT' = T'T = I. Then the expression in (9-2) can be written X -
where
JL =
LF + L*
=
E =
LTT'F +
LT and F*
E =
L*F* +
=
T'F
=
0
E
(9-7)
Since E(F* ) and Cov (F* )
=
=
T' E(F)
T' Cov (F)T
=
T'T
=
( m XI m)
it is impossible, on the basis of observations on X, to distinguish the loadings L from the loadings L* . That is, the factors F and F* = T ' F have the same statistical prop erties, and even though the loadings L* are, in general, different from the loadings L, they both generate the same covariance matrix I. That is,
I
=
LL' + 'I'
=
LTT'L' + 'I'
=
(L*) (L* ) ' + 'I'
(9-8)
This ambiguity provides the rationale for "factor rotation," since orthogonal matri ces correspond to rotations (and reflections) of the coordinate system for X.
The analysis of the factor model proceeds by imposing conditions that allow one to uniquely estimate L and '1'. The loading matrix is then rotated (multiplied by an orthogonal matrix) , where the rotation is determined by some "ease-of interpretation" criterion. Once the loadings and specific variances are obtained, fac tors are identified, and estimated values for the factors themselves (called factor scores ) are frequently constructed.
484
9.3
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
M ETHODS OF ESTI MATIO N
Given observations x 1 , x 2 , . . . , x n on p generally correlated variables, factor analysis seeks to answer the question, Does the factor model of (9-4), with a small number of factors, adequately represent the data? In essence, we tackle this statistical model building problem by trying to verify the covariance relationship in (9-5). The sample covariance matrix S is an estimator of the unknown population co variance matrix I. If the off-diagonal elements of S are small or those of the sample correlation matrix R essentially zero, the variables are not related, and a factor analy sis will not prove useful. In these circumstances, the specific factors play the domi nant role, whereas the major aim of factor analysis is to determine a few important common factors. If I appears to deviate significantly from a diagonal matrix, then a factor model can be entertained, and the initial problem is one of estimating the factor loadings fl 1 and specific variances l/Ji . We shall consider two of the most popular methods of pa rameter estimation, the principal component (and the related principal factor) method and the maximum likelihood method. The solution from either method can be rotated in order to simplify the interpretation of factors, as described in Section 9.4. It is al ways prudent to try more than one method of solution; if the factor model is appro priate for the problem at hand, the solutions should be consistent with one another. Current estimation and rotation methods require iterative calculations that must be done on a computer. Several computer programs are now available for this purpose. The Principal Com po nent (and Pri nci pal Factor) Method
The spectral decomposition of (2-20) provides us with one factoring of the covariance ma trix I . Let I have eigenvalue-eigenvector pairs ( Ai , e i) with A 1 A2 AP 0. Then >
>
· · ·
>
>
(9-1 0 )
-vx; e�
This fits the prescribed covariance structure for the factor analysis model having as many factors as variables ( m = p) and specific variances t/Ji = 0 for all i. The load ing matrix has jth column given by � ej . That is, we can write (9-11) (pIX p) (pLX p) (pL'X p ) + (p0X p) LL' Apart from the scale factor � ' the factor loadings on the jth factor are the coeffi =
=
cients for the jth principal component of the population. Although the factor analysis representation of I in (9-11) is exact, it is not par ticularly useful: It employs as many common factors as there are variables and does not allow for any variation in the specific factors e in (9-4). We prefer models that explain the covariance structure in terms of just a few common factors. One approach,
Section 9.3
Methods of Esti mation
485
when the last p - m eigenvalues are small, is to neglect the contribution of "-m+l e m+ l e � + l + · · · + AP e P e� to I in (9-10) . Neglecting this contribution, we obtain the approximation
L L' (9-12) ( p Xm ) ( mXp ) The approximate representation in (9-12) assumes that the specific factors e in (9-4) are of minor importance and can also be ignored in the factoring of I. If specific factors are included in the model, their variances may be taken to be the diagonal el ements of I - LL', where LL' is as defined in (9-12) . Allowing for specific factors, we find that the approximation becomes
I
-
LL' + 'It
VA;- e�
--------------
=
[ VA;- e 1
VI;: e 2
VA: e m ]
VI;: e2
--------------
--------------
+
VX: e �
l/1 1
0
0
t/12
0 0
0
0
l/Jp
(9-13)
m where l/Ji = (Ti i - � erj for i = 1 , 2, . . . ' p . j =l To apply this approach to a data set x 1 , x 2 , . . . , x n , it is customary first to center the observations by subtracting the sample mean x . The centered observations xl Xj l Xj l - x l x x2 xj 2 - x 2 j = 1 , 2, . . , n (9-14) xj - x = j 2 .
Xjp Xj p - Xp Xp have the same sample covariance matrix S as the original observations. In cases where the units of the variables are not commensurate, it is usually de sirable to work with the standardized variables (xj l - xl ) �
1
z. =
(xj 2 - x2 ) Vs;
j
=
1 , 2,
... , n
(xj p - xp )
vs;;
whose sample covariance matrix is the sample correlation matrix R of the observa tions x 1 , x 2 , . . . , x n . Standardization avoids the problems of having one variable with large variance unduly influencing the determination of factor loadings.
486
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria n ce Matrices
The representation in (9-13), when applied to the sample covariance matrix S or the sample correlation matrix R, is known as the principal component solution. The name follows from the fact that the factor loadings are the scaled coefficients of the first few sample principal components. (See Chapter 8.)
For the principal component solution, the estimated loadings for a given factor do not change as the number of factors is increased. For example, if m = 1 , L = [ � e l ] , and it m = 2 , L = [ � e l i � e2] . where ( A l , e l ) and ( A2 , C2) are the first two eigenvalue-eigenvector pairs for S (or R). By th�£ efinigon of {/Ji , the diagonal elements of S are equal to the diagonal el ements of LL�""'+ 'I' .""'However, the off-diagonal elements of S are not usually re produced by LL' + 'If. How, then, do we select the number of factors m? If the number of common factors is not determined by a priori considerations, such as by theory or the work of other researchers, the choice of m can be based on the estimated eigenvalues in much the same manner as with principal components. Consider the residual matrix
s - (i:i> + 'if)
(9-18)
resulting from the approximation of S by the principal component solution. The di agonal elements are zero, and if the other elements are also small, we may subjectively take the m factor model to be appropriate. Analytically, we have (see Exercise 9 .5) Sum of squared entries of ( S - (LL '
+ 'if ) ) < A�+ l + . . . + A�
(9-19)
Section 9.3
Methods of Esti mation
487
Consequently, a small value for the sum of the squares of the neglected eigenvalues implies a small value for the sum of the squared errors of approximation. Ideally, the contributions of the first few factors to the sample variances of the variables should be large. The contribution to the sample variance sii from the first common factor is e[l . The contribution to the total sample variance, s1 1 + s22 + + sP P = tr ( S ) , from the first common factor is then
···
+ e� l + + e� l = (VA: el)'( VA: el) since the eigenvector e 1 has unit length. In general,
(
er l
Proportion �f total sample variance due to jth factor
0 0 0
)
=
Al
" A1·
=
s1 1 + s22 + . . . + sPP for a factor analysis of S (9-20) " Aj for a factor analysis of R p
Criterion (9-20) is frequently used as a heuristic device for determining the appropriate number of common factors. The number of common factors retained in the model is increased until a "suitable proportion" of the total sample variance has been explained. Another convention, frequently encountered in packaged computer programs, is to set m equal to the number of eigenvalues of R greater than one if the sample cor relation matrix is factored, or equal to the number of positive eigenvalues of S if the sample covariance matrix is factored. These rules of thumb should not be applied in discriminately. For example, m = p if the rule for S is obeyed, since all the eigenvalues are expected to be positive for large sample sizes. The best approach is to retain few rather than many factors, assuming that they provide a satisfactory interpretation of the data and yield a satisfactory fit to S or R. Example 9.3
(Factor analysis of consumer-preference data)
In a consumer-preference study, a random sample of customers were asked to rate several attributes of a new product. The responses, on a 7 -point semantic differential scale, were tabulated and the attribute correlation matrix con structed. The correlation matrix is presented next:
Attribute ( Variable) Taste Good buy for money Flavor Suitable for snack Provides lots of energy
1 2 3 4 5
1 2 4 3 5 1.00 .02 ® .42 .01 .02 1 .00 .13 . 7 1 @ .96 .13 1.00 .50 .11 .42 . 7 1 .50 1.00 @ .01 .85 .11 . 7 9 1.00
It is clear from the circled entries in the correlation matrix that variables 1 and 3 and variables 2 and 5 form groups. Variable 4 is "closer" to the (2, 5) group than the ( 1 , 3 ) group. Given these results and the small number of variables, we might expect that the apparent linear relationships between the variables can be explained in terms of, at most, two or three common factors.
488
Chapter 9
Factor Ana lysis and I nference for Structured Cova ria nce Matrices "
"
The first two eigenvalues, A 1 = 2.85 and A2 = 1.81, of R are the only eigenvalues greater than unity. Moreover, m = 2 common factors will account for a cumulative proportion "
"
A1_ + A_2 _ _ p
=
2.85 + 1.81 5
=
.93
of the total (standardized) sample variance. The estimated factor loadings, com munalities, and specific variances, obtained using (9-15), (9-16), and (9-17), are given in Table 9 . 1. TABLE 9 . 1
Estimated factor loadings �e "-·l e·l 1· vf l1 F2 Fl . .
Variable 1 . Taste 2. Good buy for money 3. Flavor 4. Suitable for snack 5. Provides lots of energy Eigenvalues Cumulative proportion of total (standardized) sample variance
=
Specific variances �l/1l· - 1 - �h 2l·
Communalities �h 2· l
-
.56
.82
.98
.02
.78 .65
- .53 .75
.88 .98
.12 .02
.94
- .10
.89
.11
.80 2.85
- .54 1.81
.93
.07
.571
.932
Now, .56 .78 � �� .65 LL' + 'It .94 .80 .02 0 + 0 0 0 =
.82 - .53 .75 - .10 - .54 0 .12 0 0 0
[
.80 .94 .78 .65 .56 .82 -.53 .75 -.10 - .54
0 0 .02 0 0
0 0 0 .11 0
0 0 0 0 .07
1 .00 =
.01 1 .00
J .97 .11 1.00
.44 .79 .53 1 .00
.00 .91 .11 .81 1.00
Section 9.3
489
Methods of Estimation
nearly reproduces the correlation matrix R . Thus, on a purely descriptive basis, we would judge a two-factor model with the factor loadings displayed in Table 9.1 as providing a good fit to the data. The communalities ( .98, .88, .98, .89, .93 ) indicate that the two factors account for a large percentage of the sample vari ance of each variable. We shall not interpret the factors at this point. As we noted in Section 9.2, the factors (and loadings) are unique up to an orthogonal rotation. A rotation of the factors often reveals a simple structure and aids interpretation. We shall consider this example again (see Example 9.9 and Panel 9.1) after factor rota tion has been discussed. • Example 9.4
(Factor analysis of stock-price data)
Stock-price data consisting of n == 100 weekly rates of return on p == 5 stocks were introduced in Example 8.5. In that example, the first two sample prin cipal components were obtained from R. Taking m == 1 and m == 2, we can easily obtain principal component solutions to the orthogonal factor model. Specifically, the estimated factor loadings are the sample principal component coefficients (eigenvectors of R) , scaled by the square root of the corre sponding eigenvalues. The estimated factor loadings, communalities, specif ic variances, and proportion of total (standardized) sample variance explained by each factor for the m == 1 and m == 2 factor solutions are available in Table 9 .2. The communalities are given by (9-17). So, for example, with 2 2 .66. m == 2, hi == erl + 'li 2 == ( .783 ) + ( - .217) ==
TABLE 9.2
Two-factor solution
One-factor solution Estimated factor loadings Variable
Fl
Specific variances r--..1 l/1l· 1 - r--.h .1l2·
1. Allied Chemical 2. Du Pont 3. Union Carbide 4. Exxon 5. Texaco Cumulative proportion of total (standardized) sample variance explained
.783 .773 .794 .713 .712
.39 .40 .37 .49 .49
.571
Estimated factor loadings
Fl
F2
Specific variances r--..1 l/J l· == 1 - r--. h .12l·
.783 .773 .794 .713 .712
- .217 - .458 -.234 .472 .524
.34 .19 .31 .27 .22
.571
.733
490
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
The residual matrix corresponding to the solution for m
R - LL ' - 'I' ,....._, ,....._,
,....._,
=
=
2 factors is
.017 - .164 -.069 .055 .012 - .122 0 - .019 - .017 - .019 0 - .232 - .017 -.232 0
0 - .127 0 - .127 - .164 - .122 .055 - .069 .017 .012
The proportion of the total variance explained by the two-factor solution is ap preciably larger than that for the one-factor solution. However, for m 2, LL' produces numbers that are, in general, larger than the sample correlations. This is particularly true for r45 • It seems fairly clear that the first factor, F1 , represents general econom ic conditions and might be called a market factor. All of the stocks load high ly on this factor, and the loadings are about equal. The second factor contrasts the chemical stocks with the oil stocks. ( The chemicals have relatively large negative loadings, and the oils have large positive loadings, on the factor. ) Thus, F2 seems to differentiate stocks in different industries and might be called an industry factor. To summarize, rates of return appear to be determined by general market conditions and activities that are unique to the different industries, as well as a residual or firm specific factor. This is essentially the conclusion reached by an examination of the sample principal components in Example 8.5. M ==
,....._, ,....._,
A Mod ified Approach-the Principal Facto r Sol ution
A modification of the principal component approach is sometimes considered. We describe the reasoning in terms of a factor analysis of R, although the procedure is also appropriate for S. If the factor model p = LL' + 'I' is correctly specified, the m common factors should account for the off-diagonal elements of p, as well as the communality portions of the diagonal elements .
.
Pl l
=
1
=
h l? +
1 /r . 'P l
If the specific factor contribution l/Ji is removed from the diagonal or, equivalently, the 1 replaced by hf , the resulting matrix is p - 'I' = LL'. Suppose, now, that initial estimates l/Ji of the specific variances are available. Then replacing the ith diagonal element of R by hj2 = 1 - l/Ji , we obtain a "reduced" sample correlation matrix
R
r
=
hi2 r1 2 r1 2 h i2
ri p r2 p
ri p r2 p
h p*2
Section 9.3
Methods of Estimation
491
Now, apart from sampling variation, all of the elements of the reduced sample cor relation matrix R should be accounted for by the m common factors. In particular, R is factored as r
r
R
r
_.:_
L* L*' r
(9-21)
r
where L; = { f0 } are the estimated loadings. The principal factor method of factor analysis employs the estimates L;
=
,,,� V' l
=
[VA! et i Yfi ei m
" e l�-2 1 - .£..,; 1
� e !]
i ··· i
(9-22)
j= 1
where (Ai , ei ) , i = 1 , 2, . . . , m are the (largest) eigenvalue-eigenvector pairs deter mined from R . In turn, the communalities would then be (re)estimated by r
(9-23) The principal factor solution can be obtained iteratively, with the communality esti mates of (9-23) becoming the initial estimates for the next stage. In the spirit of the principal component solution, consideration of the estimat ed eigenvalues Ai, Ai, . . . , A� helps determine the number of common factors to re tain. An added complication is that now some of the eigenvalues may be negative, due to the use of initial communality estimates. Ideally, we should take the number of common factors equal to the rank of the reduced population matrix. Unfortu nately, this rank is not always well determined from R, and some judgment is necessary. Although there are many choices for initial estimates of specific variances, the most popular choice, when one is working with a correlation matrix, is l/1[ = 1/ r i i , where r ii is the ith diagonal element of R- 1 . The initial communality estimates then become
h{ 2
=
1 -
1/Ji
=
1 1 - -ll. r 0
(9-24)
which is equal to the square of the multiple correlation coefficient between Xi and the other p - 1 variables. The relation to the multiple correlation coefficient means that h {2 can be calculated even when R is not of full rank. For factoring S, the initial specific variance estimates use s ii , the diagonal elements of s - 1 . Further discussion of these and other initial estimates is contained in [12] . Although the principal component method for R can be regarded as a princi pal factor method with initial communality estimates of unity, or specific variances equal to zero, the two are philosophically and geometrically different. (See [12] .) In practice, however, the two frequently produce comparable factor loadings if the num ber of variables is large and the number of common factors is small. We do not pursue the principal factor solution, since, to our minds, the solution methods that have the most to recommend them are the principal component method and the maximum likelihood method, which we discuss next.
492
Factor Ana lys is and I nference for Structu red Cova ria nce Matrices
Chapter 9
The Maxi m u m Likeli hood Method
If the common factors F and the specific factors e can be assumed to be normally distributed, then maximum likelihood estimates of the factor loadings and specific variances may be obtained. When Fj and ej are jointly normal, the observations Xj - IL == LFj + ej are then normal, and from (4-16) , the likelihood is
L( /L, I, )
= ==
n [ ) J I I, � - � e - (�) :t-tt ( xi - X ) (xi - X ) ' + n (X - ) ( X - l - ( n - 1)p ( n - 1) ( 1 ) [ _1 ( j�l (x} - x ) (x; - x ) ')] (21T) 2 I I 1 - -2- e P 1 (n ) ( x - p, ( x - p, ) X (21T) - 2 1 I � - e ( 27r f
I'
tr
2
r l 2 t
) '� �-1
-
2
n
_
I' )
'
_
(9-2 5)
-
which depends on L and '\}I through I LL' + W . This model is still not well de fined, because of the multiplicity of choices for L made possible by orthogonal trans formations. It is desirable to make L well defined by imposing the computationally convenient uniqueness condition ==
a diagonal matrix
(9-26)
"
"
The maximum likelihood estimates L and '\}I must be obtained by numerical maximization of (9-25). Fortunately, efficient computer programs now exist that en able one to get these estimates rather easily. We summarize some facts about maximum likelihood estimators and, for now, rely on a computer to perform the numerical details. Result 9.1. Let X 1 , X 2 , . . . , X n be a random sample from Np ( IL, I), where I LL' + '\}I is the covariance matrix for the m common factor model of (9-4). The maximum likelihood estimators L , .q,, and jL x maximize (9-25) subject to L ' .q, - 1 i being diagonal. ==
==
The maximum likelihood estimates of the communalities are for i so
(
Proportion of total sample variance due to jth factor
)
=
fL
+
==
1, 2, . . . , p
(9-27)
e�j + . . . + e� j
(9-28 )
s11 + s22 + · · · + sP P
Proof.
By the invariance property of maximum likelihood estimates (see Section 4.3), functions of L and 'I' are estimated by the same functions of i and .q,. In par!_icular, the CS?mmunalities hf ef1 + . . . + efm have maximum likelihood estimates hf
==
e f1 + · · · + efm ·
==
If, as in (8-10), the variables are standardized so that Z the covariance matrix p of Z has the representation
==
v - 1 12 ( X
• - IL
), then (9-29)
Section 9 . 3
Methods of Esti mation
493
Thus, p has a factorization analogous to (9-5) with loading matrix Lz = v- 112 L and specific variance matrix '\}I z = v - 1 /2 '\}IV -112 • By the invariance property of maximum likelihood estimators, the maximum likelihood estimator of p is
P
=
=
( v-1f2i ) ( v-1/2i)' + v-1/2 q, y-1/2 " L Z L � + '\}I z
(9-30) where v -1;2 and L are the maximum likelihood estimators of v -112 and L , respec tively. (See Supplement 9A.) As a consequence of the factorization of (9-30), whenever the maximum like lihood analysis pertains to the correlation matrix, we call " h2l - '(,·IJl2 + '(,·IJ l2 + . . . + IJL2m i = 1, 2 , . . . p (9-31) 0
_
1
2
'(,
'
·
the maximum likelihood estimates of the communalities, and we evaluate the im portance of the factors on the basis of Proportion of total (standardized) = fL + e� j + . . . + e� j (9-32) sample variance due to jth factor p " To avoid more tedious notations, the preceding fi /s denote the elements of Lz .
(
)
"
,...
Comment. Ordinarily, the observations are standardized, and a sample correlation matrix is factor analyzed. The sample correlation matrix R is inserted for [ ( n - 1 ) j n 1S in the likelihood function of (9-25), and the maximum likelihood estimates L z and '\}I z are obtained using a computer. Although the likelihood in (9-25) is appropriate for S, not R, surprisingly, this practice is equivalent to obtaining the maximum likelihood estimates L and q, based on the sample covariance matrix S, setting L z = v - 112 L and q, z = y- 1/2 q, y-112 . Here v-112 is the diagonal matrix with the reciprocal of the sample stan dard deviations (computed with the divisor Vn) on the main diagonal. " Goin � in the other direction, given the estimated loadings L z and specific variances '\}I z obtained from R, we find that the resulting maximum likelihood estimates for a factor analysis of the covariance matrix [ (n - 1 ) /n J S are L = y 1/2 iz and q, = y 1/2 q, z y 1/2' or where a-ii is the sample variance computed with divisor n. The distinction between • divisors can be ignored with principal component solutions. The equivalency between factoring S and R has apparently been confused in many published discussions of factor analysis. (See Supplement 9A.) Example 9 . 5
(Facto r ana lysis o f stock-price data u s i n g the maxim u m l i ke l i hood method)
The stock-price data of Examples 8.5 and 9.4 were reanalyzed assuming an m = 2 factor model and using the maximum likelihood method. The estimated factor loadings, communalities, specific variances, and proportion of total (standardized) sample variance explained by each factor are in Table 9.3. The corresponding
494
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
TABLE 9.3 -
Principal components
Maximum likelihood
-
Variable 1. 2. 3. 4. 5.
Allied Chemical Du Pont Union Carbide Exxon Texaco Cumulative proportion of total ( standarc.lized) sample variance explained
Specific variances �l· = 1 - it?l
Estimated factor loadings Fl F2 .189 .517 .248 - .073 -.442
.684 .694 .681 .621 .792
.50 .25 .47 .61 .18
Estimated factor loadings F2 Fl .783 .773 .794 .713 .712
- .217 - .458 - .234 .412 .524
Specific variances �·l = 1 h2l -
-
.34 .19 .31 .27 .22
-
.485
.571
.598
.733
figures for the m = 2 factor solution obtained by the principal component method (see Example 9.4) are also provided. The communalities corresponding to the maximum likelihood factoring of R are of the form [see (9-31)] hf = efl + fT2 · So, for example, hy = ( .684 ) 2 + ( .189) 2 = .50 The residual matrix is
R
-
LL '
W "
-
0 .005 - .004 - .024 - .004 .005 - 003 - .004 0 .000 .031 - .004 0 -.004 - .003 0 .031 - .000 - .024 - .004 .000 - .004 - .000 - .004 0 .
=
The elements of R - LL' '\}I are much smaller than those of the residual matrix corresponding to the principal component factoring of R presented in Ex ample 9.4. On this basis, we prefer the maximum likelihood approach and typically feature it in subsequent examples. The cumulative proportion of the total sample variance explained by the factors is larger for principal component factoring than for maximum likeli hood factoring. It is not surprising that this criterion typically favors principal component factoring. Loadings obtained by a principal component factor analy sis are related to the principal components, which have, by design, a variance op timizing property. [See the discussion preceding (8-19).] Focusing attention on the maximum likelihood solution, we see that all variables have large positive loadings on F1 . We call this factor the market fac tor, as we did in the principal component solution. The interpretation of the second factor, however, is not as clear as it appeared to be in the principal "
-
Section 9 . 3
Meth ods of Esti mation
495
component solution. The signs of the factor loadings are consistent with a con trast, or industry factor, but the magnitudes are small in some cases, and one might identify this factor as a comparison between Du Pont and Texaco. The patterns of the initial factor loadings for the maximum likelihood so lution are constrained by the uniqueness condition that l/.q,- 1 L be a diagonal matrix . Therefore, useful factor patterns are often not revealed until the factors are rotated (see Section 9.4) . •
Example 9.6
{Factor analysis of Olympic decath lon data)
Linden [21] conducted a factor analytic study of Olympic decathlon scores since World War II. Altogether, 160 complete starts were made by 139 athletes. 3 The scores for each of the 10 decathlon events were standardized, and a sample cor relation matrix was factor analyzed by the methods of principal components and maximum likelihood. Linden reports that the distributions of standard scores were normal or approximately normal for each of the ten decathlon events. The sample correlation matrix, based on n == 160 starts, is 100-m Long Shot High 400-m 110-m Dis- Pole Jave- 1500-m run JUmp put JUmp run hurdles cus vault lin run .28 .20 .40 .35 1.0 .11 .34 - .07 . 59 . 63 1 .0 .42 .52 .51 .49 .09 . 31 . 36 . 21 1.0 .73 . 24 .36 .38 .19 - .08 . 44 1.0 .46 . 17 .39 .18 . 29 . 27 R == 1 .0 .23 .39 .34 . 17 . 13 1.0 .32 . 33 .18 . 00 .24 1 .0 .34 - .02 .24 1 .0 .17 1.0 - .00 1 .0 From a principal component factor analysis perspective, the first four eigenvalues, 3.78, 1.52, 1 . 1 1 , .91, of R suggest a factor solution with m == 3 or m == 4. A subsequent interpretation of the factor loadings reinforces the choice m == 4. The principal component and maximum likelihood solution methods were applied to Linden 's correlation matrix and yielded the estimated factor loadings, communalities, and specific variance contributions in Table 9.4. 4 3 Because of the potential correlation between successive scores by athletes who competed in more than one Olympic game, an analysis was also done using 139 scores representing different athletes. The score for an athlete who participated more than once was selected at random. The results were virtually iden tical to those based on all 160 scores. 4The output of this table was produced by the BMDP statistical software package. The output from the SAS program is identical for the principal component solution and very similar for the maximum likelihood solution. For this example the solution to the likelihood equations produces a Heywood case. That is, the estimated loadings are such that some specific variances are negative. Consequently, the soft ware package may not run unless the Heywood case option is selected. With that option, the program obtains a feasible solution by slightly adjusting the loadings so that all specific variance estimates are non negative. A Heywood case is suggested in this example by the .00 values for the specific variances for the shot put and the 1500-m run.
� \0 �
TABLE 9.4
Maximum likelihood
Principal component
F3
F4
.217 .184 - .535 .134 .551
- .520 - . 1 93 .047 .139 - .084
- .206 .092 - .175 .396 - .419
.687 .621 .538 .434 .147
.042 - .521 .087 - .439 .596
- .161 .109 .411 .372 .658
.345 - .234 .440 - .235 - .279
.38
.53
.64
F2
Variable
Fl
100-m run Long jump Shot put High jump 400-m run 100 m hurdles 7. Discus 8. Pole vault 9. Javelin 10. 1500-m run Cumulative proportion of total variance explained
.691 .789 .702 .674 .620
1. 2. 3. 4. 5. 6.
- --
-
- -
-
- - - - - - --
-
.73
�· l
==
---- - ---
Specific variances
Estimated factor loadings
Specific variances
Estimated factor loadings
1 - h?
F4
�·
1 - it ?
Fl
F2
F3
.16 .30 .19 .35 .13
- .090 .065 - .139 .156 .376
.341 .433 .990 .406 .245
.830 .595 .000 .336 .671
- .169 .275 .000 .445 - . 137
.16 .38 .00 .50 .33
.38 .28 .34 .43 .11
- .021 - .063 .155 - .026 .998
.425 .361 .030 .728 .229 .264 .441 - .010 .000 .059
.388 .019 .394 .098 .000
.54 .46 .70 .80 .00
l
.12
.37
.55
.61
l
==
l
Section 9 . 3
Methods of Estimation
497
In this case, the two solution methods produced very different results. For the principal component factorization, all events except the 1500-meter run have large positive loadings on the first factor. This factor might be la beled general athletic ability. The remaining factors cannot be easily inter preted to our minds. Factor 2 appears to contrast running ability with throwing ability, or "arm strength." Factor 3 appears to contrast running en durance (1500-meter run) with running speed (100-meter run), although there is a relatively high pole-vault loading on this factor. Factor 4 is a mystery at this point. For the maximum likelihood method, the 1500-meter run is the only variable with a large loading on the first factor. This factor might be called a running endurance factor. The second factor appears to be primarily a strength factor (discus and shot put load highly on this factor) , and the third factor might be running speed, since the 100-meter and 400-meter runs load highly on this factor. Again, the fourth factor is not easily identified, al though it may have something to do with jumping ability or leg strength. We shall return to an interpretation of the factors in Example 9 . 1 1 after a discussion of factor rotation. The four-factor principal component solution accounts for much of the total (standardized) sample variance, although the estimated specific variances are large in some cases (for example, the javelin and hurdles). This suggests that some events might require unique or specific attributes not required for the other events. The four-factor maximum likelihood solution accounts for less of the total sample variance, but, '1s the f9llowing residual matrices indicate, the maximum likelihood estimates L anj '\}I d � a better job of reproducing R than the principal component estimates L and W: Principal component: R - LL' - W r--..J r--..1
0 - .075 - .030 - .001 - .047 - .096 - .027 .114 .051 - .016
=
0 - .010 - .056 - .077 - .092 - .041 - .042 .042 .017
0 .042 - .020 - .032 - .031 - .034 - .158 .056
0 - .024 0 0 - . 122 .022 0 .014 - .001 - .017 0 .009 .067 - .129 - .215 0 .036 .041 - .254 - .005 -.022 .062 - .109 -.112 .076 .020 - .091
0
498
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
Maximum likelihood: R - LL' - W "
0 .000 .000 .012 .000 -.012 .004 .000 - .018 .000
=
0 .000 .002 - .002 .006 - .025 - .009 - .000 .000
0 .000 0 0 .000 - .033 - .000 .001 .028 0 .036 - .000 - .034 -.002 .006 .008 - .012 - .000 - .000 - .045 .052 - .013 .000 .000 .000 .000
0 .043 .016 .000
0 .091 .000
0 .000
0 •
A Large Sample Test for the N u m ber of Co mmon Facto rs
The assumption of a normal population leads directly to a test of the adequacy of the model. Suppose the m common factor model holds. In this case I = LL' + 'I' , and testing the adequacy of the m common factor model is equivalent to testing
(pIX p )
(9-33) (pLx m) ( mL'x p) + (pwx p) versus H1 : I any other positive definite matrix. When I does not have any special =
form, the maximum of the likelihood function [see (4-18) and Result 4.11 with i ( (n - 1 )/n) S = S n] is proportional to
=
(9-34)
Under H0 , I is restricted to have the form of (9-33). In this case, the maximum of the likelihood function [see (9-25) with jL = x and i = LL ' + q, , where L and q, are the maximum likelihood estimates of L and '\}I, respectively] is proportional to
I 1-n/2 i
exp
( - � [ (� (xi tr I - 1
=
-
I LL ' +
X) ( xi - x) '
)])
q, , -n/2 exp ( - � n tr [ ( LL + q,) -l S n J )
(9-35 )
Using Result 5.2, (9-34), and (9-35), we find that the likelihood ratio statistic for testing H0 is maximized likelihood under H0 2 ln A - 2 ln . . d rk maximize I e rh I oo d (9-36) _
_
_
[
]
Section 9.3
Methods of Estimation
499
with degrees of freedom,
� p(p + 1 ) - [p(m + 1 ) - � m ( m - 1 ) ] 2 = � [ (p - m ) - p - m]
v - v0 =
(9-37)
i LI/
Supplement 9A indicates that tr ( I- 1 S n ) p = 0 provided that maximum likelihood estimate of I = LL' + '\}I. Thus, we have -
(Iii)
=
+
-2 ln A = n ln TS,\
.q, is the (9-38)
Bartlett [4] has shown that the chi-square approximation to the sampling dis tribution of -2 ln A can be improved by replacing n in (9-38) with the multiplicative factor ( n - 1 - (2p + 4m + 5 )/6 ) . Using Bartlett ' s correction, 5 we reject H0 at the a level of significance if ( n - 1 - (2p + 4m + 5)/6) ln
I
"
"
"
LL' + '\}I I
sn
I
I
>
X2[(p- m) 2 - p- m ] j2 ( a )
(9-39)
provided that n and n - p are large. Since the number of degrees of freedom, � [ (p - m ) 2 - p - m J, must be positive, it follows that
m
< � (2p +
1 - V8p + 1 )
(9-40)
in order to apply the test (9-39).
I LL' I i LL'
Comment. In implementing the test in (9-39), we are testing for the adequacy of the m common factor model by comparing the generalized variances + .q, and I S n I · If n is large and m is small relative to p, the hypothesis H0 will usually be rejected, leading to a retention of more common factors. However, = + .q, may be close enough to sn so that adding more factors does not provide additional insights, even though those factors are "significant." Some judgment must be exercised in the choice of m. Example 9.7
(Testi ng for two co mmon factors)
The two-factor maximum likelihood analysis of the stock-price data was pre sented in Example 9.5. The residual matrix there suggests that a two-factor so lution may be adequate. Test the hypothesis H0 : I = LL' + '\}I, with m = 2, at level a = .05. 5Many factor analysts obtain an approximate maximum likelihood estimate by replacing Sn with the unbiased estimate S = [n/ ( n 1 ) ] Sn and then minimizing ln l I I + tr [I- 1 S]. The dual substitution of S and the approximate maximum likelihood estimator into the test statistic of (9-39) does not affect its large sample properties. -
500
Chapter 9
Factor Ana lysis and I nfere nce fo r Structu red Cova ria nce Matrices
The test statistic in (9-39) is based on the ratio of generalized variances
I i I I LL ' + q, I I sn I I sn I Let v - 1;2 be the diagonal matrix such that v - 112 S n v-1;2 R . =
=
of determinants ( see Result 2A.11),
By the properties
and I
v-112 1 1 s n I I v -112 1 I v -112 sn v-112 1 =
Consequently,
i I I v-112 1 I ii./ + q, I I v-112 1 I sn I I v-1/2 1 I sn I I v-1/2 1 I v-112 Li ' v-1;2 + v- 1;2 q, y-112 l I y -1/2 s n y-1/2 1 I L Z L � + 'II z I IRI I
=
"
"
(9-41)
"
= ----
by (9-30). From Example 9.5, we determine
I LzL � + 'l'z I IRI "
"
"
1 .000 .572 1.000 .513 .602 1.000 .411 .393 .405 1.000 .458 .322 .430 .523 1.000 1.000 .577 1.000 .509 .599 1.000 .387 .389 .436 1.000 .462 .322 .426 .523 1.000
=
.194414 .193163
=
1"0065
Using Bartlett ' s correction, we evaluate the test statistic in (9-39):
LL ' + 'II I - 1 - (2p + 4m + 5 )/6 J ln I I Sn I "
[n
=
[
"
"
100 - 1 -
]
(10 + 8 + 5) ln ( 1.0065) 6
=
.62
Section 9.4
Factor Rotation
501
Since � [ (p - m) 2 - p - m] == � [ (5 - 2) 2 - 5 - 2] == 1, the 5% critical value xi( .05 ) == 3.84 is not exceeded, and we fail to reject H0 • We conclude that the data do not contradict a two-factor model. In fact, the observed significance level, or P-value, P[xi > .62] - .43 implies that H0 would not be rejected at any reasonable level. • A A Large sample variances and covariances for the maximum likelihood estimates e ij ' l/Ji have been derived when these estimates have been determined from the sam ple covariance matrix S . (See [20].) The expressions are, in general, quite complicated. 9. 4
FACTOR ROTATI ON
As we indicated in Section 9.2, all factor loadings obtained from the initial loadings by an orthogonal transformation have the same ability to reproduce the covariance (or correlation) matrix. [See (9-8).] From matrix algebra, we know that an orthog onal transformation corresponds to a rigid rotation (or reflection) of the coordinate axes. For this reason, an orthogonal transformation of the factor loadings, as well as the implied orthogonal transformation of the factors, is called factor rotation. If L is the p X m matrix of estimated factor loadings obtained by any method (principal component, maximum likelihood, and so forth) then A A L* == LT, where TT' == T'T == I (9-42) is a p X m matrix of "rotated" loadings. Moreover, the estimated covariance (or cor relation) matrix remains unchanged, since A A A A A LL' + W == LTT' L + W == L * L * ' + W (9-43) Equation (9-43) indicates that the residual matrix, S n - LL' - .q, == s n - L * L * ' - .q, , remainAs unchanged. Moreover, the specific variances �i ' and hence the communaliti�s hr ;._ are unaltered. Thus, from a mathematical viewpoint, it is immaterial whether L or L* is obtained. Since the original loadings may not be readily interpretable, it is usual practice to rotate them until a "simpler structure" is achieved. The rationale is very much akin to sharpening the focus of a microscope in order to see the detail more clearly. Ideally, we should like to see a pattern of loadings such that each variable loads highly on a single factor and has small to moderate loadings on the remaining fac tors. However, it is not always possible to get this simple structure, although the ro tated loadings for the decathlon data discussed in Example 9.11 provide a nearly ideal pattern. We shall concentrate on graphical and analytical methods for determining an orthogonal rotation to a simple structure. When m = 2, or the common factors are considered two at a time, the transformation to a simple structure can frequently be determined graphically. The uncorrelated common factors are regarded as unit vec tqrs �long perpendicular coordinate axes. A plot of the pairs of factor loadings ( ei l , fi 2 ) yields p points, each point corresponding to a variable. The coordinate axes can then be visually rotated through an angle-call it
502
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
i* = L T ( p X2) ( p X2) (2X2) "
T= where T=
[ [
cos
sin
cos
(9-44 )
J
J
clockwise rotation counterclockwise rotation
The relationship in (9-44) is rarely implemented in a two-dimensional graph ical analysis. In this situation, clusters of variables are often apparent by eye, and these clusters enable one to identify the common factors without having to inspe ct the magnitudes of the rotated loadings. On the other hand, for m > 2, orientations are not easily visualized, and the magnitudes of the rotated loadings must be in spected to find a meaningful interpretation of the original data. The choice of an orthogonal matrix T that satisfies an analytical measure of simple structure will be considered shortly. Example 9.8
(A fi rst look at factor rotation)
Lawley and Maxwell [20] present the sample correlation matrix of examina tion scores in p = 6 subj ect areas for n = 220 male students. The correlation matrix is Gaelic English History Arithmetic Algebra Geometry .288 1.0 .329 .439 .410 .248 .354 .320 1 .0 .351 .329 R= .164 .190 1.0 .181 .595 1 .0 .470 1 .0 .464 1.0 and a maximum likelihood solution for m = 2 common factors yields the esti mates in Table 9.5. TABLE 9.5
Estimated factor loadings
Variable 1. 2. 3. 4. 5. 6.
Gaelic English History Arithmetic Algebra Geometry
Fl
F2
.553 .568 .392 .740 .724 .595
.429 .288 .450 - .273 - .211 - .132
Communalities it l?
.490 .406 .356 .623 .569 .372
Section 9 .4
Factor Rotation
503
All the variables have positive loadings on the first factor. Lawley and Maxwell suggest that this factor reflects the overall response of the students to instruction and might be labeled a general intelligence factor. Half the loadings are positive and half are negative on the second factor. A factor with this pat tern of loadings is called a bipolar factor. (The assignment of negative and pos itive poles is arbitrary, because the signs of the loadings on a factor can be reversed without affecting the analysis.) This factor is not easily identified, but is such that individuals who get above-average scores on the verbal tests get above-average scores on the factor. Individuals with above-average scores on the mathematical tests get below-average scores on the factor. Perhaps this factor can be classified as a "math-nonmath" factor. The factor loading pairs ( f i 1 , f i 2 ) are plotted as points in Figure 9.1. The points are labeled with the numbers of the corresponding variables. Also shown is a clockwise orthogonal rotation of the coordinate axes through an angle of ,... · �0°. This angle was chosen so that one of the new axes passes through ( e4 1 ' e4 2 )· When this is done, all the points fall in the first quadrant (the factor loadings are all positive), and the two distinct clusters of variables are more clearly revealed. The mathematical test variables load highly on Fi_ and have negligible loadings on Fi . The first factor might be called a mathematical-ability factor. Similarly, the three verbal test variables have high loadings on Fi and moder ate to small loadings on Fi . The second factor might be labeled a verbal-ability factor. The general-intelligence factor identified initially is submerged in the factors Fi and Fi . The rotated factor loadings obtained from (9-44) with 20° and the corresponding communality estimates are shown in Table 9.6. The magnitudes of the rotated factor loadings reinforce the interpretation of the factors sug gested by Figure 9.1. The communality estimates are unchanged by the orthogonal rotation, since LL' = LTT'L' = L* L* ' , and the communalities are the diagonal elements of these matrices. We point out that Figure 9.1 suggests an oblique rotation of the coordi nates. One new axis would pass through the cluster { 1 , 2, 3 } and the other through the { 4, 5, 6 } group. Oblique rotations are so named because they cor respond to a nonrigid rotation of coordinate axes leading to new axes that are "
"
·
"
"
"
F2
.5
"
"
"
F*2
I
I I I I I I
•3
•1
•2
0 ��c:---.... -.... .., ---.., ¢--.--:-.� 5 ....... ....... ....... • 6 ....... •5
___. 1 .._ 0_
___
....... ..... 4 .......
- .5
....... .......
--...
-
....... .......
F
Ff
1
Figure
sco res.
9.1
Factor rotation fo r test
504
Chapter 9
Factor Ana lys is and I nference for Structu red Cova ria nce Matrices
TABLE 9.6
Estimated rotated factor loadings Variable 1. 2. 3. 4. 5. 6.
Gaelic English History Arithmetic Algebra Geometry
Fi
Fi
Communalities iz�2 = /z ?
.369 .433 .211
.594 .467 .558 .001 .054 .083
.490 .406 .356 .623 .568 .372
r:i89'
.752 .604
l
l
not perpendicular. It is apparent, however, that the interpretation of the oblique factors for this example would be much the same as that given previously for an orthogonal rotation. • Kaiser [19] has suggested an analytical measure of simple structure known as the varimax (or normal varimax) criterion. Define f� = f{jjhi to be the rotated co efficients scaled by the square root of the communalities. Then the (normal) varimax procedure selects the orthogonal transformation T that makes (9- 45) as large as possible. Scaling the rotated coefficients e� has the effect of giving variables with small communalities relatively more weight in the determination of simple structure. After the transformation T is determined, the loadings f� are multiplied by hi so that the original communalities are preserved. Although (9-45) looks rather forbidding, it has a simple interpretation. In words, A
V
ex
�
j=l
(
variance of squar � s of ( scaled) loadings for ]th factor
)
(9-46)
Effectively, maximizing V corresponds to "spreading out" the squares of the load ings on each factor as much as possible. Therefore, we hope to find groups of large and negligible coefficients in any column of the rotated loadings matrix L* . Computing algorithms exist for maximizing V, and most popular factor analy sis computer programs (for example, the statistical software packages SAS, SPSS, BMDP, and MINITAB) provide varimax rotations. As might be expected, varimax rotations of factor loadings obtained by different solution methods (principal com ponents, maximum likelihood, and so forth) will not, in general, coincide. Also, the pattern of rotated loadings may change considerably if additional common factors are included in the rotation. If a dominant single factor exists, it will generally be ob scured by any orthogonal rotation. On the other hand, it can always be held fixed and the remaining factors rotated. A
Section 9.4
Factor Rotation
505
TABLE 9.7
Rotated estimated factor loadings
Estimated factor loadings
Variable
Fl
F2
1. Taste 2. Good buy for money 3. Flavor 4. Suitable for snack 5. Provides lots of energy Cumulative proportion of total (standardized) sample variance explained
.56
.78 .65 .94 .80
.82 - .52 .75 - .10 - .54
.571
.932
Example 9.9
Fi
Fi
.02
@)
@ .13
-.01
@
7
�
.43 - .02
.507
.932
Communalities };?l
.98 .88 .98 .89 .93
{Rotated loadi ngs fo r the co nsumer-preference data)
Let us return to the marketing data discussed in Example 9.3. The original fac tor loadings (obtained by the principal component method), the communali ties, and the (varimax) rotated factor loadings are shown in Table 9.7. (See the SAS statistical software output in Panel 9.1.) It is clear that variables 2, 4, and 5 define factor 1 (high loadings on fac tor 1, small or negligible loadings on factor 2), while variables 1 and 3 define factor 2 (high loadings on factor 2, small or negligible loadings on factor 1 ). Variable 4 is most closely aligned with factor 1, although it has aspects of the trait represented by factor 2. We might call factor 1 a nutritional factor and factor 2 a taste factor. The factor loadings for the variables are pictured with respect to the orig• inal and (varimax) rotated factor axes in Figure 9.2. F2
F*
.5
0
I
I
I
I
I
......
I
I
......
I
I I• 1
......
•
2
3
•
4
Fl
1 .0
-.5 ......
'-
Ff
9.2
Factor rotation for hypothetica l marketi n g data .
Figure
506
Chapter 9
Factor Ana lysis and Inference for Structured Covar i ance Matrices
PANEL 9.1
SAS ANALYSIS FOR E XAMPLE 9.9 U S I N G PROC FACTOR.
title 'Factor Ana lys is'; data cons u m er(type = corr); _type_='CORR'; i n put _n a m e_$ taste money flavor snack energy; ca rds; 1 .00 taste .02 1 . 00 m o n ey 1 . 00 .13 .96 flavor snack .42 .71 .50 1 .00 energy .01 .85 .1 1 .79 1 . 00
PROGRAM COM MAN DS
proc factor res d ata=consumer method=pri n nfact=2rotate=va rimax preplot plot; var taste money fl avor snack energy;
OUTPUT Prior Com m u n a l ity Esti mates: O N E E i g enva l ues of the Correlation Matrix: Tota l = 5 Ave rage = 1 E i g e nva l ue Diffe rence
1 2.853090 1 .046758
2 1 .806332 1 . 60 1 842
3 0.204490 0 . 1 0208 1
4 0 . 1 02409 0.068732
5 0.033677
0 . 0409 0.9728
0.0205 0.9933
0.0067 1 . 0000
2 factors wi l l be reta i n ed by the N FACTOR criteri o n .
TASTE MONEY F LAVO R S NACK E N E RGY
Tota l = 4.659423 TASTE
M O N EY
FLAVO R
S NACK
E N E RGY
(continues on next page)
Section 9 . 4
Factor Rotation
507
PANEL 9.1 (continued)
TASTE M O N EY F LAVOR SNACK E N E RGY Va ria nce exp l a i ned by each factor FACTO R 1 2 . 537396
FACTO R2 2 . 1 22027
Rotation of factor loadings is recommended particularly for loadings obtained by maximum likelih�od, �ince the initial values are constrained to satisfy the uniqueness condition that L' w -1 L be a diagonal matrix. This condition is convenient for compu tational purposes, but may not lead to factors that can easily be interpreted. Example 9. 1 0
TABLE 9.8
(Rotated load ings fo r the stock-price data)
Table 9.8 shows the initial and rotated maximum likelihood estimates of the fac tor loadings for the stock-price data of Examples 8.5 and 9.5. An m = 2 factor model is assumed. The estimated specific variances and cumulative proportions of the total (standardized) sample variance explained by each factor are also given. Maximum likelihood estimates of factor loadings
Variable
Fl
F2
Allied Chemical Du Pont Union Carbide Exxon Texaco Cumulative proportion of total sample variance explained
.684 .694 .681 .621 .792
.485
Rotated estimated factor loadings
Fi
Fi
.189 .517 .248 - .073 - .442
.601 .850 .643 .365 .208
.377 .164 .335
.598
.335
rnD 3
.598
Specific variances �l- = 1 - h?l .50 .25 .47 .61 .18
508
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
An interpretation of the factors suggested by the unrotated loadings \Va s presented in Example 9.5. We identified market and industry factors. The rotated loadings indicate that the chemical stocks (Allied Chemi cal , Du Pont, and Union Carbide) load highly on the first factor, while the oil sto cks (Exxon and Texaco) load highly on the second factor. (Although the rotate d loadings obtained from the principal component solution are not displayed, th e same phenomenon is observed for them.) The two rotated factors, togethe r, differentiate the industries. It is difficult for us to label these factors inte lli gently. Factor 1 represents those unique economic forces that cause chemical stocks to move together. Factor 2 appears to represent economic conditions affecting oil stocks. As we have noted, a general factor (that is, one on which all the variables load highly) tends to be "destroyed after rotation." For this reason, in cases where a general factor is evident, an orthogonal rotation is sometimes per formed with the general factor loadings fixed. 6 M Exa m p l e 9 . 1 1
(Rotated load ings fo r the Olympic decath lon data)
The estimated factor loadings and specific variances for the Olympic decathlon data were presented in Example 9.6. These quantities were derived for an m = 4 factor model, using both principal component and maximum likelihood solution methods. The interpretation of all the underlying factors was not im mediately evident. A varimax rotation [see (9-45)] was performed to see whether the rotated factor loadings would provide additional insights. The vari max rotated loadings for the m = 4 factor solutions are displayed in Table 9 . 9, along with the specific variances. Apart from the estimated loadings, rotation will affect only the distribution of the proportions of the total sample variance explained by each factor. The cumulative proportion of the total sample vari ance explained for all factors does not change. The rotated factor loadings for both methods of solution point to the same underlying attributes, although factors 1 and 2 are not in the same order. We see that shot put, discus, and javelin load highly on a factor, and, following Lin den [21 ], this factor might be called explosive arm strength. Similarly, high jump, 110-meter hurdles, pole vault, and-to some extent-long jump load highly on another factor. Linden labeled this factor explosive leg strength. The 100-meter run, 400-meter run, and-again to some extent-the long jump load highly on a third factor. This factor could be called running speed. Finally, the 1500-meter run loads highly and the 400-meter run loads moderately on the fourth factor. Linden called this factor running endurance. As he notes, "The basic functions indicated in this study are mainly consistent with the traditional classification of track and field athletics." Plots of rotated maximum likelihood loadings for factors pairs ( 1, 2) and ( 1 , 3 ) are displayed in Figure 9.3 on page 510. The points are generally grouped along the factor axes. Plots of rotated principal component loadings • are very similar. Some general-purpose factor analysis programs allow one to fix loadings associated with certain factors and to rotate the remaining factors. 6
Section 9.4
509
Factor Rotation
Tab le 9.9
Principal component
Variable 100-m
run
Estimated rotated factor loadings, eij Fi Fj Ft Fi
Maximum likelihood Specific variances r--..1 l/J· 1 - r--.h .12· l
l
Estimated rotated factor loadings, eij Fi Fi Fj Ft
Specific variances � - = 1 - /z? l
l
1 .884 1
.136
- .113
.16
.167 1 .857 1 .246
- .138
.16
1 .631 1
.194 [:�1�1 - .006
.30
.240 [:1771 1 .580 1
.011
.38
.245
1 .825 1 .223
- .148
.19
1 .966 1 .154
.200
- .058
.00
High j um p
.239
.150 1 .750 1
.076
.35
.242
.173 1 .632 1
.113
.50
run
1 .797 1
.075
.102
.468
.13
.055 1 .709 1 .236
.330
.33
.404 .186
.153 1 .635 1 - .170 1 .814 1 .147 - .079
.38 .28
.261 1 .589 1 - .071 1 .697 1 .133 .180 - .009
.54 .46
.176 1 .762 1 1 .735 1 .110
.217 .141
.34 .43
.078 1. 513 1 [:11�} .019 .175
.116 .002
.70 .80
.11
- .055
.056
.113
1 .990 1
.00
.18
.34
.50
.61
Long j ump Sh o t put
400-m 110-m hurdles Discus Pole
vault Javelin 1500-m run Cumulative proportion of total sample variance explained
- .036 - .048
.156
.045
- .041
.112
1 .934 1
.21
.42
.61
.73
.205
.137
Oblique Rotations
Orthogonal rotations are appropriate for a factor model in which the common fac tors are assumed to be independent. Many investigators in social sciences consider oblique ( nonorthogonal) rotations, as well as orthogonal rotations. The former are often suggested after one views the estimated factor loadings and do not follow from our postulated model. Nevertheless, an oblique rotation is frequently a useful aid in factor analysis.
510
Chapter 9
Factor Ana lysis and Inference for Structu red Cova ria nce Matrices Factor 2 •5
Factor 3
•1
•
4
6•• 2
•2
5
8•
5 .7
8
3.
9 Factor 1
0
7 •
9 •
3
•
Factor
1
-5
-5
-
8
-6
-4
-2
0
2
4
6
8
-8
-6
-4
-2
0
2
4
6
8
9.3 Rotated maxi m u m l i ke l i h ood load i n gs for factor pairs ( 1 , 2 ) a n d ( 1 , 3) decat h l o n data. (The n u m bers i n the fi g u res correspond to va riables.)
Figure
If we regard the m common factors as coordinate axes, the point with the m coordinates ( ei l ' ei 2 ' . . . ' e i m ) represents the position of the ith variable in the factor space. Assuming that the variables are grouped into nonoverlapping clusters, an or thogonal rotation to a simple structure corresponds to a rigid rotation of the coordi nate axes such that the axes, after rotation, pass as closely to the clusters as possible . An oblique rotation to a simple structure corresponds to a nonrigid rotation of the coordinate system such that the rotated axes (no longer perpendicular) pass (near ly) through the clusters. An oblique rotation seeks to express each variable in terms of a minimum number of factors-preferably, a single factor. Oblique rotations are discussed in several sources (see, for example, [12] or [20]) and will not be pursued in this book. "
9.5
"
"
FACTOR SCORES
In factor analysis, interest is usually centered on the parameters in the factor model . However, the estimated values of the common factors, called factor scores, may also be required. These quantities are often used for diagnostic purposes, as well as inputs to a subsequent analysis. Factor scores are not estimates of unknown parameters in the usual sense. Rather, they are estimates of values for the unobserved random factor vectors F, j == 1, 2, . . . , n. That is, factor scores �
fj "
==
estimate of the values fj attained by Fj (jth case )
Section 9 . 5
Factor Scores
51 1
The estimation situation is complicated by the fact that the unobserved quantities fj and ej outnumber the observed xj . To overcome this difficulty, some rather heuris tic, but reasoned, approaches to the problem of estimating factor values have been ad vanced. We describe two of these approaches. Both of the factor score approaches have two elements in common: A
A
1. They treat the estimated factor loadings e i j and specific variances l/Ji as if they
were the true values. 2. They involve linear transformations of the original data, perhaps centered or standardized. Typically, the estimated rotated loadings, rather than the original estimated loadings, are used to compute factor scores. The compu tational formulas, as given in this section, do not change when rotated load ings are substituted for unrotated loadings, so we will not differentiate between them. The Weig hted Least Squares Method
Suppose first that the mean vector JL , the factor loadings L, and the specific variance '\}I are known for the factor model
X ( pXl)
- pXILl ) (
L
F
( pXm) ( m Xl )
+
e
( pX l )
Further, regard the specific factors e' = [ s 1 , s2 , . . . , sp ] as errors. Since Var( si) = l/Ji , i = 1, 2, . . , p, need not be equal, Bartlett [3] has suggested that weighted least squares be used to estimate the common factor values. The sum of the squares of the errors, weighted by the reciprocal of their variances, is .
(9-47) A
Bartlett proposed choosing the estimates f of f to minimize (9-47). The solution (see Exercise 7.3) is (9-48) Motivated by (9-48), we take the estimates L , 4-, and jL obtain the factor scores for the jth case as
=
x as the true values and (9-49)
A
A
When L and '\}I are determined by the maximum likelihood method, these estimates must satisfy the uniqueness condition, L ' 4' - 1 L = � ' a diagonal matrix. We then have the following:
512
Chapter 9
Factor Ana lysis and I nference for Structu red Cova riance Matrices
The factor scores generated by (9-50) have sample mean vector 0 and zero sample co variances. (See Exercise 9.16.) If rotated loadings L* == LT are used in place of the original loadings in (9-50), the subsequent factor scores, fj , are related to fj by fj == T' fj , j == 1 , 2, , n . A
A
. . .
Comment. If the factor loadings are estimated by the principal component method, it is customary to generate factor scores using an unweighted (ordinary) least squares procedure. Implicitly, this amounts to assuming that the l/Ji are equal or nearly equal. The factor scores are then -1 fj == ( L ' L ) L ' (xj x) or 1 fj == ( L � Lz) L �z j A
A
A
A
-
A
for standardized data. Since L (9-15)], we have
A
=
A
-
A
[ � e1
� e2
A
f.1 ==
(9-51)
For these factor scores, (sample mean) and 1
�
- l j=.L.J1
--
n
n
A
A
f. f'. == I 1
J
(sample covariance )
"
Section 9 . 5
Factor Scores
513
Comparing (9-51) with (8-21), we see that the fi are nothing more than the first m (scaled) principal components, evaluated at x1 . The Reg ression Method
Starting again with the original factor model X - IL == LF + e, we initially treat the loadings matrix L and specific variance matrix 'I' as known. When the common fac tors F and the specific factors (or errors) e are jointly normally distributed with means and covariances given by (9-3), the linear combination X - IL == LF + e has an Np ( O, LL' + W ) distribution. (See Result 4.3 .) Moreover, the joint distribution of ( X - IL ) and F is Nm+p (O, I * ) , where
-=- :- ��ti}- �- �- f- ���L
I*
( m+p ) X ( m+p )
j ( m X m)
( mXp )
(9-52)
and 0 is an ( m + p) X 1 vector of zeros. Using Result 4 . 6, we find that the condi tional distribution of F I x is multivariate normal with and
mean == E(F I x) == L' I-1 ( x - IL ) covariance
==
Cov ( F I x)
==
==
L' (LL' + '\}I ) - 1 ( x - IL )
I - L' I- 1 L
==
I - L' (LL' + w ) -1 L
(9-53) (9-54)
The quantities L' (LL' + '\}I ) -1 in (9-53) are the coefficients in a (multivariate) re gression of the factors on the variables. Estimates of these coefficients produce factor scores that are analogous to the estimates of the conditional mean values in multi variate regression analysis. (See Chapter 7.) Consequently, giv,...en any ,...vector of ob servations xi , and taking the maximum likelihood estimates L and 'I' as the true values, we see that the jth factor score vector is given by "
j
==
1, 2, . . . , n (9-55)
The calculation of fi in (9-55) can be simplified by using the matrix identity (see Exercise 9.6) L ' ( LL ' + q, ) -1 == ( I + L ' q, -1 L ) -1 L ' q, -1 (9-56) m ( x p) ( mxp ) ( p X p ) ( p X p) ( m Xm ) This identity allows us to compare the factor scores in (9-55), generated by the re gression argument, with those generated by the weighted least squares procedure [see (9-50)]. Temporarily, we denote the former by If and the latter by fJ-5 . Then, using (9-56), we obtain (9-57) For maximum likelihood estimates ( L ' q, -1 L ) -1 == � - 1 and if the elements of this diagonal matrix are close to zero, the regression and generalized least squares methods will give nearly the same factor scores.
514
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
In an attempt to reduce the effects of a (possibly) incorrect determination of th e number of factors, practitioners tend to calculate the f�ctor,...sfores in (9-55) by using S ( the original sample covariance matrix) instead of I == LL' + 'I'. We then have the following:
Again, if rotated loadings L* == LT are used in place of the original loadings in (9-58), the subsequent factor scores fj are related to fj by A
A
f� J
==
T'f.
1'
j == 1 , 2, . . . , n
A numerical measure of agreement between the factor scores generated from two different calculation methods is provided by the sample correlation coefficient be tween scores on the same factor. Of the methods presented, none is recommended as uniformly superior. Example 9. 1 2
{Com puting factor scores)
We shall illustrate the computation of factor scores by the least squares and r e gression methods using the stock-price data discussed in Example 9.10. A max imum likelihood solution from R gave the estimated rotated loadings and specific variances
i * == z
.601 .850 .643 .365 .208
.377 .164 .335 .507 .883
"
and 'I' == z
.50 0 0 0 0 0 .25 0 0 0 0 0 .47 0 0 0 0 0 .61 0 0 0 0 0 .18
The vector of standardized observations, z'
==
[ .50, - 1.40, - .20, -.70, 1.40]
yields the following scores on factors 1 and 2:
Section 9.5
Factor Scores
51 5
Weighted least squares (9-50):
Regression (9-58):
f = L * ' R- 1 z z
[
=
'187 .657 .222 .050 - .210 .037 - .185 .013 .107 .864
.50 -1 .40 - .20 - .70 1 .40
J
[ J - 1 .2 1.4
In this case, the two methods produce somewhat different results. All of the regression factor scores, obtained using (9-58), are plotted in Figure 9.4. • Comment. Factor scores with a rather pleasing intuitive property can be constructed very simply. Group the variables with high ( say, greater than .40 in absolute value ) loadings on a factor. The scores for factor 1 are then formed by 3
I
•
•
-
I
•
•
•
-
• • •
•
�
-
•
�. 4
-
.
•
• • •
- d' �-
• • • • • •• • • • • • • • • • • $• $ • •••• • • •
-2 -
•
I -2
•
•• • •
••• • • ••• • •• d' d' • •
0
I -1
-
•
•
•
•
-3
I
• •
2 -
-1 -
)I'
. I
•
$
• • •• •
• •
•
• •
.... -
•
I 0
-
•
I 2
3
Factor scores using (9-58) fo r factors 1 a n d 2 of the stock- price d ata (max i m u m l i ke l i h ood estimates of the factor loa d i n gs).
Figure 9.4
516
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
summing the (standardized) observed values of the variables in the group, com bined according to the sign of the loadings. The factor scores for factor 2 are the sums of the standardized observations corresponding to variables with high load ings on factor 2, and so forth . D ata reduction is accomplished by replacing t h e standardized data by these simple factor scores. The simple factor scores are fre quently highly correlated with the factor scores obtained by the more complex least squares and regression methods. Example 9. 1 3
(Creati ng simple sum mary scores fro m facto r analysis groupings)
The principal component factor analysis of the stock price data in Example 9.4 produced the estimated loadings .784 - .216 .773 - .458 L = .795 - .234 .712 .473 .712 .524
and L* = LT
=
.746 .889 .766 .258 .226
.323 .128 .316 .815 .854
For each factor, take the loadings with largest absolute value in L as equal in magnitude, and neglect the smaller loadings. Thus, we create the linear combinations " !1 = X1 + X2 + X3 + X4 + Xs as a summary. In practice, we would standardize these new variables. If, instead of L, we start with the varimax rotated loadings L*, the simple factor scores would be �
�
The identification of high loadings and negligible loadings is really quite sub jective. Linear compounds that make subject-matter sense are preferable. M Although multivariate normality is often assumed for the variables in a factor analysis, it is very difficult to justify the assumption for a large number of variables . As we pointed out in Chapter 4, marginal transformations may help. Similarly, the factor scores may or may not be normally distributed. Bivariate scatter plots of fac tor scores can produce all sorts of nonelliptical shapes. Plots of factor scores should be examined prior to using these scores in other analyses. They can reveal outlying values and the extent of the (possible) nonnormality.
Section 9 . 6
9. 6
Perspectives and a Strategy for Factor Ana lysis
51 7
PERSPECTIVES AND A STRATEGY FOR FACTOR ANALYSIS
There are many decisions that must be made in any factor analytic study. Probably the most important decision is the choice of m, the number of common factors. Al though a large sample test of the adequacy of a model is available for a given m, it is suitable only for data that are approximately normally distributed. Moreover, the test will most assuredly reject the model for small m if the number of variables and observations is large. Yet this is the situation when factor analysis provides a useful approximation. Most often, the final choice of m is based on some combination of (1) the proportion of the sample variance explained, (2) subject-matter knowledge, and (3) the "reasonableness" of the results. The choice of the solution method and type of rotation is a less crucial deci sion. In fact, the most satisfactory factor analyses are those in which rotations are tried with more than one method and all the results substantially confirm the same factor structure. At the present time, factor analysis still maintains the flavor of an art, and no single strategy should yet be "chiseled into stone." We suggest and illustrate one rea sonable option: 1. Perform a principal component factor analysis. This method is particularly ap
2.
3.
4.
5.
propriate for a first pass through the data. (It is not required that R or S be nonsingular.) (a) Look for suspicious observations by plotting the factor scores. Also, calcu late standardized scores for each observation and squared distances as de scribed in Section 4.6. (b) Try a varimax rotation. Perform a maximum likelihood factor analysis, including a varimax rotation. Compare the solutions obtained from the two factor analyses. (a) Do the loadings group in the same manner? (b) Plot factor scores obtained for principal components against scores from the maximum likelihood analysis. Repeat the first three steps for other numbers of common factors m. Do extra fac tors necessarily contribute to the understanding and interpretation of the data? For large data sets, split them in half and perform a factor analysis on each part. Compare the two results with each other and with that obtained from the com plete data set to check the stability of the solution. (The data might be divid ed at random or by placing the first half of the cases in one group and the second half of the cases in the other group.)
Example 9. 1 4
{Factor ana lysis of chicken-bone data)
We present the results of several factor analyses on bone and skull mea surements of white leghorn fowl. The original data were taken from Dunn [10] . Factor analysis of Dunn ' s data was originally considered by Wright [26] , who started his analysis from a different correlation matrix than the one we use.
518
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
The full data set consists of n Head: Leg: Wing:
{ { {
276 measurements on bone dimensions: xl = skull length x2 = skull breadth
=
x3 x4
X5 x6
= = = =
femur length tibia length humerus length ulna length
The sample correlation matrix 1 .000 .505 .569 .602 .621 .603 .505 1.000 .422 . 467 .482 .450 .569 .422 1 .000 .926 . 877 . 878 R= . 602 . 467 .926 1.000 . 874 .894 .621 . 482 . 877 . 874 1.000 .937 .603 .450 . 878 .894 .937 1.000 was factor analyzed by the principal component and maximum likelihood meth ods for an m = 3 factor model. The results are given in Table 9.10.7 After rotation, the two methods of solution appear to give somewhat dif ferent results. Focusing our attention on the principal component method and the cumulative proportion of the total sample variance explained, we see that a three-factor solution appears to be warranted. The third factor explains a "significant" amount of additional sample variation. The first factor appears to be a body-size factor dominated by wing and leg dimensions. The second and third factors, collectively, represent skull dimension and might be given the same names as the variables, skull breadth and skull length, respectively. The rotated maximum likelihood factor loadings are consistent with those generated by the principal component method for the first factor, but not for fac tors 2 and 3. For the maximum likelihood method, the second factor appears to represent head size. The meaning of the third factor is unclear, and it is prob ably not needed. Further support for retaining three or fewer factors is provided by the residual matrix obtained from the maximum likelihood estimates:
/'>.
R - LzL � - 'l'z
=
.000 .000 -.000 .001 .000 - .003 .000 .000 .000 .000 .000 .000 .000 .000 - .001 .004 - .001 - .001 .000 - .000 .000
7Notice the estimated specific variance of . 00 for tibia length in the maximum likelihood solution. This suggests that maximizing the likelihood function may produce a Heywood case. Readers attempting to replicate our results should try the Hey(wood) option if SAS or similar software is used.
Section 9 . 6
TABLE 9 . 1 0
Perspectives and a Strategy for Factor Ana lysis
519
FACTO R ANALYS IS O F CH I CK E N -B O N E DATA
Principal Component Variable 1. Skull length 2. Skull breadth 3. Femur length 4. Tibia length 5. Humerus length 6. Ulna length Cumulative proportion of total (standardized) sample variance explained
Estimated factor loadings Rotated estimated loadings Fi Fi Fl F2 F3 Fj ( .902 ) .573 .741 .355 .244 .350 ( ) .604 .949 .235 .21 1 .720 - .340 .164 .218 .929 - .233 - .075 � .212 .904 .252 .943 - . 175 - .067 .888 .228 .948 - .143 - .045 .283 .945 - .189 - .047 .192 .908 .264
.743
.873
.950
.576
.763
t/Ji
""'-'
.00 .00 .08 .08 .08 .07
.950
Maximum Likelihood Variable 1 . Skull length 2. Skull breadth 3. Femur length 4. Tibia length 5. Humerus length 6. Ulna length Cumulative proportion of total (standardized) sample variance explained
Estimated factor loadings Rotated estimated loadings Fi Fl F2 Fj F3 Pi .214 .128 .467 .286 .602 2 .211 .177 .050 .467 .652 .084 .926 .289 .145 - .057 .890 .345 .000 - .000 1.000 .936 - .073 .362 .831 .874 .012 .396 .463 .325 .857 .894 .336 - .039 .272
l:J
.667
.738
.823
.559
.779
"
l/Ji .51 .33 .12 .00 .02 .09
.823
All the entries in this matrix are very small. We shall pursue the m = 3 factor model in this example. An m = 2 factor model is considered in Exercise 9.10. Factor scores for factors 1 and 2 produced from (9-58) with the rotated maximum likelihood estimates are plotted in Figure 9.5. Plots of this kind allow us to identify observations that, for one reason or another, are not consistent with the remaining observations. Potential outliers are circled in the figure. It is also of interest to plot pairs of factor scores obtained using the prin cipal component and maximum likelihood estimates of factor loadings. For the chicken-bone data, plots of pairs of factor scores are given in Figure 9.6. If the loadings on a particular factor agree, the pairs of scores should cluster tightly
520
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices 3
F-:\ �
2
-
��
1
-3
-
-
-
--
- -
�-
.-- <1' ..p - -
-
-
• •• •• • • $ • • • $$ • • • • • $ • $ • • • • $• • • • • • $ $ • $ • •$ . • • • • •• $$ • • • • $• $$ $ • $ • $ • • •• $$ • • •• • • $• • • • • • •• •• • • • • • • •
•
r--
•
-2
• • • • •• • • • •$ • • • • • • • •••• • • • •• • • •$$ •• •• • • • • • $••$• $• • • • $ $ • • •$• • • $ � · $ $ $ • • $ • • •• •$ • ••• • •• •• • $ ••$ --
0
r--
-
•
••
r--
-
•
• •
I
I
I
I
-
$
,..
•
-
•
-
•
I -2 Figure 9.5
l -1
I 0
��
I 2
3
Factor scores for the fi rst two factors of ch icken-bo n e data.
about the 45° line through the origin. Sets of loadings that do not agree will pro duce factor scores that deviate from this pattern. If the latter occurs, it is usu ally associated with the last factors and may suggest that the number of factors is too large. That is, the last factors are not meaningful. This seems to be the case with the third factor in the chicken-bone data, as indicated by Plot ( c ) in Figure 9.6 . Plots of pairs of factor scores using estimated loadings from two solution methods are also good tools for detecting outliers. If the sets of loadings for a factor tend to agree, outliers will appear as points in the neighborhood of the 45° line, but far from the origin and the cluster of the remaining points. It is clear from Plot ( b ) in Figure 9.6 that one of the 276 observations is not consistent with the others. It has an unusually large F2 -score. When this point, [39.1, 39.3, 75.7, 115, 73.4, 69. 1 ] , was removed and the analysis repeated, the loadings were not altered appreciably. When the data set is large, it should be divided into two (roughly ) equal sets, and a factor analysis should be performed on each half. The results of these analyses can be compared with each other and with the analysis for the full data set to test the stability of the solution. If the results are consistent with one an other, confidence in the solution is increased.
Section 9 . 6 I
2.7
I
I
I
-
.9
-
- 3 .6
)�
I
I
Principal component
1
I
I
I
1 1 1 1 11 1 1 1 11 1 1 1 32 1 1 1 2 1 12 1 1 21 1 1 1 1 2 32 1 1 1 1 4 2 63 1 1 1 2 1 24 3 1 1 33 1 12 2 1 1 21 17 2 1 ..... ..... 1
.J.
-
1
-2.7
I
I
I
1 1
0
- 1 .8
I
-
1 .8
- .9
521
Perspectives and a Strategy for Factor Ana lysis
-
1 1
1
-
1 3 -
-
£ ..... ,., ..... 1 .....
... ... .J.
v ... .J ... .J. ...
1 1 43 4 3 2 1 1 2 243 3 1 4 1 3 2231 5 2 1 4 1 1 22 1 11 1 21 2 1 1 1 1 5 25 1 1 11133 1 1 1 2 11 1 1 1 1 11 1 1
Maximum likelihood -
-
-
-
-
-
1
I
1
- 3.5 - 3.0
I
I
I
-2.5 - 2.0 - 1 .5
I
- 1 .0
J
-.5
0
I
.5
I
1 .0
I
1 .5
I
2.0
I
2.5
I
3.0
(a) First factor
9.6
Pa i rs of factor scores for the chicken-bo ne data. (Loa d i n g s a re estimated by pri ncipal com ponent a n d maxi m u m l i ke l i hood methods.)
Figure
The chicken-bone data were divided into two sets of n 1 = 137 and n 2 = 139 observations, respectively. The resulting sample correlation matrices were 1 .000 .696 1.000 .588 .540 1.000 Rl = .639 .575 .901 1.000 .694 .606 .844 .835 1 .000 .660 .584 .866 .863 .931 1 .000
522
Factor Ana lysis and Inference for Structu red Cova ria nce Matrices
Chapter 9
component 9.0
7.5
6.0
4.5
3.0 1 11 1 1 1 2 12 11 1 1 2322 1 1 23 4623 1 22 43 3 1 4 46C6 1 1
1.5
0
�------�����-
- 1 .5 1
Maximum likelihood
2 1 3 62 5 1 2 1A3 8 3 7 1 152 5 1 1 1 223 3 1 1 111 21 1 1 1 11
- 3 .00 -2.25 - 1 .50 - .75
0
.75
1 .50
2.25
3.00
3.75
4.50
5.25
6.00
6.75
7.50
(b) Second factor Figure 9.6
(continued)
and 1.000 .366 1.000 .572 .352 1.000 R2 = .587 .406 .950 1 .000 .587 .420 .909 .911 1.000 .598 .386 .894 .927 .940 1.000 The rotated estimated loadings, specific variances, and proportion of the total (standardized) sample variance explained for a principal component so lution of an m = 3 factor model are given in Table 9.11 on page 524.
I 3.00
I
I
I
I
I
I
I
I
I
I
Principal component
f--
523
Perspectives and a Strategy for Factor Ana lysis
Section 9 . 6
I
I -
1 2.25
f--
-
2
1 1 111 1 1 1 1 1 1 11 1 2 1 111 1 1 1 1 1 1 21 32 1 111 1 2 1 22 121 1 1 1 2 1 ,3 1 4 1 1 1 1 1 1 11 1 111 1 1 1 2 11 2 1 1 1 3 1 1 21 1 1 1 1
1
2
1 .50
.75
-
1
-
1
0
- .75
- 1 .50
- 2.25
- 3.00
1
.1
1
1
f--
f-- 1
.1
.1
1
.1
1
1 1
-
1
1
1 1 1 2 1 1
, , 1 1
111
"'- "'- .1 .1
.1
1
1 1
I
I
1 -
1 1
1
1
,
""'
1 1 1 1 22 ' 1 1 1 2 1 1 1 1 1 1 1 2 1 1 21 2 2 1 21 11 1 211 11 1 1 2 1 11 1 1 11 1 1 1 2 1 1 11 11 3 1 1 112 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 2 1 1
f--
r--
1
1
1
Maximum likelihood 1
1
I
- 3.0 - 2.4 - 1 . 8 - 1 .2
0
-
-
-
I
I
I
I
I
I
I
I
.6
1 .2
1 .8
2.4
3.0
3.6
4.2
4.8
I - .6
1
1
-
(c ) Third factor Figure
9.6
(continued)
The results for the two halves of the chicken-bone measurements are very similar. Factors F i and F j interchange with respect to their labels, skull length and skull breadth, but they collectively seem to represent head size. The first factor, Fi , again appears to be a body-size factor dominated by leg and wing di mensions. These are the same interpretations we gave to the results from a principal component factor analysis of the entire set of data. The solution is remarkably stable, and we can be fairly confident that the large loadings are "real." As we have pointed out however, three factors are probably too many. A one- or two-factor model is surely sufficient for the chicken-bone data, and you are encouraged to repeat the analyses here with fewer factors and alter • native solution methods. (See Exercise 9.10 . )
524
Chapter 9
Factor Ana lysis and I nference for Structu red Cova riance Matrices
TABLE 9 . 1 1 -
Second set ( n 2 == 139 observations) Rotated estimated factor loadings
First set ( n 1 == 137 observations) Rotated estimated factor loadings Variable 1 . Skull length 2. Skull breadth 3. Femur length 4. Tibia length 5. Humerus length 6. Ulna length Cumufative proportion of total (standardized) sample variance explained
Fi .360 .303
�
.877 .830 .871
.546
Fj Fi .361 ( 853 ) ( .899 ) .312 .238 .175 .270 .242 .247 .395 .231 .332
.743
.940
r--..1
l/Ji
.01 .00 .08 .10 .11 .08
Fi .352 .203
Fi
.925 .912 .914
.145 .239 .248 .252 .272
Fj .167 ( .968 ) .130 .187 .208 .168
.593
.780
.962
�
@I)
r--..1
l/Ji
.00 .00 .06 .05 .06 .06
Factor analysis has a tremendous intuitive appeal for the behavioral and social sciences. In these areas, it is natural to regard multivariate observations on animal and human processes as manifestations of underlying unobservable "traits." Factor analysis provides a way of explaining the observed variability in behavior in terms of these traits. Still, when all is said and done, factor analysis remains very subjective. Our ex amples, in common with most published sources, consist of situations in which the factor analysis model provides reasonable explanations in terms of a few interpretable factors. In practice, the vast majority of attempted factor analyses do not yield such clear-cut results. Unfortunately, the criterion for judging the quality of any factor analysis has not been well quantified. Rather, that quality seems to depend on a WOW criterion If, while scrutinizing the factor analysis, the investigator can shout "Wow, I under stand these factors," the application is deemed successful. 9.7
STRUCTURAL EQUATI O N M O D E LS
Structural equation models are sets of linear equations used to specify phenomena in terms of their presumed cause-and-effect variables. In their most general form, the models allow for variables that cannot be measured directly. Structural equation models are particularly helpful in the social and behavioral sciences and have been used to study the relationship between social status and achievement, the determi nants of firm profitability, discrimination in employment, the efficacy of social action
Section 9 . 7
Structura l Eq uation Models
525
programs, and other interesting mechanisms. (See [2] , [5], [6] , [7] , [9] , [11], [13] , [14] , [15] , [17] , and [23] for additional theory and applications.) Computer software for specifying, fitting, and evaluating structural equation models has been developed by Joreskog and Sorbom and is now widely available as the LISREL (Linear Structural Relationships) system. (See [18].) T h e LI SREL Model
Using the notation of Joreskog and Sorbom [18] , the LISREL model is given by the equations + � + r� 1J Bl] (9-59) (mXl)
y
=
X
=
(pX l)
(qX l)
with
(mxn) (nX l)
( rn X m ) ( m X l )
A y lJ
(pXm) (m X l )
Ax �
(qXn) (nX l)
+
(mX l)
E
(pX l )
+ B
(9-60)
(qX l)
Cov (�) = '}I E(�) = 0; (9-61) Cov (e) = e E E(e) = 0; Cov ( B ) = 8 a E( B) = 0; Here � ' e , and B are mutually uncorrelated; Cov (�) = ; � is uncorrelated with � ; e is uncorrelated with 11; B is uncorrelated with �; B has zeros on the diagonal; and I - B is nonsingular. In addition to the assumptions (9-61), we take E(�) = 0 and E( 1J ) = 0. The quantities � and 1J in (9-59) are the cause-and-effect variables, respective ly, and, ordinarily, are not directly observed. They are sometimes called latent vari ables. The quantities Y and X in (9-60) are variables that are linearly related to 1J and � through the coefficient matrices A y and Ax , and these variables can be measured. Their observed values constitute the data. Equations (9-60) are sometimes called the measurement equations. Co nstructi on of a Path Diag ram A distinction is made between variables that are not influenced by other variables in
the system ( exogenous variables) and variables that are affected by others ( endoge nous variables). With each of the latter dependent variables is associated a residual. Certain conventions govern the drawing of a path diagram, which is constructed as follows (directed arrows represent a path): 1. A straight arrow is drawn to each dependent (endogenous) variable from each of its sources. 2. A straight arrow is also drawn to each dependent variable from its residual. 3. A curved, double-headed arrow is drawn between each pair of independent (exogenous) variables thought to have nonzero correlation. The curved arrow for correlation is indicative of the symmetrical nature of a cor relation coefficient. The other connections are directional, as indicated by the single headed arrow.
526
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
Example 9 . 1 5
(A path diagram and the stru ctu ral equation)
Consider the following path diagram for the cause-and-effect variables:
This path diagram, with m tion [see (9-59)]
=
2 and n
[ ] [ {3 ] [ ] [ TJ l = 0 0 0 TJz
TJ 1 TJz
3, corresponds to the structural equa
=
+
Y 1 Y2 0 0 1'3 1' 4
[J ��J
+
g3
[] �1 �2
with Cov ( g l , g2 ) =
Because 1J and � are not observed, the LISREL model cannot be verified directly. However, as in factor analysis, the model and assumptions imply a certain covariance structure, which can be checked. Define the data vector [Y', X' ] ' . Then
Cov
(�)
=
(p +
�
I 1 1 1 I 12
+
q) (p q)
=
!
(pxp) (px q ) --- -- - ---:- - -- - --
-
-
-
-
I 21 1 I 22
( q x p)
1
(q x q )
=
[
]
Cov (Y) Cov (Y, X) - Cov (X, Y) Cov (X)
--------------------- ---------- -
--
----
--
(9-62)
Section 9 . 7
527
Structural Eq uation Models
where, with B = 0 to simplify the discussion, Cov (Y) = E (YY ) = Ay Cov (17) A� + e e == Ay ( ff' + w ) A � + e e Cov (X) == E(XX' ) == Ax Cov ( � ) A � + e o == AxA� + eo Cov (Y, X) = E (YX ) == E ( Ay ( f � + �) + e) ( A x � + B ) ' == Ayf A� = [Cov (X, Y) ] ' '
(9-63)
'
The covariances are (nonlinear) functions of the model parameters Ax , Ay , r, , '}I , e and e 0 . (Recall that we set B == 0.) Given n multivariate observations [ yj , xj] ' , j == 1, 2, , n, the sample covari ance matrix [see (3-11)] E
'
. . .
S1 1 i S1 2
-��s��:__ ; y_�-��l S22 1 2
( p +q) sX ( p +q)
(qxp) j (qxq)
can be constructed and partitioned in a manner conformable to I. The information in S is used to estimate the model parameters. Specifically, we set "
(9-64)
I == S and solve the resulting equations. Esti mation
Unfortunately, the equations in (9-64) often cannot be solved explicitly. An it erative search routine,... that begins with initial parameter estimates must be used to produce a matrix I that closely approximates S. The search routines use a criterion function that measures the discrepancy between i and S. The LISREL program currently uses a "least squares" criterion and a "maximum likelihood" criterion to estimate the model parameters. (See [18] for details.) The next example, with hypothetical data, illustrates a case where the parame ter estimates can be obtained from (9-64). Example 9. 1 6
Take m
==
(Esti matio n for a structu ra l eq uation model-artificial data)
1, n == 1, p == 2, q == 2, and B = 0 in (9-59) and (9-60). Then TJ ==
[i] [�J
= =
yg +
�
[ �J� [ �2 J g
+
+
[ :J [ :J
528
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices lji,
Additionally, let Var (g)
=
=
=
2 + + 2 + ( y ) - I 11 - ( 'YA ( ·-lc/Jl/1+) 1/1)e 1 Ai('YA12(c/J'Y + 1/1 ) l/1+) fh 1 A2 ( y
-
'
Y
X,
[���+��;]
=
-
0
�
-
��:i :��:�J i:i i:� i -
3.2
=
---
-6.4
�
----
1.6
=
-
1.1
is constructed. Setting we obtain the equations ((ii)) ((iivi)) ((vvi)) ((vviii)) ((ixx))
( 12 ¢ + � ) + 0 1 A1(r2 � + � ) 5tr ( r2 � + � ) + 82 5..2 1� y� A1 A2 ( 1� ) A1 ( 1� ) �5..� + 83
,...
A2 = 2.0.
y=4 "'
cp = .8
Ay = ,...
fjJ = 1
+ (}4
= 14.3
= -27.6 = 55.4
= 6.4
= 3.2 -
= 12 8 .
= - 6.4
= 3.7
1.6 = 1.1 =
,...
= 8 .
[ -� J
[�J
,
[ J and [
A = x
,... ·5 O ee = 0 .2
eo =
·5 O 0 .3
J
Section 9. 7
Structura l Eq uation Models
529
' The col u mns of t h e and mat r i c es cont a i n s i n or d er t o f i x t h e s c al e ofbetthheeunobs e r v ed and g. I n t h i s exampl e , t h e uni t s of have been chos e n t o tushosedetoofreYmove1 . Thethunie sctals eofambig argeuitatykenassotocibeatedthwiostehoftheXvar2 • i(aAblnyes coandnstantg.sacThemean beasnumber i s a conveni e nt choi c e. ) I n t h i s exampl e , t h e s t r u ct u r a l var i a bl e s , and g, mi g ht be t h e per f o r manceurofedthdierefictrmly,andbutmanager irasl tofaletntherefisrpmect'sivperelyf.oThermance,formferorcannot bee, meas i n di c at o exampl it and Yia2l tacommon s t o ck pr i c e, can be meas u r e d. Si m i l a r l y , i n di c a Yitors ofprofmanager l e nt , s a y, X year s of chi e f execut i v e exper i e nce and 1 ured. iaAsl tsaulmientn,gwethatarea flXeird2mt'os member perthe ffoorrmegoisancehinpgs onimodel s causboar.edd,s tofo dia lraerctgeordegrs, canee,bebymeasmanager al, to .esConstimateequentthe model partheamettotaelrnumber s, we needof unknown more equatparioansmettheanrs, thandere Iarn emusgenerunknowns l y , i f ti s t be such that t Condi t i o n however , does not guar a nt e e t h at al l par a met e r s can be es t i m at ed uniThequelfyin. al fit of the model must be asses ed careful y. Individual parameter esratmetimatereses, altiomngatwiestshhtoulhe entd haveries iapprn theorpresiiadtuale simatgnsrandix magnishtuouldesd.beForexamiexamplned.ePa, it ermayativnote paryiaemetld vareriesancetimatesitoimn ratoeuts itnhesat operare posate iovertive. thEnte entrieisreinparthaemetreseidrualspacematandrix should be uniformly small. Iparn lainmetearesrtsruandcturtahleequatassociioantemodel s , i n t e r e s t i s of t e n cent e r e d on t h e val u es of t h e d "ef f e ct s . " Pr e di c t e d val u es f o r t h e var i a bl e s ar e not easgreislyioobtn model ained,. unles the model reduces to a variant of the multivariate linear re AIf posusefsuibllmodel f i t i n g s t r a t e gy cons i s t s of t h e f o l o wi n g: e , gener a t e par a met e r es t i m at e s us i n g s e ver a l cr i t e r i a ( f o r exampl e , leasArt sqeuartheess,igmaxins andmummagnilikeltiuhdesood)cons, andistcompar e the estimates. e nt ? ArAree talhlevarresiiadnceual esmattimriatceses, positive?similar? Dodoesthsetaanalndarydsiisziwingththbote obsh erandvable vartheiasablmples havee coronrelatthieonoutmatcome?rix. What effect ttSplhheeirstetlasaubirlgtleistywidatofthateachhseetssoiotnluhhaltieron.fand, andwiperth ftohremreStsueltpsfor tandhe complon eachete dathalafs.etCompar to checke A model t h at gi v es cons i s t e nt r e s u l t s f o r St e ps i s pr o babl y a r e as o nabl e one.the datInaconsandisneeds tent retosubelts modipointfiteodwaror dabandoned. a model that is not completely supported by Ay
1
Ax
YJ
YJ
1
YJ
YJ
=
=
=
=
•
p
q
<
(9-65),
(p + q) (p + q + 1 )/2
S
-
(9-65)
i
Model-Fitti ng Strategy
1.
(a) (b) (c)
2.
3.
S
S
-
i,
R,
1
2
1-3
S U PP LE M E NT 9A
Some Computational Details for Maximum Likelih ood Estimation
Allikelthioughhood esa tsiimmatploersal}aandlytic�l expres s i o n cannot be obt a i n ed f o r t h e maxi m um t h ey can be s h own t o s a t i s f y cer t a i n equat i o ns . Not surprisingly, the conditions are statedofinantermunss oftrutctheumaxired covar mumilaiknceelihmatoodriesx.timSomeator j 1 = factor analtyostrsefempler tooryetshueltusinugalesstaimmplatees.covarThisiamodince ficbutatiosn,tilreusfeerethncede titlien Footnote ofdistritbhutis ichapt e r , amount s t o empl o yi n g t h e l i k el i h ood obt a i n ed f r o m t h e Wi s h ar t o n of and i g nor i n g t h e mi n or cont r i b ut i o n due t o j 1 = tchoihe norce ofmal densor itsyinfceor theyThebothfaprctooducer analthyesissameof coris, rofelacourtionsmate, unafrix.fected by the Let be a r a ndom s a mpl e f r o m a nor m al popul a t i o n. Thesubjemaxict tomthume unilikqeluenesihoods condiestimtiatonesin and Theyare obtsatiasfinyed by maximizing A sroestphondie jthncolg toumneigenvalof ue is t�e (nHeronnore malized) eigenvector of cor and j= 1 Also, at convergence, ith diagonal element of and L
'\}I,
n
S n == ( 1/n ) :L (Xj - X) (Xj - X) '
likelihood 5
S,
maximum
n
:L (Xj - X) (Xj - X) ' X.
R
S n S,
x 1 , x2 , . . . , x n
Result 9A.l.
"
"
L '\}I (9-26).
(9-25)
(9 - 1 )
.q, - 112 Sn .q,- 112
.q, -112 L 1 + �i ·
n 1 s n == n L ( xj -
x) ( xj - x) ' ==
"
l/Ji
530
==
n-1 ( n - 1 ) s
"
"
L.l 1 - L.l 2 A
Sn - LL'
>
A
>
· · ·
"
- L.lm
>
A
(9A-2)
Supplement 9A
531
Some Com p utational Deta i ls for Maxi m u m Likel i h ood Esti mation
�
Weationavoiof dthtehelodetg-liakiellsiofhoodthelpreadsooft.oHowever , i t i s evi d ent t h at and a cons i d er 1 t h e maxi m i z at i o n of -( [ l n t r ( I I I l overLand Equi v al e nt l y , s i n ce and ar e cons t a nt wi t h r e s p ect t o t h e maxi Sn mization, we minimi(ze L) ln I I 1 -ln I tr ( I n) subject to L' L a diagonal matrix. Lawl e y and Maxwel l [ 2 0] , al o ng wi t h many ot h er s who do f a ct o r lanal[iSkeeelyihsioods, useesandtthiemunbiate Sn.asedINow,feswetimi(gatnore-eofthStehhasconte covar, froibrutinoranceionmtmatalo tdathreixalii,knaelsWitiehadoodshofarittnhdiesmaxitributmfriumoomn. tandhe secondis equitervmaleinntvoltovimaxing mi-zing ththeenWimaxishartmliikzelinighoodthe reduced likelihood over L Li k el i h ood I I overLand Equivalently, we canIn I Iminimtriz(eI S) or, as in 1 S ) - ln i S t r ( I l n i i i Under t h es e condi t i o ns , Res u l t hol d s wi t h i n pl a ce of . Al s o , f o r l a r g e S S n maximum[seelikelihoodI LesL'timates,I SLandand Sn arwoule almdosbet idsientmiliacalr. , Forand ttheseticorngrtehsepondifactonrgmodel sI hiouli' d be compar e d wi t h n i f t h e act u al l i k el i h ood of i s empl o yed, and S I s h ou d be compar e d wi t h i f t h e f o r e goi n g Wi s h ar t l i k el i h ood i s us e d S I I I to derive Land 1 wForm t h e condi t i o n L' L cons t r a i n t s ef f e ct i v el y i m pos e s oncontthreaielntesment, in ans ofiteLand and t h e l i k el i h ood equat i o ns ar e s o l v ed, s u bj e ct t o t h es e r a t i v e f a s h i o n One pr o cedur e i s t h e f o l o wi n g: . Comput e i n i t i a l es t i m at e s of t h e s p ec i f i c var i a nces o/ Jor e s k og [ 1 6] , , 1 suggests set ing ( ) (�) A 1 s. wher e i s t h e i t h di a gonal el e ment of and Gicorvrenespondicomput e t h e f i r s t di s t i n ct ei g enval u es , variance matnrgixeigenvectorSs*, n of the "uniqueness-rescaled" co x of matrix of eieiggenvect o r s Letand [diag[ bebethtehe matdiraigonal envalues. W.
w- 1
== a ,
(4-23).]
'\}I
S
n
( J.t
-1 s
sn I +
==
Comment.
( 4-21)
n/2)
p
h it ' '\}I '
==
x
+
-p
(9A-3)
1)
(9-25)
x),
�-(n - 1 )/2e-[ (n - 1 )/2J tr [I- l s]
ex
W.
I +
(9A-3),
sn ) J
-1
+
I
p
(9A-1)
4' ,
!
+ q,A
n,
+
(9-39)], (9-25)
I
q,
'\}I.
Reco m mended Computational Scheme >
1,
a
==
'\}I,
m( m - 1 ) /2
lj12 ,
1.
,Jr .
'P l
si i
"'
'\}I,
2.
==
m
1
_
_!_ . m
2
. • •
ljJ P .
sl l
p
e1 ' e 2 , . . . ' em ' q, - 1 ;2 8 q, - 112
(9 -4)
"'
A1
>
"'
(9A-5)
==
E
== A ==
em ] e1 ! e2 ! A 1 , A2 , . . . , A m ]
p
X
m
mXm
"'
A2 > · · · > A m > 1 ,
normalized
532
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
Chapter 9
A =I +
E = .q,- 112 L � - 1/2 . = .q, 1 12 E � 1 /2 = .q, 1/2 :ft ( A - 1 ) 1/2
Thus, we obtain the est(im9A-ates From (9A-l), i � and 6 ) SubsminimtitizuetethLe robtesualtinwiedthinres(p9ectA-6t)o into the likelihAoodnumerfuncticalionsear(9cA-h 3ro),utaninde musempltobeyedusated.SteThep valtoucreseate a new. .L. Stobtepsainedandfro(3m) artheisremipeatnimedizuntatioilnconare veraregneglence-tigiblheat. is, until the dif erences between successive values of and tinivadmie mins iimblume andcorisrIestsapofidonditteonbenhappens gimtoproperthat, ortvalhae uobjesefctorivseomefunctForioThin imosns (s9otA-lpackaged ut3i)onhasis cla ecomrarelaly putsmalerl posproigrtivaemsnumber , negatisvbefe oreifprthoeyceedioccnugr wionthatparhe nextticulasrteip.teration, are changed to When has the factor analysis structure = LL' "'L',ZL� can"�'bez · Thefactoloradied nasg matvariraincex fomatr therisxtaisndar"''z dized variables iswherLz e is tandhe dithaegonalcorrematspondirix wingthspiecthidific agonal el e ment I f i s s u bs t i t u t e d f o r i n t h e obj e ct i v e f u nct i o n of ( 9 A3 ) , the investigator mini(miLzesZL� "''z ) ln (9A-7) tr [ (LzLz 'l'z) whos e i t h di a gonal el e ment i s t h e s q uar e I n t r o duci n g t h e di a gonal mat r i x root of the ith diagonal element of we can write the objective function in (9A-7) as "
3.
� 1 , �2 , . . . , � � 1 , �2 , . ;._ �P
(2)
P
.
(2)
Comment.
negative
�i'
�i
·
Heywood case.
Maxi m u m Li ke l i hood Esti mators of p = Lz l� + "'' z
p=
I I + v - 112 IV -112 = ( V - 112 L ) ( V -112 L ) ' + v - 112 '\}I' V -112 = = v - 1 12 L , = v- 1 12 '\}I' V- 1 12 ' v -1/2 a-�1/2 • R Sn I
+
I
IRI
Sn ,
+
�l
ei j
'
V 1 12 ,
+
- 1 R]
p
-
+
p
(9A-8) Theimizelatsht einobjequalectiitvyeffoulnctowsiobecaus e t h e maxi m um l i k el i h ood es t i m at e s Land "' ' m i n n ( 9 A3 ) . [ E qual i t y hol d s i n ( 9 A8 ) f o r and z ztaining i and froTherm efandore, miestniimmatiziinngg (Lz9A-7) over Lzbyand"'iz 'z is equivalandent t"'o 'obz by The r a t i o nal e f o r t h e l a t e r procedure comes z from the invariance property of maximum likelihood estimators. [See ( "
v - 1;2 -q,V - 112 .] .q, sn 2 2 1 1 y - 1 "1' V - 1 .q, = v- 1;2 -q,y- 112 . .q,
=
L
=
v - 1 12 L
=
=
"
v- 112 L
v -112 L
4-20).]
=
Chapter 9
EX ERCI S E S 9.1.
=
zl ' z2 '
p=
m =
Z1 Z2
'It
=
=
0,
(a)
z3
s1 s2 s3
=
'1' .
+
=
h[ , i
( Zi ,
A A2 A3
=
=
p
(a) (b)
+ + +
=
z3
=
9.3.
533
Show that the covariance matrix [ 1.0 .63 .45 ] ..4653 .1.350 1..350 fbyortthhee 31 fsatactndaror model dized random variables and can be generated ..79FF11 . S F 1 where Var(F1 ) 1, Cov(e, F1) and[ .19 0 0 ] Cov(e) 00 �51 0.75 ThatUse tihse, wrinfitoermpiatniotnheinfoExerrm cise 9.LL'1. CalCalccululaattee Corcommunal i t i e s 1, 2, 3, and i n t e r p r e t t h es e quant i t i e s . r f o ri 1, 2, 3. Whi c h var i a bl e mi g ht c a r y t h e gr e at ) F 1 es t wei g ht i n "nami n g" t h e common f a ct o r ? Why? The eigenvalues and1eigenvect o r s of t h e cor r e l a t i o n mat r i x pi n Exer c i s e 9. 1 ar e 1..698,6, .-625,.219,.593,-.4.91,507].843] . 3 6, [ . 7 49, 6 38, 1 77] -. -. Astrixsuofmispnecg anific varia1ncesfactorusmodeling th, ecalprciulnciatpealthcomponent e loading matsoluritxioLand ma n met h od. Compar e t h e r e s u l t s wi t h t h os e i n Exer c i s e 9. 1 . What is explained by the first cGivenommonandprofporactotiinro?nExerof tchieseto9.t1alandpopulanation var1 fiaactnceor model , cal c ul a t e t h e r e duced cor r e l a t i o n mat r i x and t h e pr i n ci p al f a ct o r s o l u t i o n f o r t h e lShoul oadindgimatt be?rix L. Is the result consistent with the information in Exercise 9.1? EstablSiishncethe in-equalLL' it-y (9-has19).zeros on the diagonal, (sum of squared entries of - - (sum of squared entries of Now, and is the diagonal matrix with elements where p
9.2.
Exercises
=
=
e1 e2 e3
=
= =
m =
=
=
[ [
=
'It
(b)
9.4.
9.5.
'It
p
[ em+ l
r--..J r--..1
S
Hint:
S - LL' ;
···
l
p
=
epJ
'It
=
'It
p
m =
r--..1
S
Am +lem+1e�+1 A (2 )
LL' +
.qf )
S
<
···
+
AP e P e�
=
P(2 ) A (2 ) P( 2 ) '
LL' )
P(2 )
Am + l , . . . , A p .
=
534
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
9.6.
9.7.
A)
[ A (2 ) A (2 ) ] .
(a) + Hint: + 'It ) (b) Hint: + (c) Hint:
+
=
=
'It )
(I +
=
+
+
AA'
===
)
'It )
not
+
(The factor model parameterization need not be unique.) p=2 m=1 = =
+ l/1 + l/12
( Unique but improper solution: Heywood case.) m=1 1 .4 I = .4 1 1
lj13 <
9.9.
+ +
=
aa-22
9.8.
[P(2 ) A (2 ) A (2 ) P( 2 ) ]
U§_ e ( s u m of s q uar e d ent r i e s of t r and t r tVerr ify the fol owing matrix identities. (I PreL'mulw-ti1pL)ly-bot1L'hw-si1dLes byI (-I (I L' w-L' w-1L 1.L)-1 (LL'Postmult-ip1 ly botw-h1 s-idesw-by1L((ILL' L' w-1L)and-1Lus' ew-(a1). L' (PosLL'tmul'1'tip)l-y1 the resulL't in'I'(-b1L) by)-1 LL,' 'I'us-e1 (a), and take the transpose, ing that (LL' -1 , w-1 , and (I L' w-1L)-1 are symmetLetrtihcematfactroicresmod. el with and prevail. Show that 1 eei�11 1 , and, for given o-1 , o-2 , and o-1 2 , there is an infinity of choices for Land '1'. Consider an factor model fo[r the popul.9 ]ation with covariance matrix . 7 . 9 . 7 Show th, sato tthheerchoie is caeuniis notqueadmichoisceibofle.L and with LL' but th rtIhnankiea9sntugsdy9ofsaofmplliqeuor9lcoriqpruorreelaftteyiropesenncematfroinrmixFrofance,rankStoroidneterdizelvinidgsualgavescol. Atlhefactectfeoodlrprefoanalwineygrseesinces tofi mated loadings: Es t i m at e d f a ct o r l o adi n gs VarLiquoriablse (X1 ) 1 KiMirrsacbelh le . 9 4 . 7 9 . 7* Rum . 1 7 - ..0399 WhiMarCalvscadoskey --- ...224999 0 3 Cognac -. Armagnac A
'It
0
p
X
=
I
+
=
[25]
n = 1442
F
F2
F3
.64 .50 .46
.02 - .06 -.24
.16 - .10 -1
- .52 - .60
.66 .08 .20 -.
17
- .04 .42 .14
* This figure is too high. I t exceeds the maximum value of .64, as a result of an approximation method for obtaining the estimated factor loadings used by Stoetzel.
'It ,
at
Chapter 9
Exercises
Giliqvuoren prtheesfeerernceesulitns, FrStaonceetzelis concl u ded t h e f o l o wi n g: The maj o r pr i n ci p l e of t h e di s t i n ct i o n bet w een s w eet and s t r o ng l i q uor s . Theberinsgecondthat limotquorivatis ibotng helanementexpensis priviece,commodi which canty andbe under stoofodconsby rpeimem an i t e m c uous consitemsumpt(rumioandn. Except i n t h e cas e of t h e t w o mos t popul a r and l e as t expens i v e mar c ) , t h i s s e cond f a ct o r pl a ys a much s m al l e r r o l e i n pr o duc ctor concers. (Snees th[2e5]so, p.cio11.logi) cal and pri maring Giprilyevftenhereewhatrnceegiojyouunaldgment ,know variasbi.aboutlTheity oftthhtiehrdvare jfuaidgment ' o us l i q uor s i n vol v ed, does St o et z el s i n tPleroptrtehteatliooadin snegempairersasfoornabltheef?irst two factors. Conduct a graphical orthog onalInterproretattitohne rofottahteedfalctoadior naxesgs f.orGenerthe fiartset apprtwo foaxictmorats.e Does rotateyourd loadiinntegsr. prroteatatetidolnoagradineegswi? tExplh Stoaetinz. el's interpretation of these factors from the un The correlation matrix for chicken-bone measurements (see Example 9.14) is 1..500500 1.000 ..650269 ..442267 1..092600 1.000 ..662103 ..448250 ..887778 ..887494 1..903700 1.000 Thelihoodfolprowiocedure: ng estimated factor loadings were extracted by the maximum like Esfactotrimloatadiedngs roftaacttVareodr lesiomaditiaxmnatgsed Variable .602 .200 .484 .411 ..496726 ..114354 ..367503 ..731719 1..087400 ..407600 ..856119 ..485599 .894 .327 .744 .594 Usestiinmgatthese of the fol oeswitinmg.ated factor loadings, obtain the maximum likelihood TheThe scpommunal ecific varitiaiencess. . TheThe rpresoipordualtiomatn ofrixvariance explained by each factor. (a)
(b)
9.10.
535
F2
Fl
1. Skull length 2. Skull breadth 3. Femur length
4. Tibia length 5. Humerus length 6. Ulna length
unrotated
(a) (b) (c) (d)
R
"
-
"
LzL �
"
-
'\}I z .
Fi
Fi
536
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
9.11. 9.12.
RefunroetrattoedExerandcirsoeta9.t1e0.d esComput e t h e val u e of t h e var i m ax cr i t e r i o n us i n g bot h t i m at e d f a ct o r l o adi n gs . Comment on t h e r e s u l t s . Theple 8.4) is matrix for the logarithms of turtle measurements (see Exam 310- [ 11.8.8.001726019 6.6.400517 6.773 ] The f1olmodel owingwermaxie obtmumained:likelihood estimates of the factor loadings for Estimloatadiedngsfactor Variable ..01752022 .0765 Usof eaching thofe esthteimfoatl eodwifnactg.or loadings, obtain the maximum likelihood estimates SpecCommunal ific variitaiencess. . PrTheoporrestiidoualn ofmatvarriiaxnce explained by the factor. Conver t t o 9.12.(wComput cate whybeacartest ofrRefiedeoutr tofExeror thcisiseexampl ie. th[See (e9t-1)h40).everte] sstusstatistic inunr(9e-3s9)tri.ctIenddicannot The maximum likelihood factor loading estimates are given in (9A-6) by Verify, for this choice, that wher e i s a di a gonal mat r i x . Hiof account rschey anding andWichermarnket[15]-valinuvese meastigatuereths eofconsprofiisttaebincliyt,ydet. Aserparmintantof tsh, eiandr stusudy,es a factcoprofr analitsywassis ofconduct accounted.ingTheprocorfit meas uornemats andrixmarof account ket estiminatg hiesstoforecoical, nomi r e l a t i account i n g r e pl a cement , and mar k et v al u e meas u r e s of pr o f i t a bi l i t y f o r a s a m ple of firms operating in 1977 is as fol ows: covariance
s
=
an
m =
Fl
1 . ln(length) 2. In(width) 3. ln(height)
(a) (b) (c) (d)
9.13.
9.14.
S
Hint:
H0 : I = LL' + '\}I
Sn Sn .
"
-
fl. = A "
9.15.
"
-
I
-
H1 : I
m =
i
W. "
"
LL'
=
� 1 12 E� 1 12
Chapter 9
Exercises
537
HRS HRA HRE RRA RRE Var i a bl e RRS REV Q HiHissttoorriiccalal rreettuurrnn onon equiassettsy,,HRA 1. 0 00 HRE 1. 0 00 . 7 38 ical returrentuonrn sonaleass, sHRSets, RRA ..782831 ..562088 1..060052 1.000 ReplReplHistoaarcement cement . 8 87 . 6 81 . 5 13 1. 0 00 r e t u r n on equi t y , RRE . 8 31 . 7 12 . 8 26 . 8 67 . 6 92 1. 0 00 ReplMarkaetcement r e t u r n on s a l e s , RRS . 5 43 Q s value, REV ..662504 ..330322 ..567917 ..656339 ..435219 ..660810 1..900037 1.000 Market rQelraattivioe, exces The fo3lfoawictonrgmodel rotatedwerpreinobtcipaalincomponent es t i m at e s of f a ct o r l o adi n gs f o r an ed: Es t i m at e d f a ct o r l o adi n gs Var i a bl e HiHissttoorriiccalal rreettuurrnn onon asequisettsy ..412533 ..689212 ..429934 HiReplstoarcement ical returrentuonrn sonaleass sets ..420696 ..720838 ..488387 ReplReplaacement cement rreettuurrnn onon sequialesty ..319831 ..489514 ..278389 MarMarkketet rQelraattivioe excess value ..992810 ..017960 ..235594 Cumul a t i v e pr o por t i o n of total variance explained .287 .628 .908 Uscommunal ing the esititeism. ated factor loadings, determine the specific variances and Detandetrhme icumul ne theatrievseidprualopormattiornixof, totaLzl varL�ia-nceWz.explGiaivnened tihnitshienfproremcediationng tAsablsuemi, doesng tanhat estim3atfaectdoloradimodelngs lappear appr o pr i a t e f o r t h es e dat a ? e s t h an . 4 ar e s m al l , i n t e r p r e t t h e t h r e e feviactdoencers. Doesof priot fappear , f o r exampl e , t h at mar k et v al u e meas u r e s pr o vi d e iatytediaccount stinct frinogmhithstatorpricalovimeasded ubyreaccount ifnitgabimeality sfruoresm ?account Can youing irsteeaplparbialcement s of pr o meas u res ? Vervectiofyr thandat fazerctoorssacmplorees covar constriauncescted. according to (9-50) have sample mean Cons i d er t h e LISREL model i n Exampl e 9. 1 6. I n t e r c hange 1 and i n t h e pa A. 1 rthamete mater vectrix oprr oAviy,dedandininthteercexampl hange Ae,2sandolve1fionrtthheeparmodelametparer avectmetoerrAs. x.ExplUsaining why the scales of the structural variables and g must be fixed. m =
Fl
F2
(a) (b)
"'
R-
m =
(c)
9.16.
0
9.17.
S
YJ
F3
538
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
The following exercises require the use of a computer. 9.18.
RefUser itnogExeronlyctishee8.meas16 concer n i n g t h e number s of f i s h caught . so lUsutiionng fonloryfatctheormeasmodeluurreesment wimentthssm 1 andobtobtmaaiinn 2.tthhee prmaxiincimpumal component l i k el i h ood s o lRotutioantefoyourr factsoolrumodel s Parwithtsma and1 andb .mCompar2. e the solutions and com t i o ns i n ment on t h em. I n t e r p r e t each f a ct o r . Determinande Perreasfoonablrm aefnumber actor analofyfasicts ousrsinm,g tandhe meascomparureementthe prs incipal component maxi m um l i k el i h ood s o l u t i o ns af t e r r o t a t i o n. I n t e r p r e t t h e f a ct o r s . Aanfexami irm is atnatteimpton oringsetorieevals ofutateesttshtehqualat mayity ofrevealits satlhesespottaf eandntialisftorryigoodng to perfind fhasormevalancueatinesdaleaces. hTheon f3irmeasm hasusreelseofcteperd afroarndom s a mpl e of 50 s a l e s peopl e and m ance: gr o wt h of s a l e s , pr o f i t a bi l i tsycalofe,sonalewhis, andch 100new-inadiccountcates "aver sales.aThesge" pere measformuance.res haveEachbeenof thconver t e d t o alsosnitonokg, abseachtraofct retasesotsni, whing,candh purmatporhtemated toicmeasal abiulritey,crreesatpiectvitiyv, elmecye. 50hThenaniincdialvrideua obsAservsatumeions anon orthogonal variablfeasctaroer lmodel isted in Tablfor eth9.e12. 1 , 2, Obt a i n ei t h er t h e pr i n ci p al compo nentcommonsoluftaioctnorors. the maximum likelihood solution for m 2 and m Given youre thseoltuwtoiosneitns ofa r, oobttataeind tloheadirontgsat.edInloteadirpnregst tfhoermm 22andandmm Compar fLiactstotrhseoesluttiimonsat.ed communalities, specific variances, and for the ns. Compare the results. Which choice of m do youmConductpr2eandfear tatmestthofis3poiH0:sonlut?tioWhy? f o r bot h m : ver s u s H I 1 andc, whimch choi3 atctehofe m appear .01levels to. beWitthhethbeseset?results and those in Parts b and Suppose a new, salesper110,son,98,sel105,ected15,at18,random, obt a i n s t h e t e s t s c or e s 12, 35 . Cal c ul a t e t h e s a l e s persregreosni'sonfametctorhod.score using the weighted least squares method and the The component s of mus t be s t a ndar d i z ed us i n g t h e s a mpl e means and varUsiinagncesthe calairc-pulolaltuetdiofnrovarm tihaebloresiginal data. and given in Table 1.5, gener ate Obtthe saainmplthee principal component matrix. solution to a factor model with m 1 and FimCompar nd t2.heemaxithe mfaumctorliizkateliiohnoodobtesaitniedmatbyestofhe prandincipalfocomponent r m 1 andandmmax imum likelihood methods. (a)
x 1 - x4 , =
=
(b)
x 1 - x4 ,
(c)
( )
=
=
( )
(d)
9.19.
x 1 - x6 •
a
a
4
(a)
p
Zi
=
=
= 50
7
(Xi - JLi)/V0::Z , i
=
standardized variables
. . . , 7.
=3
=
(b)
( )
=
=
"
(c)
LL' + '\}I
=
=
(d)
x' =
=
2
a =
[ x 1 , x2 ,
• • •
x7 J
=
Note: 9.20.
# LL' + '\}I
LL' + '\}I
=
(e)
3.
=3
=
[
]
x
covariance
(a)
X1 , X2 , X5 ,
X6
=
=
(b) (c)
L
'\}I
=
= 2.
Chapter 9
TA BLE 9. 1 2
Exercises
539
SALES P E O P L E DATA
-
-
Sal e s Salesp1erson growt93.h 0 23 88.95.08 45 101.102.03 67 95.95.58 89 110.102.88 1011 106.103.38 1213 103.99.55 1415 100.99.50 1617 101.81.53 1819 103.95.33 2021 88.99.55 2223 87.99.53 2425 105.107.03 2627 106.93.83 2829 106.92.83 3031 106.106.30 3233 96.88.30 3435 106.94.53 3637 106.92.05 3839 102.108.03
(x 1 )
IndexSalofes: prabilitoyfit96.91.08 100.103.38 107.97.85 99.122.50 108.120.35 109.111 .88 112.105.55 107.93.05 105.110.38 104.105.33 95.115.03 92.114.50 121.102.00 118.120.00 90.121.08 119.92.58 103.94.53 121.115.55 99.99.58 122.3
( x2 )
Newaccount sale97.s 8 96.99.08 106.103.08 99.99.30 115.103.38 102.104.00 100.107.30 102.102.83 102.103.95.058 103.106.03 95.104.38 95.105.83 109.97.08 107.104.38 99.104.85 110.96.58 100.99.50 110.107.50 103.103.53 108.5
( x3)
Score on: Mechani c al Abs t r a ct Mat h eCrteestativity rteeasstoning rteeasstoning ternast tics 0907 1210 0910 2015 0813 1214 0912 2629 1010 1514 1211 2132 0918 2012 0915 2551 1014 1718 1113 3931 1210 1817 0812 3231 1608 1710 1111 3434 1307 0910 0508 3416 1111 1214 1111 3235 0517 1417 1113 2730 1005 1211 0711 4215 0912 0915 0712 3716 1610 1519 0712 3923 1410 1616 1211 4939 0809 1710 1113 4417 1813 1511 0810 4310 0710 1512 1111 2719 1808 1713 1014 4247 1813 1612 0814 2818 15 19 12 41 ( x4)
(x5 )
( x6)
( x7)
(continues on next page)
540
Factor Ana lysis and I nfe rence for Structu red Covaria nce Matrices
Chapter 9
TABLE 9. 1 2
(continued)
Scor e on: IndexSalofes: NewMechani cgal rAbseasotniranctg Matthice-s pr o f i t r e as o ni n account Cr e at i v i t y Sal e s Sales40person growt106.h8 abil119.ity 0 sa106.les 8 test14 test20 test 12 test 4142 102.92.55 102.109.35 103.99.38 0913 1715 0613 4344 102.83.38 113.87.38 106.96.38 0117 2005 0910 15 4546 103.94.58 101.112.08 110.99.88 0718 1613 1211 4748 84.89.35 96.89.08 94.97.33 0708 0815 0811 4950 104.106.03 118.109.55 106.105.50 1214 1216 1211 Perpretfotrhme raesvarultism. Arax erotthaetiproninofcipbotal component h m 2 solandutionsmaxiinmExerumcliiskeel9.ih2ood0. Insotleur tRefionserconsto Exeristentciswie 9.th2each0. other? Cal(i) weiculgahtteethdelefasacttosrquarscoreessinfr(o9m-5t0)hemand (i 2) tmaxihe remgrumes iloiknelapprihoodoachestiofmat(9e-s58). FiCompar nd theefathcteotrhsrceore esestfsroofmfathcteoprr sicnorcipesal. component solution, using (9-51). Repeat Exer c i s e 9. 2 0, s t a r t i n g f r o m t h e s a mpl e mat r i x . I n t e r p r e t a dif erence if R, rPerthatehfferoarctmtohraansffaoctriotshrfeanalactmoyrseid?s1 ofandExplthme censain.2usso-tlruatctionsdat. aDoesin Tablit make e 8. 5 . St a r t wi t h and R obtainonbotyourh thechoimaxicemofumm.likYourelihoodanalandysiprs sinhcioulpald component soorlurtoiotansti.onCom ment i n cl u de f a ct tPerhe fcomput a t i o n of f a ct o r s c or e s . o r m a f a ct o r anal y s i s of t h e "s t i f n es s " meas u r e ment s gi v en i n Tabl e andn thdie sdatcusas. eUsd ine Exampl e 4.e 1covar4. Comput e farctixor scores, and check for outliers iCons t h e s a mpl i a nce mat imatderrtixh.e (miSeece-Exerweigchtisedat8.1a5infoExampl e 8. 6 . St a r t wi t h t h e s a mpl e i r�. ) Obtand amin th2.e principal component solution to the factor model with m 1 Fiancesnd tfhoermaxim m1umandlimkelihood2. estimates of the loadings and specific Perform a varimax rotation of the solutions in Parts a and b. ( x2 )
(x 1 )
( x3 )
( x5 )
( x4)
( x6)
rna
( x7)
37 32 23 32 24 37
14
09 36 39
=
9.21.
9.22.
by
=
(a)
(b) (c)
9.23.
=
S,
=
correlation
9.24.
and
4 .3
9.25.
S.
9.26.
ance (a) (b)
(c)
co var
=
=
=
=
vari
-
Chapter 9
9.27.
9.28.
. Repeat Exer c i s e 9. 2 6 by f a ct o r i n g i n s t e ad of t h e s a mpl e covar i a nce mat r i x S Altorsosc, fororesthuse imousng theewimaxith smtaumndarlidkielzedihoodweigeshttsim[ .8at, e-.s 2of, -.the6,l1.oadi5],nobtgs aandin thEqua e fac tPerionfo(r9m-58).a factor analysis of the national track records for women given in Table 1.to9r. scUsoreesth, ande samplchecekcovarfor outianceliermats inrtihxeSdatanda. inRepeat terpret tthhee analfactoyrssi.s wiComput esafamc t h t h e plfaectocorred?relaExpltionamatin. rix Does it make a dif erence if rather than S, is ReferutroedExerin metcisee9.rs2per8. Conver t theSeenatExerionalctirsaeck8.1r9.ecor) Perds ffoorrmwomena facttoorsanalpeedsy meas s e cond. sfiasctofortsh. eComput speed date faa.ctUsor esctorheess,aandmplecheckcovarfioancer outmatlierrsixinS tandhe datintae.rpRepeat r e t the tifhe analrathyersisthwianthSt,hisefsaactmploreed?corExplrelaatiionn. Compar matrix e yourDoesreistumake a tdihfe ereresuncelts l t s wi t h iPern Exerformciasefa9.ct2o8.r Whianalychsisanalof tyhseisnatdoiyouonalprtreafcker?reWhy? cor d s f o r men gi v en i n Tabl e the appr'sodatpriaa?te Iffanotctor, armodele the f8.ino6tre.trhRepeat eprementatio'stnshdateofsatetdihpsfe egifravectntenorfisrnormExeroughlthecyioneseth9.ef2os8.rame?thIes women I f t h e model s ar e di f e r e nt , explRefearintothExere difciesreences9.30.. Convert the national track records for men to speeds meas u r e d i n met e r s per s e cond. See Exer c i s e 8. 2 1. ) Per f o r m a f a ct o r anal y sfiasctofortsh. eComput speed date faa.ctUsor esctorheess,aandmplecheckcovarfioancer outmatlierrsixinS tandhe datintae.rpRepeat r e t the tifhe analrathyersisthwianthSt,hisefsaactmploreed?corExplrelaatiionn. Compar matrix e yourDoesreistumake a tdihfe ereresuncelts l t s wi t h iPern Exerformciasefa9.ct3o0.r analWhiycshisanalof thyesidats doa onyoubulprlesfgierv?enWhy?in Table Use the seven vartor itahble essamplYrHegtcovar, FtFiraBnceody,matPrcrtiFx FB,S andFrainme,terpBkFatret th,eSalfaectHtor,sand. Comput SaleWte f. aFacctor smatcorreisx, andCompar check foerthoute rleiesrusl.tsRepeat t h e anal y s i s wi t h t h e s a mpl e cor r e l a t i o n it make a dif erence if ratherobtthanainSed, isfrfoamctoSrewid?thExplthe raeisnu.lts from Does R
R.
9.29.
R,
(
R.
R,
9.30.
9.31.
(
R.
R,
9.32.
541
Refere nces
1.10.
R.
R,
R.
REFERENCES 1. Anderson, T. W. An Introduction to Multivariate Statistical Methods (2d ed.). New York: John Wiley, 1984. 2. Bartholomew, D. J. Latent Variable Models and Factor Analysis. London: Griffin, 1 987. 3. Bartlett, M. S. "The Statistical Conception of Mental Factors." British Journal of Psy chology, 28 (1937), 97-104.
4. Bartlett, M. S. "A Note on Multiplying Factors for Various Chi-Squared Approxima tions." Journal of the Royal Statistical Society (B) , 16 (1954), 296-298.
542
Chapter 9
Factor Ana lysis and I nference for Structu red Cova riance Matrices 5. Bentler, P. M. "Multivariate Analysis with Latent Variables: Causal Models." Ann u al Review of Psychology, 31 (1980) , 419-456. 6. Bielby, W. T. , and R. M. Hauser "Structural Equation Models." Annual Review of Sociology, 3 (1977), 137-161. 7. Bollen, K. A. Structural Equations with Latent Variables. New York: John Wiley, 1989. 8. Dixon, W. J., ed. BMDP Biomedical Computer Programs. Berkeley, CA: University of California Press, 1979. 9. Duncan, 0. D. Introduction to Structural Equation Models. New York: Academic Press, 1 975. 10. Dunn, L. C. "The Effect of Inbreeding on the Bones of the Fowl." Storrs Agricultural Experimental Station Bulletin, 52 (1928), 1-1 12. 11. Goldberger, A. S. "Structural Equation Methods in the Social Sciences." Econometrica, 40 (1972), 979-1001 . 12. Harmon, H. H. Modern Factor Analysis. Chicago: The University of Chicago Press, 1967 . 13. Hayduk, L. A. Structural Equation Modeling with LISREL. Baltimore: The Johns Hopkins University Press, 1987. 14. Heise, D. R. Causal Analysis. New York: John Wiley, 1975. 15. Hirschey, M., and D. W. Wichern. "Accounting and Market-Value Measures of Prof itability: Consistency, Determinants and Uses." Journal of Business and Economic Sta tistics, 2, no. 4 (1984), 375-383.
16. Joreskog, K. G. "Factor Analysis by Least Squares and Maximum Likelihood." In Sta tistical Methods for Digital Computers, edited by K. Enslein, A. Ralston, and H. S. Wilf. New York: John Wiley, 1975.
17. Joreskog, K. G. , and D. Sorbom. Advances in Factor Analysis and Structural Equation Models. Cambridge, MA: Abt Books, 1979. 18. Joreskog, K., and D. Sorbom. LISREL 8: User's Reference Guide, Chicago : Scientific Software International, 1996. 19. Kaiser, H. F. "The Varimax Criterion for Analytic Rotation in Factor Analysis." Psy chometrika, 23 (1958), 187-200.
20. Lawley, D. N. , and A. E. Maxwell. Factor Analysis as a Statistical Method ( 2d ed. ) . New York: American Elsevier Publishing Co. , 197 1 . 2 1 . Linden, M . "A Factor Analytic Study o f Olympic Decathlon Data." Research Quarterly, 48, no. 3 (1977), 562-568.
22. Maxwell, A. E. Multivariate Analysis in Behavioral Research. London: Chapman and Hall, 1977. 23. Miller, D. "Stable in the Saddle, CEO Tenure and the Match between Organization and Environment." Management Science, 37, no. 1 (1991 ) , 34-52.
24. Morrison, D. F. Multivariate Statistical Methods ( 2d ed. ). New York: McGraw-Hill, 1976.
25. Stoetzel, J. "A Factor Analysis of Liquor Preference." Journal ofAdvertising Research , 1 (1960) , 7-1 1 .
26. Wright, S . "The Interpretation of Multivariate Systems." I n Statistics and Mathematics in Biology, edited by 0. Kempthorne and others. Ames, lA: Iowa State University Press,
1954, 1 1-33.
C HAPTER
10
Canonical Correlation Analysis
1 0. 1
I NTRO DUCTI ON
Canoni c al correl a t i o n anal y s i s s e eks t o i d ent i f y and quant i f y t h e as s o ci a t i o ns ofdvaried tahble exampl es. H. Hote oferleliantging arithmetwhoic spieednitiaandl y devel opedic tbetpower hewteeenchnito readi que, provi ar i t h met n g s p eed and r e adi n g power. See Exer c i s e Ot h er exam plandes irelnclautidengrecollatliengeg gover"perfnomentrmance"al polvariicy varablieasblwiesthwiprecol th economi c goal var i a bl e s variablCanoni es. cal correlation analysis focuses on the correlation betlegeween"achia evement" of t h e var i a bl e s i n one s e t and a of t h e var i a bl e s i n an combinatnioatnsiohavins having tnhge tlotahrehgerlesartgscoreest.t rcorTheelarteioliadnteaioamong n.is Nextfirsalt,ltweopaidetdetrs euncor errmmiinnereettlhaheteepaidpaiwirrofthofltihnlieearniearncombi i t i a l y s e l e ct e d pai r , and scoro on.relaThetionspaiarrescaloflleindear combinations are called the and their The canoni c al cor r e l a t i o ns meas u r e t h e s t r e ngt h of as s o ci a t i o n bet w een t h e t w o sconcent ets of varratieabla ehis.gh-Thedimmaxiensimonalizatrioelnatasiopnsecthipofbetthewteenechnitwqouesertesprofevarsentiasblanesatintteompta fetwo pairs of canonical variables. two sets
( [5] , [6] ) ,
(
bination
linear combination
canonical correlations.
1 0.2
10.9.)
linear com
canonical variables,
CAN O N I CAL VARIATES AND CAN O N I CAL CORRE LATI O N S
Wefirstsgrhalolup,beofintervarestieadblinesmeas, is reuprreesseofntasedsobyciatthioen betweenratwndomo grovectups oofrvariablThees. Thesec esente,dthbyatthe represerntandoms the vector set, sWeo thasat sondume,grionup,theoftheorvareiatiblcales,develis repropment p q
p
<
q.
(p X 1 ) (q X 1 ) X(l)
X(l). X (2) . smaller
543
544
Chapter 1 0
Ca nonica l Corre lation Ana lysis
For the random vector(sl)X(l) and(l) X(2), let (1) EE((XX(2) 11-IL(2);; CovCov ((XX(2) II21 It wil beandconveniewent tfoinconsd thiatderthXe r(la)ndom and Xvect(2) jooirntly, so, using results through ==
==
(2-40)
= =
(10 -1 )
(2-38)
(10-1),
(10-2 )
has mean vector and covariance matrix I E(X - P-)(X - IL )' [ifi���-=- :��-H-i���- =- :���f-�-ifi���-=- : ��-�-fi���--=- :���+-J I1 I1 2 -I-21- - -I2---n I21var. iThataandble fisr,otmheXThepq(l)el, covar oneementvariasncesiaofblIebet1f2romeaswmeenXu(2rpai)e-trharseeofascontsvarociaiaaintbliedoensibetnfrIowm1 2eenordif, equitehreenttvwaloseesnttest-lsy.,oneiWhenp qlesar.e Morrelateivoverely ,laitrgise,ofintteenrplrinetearingcombi the elenmentationss ofofIvar1 2 icolablleects tihvelatyariseorindtienraresitliynhope g and ustioenfuanall foryspris eisditcotisvuemmaror compar a t i v e pur p os e s . The mai n t a s k of canoni c al cor r e l a 2 l ( ( ) ) i z e t h e as s o ci a t i o ns bet w een t h e X and X s e t s i n t e r m s ofancesa in Icar1 2e. ful y chosen covariances (or correlations) rather than the pq covari (10-3)
+ + (p q) X (p q)
==
=
1 ( pxp) 1 (pxq)
-- - -- :- - -- -
(qx p) ll (qxq ) I
few
(10-4)
Section 1 0.2
545
Ca nonical Va riates and Ca non ical Corre l ations
Linear combinations provide simple summary measures of a set of variables. Set for some pair of coefficient vectors and Then, using and we obtain VaVarr U CovCov Cov Cov We shall seek coefficient vectors and such that is as laWergedefas iposne tshibelefo. l owing: Thelinear combinations having uniort variances, which maximizeisthtehecorpairerlofa tTheion or i s t h e paicorrreoflatliionnear combiamongnationsall choiceshavithatngarunie uncort variraenceslated, whiwithchthmaxie firsmt ipaize rthofe canonical variables. At the kth step, Thelinear combinations having uniort variances, which maximizeisththeecorpairerlofa tvarioniable paiamong al l choi c es uncor r e l a t e d wi t h t h e pr e vi o us k canoni c al rs. The correlation between the kth pair of canonical variables is called the The f o l o wi n g r e s u l t gi v es t h e neces s a r y det a i l s f o r obt a i n i n g t h e canoni c al var i ables and their correlations. Suppos e and have and l e t t h e r a ndom vect o r s Covrank. For coefficie,ntCovvectors and and,Covform the linear combinat, wherions e has ful and Then max Corr U = a'X ( l ) V = b'X ( 2 )
a
b.
( ) = a'
(10-5)
(2-45),
(X ( 1 ) ) a = a' I1 1 a (X ( 2 ) ) b = b ' I 22 b ( X ( 1 ) , X ( 2 ) ) b = a' I 1 2 b
( V) = b' (U, V) = a' a
(10-5)
(10-6)
b
(10-7)
first pair of canonical variables, first canonical variate pair, U1 , V1 (10-7); second pair of canonical variables, second canonical variate pair, u2 ' v2 (10-7)
kth pair of canonical variables, Uk , Vk (10-7)
kth canonical variate pair,
- 1
kth canoni
cal correlation.
Result 10.1. (X ( 1 ) ) = I 1 1 ( p X p)
V = b'X ( 2 ) .
p
<
q
(X ( 2 ) ) = I 22 ( qXq ) a b ( qXl ) ( p Xl ) a, b
X( l ) ( pX l ) l 2 ) ( ) (X( , X ) = I 1 2 ( p Xq )
(U, V ) = P i
X( 2) ( qXl ) I U = a'X ( l )
546
Chapter 1 0
Ca nonica l Corre lation Ana lysis
(
attained by the linearUcombi n at i o ns f i r s t canoni c al var i a t e pai r 1 e and The kth pair of canonical variates, k e' U k k k k . . .. .. maximizes Cor r U , V k k among t h os e l i n ear combi n at i o ns uncor r e l a t e d wi t h t h e pr e cedi n g canonicereal varp1iablesp.2 are the e1genva ues of _..1 ...12 ...2 1...1 , and o r s . The quant i t i e s ear1e, eal2,so. t. h, eeP arlaergthesetaseisgoenvalciatedues of the mateigenvect , , p p r i x wi t h cor r e s p ond ing The canonieigenvectcal varorisate1 ,s have thPe. prEachopertiiessproportional to ei . Var U Var V k k CovCov UVkk,, Ue CorCorrr UVkk,, VeUe kk Cov Uk , Corr Uk , k fork, 1 I n t r o duce t h e s y mmet We as s u me t h at and ar e nons i n gul a r . 1 ric squareSet-root matricesand and wistoh 1 and and Then See ...1 ...1 2 Corr 1 , By the Cauchy-Schwarz inequality Since yields is a symmetric matrix, the maximization result _..1 _..1 2_..2 _..2 1_..1 )
== 1 I1i12 X (l )
Vi == f1 I2i12 X (2 )
== 2, 3, . . , p, .
� - l/2 x (l ) ll
v;
(
*2
H
(q X 1 )
*2
>
p
>
>
· · ·
(
(
e == 1, 2,
Proof.
(2-22).]
0 0 0
'
p.
(
) ==
Ve) ==
Ve) ==
Iii? Ii�2 d == Ii�2 b, c == Ii lla
k
-
1
� - 1 /2 � ... � - 1 � �- 1/2 22 f 2 i 2 . . . , p�2 [ I2i12 I 2 1 I1ii 1 2 I2i12 fi I2i12 I 2 1 I! i12 ]
) ==
I
1, 2, . . . ,
1
0
f f2 , . . . , f
(
- 1;2 x (2 ) f' � 22
Pk
) ==
Pp*2 (p X 1 )
(
_
) ==
1
(
) ==
0
(
) ==
0
Ve)
(
==
=I= e =I= e =I= e
0
I 22
Iii?Iill I1i I1 f /2 I1 f /2 . [ b == I2i12 d . a == I1f /2 c � - l/2 d a ' I 1 2b C , � - 1/2 � ... 22 2 ( ( ) ) == ::: ==== ======::----;::: -;: ( a' X b ' X ) == ----== (10-8) � Vb ' I 22 b I
==
==
(2-48),
(10-9)
(2-51)
I1 f/2 I 1 2 I2ii 2 1 I1 f/2
pXp
C ' � - 1 /2 � � - 1 � � - 1 /2 C
<
1t 1 C C \
'
(10-10)
1 If I- 1 1 or I- is singular, one or more variables may be deleted from the appropriate set, and the 22 linear combinations a ' X ( l ) and b ' X ( 2 ) can be expressed in terms of the reduced set. If p > rank (I- 1 2) p 1 , then the nonzero canonical correlations are p{, . . . , p; 1 •
=
Section 1 0.2
Ca nonical Va riates and Ca nonica l Corre lations
547
I1if2 I 1 2 I2ii 2 1 I1i1 2 . A1 . 2 I2i12 I 2 1 I1F e 1 .
A. 1 c e1 ,
wher fifordise=proipors tahnorteiolanalmrgalestioztedeigeienvalgenvalueuofe assoThusciate, d with EqualEqualityitalysoccur o holdssinin(1(0-10-10)9) max Corr =� (10-1 1) with equality occurrwhering efotrhe sign is select=ed to giveandpositwiivtehcorreprlaotporion.tiWeonaltaketo = This last correspondence fol ows by multiplying both sides of = (10-12) Thus, )if-with thise annormeigalenvalized fuoe-eirm ofgenvector pair for an eigenvalue-eigenvectthenor paiWerhavefor demonstrated that The= sign for andis chosV1 e=n to give a posarietivtheecorfirsrtepailatiroofn. canonical variabl=es and th1,ands at theiirmicorlarlrye,lVatar(ionV1i)s = 1.= �. Also, Var = Continaruienuncorg, werenotlateedtihfat and an arbitrary linear combination = Atthenthyie kteldhsstage, we require that The maximization result (2-52) < and by (10-8), < Corr , = with equalitandy for == or are the ktandh canonical pair,asandbefthoeyre.haveThuscor, relation�= Although we did not explicitly require the to be uncorif krei=latfed,
I2i12 I2i12 I 2 1 I1if2 e 1 , b I2il2 f1 .
( a' X ( 1 ) , b ' X ( 2 ) )
I1V2 e 1
a = I1if2 c
( I!i !2 I 1 2 I2ii 2 1 I1if2 ) e 1
( "- 1 , e 1 ) f1 ( A. 1 , f1 2 I2i1 I 2 1 I1i i 1 2 I2i12 . U1
e1I1if2 I 1 1 I1if2 e 1
e1e 1 =
c ' I1if2 X (l )
a' X(1)
U1
.l
e 1 , e 2 , . . . , ek_ 1 .
� 1-11 /2 "" � 1 ...,. � - 1 � � - 1/2 C ' "" 2 2 2 "" 2 1 ""1 1 C
( a ' X ( 1 ) b ' X (2 ) )
Pk ·
Vk
c
ltk C C \
'
c ' I -1 11 f2 I 1 2 I -221/2 d " r-;d'd v c' c vr;;-:;
ek a = I1if2 ek 2 fk i2i1 X (2 ) ,
v/""\ Ilk
�
-
�
b
=
I2i12 fk ,
Uk =
Vk
ek i!i f2 I 1 2 I2i12 fe eki1i12 I 1 2 I2i/2
Cov (Uk , Ve )
fk
"- 1 e 1
I1if2 I 1 2 I2ii 2 1 I1i12 , I2i12 I 2 1 I1if2 e 1-is f1 f1 I2i12 X (2 ) e1I1if2 X ( 1 ) ( U1 ) pf
c
ek i 1if2 X (l )
b
=�=
e
P
•
548
Chapter 1 0
z (2 )
Ca non ica l Corre lation Ana lysis
z (l)
[ Z f l ) , Z�1 ) , . . . , Z �1 ) ] ' and
If the original variablesfrarome sftiarsndart prindicizpedleswi, ththe canonical variates are of the form
=
[ zF ) , z �2 ) , . . . , z�2 ) J ' ,
=
Uk - a'k z ( 1 ) -- e'k p -1 11/2 z ( 1 ) _ _ 1 v k - b'k z (2 ) - fk' p-22!2 z (2 ) TT
Cov ( Z ( 1 ) )
(10-13)
p 1 1 , Cov ( Z (2 ) ) = P22 , Cov ( Z ( 1 ) , Z (2 ) ) = P1 2 = P2 1 , ek 2 2 2 2 2i1 1fi i 2i , 2i fl P P2 1 P r P1 2 P l r e P P1 2 P P2 1Pr Pk ,
Here, an d genvectcalorcors ofrelations, satisfy and andspectfikvarelye. tThehe eicanoni k where (or, equivalearntely, tthheelarnonzer gest eigoenvaleiguenval es of ues of the matrix Notice that ( ) (X (l)) (X =
=
p f2 pi_ 2 p-;}12p1 2 P 2i P2 1P 1i12 >
>
···
1 , 2, . . . ' p
p;2
>
(10-1 4)
P 2i12p2 1P1f p1 2 P2i12).
Comment.
a k (X 1 ) - IL ( 1 )
=
ak 1
i 1 ) - J.L P ) + a k 2 ( X�1 ) - J.L�1 ) )
+
+ ak p
+
···
···
1 p - r11.p( ) )
+ a k p vr=a-PP �
1 1 ( x p( ) - J.Lp( ) ) r=-
v a-PP
�
) wher e Ther e f o r e , t h e canoni c al coef f i c i e nt s f o r t h e X ? Va r ( i . sictaalndarcoefdfiizcedientvars atiatblachedes, to the original variables are sSpeciimplyfirceallalyt,eidf to itshtehcanon e coe f fthiceiektnthvectcanonior fcoalr tvarhe iktathe canoni c al var i a t e t h en i s t h e coef f i c i e nt vect o r f o r ed fromeletment he standarva::zd.izSiedmvarilarilaybl, es isHertheecoef ifsictiheentdivectagonalor fomatr threixcanoni witconsh ictalhtrdivaructagonal i a t e cons t r u c t e d f r o m t h e s e t of s t a ndar d i z ed var i ablvares ( X In),thTheis cascanoni e cials tcorhe direalagonaltionsmatarerix with ith dibyagonalthe seltaendarmentdiVO::Zzation. HoweverThe,rtehleatchoiionscheipofbetthweeencoeftfhiecicanoni ent vectcalorcoefs ficiewintls ofnotthbee stuniandarquediifzed variables andture tofhetcanoni c al coef f i c i e nt s of t h e or i g i n al var i a bl e s f o l o ws f r o m t h e s p eci a l s t r u c he matrix [see also ( or and,pal component in this book,analis uniysisq,ueif to icanoni c al cor r e l a t i o n anal y s i s . For exampl e , i n pr i n ci s t h e coef f i c i e nt vect o r f o r t h e kt h pr i n ci p al compo nentis theobtcoefaifniedcientfrovectm orthfoenr the kth principal componentbut wedercannot i n f e r t h at ived from =
a-ii '
1, 2, . . � p. z? ) = (X?J - J.L? ) )/ VO::Z , =
Uk ,
V
Z (2 ) .
x? ) . ak Vi�2
ak
Z ( 1) . bk Vi�2
Vi�2
Vi�2
12)
=
unchanged ak , b k
P k2
=
Pk� 1 .
(10-16)]
P 1 fl2 P1 2 P2i P2 1 Pr if2 )
I1i12 I 1 2 I2ii 2 1 I r if2 I,
ak ak( X - IL )
=
ak V 112 z,
p.
ak V1/2
Section 1 0.2
Example 1 0. 1
Ca non ica l Va ri ates and Ca nonica l Corre lations
549
(Calcu lati ng ca nonical variates and canonica l correlati ons for standardized variabl es)
1 1 Z (l) == [ z i ) , Z� ) ] '
Z ( 2 ) == [ z i2 ) , Z�2 ) ] '
Suppos are alsoestandardized variablarese. sLettandardized1.var0 ia.bl4]esandand.5 Cov ( [H�-�-l-P�-�J - - ..45- - -1.-0- - - 1.- 0- - -1..4-0. 4 . 6 Then [ 1.0681 1.0681 ] [ 1.0417 1.008417 ] and [ 4 71 ..21096178 ] The e1genva ues, of are obta1ned from 0 .4371_2178 _1096.2178 ( .4371 ( .1096 - (2.178)2 - .5467 .0005 yitoerldequating ion .5458 and .0009. The eigenvector e1 fol ows from the vec [ .4217871 .1096 ] ( .5458) Thus, [.8947, .4466] and - [ ..28776561 ] ] [ .8561 ] [ .4054436 ] [ We must scale so that Z ) ==
Z == [ Z (l ) , Z ( 2 ) '
P2 1 l P22
p-1 11 /2 -
-- --
==
p*1 2 , p2* 2
_
.2
- .2 3
- . 2083
. 3 . 21 78
A
·
- A)
=
== A2
pf 2
- --
- .2229
-1 p -1 11 /2 p 1 2 p-I 22 p 2 1 p 1 1/2
,
-A
=
- -
.3
.2
1 - 1 P 1P1-11/2 P -1 1 /2 P1 2 P22 2
1
.3
-
- . 2229
p221 -
·
.6
- A)
A+
p! 2 =
=
. 3
e 1 ==
. 2178
el
==
el
a 1 - P1- 11 /2 e 1 -
b1
ex
_1 P22 p2 1 8 1 -
b1
_
. 3959 .2292 . 520 9 . 3542
. 277 6
==
. 2
550
Chapter 1 0
Ca nonica l Correlation Ana lysis
The vector .4026, .5443 gives [.4026, .5443] [1 .·°2 1.·20] [ .·45443026 ] .5460 Using .7389, we take 1.7389 [ ..54443026] [ ..57448366 ] The first pair of canonical variates i2s 1) 2 1) . 8 6 i . 2 8 � V b1Z(2) .542l2) .742�2) and their canonical correlation is .74 Thifroms isththee largandest corZ(2r)esleattsio. n possible between linear combinations of variables corr ofrecanoni lation,cal variates, although.03,uncor is verryeslmatealdl,wiandth consmembereThequents ofselycondt,htehefircanoni ssetcondpairc,paialconveys ver y l i t l e i n f o r m at i o n about t h e as s o ci a t i o n betsiderweeend inseExerts. (Tcihese 10.cal5cul.) ation of the second pair of canonical variates is con We not e t h at and V , apar t f r o m a s c al e change, ar e not much di f e r e nt 1 from the pair [
J'
=
v3460 =
bl =
=
+
U1 = a1Z (l ) = = 1 = Pi =
+
v'Pf2 =
z(l)
V3458 =
Pi = \1.00()9 =
U1
For these variates, Var(U ) 12.4 1 2. 4 Var b b' P 2 4. 0 Cov( U , 1 and Corr , 12.4.40 2.4 .73 ThelinearcorcombirelatnioatniobetnswUeen1 , thisealrmatoshert tshiemmaxiple and,mumpervalhuapse , easil.y74.interpretable = a' p1 1a =
( Vr ) =
=
Vr ) = a ' p 1 2 b =
( U1 Vi) = v'IT.4 v'2.4 = �
V1
�
Pi =
M
Section 1 0.3
551
I nterpreting the Pop u l ation Ca nonica l Va ria bles
The pr o cedur e f o r obt a i n i n g t h e canoni c al var i a t e s pr e s e nt e d i n Res u l t 10. 1 has cericaltacoefin advant a ges . The s y mmet r i c mat r i c es , whos e ei g envect o r s det e r m i n e t h e canon f i c i e nt s , ar e r e adi l y handl e d by comput e r r o ut i n es . Mor e over , wr i t i n g t h e coefand ftihceiierntgeomet vectorrsicasinterpretationsand. To ease the computfaciliattaitoenals analburydtien,c desmanycriptpeoions ple prefer to get the canonical correlations from the eigenvalue equation (10-15) The coefficient vectors and fol ow directly from the eigenvector equations (10-16) TheExermatciseri10.ces4 for more detailands.) are, in general, not symmetric. (See bk == I2il2 fk
ak == I1F2 ek
a
b
I1i i 1 2 I2ii 2 1 a == p * 2 a I2i i2 1 I1 i i 1 2 b == p * 2 b
I1i i 1 2 I2ii 2 1
1 0.3
I2ii 2 1 I1i i 1 2
I NTERPRETI NG TH E POPU LATION CAN O N I CAL VARIABLES
Canoni c al var i a bl e s ar e , i n gener a l , ar t i f i c i a l . That i s , t h ey have no phys i c al mean inhaveg. If tunihe tors prigionporal vartionaliablteos thoseandof the are andused, thesecanoni c al coef f i c i e nt s and t s . I f t h e or i g i n al var i a bl e s zero means haveardare dsitznoaedndarunivardtiiszaedblofemeastso. haveurement , and thandey musunitt varbe iianncesterp,rethteedcanoni in termcals ofcoefthfeicsietantns Res u l t 10. 1 gi v es t h e t e chni c al def i n i t i o ns of t h e canoni c al var i a bl e s and canon ical correlations. In this section, we concentrate on interpreting these quantities. Even t h ough t h e canoni c al var i a bl e s ar e ar t i f i c i a l , t h ey can of t e n be "i d ent i f i e d" i n tputerminsgofthtehecorsubjreelactti-omnsattbeter varweeniabltehse. canoni Many tcimalesvarthiiastiedsentandificthateioorn iigsiainaldedvarbyiacombles. Thesunivare icoraterienlafotiromnsat, ihowever , mus t be i n t e r p r e t e d wi t h caut i o n. They pr o vi d e onl y o n, i n t h e s e ns e t h at t h ey do not i n di c at e how t h e or i g i n al var i ablreaseos n,contmanyributinevestigatotrostprefhe canoni c al anal y s e s . ( S ee, f o r exampl e , [ 1 1 ] . ) For t h i s directLetly from the standardized coefer andtofiasciseents st(h1e0-cont13).ributions of, sothethoratigtihnealvectvaroiarbls eofs canonical variables are (10-17) where we are primarily interested in the first canonical variables in Then(10-18) X (l )
b
X (2 ) X (l )
a
X (2 )
Identifyi ng the Canon ical Va riables
jointly
A == [ a 1 , a 2 , . . . , a p ] '
( pxp)
B == [b 1 , b 2 , . . . , b q ]
(qxq )
U == AX (l )
( pX l )
(
V
qXl) p
'
== BX (2 )
V.
552
Chapter 1 0
Ca nonica l Corre lation Ana lysis
1 Xk ) )
1 Xk ) )
Becausar(e Var (�) 1,EquiCorvral(e�nt, ly, Corris( �obt, ained byCov(div�id,ingkVCov ( �. ,Introduc mating trhixe t(eprms, p) diagonal matrix with kth diagonal element kV we have, in Cov ( , x Corr Cov Similar calculations for the pairs , and yield (10-19) whereCanoniicsalthvare iables derdiaivgonaled fromatm srtiaxndarwithdiitzheddivaragonaliableels earmente sometimar(es inte)r preted by computing the correlations. Thus, (10-20) where and are the matrices whose rows contain the canonical coefficients f(o1r0-20)the haveandthe senumer ts, resipcectal valiveluyes. Theas thoscorerappear elationsingininth(e10-19)matr;icthesatdiiss,played in and so forth. This fol ows becausThee, fcoror rexampl e , e l a t i o ns ar e unaf f e ct e d by t h e standardization. Comput e t h e cor r e l a t i o ns bet w een t h e f i r s t pai r of canoni c al var i a t e s and t h ei r component variaiblablesesinfoExampl r the situeat10.ion1 consare alidrereadyed instExampl ez10.ed,1so. equation The var a ndardi (10-20) is applicable. For the standardized variables, [ 1..40 1..40 ] [ 1..20 1..20 ] and [ .5 ..46 ] With p 1, [ .86,.28] [.54, .74] Vv
= 1 X i ) ) = a-}!'f .
V1if2
X
Pu, U ) =
(U, X (l ) ) =
(p Xp)
1 Xk ) ) =
(U, V1if2 X ( 1 ) ) =
(U, X ( 2 ) ) (V, X ( 2 ) )
Pu,x(I) = AI 1 1 V i if2
(
( p Xq )
( qX p )
AX ( l ) V1 i f2 X ( 1 ) )
(V, X( l ) )
Pv ,x( l ) = BI 2 1 V1if2
( q X q)
Pu, z(l) = Az P1 1 P u,z(2) = Az P1 2
2 Xk1 ) ) 2, a-
Pv,x(2) = BI22 V2i12 qxq )
(p X p)
Pu,x(2) = AI 1 2 V 2i12
v ;-i/2
a-
by
Vv X) 2 l .
Pv, z(2) = Bz P22 Pv, z(l) = Bz P2 1
Bz
Az
(pxp)
( qxq ) Z (2 )
z (l )
same
Pu,x(l) ==
Pu , x( l ) = AI 1 1 V 1 if2 ==
Pu, z( l ) , AVi�2 V1if2 I 1 1 V1if2 = Az P1 1 == Pu , z( l ) . Example 1 0.2
(Com puti ng correlations between ca nonica l va riates and thei r co mponent variables)
Pl l =
P 22 =
P1 2 =
=
Az =
.3
Bz =
Section 1 0.3
so and
I nterpret i n g the Pop u l ation Ca non ica l Va riables
553
[
[ .86,.28J [ 1._40 1_.40] .97,.62] [ .54,.74] [ 1..20 1..20] [ .69,.85J We concl u de t h at , of t h e t w o var i a bl e s i n t h e s e t t h e f i r s t i s mos t closetlyheassseocondciateids moswithttclheoscanoni coalcivaratediawiteth OfInthtehitswcaso vare, tihaeblcores irneltahteiosnset e l y as s rHowever einforce,tthhee icornforremlaattiioonsn seluepplvatieedthbye rtehleatsitvaendarimpordiztedancecoefofficientins the andfirst set e they ignore the contribution of the remain iandng varFriaoblimnet(hi1ne0-eachs2e0)cond, wesets.aletsobecaus obtain the correlations [ .86,.28] [ ..53 ..46 ] [ .51,.63] and [ .54,.74] [ ..56 ..43 ] [ .71,.46] Latthe einr,tienrpourretaditisocnusofsiothnesofe tlhaestscora�plreelacanoni c al var i a t e s , we s h al l comment on tions. The cor r e l a t i o ns and c a n hel p s u ppl y meani n gs f o r t h e canoni c al vartionsiatbetes. wTheeenstphireitprisinthciepsalamecomponent as in prinsciandpaltcomponent analted varysisiawhen the corprorvieldae h ei r as s o ci a bl e s may subject-matter interpretations for the components. Pu 1 , z( l )
= A z P1 1 =
=
Pv 1 , z( 2 )
= Bz P = 22
=
Z (l ) ,
U1 . Vi .
Z (2) ,
Z�
Zl2 )
Pub z( 2 )
Pv 1 , z( 1 )
= A z P1 = 2
Az
1)
Bz .
=
= Bz P 1 = Bz P r = 2 2
=
•
Pu, x (l )
Pv, x (2 )
Ca non ica l Correlati ons as Generalizations of Oth er Correlation Coefficients
FiWhenrst, the canoni c al cor r e l a t i o n gener a l i z es t h e cor r e l a t i o n bet w een t w o var i a bl e s . and each consist of a single variable, so that 1, for all TherefoCorre, rthe "canonical When variates" and haveand more component have cors,resleattiionng X (l )
p=q =
X( 2 )
a,
Pi = I
(XP ) , Xl2 ) ) 1 .
X (l )
ul = x P )
X (2 )
Vi = x i2 )
b
554
Chapter 1 0
Ca non ica l Corre lation Ana lysis
a' = [0, . . . , 0, 1, 0, . . . , OJ 1
b' = [0, . . . , 0, 1, 0,
wi t h i n t h e i t h pos i t i o n and with in the kth position yields Corr Cormax rCorr That is, the first canonical correlation is larger than the absolute value of any entry [ s e e i s a s p eci a l cas e Second, t h e mul t i p l e cor r e l a t i o n coef f i c i e nt of a canonical correlation when has the single element Recall that for When the mulFintialplley,corwernoteislalateirotgnsheratofthan eachwithof the multiple correlations of with or k from the proof of Result Similarly, Corrk max Corr s, thore canoni wiThatth iBecaus teheofmulitcsaltmulipcorletricorepllaerteicorolanstrieoarlnaetcoefialonsofcoefitchieentfmulicsieofnttipilnewitcorertphrreeltaattiioonn,coefthefiktcihents of canoni c al cor r e l a t i o n i s t h e pr o por t i o n of t h e var i a nce of canoni c al var i a t e Uk cal vari atbet"expleweenai"explned"theabytinwed"tohseesbytest the sandetIt is alsoTherThetheeprlfaororgporees, ttvalionisuofofe, ttehne varcalis lsieaodncemettheofimcanoni es r e gar d ed as a measure of set "overlap." I
1
(X? ) ' X i2 ) ) 1 = I <
(a'X ( l ) ' b'X ( 2 ) ) 1 ( a'X ( l ) , b'X ( 2 ) ) = Pi
a, b
. P 1 = v-1 11;2 �""I v- 1;2 · Ill 2 22 2
>
(10-21)
p= 1
1, Pi
(10-22)
1 X� )
X(l).
xj 2)
, OJ
(7 -54)] xP ) (p = 1 ) .
p 1 ( x(2) )
X(l)
p
. . .
X (2)
(10-23 )
= 1, 2, . . . ' p
10.1.
Pv k ( x(l ) )
=
(Uk , Vk ) = Pk , = 1 , 2, . . . ' p
(a'X ( l ) , Vk ) =
a
X (2)
squared
2
X (2) .
Vk
Uk
X(l).
Vk
Pk
(10-24)
X(l)
X(l). X ( 2) .
Pk
2
shared variance
pi , 2
The Fi rst r Canonical Variables as a Summary of Variab i l ity
U = AX ( l )
X(l)
X (2)
V = BX ( 2 )
i s t o Thechosechange of coor d i n at e s f r o m t o and f r o m have n t o maxi m i z e Cor r and, s u cces s i v el y , Cor r , Vj) , wher e ( , ( Vi Vi tzerionobetcorBywrdeseeenlatigitohn,netswiheettschoteffheicandprieentvivectoushasopairsbeenrs iarsoelasteeldectinetdhteopaimaxirs ofmiczaenonicorrceallavartCorionsiarbl,enotleas. neces s a r i l y t o pr o vi d e var i a bl e s t h at ( a ppr o xi m at e l y ) account f o r t h e s u bs e t covar i s provicaldecorpoorrelasutimon ances clear howcal vara hiiagblh ecanoni marshoulieds ofbetandihnetevarrpretiabiWhen led.ity inthe fandirst few paiit irssnotof canoni (U1 , Vi ) X (l) X (2)
I1 1
I 22 .
I1 1
(U1 , Vi ) , (U2 , V2 ) ,
ai , b i
I 22 ,
. . .
, ( Ui_ 1 , Vj_ 1 ) .
Vi )
Section 1 0.3
Example 1 0.3
555
I nterpreting the Pop u l ation Ca nonical Va riables
(Canonical correlatio n as a poor sum mary of variabil ity)
Consider the covariance matrix i [ J Cov Theates reader mayandverify (see Exerhascicorse relatiotnhat the first pair of canonical vari Corr YetMost of the varpriaobivilidtyesinatverhisysepoort is insummarwhiy ofchtihseuncorvariarbiellaittyedinwithteh firstTheset. same situation is true for Vi in the second set. xP ) x�l ) x i2 ) x�2 )
ul = x�l )
Il l ! I1 2
=
:t;-�-T-i;-� -
=
_____
Q
�
j
0
:��
o .95 ; 1 0 0 0 _____
____
__
0
________
I
I
Q_
o 100
10.1)
vl = xi2 ) pf
100 0
=
(U1 , V1)
1 U1 = X� )
=
VXl2 )
= .95
xP ) ,
U1 .
•
A Geo metrica l I nterpretation of the Po pulation Canon ica l Correlatio n Analysis
geomet r i c al i n t e r p r e t a t i o n of t h e pr o cedur e f o r s e l e ct i n g canoni c al var i a bl e s pr o vides Thesometrvalansufablormeatiniosinghts into the nature of a canonical correlation analysis. from to gives Cov(U) FrmatomrixReswituhlrtowe; ,andand wher e i s an or t h ogonal deris thiveedithfrpromincipalalcomponent one. The matscalreixd toNow,have unit varhasis tiaihtnce.ehsreotwThatof prinisc, ipal component whichs A
U
x(l)
u
= AX ( l )
= AI 1 1 A ' = I
(2-22), A = E' I1i12 = E'P1 A1 112 P1 P1X( l ) I 1 1 = P1 A 1 P1 . 1 A1 12 P1 X ( l )
10.1
X(l)
Cov ( A1 1/2 P1X ( l ) )
= =
A 1 1 /2 p 1I l l Pl A1 1/2 A 1 1 /2 A l A 1 1 /2 = I
=
E'
(1/\0\ ) pi X( 1 ) ,
A 1 1 /2 P1Pl A1 P r Pl A 1 1 /2
= AX ( l ) = E ' P1 A1 112 P1X( l )
, to uncorrelated standardized prcinancipbeal component interpreteds, fasol oweda tbyransfCons oarmriatgeiiquent od n(oofrthlyogonal and t h en anot h er r o ) r o t a t i o n det e r m i n ed by tplatieiosnto determined from the ful covariance matrix similar interpretation ap (2)
U X(l)
E' V
= BX( 2 ) .
P1
I1 1 I. A
(1)
(3)
556 1 0.4
Chapter 1 0
Ca nonical Corre lation Ana lysis
TH E SAM PLE CAN O N I CAL VARIATES AN D SAM PLE CAN O N I CAL CORRE LATI O N S
X ( 1 ) , X ( 2)
(p + q)
var i a bl e s rbe asandom s a mpl e of obs e r v at i o ns on each of t h e ca n sexmbled[xint(ol) theX(2 ] data matrix ) [; �] The vector of sample means can be organized as where i(l) tSiamtioilnarly, the sThusampl,e covariance matrix can be arranged analogous to the represen A
=
X 11(( 11 )) X21 X n( 11)
i
n n X (p + q)
X 12(( 11 )) X22 X n( 12)
X 1(( 11p)) X2 p X np( 1 )
X 1(( 22q)) X2 q X n( 2q)
X 11(( 22 )) X 12(( 22 )) X 21 X22 X n( 21) X n( 22)
=
i (2 ) =
(10-4) .
xi J , xi J , ( 1 ) Xn( 2 ) , Xn ,
=
1 n - :L n j= 1 1 n - :L n j= 1
x) 1 ) x}2 )
(10-25 )
(10-26)
S 1 1 i S 12
_ :�:S � : _ �-! - :�S:��22
s
( p+ q ) X ( p+ q )
( qx21p ) ! ( qxq )
where The linear combinations have sample correlation [see nations ViTheIhavin generng aunil, tht esample variances that maximizeisththeiesrtapaihteiopair ofr oflinliearnearcombi ns c o mbi n at i o havi n g uni t s a mpl e var i a nces t h at maxi m iz e t h e r a t i o among t h os e l i n ear combiThenatiosnsampluncore correrlaetleadtiowinthbetthwe eenpreviousand issacalmplleedctahnonie cal variates. The s a mpl e canoni c al var i a t e s and t h e s a mpl e canoni c al cor r e l a t i o ns c a n be i n a manner con and tsaisitneednt fwirotmh tthhee popul samplaeticovaron casiaencedesmatcribredicesin Result Sk l -
1
n
_
� ( Xj( k ) - X- ( k ) ) ( Xj(/) 1 j= 1 £,;
-
-(!) ) ' , X
k, l
=
1, 2
{; = a ' x ( 1 ) . (3-36)]
(10-28)
ruA , vA --
(10-29)
'
A
U1 ,
A
a ' s 12h � " � v a ' S 11 a ,..v b ' S 22b
-------
first pair of sample canonical variates
Uk , Vk A
A
(10-27)
kth pair ofsample canonical variates {;k
k-1
(10-29).
(10-29)
vk
kth sample canonical
correlation .
S 1 1 , S 12 S2 1 , =
10.1.
S 22
ob
Section 1 0.4
557
The Sa mple Ca nonica l Va riates and Sam p l e Canonical Correlations
2 2 2 p p p Let beors teh1e, e2, .ordered ei g enval u es of ; f f S1ideffi2nSed1 2S2iinS2 1S1if2 withandcorrespondingLeteigenvect . , e , wher e t h e t ar e S P k f , . . , f , f be t h e ei g envect o r s of P 1 2 2S2( 1/i1p%S)2 1S2S1�i 2SS1221SS2i!�1122,ekwhere ' t h e f i r s t f s may be obt a i n ed f r o m f k 2 , kU"k - ks-1/2 x(Then t h e kt h s a mpl e canoni c al var i a t e pai r i s 1 1) k - ks-2 1/2 2 2 2 l l ( ( ( ( ) ) ) ) x wher e x and ar e t h e val u es of t h e var i a bl e s and f o r a par t i c ul a r ex X X perpleicormentrelaaltuniiont. Also, the first sample canonical variate pair has the maximum sam and for the kth pair, iprs ethcedie lanrggeskt-posssiabmplle core canoni relatiocnalamong l i n ear combi n at i o ns uncor r e l a t e d wi t h t h e var i a t e s . The quantities . . , p; are the sample canonical correlations? t s u bThe pr o of of t h i s r e s u l t f o l o ws t h e pr o of of Res u l t wi t h S k stituted for Ik1 , k, The sample canonical variates have unit sample variances and their sample correlations are kk i=i= ff Thetweeninttehreprcanoni etationcalofvarUki,aVtkesisandoftetnheaivardediabyblecomput i n g t h e s a mpl e cor r e l a t i o ns be1 2 ( ( ) ) the matrices [a1 ,a2, . . s in the sets X and X . We define 4 whos e ar e t h e coef f i c i e nt vect o r s f o r t h e s a mpl e canoni c al var i a t e s . Anal o gous to we have Bx(2) >
Result 10.2.
p
(10-27)
>
· · ·
q.
<
>
p
"
p
"
=
1 , 2, . . . , p. - e" '
=
"
t7
" r'
y
x( )
"
bk
ruk, Vk =
1
Proof.
Pk
Pi , P i ,
l
10.1,
1 , 2.
=
•
su k , u k = svk, vk =
ru b u c = rvk, Vc =
"
row s
(10-17),
A p( x p )
ruk, vc =
"
=
.
iJ
( p X1 )
=
, ap J '
A x(l )
(10-30)
1
0, 0,
(10-31)
:8
(10-32)
( qxq ) V
( qX 1 )
(10-33)
=
When the distribution is normal, the maximum likelihood method can be employed using i = S n in place of S . The sample canonical c�rrelations Pk are, therefore, the maximum likelihood estimates of Pk and v'n/ ( n - 1 ) ab v'n/ ( n - 1 ) bk are the maximum likelihood estimates of ak and bk , respectively. 3 If p > rank ( S1 2 ) = p 1 , the nonzero sample canonical correlations are pf, . . . , p� 1 • "' "' "' - 1 ;2 "'rp1+ 2 , ·;, . , b q - s2-21 ;2 "'rq are deterrmne d from a ch o1ce of 4Th e vectors b p1 + 1 - s-221 ;2 "'rp1+ 1 , b p1+ 2 - s22 the last q - p 1 mutually orthogonal eigenvectors f associated with the z ero eigenvalue of 2
·
s 2Y2 S 2 1 S1i s 1 2 s 2Y2 •
·
558
Chapter 1 0
Ca nonica l Correlation Ana lysis
and we can define
matmatrriixx ofof ssaamplmplee corcorrreellaattiioonsns ofof matmatrriixx ofof ssaamplmplee corcorrreellaattiioonsns ofof Corresponding to we have Ru, x( l )
==
Rv, x(2)
==
Ru, x(2)
==
Rv, x(l )
==
(10-19),
Ru , x( l )
Rv, x( 2 ) Ru x( 2 ) Rv x( l )
D1if2 1 ) - 12
( p X p)
U V
iJ
V
wiwitthh x(l) wiwitthh x(l) x (2 ) x (2 )
== A s 1 1 D !i!2
== B S 22 D2�12 == A s 1 2 D2�12
(10-34)
== B S 2 1 D1i l2
i s t h e di a gonal mat r i x wi t h i t h di a gonal el e ment ( s a mpl wher e e l ) var((samplx e var( and is the (q q) diagonal matrix with ith diagonal element I f t h e obs e r v at i o ns ar e s t a ndar d i z ed [ s e e t h e dat a trix becomes [ ] �
D2�/2 2 x � ) ) ) - 1 /2 .
X
Comment.
ma
(8-25)],
Z
=
[ Z{ll
i
z (zJ ]
1 z (1 ) ' z (21 ) ' :
=
1
Zn( ) ' Zn(2 ) '
and the sample canonical variates become z z and wher e The s a mpl e canoni c al cor r e l a t i o ns are fchanged ected byandthemaystandarbe caldiczulataitoen.d, fTheor stacorndarreldaitziedonsobsdiesprvlaatyedionsi,nby substiturteimaing p un obsBzervfatorB,ionsand. for Note that and for standardized IfonwlExampl e dat a c o ns i s t i n g of bone and s k ul l meas u r e ment s of whi t e l e ghor n were described. From this example, the chicken-bone measurements for {xP) sskkulunll brengteadth h { ftiebmuria lelngtength h iJ == A z ( 1 )
V == B z (2 )
B z == BD i�2 .
A z == An }�?
S.
(10-3 5)
(q X 1 )
(pX 1 )
(10-34)
D1i12 ==
D2�/2 == I
unaf
Az for
A,
R
Example 1 0.4
(Canon ical correlati on ana lysis of the ch icken-bone data)
I
(p x p)
9.14,
1)
==
X� ==
Xi2 ) == X �2 ) ==
( qxq )
Section 1 0.4
The Sample Ca nonical Va riates and Sa m p l e Ca nonical Corre lations
have the sample correlation matrix
559
1.0 .505 : .569 .602 [i-!�- 1- :�-:J - --:.�6-�02-�- - �.%42-672--h-�.�926?:?:_- -1.- �0�i� lation canalal corysirseoflatitohnse headand corandrelsepgondisetsnofg paivarrisaofblevars usiainblges prA ocanoni duces ctalhecortworecanoni 1 ) ) . 3 45z� V . 7 81zP 1 .631 .060zl2) .944z?) and 1 ) ) . 8 56zP U 1. 1 06z� 2 .057 V; -2.648zl2) 2.475z�2) Herand e2,z?re)s,pectiv1,el2y.andThez�pr2) ,ecedi1,ng2rareseultthsewerstaendartakendizedfromdattahevalSASues sftoartissetitcsal1 sinoalftwvarareiaoutblesputwitshhownthe canoni in Panel10. 1 . I n addi t i o n, t h e cor r e l a t i o ns of t h e or i g cal variables are highlighted in that panel. Asisfaparctiotn,of" Dunham a larger st[u4]dyinofvestthiegatefefdectthseofextoregntanitozatwhiionalch measstructuurrees onof j"jobobsastaist fobtactaioinnedarmease relautreedmentto josbofcharact5ejroibsticharcs. Usactienrgisatiscus randvey instr7ujmentob sa,tDunham igsefaretctioanil varmericahandi bles fsoinr g corp784oratexecut i v es f r o m t h e cor p or a t e br a nch of a l a r i o n. Ar e meas u r e s of j o b s a t i s f a ct i o n as s o ci a t e d wi t h j o b characteristics? The answer may have implications for job design. a
=
=
Pi =
-�
VI
= =
Pi =
i=
I
=
=
-
!
-
-
-
R
+ +
+ +
i=
•
Example 1 0. 5
(Canon ical correlatio n ana lysis of job satisfaction)
n=
PANEL 10.1
p
=
q=
SAS ANALYSIS FOR EXAM PLE 1 0 .4 U S I N G P R O C CAN CORR.
title 'Ca n o n ical Correlation Ana lys is'; d ata sku l l (type = corr); _type_ = 'CORR'; i n put _name_$ x1 x2 x3 x4; ca rds; x1 1 . 0 . 50 5 x2 1 .0 x3 . 569 .422 1 . 0 x4 .602 .467 .926 1 .0
PROGRAM COM MANDS
I
p roc cancorr data = sku l l vp refix = head wprefix = leg; va r x 1 x2; with x3 x4;
(continues on next page)
560
Chapter 1 0
Ca nonical Correlation Ana lysis
PANEL 10.1
(continued) Ca n o n ica l Correlation Ana lysis App rox Adj u sted Sta n d ard Ca n o n ica l E rror Correlation 0.62829 1
2
0.036286 0.060 1 08
Sq u a red Canon ical Correlation 0. 398268 0.003226
OUTPUT
Canonical Struct u re
X1 X2
H EA D 1 0.9 548 0. 7388
H EAD2 -0. 2974 0.6739
(see 1 0-34)
X3 X4
LEG 1 0.9343 0.9997
LEG2 -0.3 564 0.0227
(see 1 0-34)
X1 X2
LEG 1 0.6025 0.4663
LEG2 -0. 0 1 69 0.0383
(see 1 0-34)
X3 X4
H EAD 1 0 . 5897 0 . 6309
H EAD2 -0.0202 0.00 1 3
(see 1 0-34)
Section 1 0.4
561
The Sa mple Ca nonica l Va riates and Sample Ca nonical Corre lations
X (l),
ables,The orwerigienalrejsopbectcharivelayctdeferiisntedic varas iables, and job satisfaction vari ftask esiedback g ni f i c ance ttaasskkivardentieittyy autonomy scarueperer-vfuistourresastaistifsafctactioinon fworinancikloadal ssaattiissffaactctiioonn company i d ent i f i c at i o n kingener d-of-walorsakt-issaftaictsfiaoctnion Resdardpionszed.esThefor varsampliableecors relaandtion matwerrix base reecord onded784onreaspsconsaleeands is then stanX (2) ,
x( l )
x( 2 )
x il ) x �l ) x �l ) x il ) x �l )
=
xF) x �2 ) x �2 ) x�2 ) x �2 ) x �2 ) x�2 )
=
=
X (2 )
X(l)
R
=
[ RuR 1 j RR 1 2 J --------:- -- --2 i 22
..3330 ..3221 ..2106 ..0189 ..2370 ..3375 ..2201 1..049 1.0 ..4593 ..5476 1..048 1.0 ..2341 ..2223 ..1124 ..0179 ..2241 ..3279 ..1168 ..5331 ..5330 ..5371 ..2574 1..038 1..038 .32 .17 .23 .32 .36 .27 ..2320 ..2161 ..2143 ..2122 ..3127 ..2473 1..033 1.0 ..3109 ..2078 ..2074 ..2119 ..2323 ..2344 ..2546 ..2456 1..028 1.0 ..3271 ..2305 ..3178 ..2169 ..2376 ..4307 ..3528 ..4259 ..2370 .35 1..031 1.0 Thecanonimicnal(pvar, iate micoefn(f5,ici7ent vect5 saomplrs (ferocanoni cal corr[4el]a)tarionse diandsplatyedhe sainmplthee m Dunham fol owing table: -- -- ---- -------- ---- -------- ---- - - ----
=
---
-
-
-
--
-
-
-
---
-
----r-
------ ----- ---- - - --- ------ - --- --- --- - ---- - ------
--
----
-
---
--
-
-
--
-
. 59
q)
=
)
=
--
--
--
--
-
U'1 0\ N
CAN O N I CAL VAR I ATE CO E F F IC I E NTS AN D CAN O N I CAL CO R R E LATI O N S
Z 1( a 1' ··
1)
Standardized variables
Z 2( 1 )
Z3(
1)
Z4( 1 )
Z5( 1 )
.42
.21
.17
- .02
.44
a2·' a3' ··
- .30
.65
.85
- .29
- . 81
- .86
.47
- .1 9
- .49
.
a4' ··
.76
- .06
- .12
-1 .14
.27
1 .01
- 1 .04
·
a5' ·
·
.
16
"'
*
P1
.55 .
23
95
.12
- .25
.08
.32
.05
b'1 ·· "' b'2 ·· "'
"'
b'3 ·· b'4 ·· b'5 ·· "'
"'
Standardized variables Z3( 2 )
Z4( 2 )
Z5( 2 )
Z6( 2 )
Z7( 2 )
.22
- .03
.01
.29
.52
-.12
.03
- .42
.08
- .91
.14
59
- .02
.58
- .76
- .41
- .07
.19
- .43
.92
23
.49
.52
- .47
.34
- .69
- .37
- .52
- .63
.41
.21
.76
.02
.10
Z 1( 2 )
Z 2( 2 )
.42
.
.
Section 1 0 .4
563
The Sa m p l e Ca nonica l Va riates a n d Sa m p l e Ca nonical Corre lations
For example, the fir1 st sample1 canonical variate pair is U1 = ..442z2z )) ..222z�1z�2)) - ..173z�z�2)) - ..002z�1z�21)) ..424z�9z�21)) .52z�2) - .12z�2) with sAccor ampledcanoni c al cor r e l a t i o n . 5 5. = A i n g t o t h e coef f i c i e nt s , U i s pr i m ar i l y a f e edback and aut o nomy 1 A varfactiiaobln,eal, whionglewithrecompany presents isduentperiv�icatsori,ocarfl. eer-future, and kind-of-work satiAsovide invarteriparbleteastandionsbetforwUeen1 an2 and, thites scaomplmponent e cor varelatiiaoblnsesbetwerweeencomU1 andputeid.tsTocAloprmponent s o , t h e f o l o win g t a bl e s h ows t h e s a mpl e c o r e l a t i o ns bet w een var i a bl e s icnanonebesceatlandculatthede fiusrisntgsa(mpl10-3e4)ca. nonical variate ofthe other set. These cor elations Sampl e Sampl e canoni canoni cals varA iatecalsA var i a t e A A (2) variables var i a bl e s X . 8 3 . 4 2 1. Super v i s o r s a t i s f a ct i o n . 4 6 . 7 5 1.2. TasFeedback . . 7 4 3 . 4 1 5 2 . Car e er f u t u r e s a t i s f a ct i o n k s i g ni f i c ance . 6 5 3.4. TasTaskk varidentietiyty ..7562 ..4324 4.3. WorFinancikloaadl ssaattiissffaactctioionn ..2211 ..3379 . 3 6 5. Autonomy .85 .48 5.6. KiCompany i d ent i f i c at i o n . 6 5 4 n do f w or k s a t i s f a ct i o n 4 . 80 7. General satisfaction .28 .50 Al l f i v e j o b char a ct e r i s t i c �ar i a bl e s have r o ughl y t h e � a rne cor r e l a t i o ns prwitatetihtoen,tdhbasase faierjdsotonbcanoni charcoefafcctialceirevarintstisci,awher"itendex.U1e. t"hFreThiotamsskdithvarfiserisastablfndpoi reosmartehnen9tt,prU1eimfmieporrgehtdtaibentnt.eirnpterer The ot h er member of t h e f i r s t canoni c al var i a t e pai r , V , s e ems t o be r e p 1 rideentsentifiincg,atipron,imandarilyki, snud-perofv-wisororksasatitsifsafactctioion,n.carAseerth-feuvartureiasblaetissfsauctggesion,t,c2mpany mi g ht betherpregarecedidedngasinatjeorbprseattaistfioanctibason-cedoonmpanythe cal}identoniicfialcatAcoefionfiincidex.ents ofThithseagrz�2e)es's. wiTheth wweeneentjhoebtcharwo inactdiecresistUic1sandand joibs sati=sfa.5ct5.ion.TherWee appear s tsthoamplibes isseoucormee furoverertlhateriloainpnbetbetExampl expl o r e eA 10.A 7. Scat t e r pl o t s of t h e f i r s t U , V re) pai r may r e veal at y pi c al obs e r v at i o ns ( 1 1 quiring further study. If the canonical correlations are also moderately vl
=
P + F +
+
+
o
+
+
+
Pi
Vi
Vi
Vi
SAM PLE CORRELATIONS B ETWEEN ORIGI NAL VARIABLES AN D CANO N I CAL VARIABLES
x ( l)
ul
vl
vl
ul
.
Vi
Vi Pi
•
Pi , pj ,
xj
...
564
Chapter 1 0
Ca non ica l Corre lation Ana lysis
cu2 , v2 ) , cu3 , v3 ) ,
lsaprectge,. sMany catter planalotsysofts tshueggespaitrsplot ing "significant. " canoni may alcsalo varbe ihelatepsfuagail inntsht itshreier cthoemponent var i a bl e s as an ai d i n s u bj e ct m at t e r i n t e r p r e t a t i o n. Thes e pl o t s r e i n f o r c e corIfrtehlaetisoanmplcoefe sfiizceieints slairnge, it is often desirable to split the sample in half. The fcalirstvarihalaftofes tandhe sampl e c a n be us e d t o cons t r u ct and eval u at e t h e s a mpl e canoni ions. The(if rany)esultisncanthethnatenubere "valof thideatcanoni ed" wictalh tanalhe ryesmaiis winlinprg oobsvicanoni edervatancialionnsdicor.catThereiolantchange of t h e s a mpl i n g var i a bi l i t y and t h e s t a bi l i t y of the conclusions. . .
(10-34) .
1 0. 5
ADDITIONAL SAM PLE D E SCRIPTIVE M EASURES
Ithf enthethcanoni c al var i a t e s ar e "good" s u mmar i e s of t h ei r r e s p ect i v e s e t s of var i a bl e s , e as s o ci a t i o ns bet w een var i a bl e s can be des c r i b ed i n t e r m s of t h e canoni c al varto whiiatecshandthe tcanoni heir corcalrevarlatiioanste.s Iaccount t is usefufolrtothhavee varsiuatmmar y meas u r e s of t h e ext e nt iabls alessoexplusefauinl,edonbyoccastheicanoni on, to calcalcvarulaitaettehseofprtohpore otthioernisofoentvar. in itahnceeir riensonepectisveet ofsetvars. i Giof ven andthe matrirceesspectandively. defSininceed in andlet and wedenotcane wrtheiteith column Becaus e s a m l e C v (1}, s a mpl e Cov , and ? p sample Cov It
Matrices of E rro rs of Approxi mations
.A- 1
S1 1 S22
=
= =
B S22 B'
=
V) =
=
p;
0
A- 1 U
=
0
=
b (i ) Bx ( 2)
:B - 1 V ( qXq ) ( qX 1 ) (iJ)
I ,
0
=
=
( qxq ) 0 0
( A- 1 ) ( A- 1 ) ' ( B- 1 ) ( :8- 1 ) '
V
As 1 2 B ',
Pi 0 A- 1 0 Pi
x( l )
8( i )
B
(10-32), iJ = Ax ( 1 ) ( l ) = A - 1 iJ x( 2) x ( p X 1 ) (p X p ) ( p X1 ) ( qX 1 )
:B- 1 ,
(V)
s1 2
A
( :8- 1 ) '
==
(10-36)
=
As 1 1 .N
=
I
(p X p)
pf 8 ( 1 )b( 1 ) ' + pf8( 2 )b( 2 ) 1
+ . . . + p; 8 (P) b( p ) 1
(10-37)
8 ( 1 ) 8 ( 1 ) 1 + 8 ( 2 ) 8 ( 2 ) 1 + . . . + 8 (P) 8 ( p ) 1 -b ( l ) -b ( l ) l + -b ( 2 )b( 2 ) 1 + + -b ( q ) -b ( q ) l · · ·
iJ
.A- 1
Si n ce and has s a mpl e covar i a nce t h e f i r s t r c o f l u mns � A A conttheiracomponent in the samplvare covar i a nces of t h e f i r s t canoni c al var i a t e s , wi t h 1 1 l l , X i a bl e s Xi , Si m i l a r l y , t h e f i r s t col u mns of X � contain the sample covariances of V, with their component variables. =
I,
0; 0 ,
V1 , V2 ,
. • •
U1 , U2 ,
r
�1; ,
0
r
. • •
U,
.8� 1
Section 1 0. 5
565
Add ition a l Sa mple Descri ptive Measures
If only the first canonical pairs are used, so that for instance, r
x( 1 )
[ 8( 1 )
=
" u1 " u r 8( ) J 2 " ur
8( 2 )
and X' ( 2 )
[ b( 1 )
=
(10-38)
" " v b(r) J 2
Vi
b (2 )
"
vr (x ( 1 ) , x ( 2 ) ) . matrices of errors of approximation s 1 1 ( 8( 1 ) 8 ( 1 ) 1 + 8 ( 2 ) 8( 2 ) 1 + . . . + 3( r )3( r ) 1 ) = 8( r+1 ) 8( r +1 ) 1 + . . . + 8 (P) 8( p ) 1 s22 - ( b( l )b( l ) l + b( 2 )b( 2 ) 1 + . . . + b( r )t;( r ) l ) = (;( r +1 )(;( r + 1 ) 1 + . . . + b ( q )(;( q ) l s 1 2 - ( /Ji 8 ( 1 ) b ( l ) + iJi 8 ( 2 )b( 2 ) + . . . + ;J; 8 ( r )t;( r ) ) " r*+1 8( r +1 ) b( r +1 ) 1 + . . . + p"p* 8 ( P) b (P) 1 = p
then Contis iapprnuinog,xiwematseeed tbyhatsatmplhe e Cov S12
are
_
I
I
I
( 1 03 9) 0-39) cmayal varbeiaitnetserrpeprreoteduced as tdeshe csraipmpltivee scovarummarTheianceiesapprmatof howorxiicesmwel.atPatiol nttheerernrfosirrofstmatlasrargmpliecesente(r1canoni i e s i n t h e r o ws and/ o r col u mns of t h e ap proximOratdiionnarerilryo, trhmate firrisctes varindiiactatese doa poora bet"fteitr" tobo thofe rcoreprreosducipondingntghvare elieament ble s)s. of t h an t h e el e ment s of or Mat h emat i c al l y , t h i s occ u r s becaus e t h e ricealsidcualor ematlatiroinsx . Thesthe feorcmorerecaslatieonsis diarreecustlyualrellyataleldctloostehetosmzeralole. sOnt the otsahmpler hand,e canonthe rpendesidualonlmaty onritchese lasasstociatedandwith the apprcoeffoxiicmientativectons otorst. hThee matelreimentces s inandthese vecde tors mayForbestarndardi elativelzyedlarobsge,eandrvathence, t h e r e s i d ual mat r i c es can h a ve "l a r g e" ent r i � s . ,. .. i o ns , r e pl a ces and r e pl a ce in (10-39). Iandn Exampl e 10. 4 , we obt a i n ed t h e canoni c al cor r e l a t i o ns bet w een t h e t w o head relattihoentmatwo lreixg variables for white leghorn fowl. Starting with the sample cor 1..0505 1..0505 ..452269 ..460267 [ J ..560269 ..442267 1..0926 1..0926 r
S1 2
=
S2 1
r
S22 .
S1 1
in
j
p-r
s1 1
q-r
p-r
Rk 1
Example 1 0.6
(
Sk 1
8�k ) , b�1 )
{Ca lcu lati ng matrices of errors of approximation)
R
=
R 1 1 ! R 12 R 2 1 l R 22
- ---- +----- --
--
I
-
-
=
... ------------------------ ------ ------------------
s 22
8( k ) , b ( l )
566
Chapter 1 0
Ca nonical Correlation Ana lysis
we obtained the two sets of canonical correlations and var1 iables ) ) . 3 45z� . 7 81zP u 1 .631 VI .060zi2) .944z�2) and 1 ) ) u 8 56zP 1. 1 o6z� -. 2 .057 v2 -2.648zi2) 2.475z�2) wherand 2,e rz?es)p,ectiv1,ely2. and z�2) , 1, 2 are the standardized data values for sets We first calculate (see Panel10.1) _1 [ -..785681 1..314506 ]-1 - [ ..79548388 -..26739974] 8-1 [ ..99997343 -..03227564] tConshe fiersquent t canonily, tchale paimatr raricees of errors of approximation created by using only R12 - sample Cov(Z(l) , Z(2l ) ( .057) [ -:���:J -.3564 .0227] [ -..000614 -..000100 ] R1 - sample Cov('"'"'z (1) [ - ..26739974] -.2974 .6739] [ -..200088 ..240054 ] R2 - sample Cov(Zt2l ) [ -:����J - .3564 .0227] [ -..012708 -..000108 ] 1 1 1 2 l l ( ( ( ( ( ( ) ) ) ) ) 1 and 8 b wher e z z ar e gi v en by ( 1 03 8) wi t h r e pl a ce 8 b ) respectively. pf
i
=
=
i
A
z
z
+ +
=
=
==
+ +
=
r i ==
1
=
=
==
[
=
=
[
=
-
=
[
=
=
'
r
=
z
'
z
'
'
Section 1 0. 5
567
Ad d itional Sa m p l e Descr iptive Measures
We s e e t h at t h e f i r s t pai r of canoni c al var i a bl e s ef f e ct i v el y s u mmar i z es (arreeprnotoducesparti)culthaerilnytefrafseectt coriverseulmmar ationsiiens of theHowever , t h e i n di v i d ual var i a t e s inal and sets, respectively. This is especisaalmply triunegfvaror �ability in the orig When t h e obs e r v at i o ns ar e s t a ndar d i z ed, t h e s a mpl e c o var i a nc e mat r i c e s ar e c9r r e l a tionandmatthreices Theof Acanoniand cal caroeffe thiceiesntamplvecetocorsr areleattihoens betwofeenthethmate carnoniicescalvarandi ates andSpecitheifirccalolmponent var i a bl e s . y, sample Cov , == sample Cov(A;1U, == A and sample Cov == sample Cov == so R1 2 .
z (1)
z (2)
U1 .
•
Proportions of Explained Sample Varia nce
Bz
Rk l · columns
B" z- 1
;1
rows
B;1
(z ( 1 ) iJ)
iJ)
( z ( 2 ) , V)
( B ; 1 V, V)
[b" z( 1 ) ' b"z( 2 ) '
Sk 1
b" z( q ) J
Az
;1
:8; 1
1
ru l , z ( i )
ru2, z ( i )
r{; l · z ( �)
Y{;2 , Z(�)
ru P ' z ( l) r{; P ' z ( 2l )
ru l , z ( 1)
ru2 , z(1)
r{; P ' z( Pl )
rv l , z ( i )
rv 2 • z(i )
rv P ' z( 2 )
Yv l , Z( � )
Yv 2 • Z( � )
rv P,' z (2)2
rv l , Z ( �)
rv2 , Z (�)
rv P ' z( P2 )
1
(10-40) and ar e t h e s a mpl e cor r e l a t i o n coef f i c i e nt s bet w een t h e quant i wher e ties wiUsthisnugbs(1c0-rip37)ts. with standardized observations, we obtain Total (standar==ditzred) sampl== tervariance in first set . == (10-41a) Total (standardized) sample variance in second set (10-41b) Sisanmplce ethcanoni e correclalativaronsiaitnesth� fi�st, , andcolu�ms of A andrespectinivvolelyv, eweonldefy tihnee rul ' z ( P
==
. • •
'
rvl ' z ( � )
( R1 1 )
( a� 1 ) a� 1 ) ' + a�2 ) a�2 ) ' +
U1 , U2
r
. • •
U,
·
V1 , V2,
p + a� ) a�p ) , )
.
;1
A
, V,. , A
. .
.
:8; 1
p
568
Chapter 1 0
Ca non ica l Corre lation Ana lysis r
tvarheicontancesribasutions of the first canonical variates to the total (standardized) sample and Thecanonical variateofs thtoentalbecome (standardized) sample variances "explained by" the first (prosporampltioenvarofitaon�etal sil}tafndarirst sd�itzed) explained by , tr t r
r
proportions
Ri( 1 )iu1,u 2 ,
. . .
,ur
=
1 1 ( a-" z( ) a-" z( ) ,
=
U1 , U2 ,
+
...
+
( R1 1 )
-"
Y
-"
• • •
U,
Y 1
az( ) az( ) )
(10-42)
p
and Ri( 2 ) 1v1,v 2 ,
. . .
, vr
=
(prsoamplporteiovarn ofiancetota,._ilnst�aendarcond,._dsizeedt ) explained by Vi , , V, V2 , . . .
q
Des c r i p t i v e meas u r e s pr o vi d e s o me i n di c at i o n of how wel l t h e canoni calof tvarhe matiatesrirceesprofeseerntrotrhsei. rInreparspectticiulvears,ets. They provide single-number descriptions tr according to and (10-42)
!
p
1 1 r r [ R 1 1 - a z( ) a z( ) ' - a z( 2 ) a z( 2 ) ' - . . . - az( ) az( ) , J = 1 -
(10-41)
(10-42) .
R z2( 1 ) IU 1 ,U 2 , A
, Ur A
A
. . .
Section 1 0.6
Example 1 0.7
569
Large Sample I nferences
(Ca l cu lati ng proportions of sample variance explai ned by canonica l variates)
Cons i d er t h e j o b char a ct e r i s t i c j o b s a t i s f a ct i o n dat a di s c us s e d i n Exampl e Usweifnigndththeattable of sample correlation coefficients presented in that example, 10.5.
The fofirstthseamplset'setcanoni cal vare variatieance.ofThethe jfoirbstcharsamplacteecanoni ristics scealt account §._ foofr o t a l sampl var i a t e tthheusjoinbfseartitshfatactionissaet"betexpltaeirn"sr37%epreofsentthaetsiveet's9ftoittaslsseatmplthane v�riisance.of itsWesetmi. Theght itniotnermatesterdicreseader andmay wishretospsecteeivhowely. wel[Seel and reproduce the correla "
U1
58%
V1
U1
R1 1
1 0.6
Vi
U1 V1 (10-39).]
R 22 ,
•
LARG E SAM PLE I N FERENCES 0, a'X ( l )
b'X ( 2 )
When and f o r al l vect o r s and have covar i a nce Cons e quent l y , al l t h e canoni c al cor r e l a t i o ns mus t be zer o , and t h er e i s no poi n t i n pursuing faorcanoni large csalamplcorerse.lation analysis. The next result provides a way of testing Let I12
I 12
=
=
a'I 1 2b
=
0
a
b.
0,
R esult 10.3.
j
be a random sample from an
Np+ q ( #L, I )
=
1, 2,
. . .
, n
population with
I1 1 l I1 2 I
I = -��-�-�L_l_���:}__ I 21 l I 22 ( qX p ) i ( qXq )
Then large valtheulesikelofihood ratio test of
H0 : I 1 2 = 0 ( p xq )
versus
H1 : I 1 2 # 0 ( p xq )
rej ects for H0
(10-43)
570
Chapter 1 0
Ca nonica l Corre lation Ana lysis
where [ I J ilys tdihestunbiributaesdedasesatchiimat-sqoruarofe random For larvargein,abltheewitethst statd.isft.ic is approximate See Kshirsagar The l i k el i h ood r a t i o s t a t i s t i c compar e s t h e s a mpl e gener a l i z ed var i a nce under H0, namely, s
I.
Proof.
l = S1 1 + 82 1 l ------
S1 2 S 22
--------
pq
(10-43)
[8] .
•
(10-43)
with tBarhe unrtletestricstueggesd generts reapllizaedcinvarg thiaencemultiplicative factor n in the likelihood ratio sthtaetissatmplic wiinthg tdihestrfiabctutoironn of -2ln Thus, for nand to imprn o-ve the apprlargoe,xiwemation to Reject H0: p p at significance level if I S I·
[3]
-
1
-
� (p + q + 1 ) A.
x2 (p + q)
I 1 2 = 0 ( i = Pi = · · · = ; = 0)
a
(10-44) pq d.f.
wherelf thenulilshypot the upperhesis H0: )th percentile of a chi-squarp e distriibs utrejioectn ewid,thit is nat urcanonial to cexami n e t h e "s i g ni f i c ance" of t h e i n di v i d ual canoni c al cor r e l a t i o ns . Si n ce t h e al cor r e l a t i o ns ar e or d er e d f r o m t h e l a r g es t t o t h e s m al l e s t , we can begi n ascanonisumicnalg corthatretlhateiofnsirstarcanoni calIf tcorhisrhypot elatiohnesisisnonzer oeandd, wethasesruememaithnatingthe first e zer o . i s r e j e ct ttwioonscanoni c al cor r e l a t i o ns ar e nonzer o , but t h e r e mai n i n g 2 canoni c al cor r e l a arLete zertheoi,mandpliesdo sfeoquence rth. of hypotheses be HH�:pj� : forsomei k Barhoodtlertatio crhasiterariogn.uedSpecithatfitchalelkty, h hypothesis in can be tested by the likeli X�q( a)
( 1COa I 1 2 = 0 ( Pi = Pi = . . . = ; = 0)
p
Pi =I= 0, Pi =I= 0, . . . , Pk =I= 0, Pk+l = . . · = P; = 0 # 0, +1 >
[2]
(10-45)
p
by
-
1
(10-45)
Section 1 0.6
571
Large Sa m p l e Inferences
Reject H k)at significance level a if �
tp
t h per c ent i l e of a chi s q uar e di s t r i b ut i o n t h e upper ( 1 00a wher e ( a) i s ) k k q ( ) ) with ( 1 pj2() , the "red.sifd. ualWe" afpoitenrtthoute fitrhsatt thsaemplteset scanoni tatisticcalincor(10-re4la6)tioinnsvolhaveves ibeen=k+l removed from the total criterion 2/n ( 1 pj2 ) . i = l 1 2 ) ) I f t h e member s of t h e s e quence H0, H , and s o f o r t h , ar e t e s t e d one at , H k ) aand,timien untfacti,lwoulH disbenotdifreicjuelctt etod detforesrommeine. Anotthe overher defal esctignioffitchanceis prolecedurvel ise notis thae tnotendencrejecytietd.induces to conclude that a null hypothesis is correct simply because it is To s u mmar i z e, t h e over a l t e s t of s i g ni f i c ance i n Res u l t 10. 3 i s us e f u l f o r mul t i reted wivarberthiaoftcauteimnorporiomntalandantdatcanoni arae. , Thepercalhsapsevarquent, ibesatetisa.rletgarestsdedimplasierodughby (gui10-d4es6) fsohroulseldectbeinignttehrepnum x
p II
(p
-
-
k) q
-
k)
k
-
A
�
�
p
= II
k,
-
�
(Testi ng the significance of the canon ica l correlations for the job satisfaction data)
Example 1 0.8
Teschartatctheerissitgicnis-fjiocanceb satisoffactthioen canoni c al cor r e l a t i o ns exhi b i t e d by t h e j o b data aintetrionduced ineExampl ei10.zed5i.n the table on Al l t h e t e s t s t a t i s t i c s of i m medi t e r e s t ar s u mmar 10. page .12, From.08,Exampl 5, 5 . 5 , 5, 84, e i an d 5 . 0 . iate nortombealnonzer data, weo, alftinhdoughthatwiththe tfhirestvertwyolacanoni cmal corplerseizlAsaet,issomunsmial, lPndeviigandmulatPitoiivns,arappear r g e s a f r o m zer o wi l s h ow up as s t a t i s t i c al l y s i g ni f i c ant . Fr o m alaprtioanscticancal poipronbablt of yvibeew,igthnore seed,condsince(and(1)stuhbseyequent ) s a mpl e canoni c al cor r e are r e as o nabl y s m al l i n magni tsuamplde ande variatitohneincorthreesvarpondiiablnegsecanoni c al var i a t e s expl a i n l i t l e of t h e ts x(l) and X(2). The di s t r i b ut i o n t h eor y as s o ci a t e d wi t h t h e s a mpl e canoni c al cor r e l a t i o ns and tandhe sampl1e scanoni c al var i a t e coef f i c i e nt s i s ext r e mel y compl e x ( a par t f r o m t h e 1 distribution itthueoratioynsis)r,eevenfer eidnttohKse nulhirlsacasgare,[I8]1.2 The reader interested in the *
p3 A
572.
-
-
*
p4 A
*
p5 A
-
n=7
p=
-
(2)
q = 7, Pi =
P = .23,
very
•
q=
= 0.
p=
U'1 ._,... N
TEST R E S U LTS
Upperof1%x2 point Degrees of freedom distribution Conclusion 5(7) 35 001) 57 Reject H0o
Obs(Barerlveedt cortesrtescttaitoin)stic Null hypothesis ( 1 � 1 ) ) In D ( 1 Pi2) lo (784 - � ( 5 1) ) In ( . 6453) (al 0) 34001 0 0 ( 1 - � 1 ) ) a ( 1 Pi2) - 1) 1) 24 x�4( o01) 42098 Reject H0o 0 0 0 6004 0, 0, ( !2 1 ) ) In IT (1 Pi2) - 2) - 2) 15 xi5 ( .01) 30058 Dorejenotct H0o 18°2 == 0 Ho : I 1 2 == 0
-
n
=
pf ==
-
-
(p + q +
-
-
1
-
pq
=
=
X�s (
==
+7+
==
2
H o( l ) : P1*
H �2 ) : P i
p f ==
,
-
== Pt ==
Pi ==
3°
=I=
0 0 •
=I=
== Pt
Pi
n
(p + q +
-
m
-
(p
(q
(p
(q
-
==
==
==
==
==
=I=
-
n
==
-
1
-
(p + q +
i= 3
-
Chapter 10
573
Exercises
EXERCI SES 10.1.
Consider the covariance matrix given in Example Cov .95 i Vercanoniify ctalhatcorthreelfaitrisotnpair of.9canoni c al var i a t e s are wi t h 5. Thecovariance matrandom vect o r s and have t h e j o i n t mean vect o r and j o i n t rix 10.3:
x il ) x �l ) x i2 ) x�2 )
100 0
-----Q -- - }
____
(2 X 1 )
IL
=
__
:
X( l)
[ J= IL ( l ) 2 11- ( )
---------
-
-
0 100
= X�1 ) , V1
=
X l2 )
X (2 )
-3 2 0 1
DetLetCalceulrmatienethtehecanoni canoniandcalcalcorvarreilaatteiopainsrpf,sFrom firsandt principles, evaluate Compar e your r e s u l t s wi t h t h e pr o per t i e s i n Res u l t Letstandardized variables. If and. . . , p; are the canonical correbelatitownso fsoertsthofe ar e t h e as s o ci a t s e t s and edatecanoni c al var i a t e s , det e r m i n e t h e canoni c al cor r e l a t i o ns and canoni c al var i scanonifocralthvare iate coeffisceietsnt. vectThatoriss,fexpror thees the canonisetscialn tcorermrselofatitohnsoseandfor the s e t s . Show t h at , i f iis als sano aneigeienvalgenvalue ofue of wi t h as s o ci a t e d ei g envect o r t h en wiimthpleiiegsenvect o r that (a) (b) (c)
10.3.
=
U1
0
) �� - - - - - -Q-
0 1 0 0 ! 0
Pi
10.2.
0
U
Z(l)
X ( l ) ' X (2 )
= [U1 , U2 ] '
= V 1if2 (X ( 1 )
V
=
[Vi , V2 ] ' .
Pi . U ( 1 , V1 )
10.1. V 2i12 (X ( 2 ) - IL ( 2 ) )
Z (2 ) = - 11- ( 1 ) ) Pi , Pi, l (2) ) ' i ( ) ' b�X · V) (U.n l = (a�X l l
z ( l ) , Z (2 )
( U2 , V2 ) .
=1
'
2
, • • • ,
p'
z ( l ) , Z (2)
X ( l ) , X (2) 10.4. (Alternative calculation of canonical correlations and variates. ) I 1 if2 I1 2 I2ii 2 1 I 1F2 Ai I1F2 e i . I 1i i 1 2 I 2i i 2 1 Hint: I I !f !2 I 1 2 I 2ii2 1 I 1F2 - Ai l I = 0 0 = I I1F2 I I I1 F2 I 1 2 I 2i i 2 1 I1 F2 - Ai l I I Ii1l l = I I1i i 1 2 I2i i2 1 - Ai l I
ei ,
Ai
574
Chapter 1 0
Ca nonica l Correlation Ana lysis
10.5. UseFithned itnhfeoeirmgenval ation uinesExampl e 10. 1 . and ver i f y t h at t h es e ei g enval u es of ar e tDethe searmmeineasththeeseeicondgenvalpaiuresofofcanonical variates ( V2) and verify, from first pr i n ci p l e s , t h at t h ei r cor r e l a t i o n i s t h e s e cond canoni c al cor r e l a t i o n 10.6. tShowransfotrhmatattihoenscanoni c al cor r e l a t i o ns ar e i n var i a nt under nons i n gul a r l i n ear 2 l ( ( ) ) of the X ' X variables of the form and Consider Cov ([��[ ��: -] ) [_��[ U-��-L��J_?p_�-J . Consider any linear combination ' ( with) ' wiTheth choices Similarly, andconsider give the maximum correlation. [ : � J corresponding to the equal 10.7. Let [: :J and corrDetelateiromn isntreutctheurcanoni s . e wherceal variandatesXcor(2)reachespondihavengttwoothcomponent e nonzer o canoni c al corGener2relaatliiozen.the results in Part a to the case where X(l) hasp components and s . X( ) has qp11'p,component ' w her e 1i s a( p 1)c o l u mnvect o rof1 s and1'i s a( q 1) ' row vect[ 1or (pof- 1 )1ps. Note that [ 1 (p - 1)p]1 so 10.8. direction, are in the form of angles. An anglSomee 02obscanerbevatrieoprnse, sseuntchedasaswitnhde pairShow that[cos'( 02) , sin(02) ]'. cos(02 - where cos( f3 ) andcos(02 - f3) cos(sin02(f3) cos() . f3) sin( 02) sin( f3) . 1 ) LetcorrXela(lt)iohaven is a singlmaxe component Xl . Show t h at t h e s i n gl e canoni c al ) Cor r X P 0 , cos( ) Sel e ct i n g t h e canoni ( calJohnsvaroina) blande ViWehramounly [7s].)to selecting a new origin f3 for the angle 02 • (See nd diroecwntioMin measlwaukee,ured LetfromxPthebenortozoneh. Ni(innetpareents perobsmiervlatioin)onsandmade0 inwidownt Wisconsin, give the sample correlation matrix I1i i 1 2 I 2i i2 1 I1 f!2 I 1 2 I 2ii 2 1 I 1F2 . U2 ,
(a)
(b)
C ( p Xp )
Hint:
X( l ) ( p Xl )
Pi
=
D ( qXq )
nx ( ) ni 2 1 c ! DI 22 D a1 CX ( l ) = a X ( l ) a' = a1 C. 2 b X( ) b' = b!D. a1 = e ' I1f i2 C -1
.03.
X( 2 ) (q X l )
0
=
b1 (DX ( 2 ) ) = f' I 2i12 D - 1 p1 2
p1 1
=
=
p22
X(l)
(a)
b�
==
=
(b)
>
Hint: p 1 2 P1 i12 1 =
X
=
p1 1 1
+ ] - 112 1. (Correlation for angular measurement. )
X (2)
(a)
Hint: (b)
=
r-___ b X ( 2 ) = Vbi + b� b2/ Ybi + b� = =
Pi
=
X
+
=
b1/ Vbi + b�
{3 )
=
+
2 - {3 )
�
(c)
2
Pi
.
=
vl
Firensedntthinegstahmple newe canoni c a J cor r e l a t i o n and t h e canoni c al var i a t e r e p origin {3.
Chapter 1 0
X (l)
Exercises
575
X(l)
of meas u r e ment s Suppos e i s al s o angul a r t h e f o r m cos( B Show [cos(B1 ) , sin (pfB1 ) ] ' . maxThenCorr (cos (01 - cos (02 -1 - that Twent y o ne obs e r v at i o ns on t h e 6: 0 0 and noon wi n d di r e ct i o ns gi v e the correlation matricos(x B1 ) sin(B1 ) cos(B2) sin(B2) 1.-- 0 - --.- 291 .440 - --- -.372 ����-- � �i + --��6Q�-- �i-�i .372 .243 .181 1.0 Find the sample canonical correlation and &1 ' H.on 2HotxP) e) l inrgeadi[5] nregporspeed,ts thatx 1 140readisenventg powerh-gra,dexichi2 ldrarenitrhemetceivicedspfoeed,ur tandests X� arithmetic power. The cor1.r0elations .f6o328r perfor.2m412ance ar.0e586 R [RR211 ,: RR- 21-2-J - - ...----026412586328----- - -.---1..-000-655553---- - ::---.-1.- .--004248553--- - ----- ..-40-248655- -FiStnadtinalgl anythe sasamplsumpte canoni c al cor r e l a t i o ns and t h e s a mpl e canoni c al var i a t e s . iHo : Io1ns2 youP1make, t e s t t h e hypot h es e s ( pf Pi 0) 2 H : I P 1 1 1 2 2 at the .05 level of sigHni�f1ic) :ance.pf IO,f H0pi is r0ejected, test 1) : Pi 0 Hi withethtwa osigtenisftisc)ancecorrleelvelate ofwith ar.i0th5.metDoesic abireladiityn(gasabimeaslity u(arsedmeasby tuhreedtwbyo tEvalests)u?atDie tshceusmats. rices of approximation errors for �1 1 R2 , and R1 2 detIn a esrtumdyinedof pover by thetyf,ircrstimsae,mplande canoni c al var i a t e pai r U , Vi. 1 det e r e nce, Par k er and Smi t h [ 1 0] r e por t cer tporaintisounmmarof thyeicrr simample steatcoristirceslaintiovarn matiousrisxtaistes for the years 1970 and 1973. 1.- 0 - - .-6-15 -- --.111 - ---.266 [:-��+-:�-!-] - ��-.-�-K - - �-:i §s l i� �� - =����266 -.085 -.269 1.0
(d)
a'X ( 1 )
=
(e)
=
a) ,
a , {3
a =
a).
v'ar + a�
=
{3 ) )
A.M.
=
-
!
-
os
!
-
-
--
I
The following exercises may require a computer. 10.9. n= �) = =
Vi .
Pi
)
=
=
l
-------- r- - - -
--
=
--
-
1 - :1
-
-
=
-
--
- - ---
LO
(a) (b)
=
a =
=
=
*
0 0
=I=
a =
=
=
=
=I=
(c)
10.10.
A
!
a =
=
!
576
Chapter 1 0
Canon ica l Corre lation Ana lysis
The variabl1973es arenonprimary homicides 1973 prnitmancesary homi) cides (homicides involving family or acquai 19701970 scerevertaiintytyofofpunipunishshmentment(m(nediumberan montof admihs sesrivoed)ns to prison di v i d ed by number of homi c i d es ) FiDetndertmheinseampltheeficanoni c al cor r e l a t i 9 ns � and i n t e r p r e t t h es e quant i t i e s . r s t canoni c al pai r Exampl e 8. 5 pr e s e nt s t h e cor r e l a t i o n mat r i x obt a i n ed f r o m 1 0 0 s u cces s i v e weekl y r a t e s of r e t u r n f o r f i v e s t o cks . Per f o r m a canoni c al cor r e l a t i o n anal y sniiseswi, andth t h e r a t e s of r e t u r n f o r t h e chemi c al compat h e r a t e s of r e t u r n f o r t h e oi l compani e s . rsociaandom s a mpl e of 70 f a mi l i e s wi l be s u r v eyed t o det e r m i n e t h e as tLetion"tvarioniabetblews.een certain "demographic" variables and certain "consump annual ffrreequency quency ofof diatnteinding atngamovirestaeurs ant Cristeertion { annual ageannualof headfamilofy ihousncomeehold Presdietctor { educat i o nal l e vel of head of hous e hol d Suppos e 70 obs e r v at i o ns on t h e pr e cedi n g var i a bl e s gi v e t h e s a mpl e cor r e l a tion matrix 1.0 -l-----------------------------[ - --I - -- -J ..2676 ..3539 1..037 1.0 . 3 4 . 3 4 . 2 1 . 3 5 1. 0 Determine t(hoer, sequiamplvale ecanoni c al cor r e l a t i o ns , and t e s t t h e hypot h es i s nt l y , at t h e l e vel . I f i s r e j e ct ed,Usintegststfaondarr thedsiizgednifvaricanceiables, cons.0t5)ructoftthheecanoni first canoni c al cor r e l a t i o n. cal variates correspond iUsngintog tthhee "sreigsuniltfsiciantn Par" canoni c al cor r e l a t i o n( s ) . t s a and b, pr e par e a t a bl e s h owi n g t h e canoni c al varcorriaetleatcoefionsfiofcietnthes canoni (for "sicgalnifvaricantiat"escanoni c al cor r e l a t i o ns ) and t h e s a mpl e heir component var. iables. GiDoventhethdemogr e inforamphiaticonvarinia(cbl)e, isnhaveterpresowit metththe thcanoni c al var i a t e s i n g t o s a y about t h e cons u mp tabout ion varthieabldemogr es? Doaphithec varconsiaublmptes?ion variables provide much information xP) x �l )
x i2 ) X�2 )
==
==
==
==
(a) (b)
10.11.
U1 , Vi
n =
x ( l ) == [ x i l ) ' x �l ) ' X �1 ) ] ' , X ( 2 ) == [ x i2 ) ' X �2 ) ] ' ,
10.12.
A
n
==
xP) X �1 ) xF) X�2 ) X�2 )
== ==
=
== ==
I
R
(a)
(b) (c)
(d) (e)
H0 : I 1 2
==
== 0
-
R- --1 1 :i --R- 1 2 R 21 l R 22 -
-
-
-
=
(a
---: §Q----�: Q---
p1 2 =
== 0)
; ! 1
a
=
.OS
H0
Chapter 1 0
10.13.
Exercises
577
Waugh [ 1 2] pr o vi d es i n f o r m at i o n about 138 s a mpl e s of Canadi a n har d r e d ssuprreinmentg wheats (in andstandarthedflizoedurfmade f r o m t h e s a mpl e s . The 5 wheat mea orm) were z 1 ) kernel texture z�1 ) test weight z 1 ) damaged kernels z 1 ) foreign material z 1 ) crude protein in the wheat The 4 (standardized) flour measurements were wheat per bar r e l of fl o ur � crasuhdein prfloourtein in flour gluten quality index The sample correlation matrix was R [R ll J 1..0754 1.0 -.-.464690 -.-.751215 1..0323 1.0 .-.669205 -..742212 -..474437 -..352734 -.1.0383 1.0 -..748079 -..544219 -..534661 -..439361 -..750537 -..245190 -.1.0434 1.0 -.152 -.102 .172 -.019 -.148 .250 -.079 -.163 1.0 Find .0th1le esvelampl) canoni e canonical ccoral rvarelaitaiotenss .corresponding to significant (at the Irneprterepserentt tthhee foverirst saalmplquale canoni c al var i a t e s U1 , Vi . Do they in some sense iotytaofl stah:rp.e pwheat andiancefloofur,threespfiectrstivseletyz(? l) is ex What pr o por t i o n of t h e t l e var plvaraiinanceed byofththeecanonisectalisvarexpliataeinUed1? byWhatthe prcanonioportcialonvarofitahtee tVi9t?alDisamplscuses your answers. n =
p =
i
� � �
= =
= = =
q =
z i2 ) z 2) z�2 ) z�2 )
=
=
=
=
=
! R1 2
i
R 2 1 l R 22
---------
---------
=
(a)
(b)
a =
A
(c)
Z (2)
A
578
Chapter 1 0
10.14.
Ca nonical Correlation Ana lysis
Cons i d er t h e cor r e l a t i o n mat r i x of pr o f i t a bi l i t y meas u r e s gi v en i n Exer c i s e 9. 1 5. Letcounting measures of profitabilitybe, andtheletvector of variables rebeprtehseentvectingoracof varsampliableecors rerpreleasteiontninmatg thriextwaccoro mardinketglymeas, anduperresfoformproaficanoni tabilityc.alParcortirteiolantitohne analDetysise. rmSpecineiftihcealflyir,st sample canonical variates and their correlation. ILetnterpreandt these canoni c al var i a t e s . be t h e s e t s of s t a ndar d i z ed var i a bl e s cor r e s p ondi n g t o y. Whatcalprvaroporiateti9n ofWhatthe tprotoaporl sampltioneofvarthieancetotaofl sam andis explain, redespbyectthiveelcanoni planse wvarerisa. nce of is explained by the canonical variate ? Discuss your ervatioens4.1on4. foUsuremeasthe datureasiofn tshtieftnableses artoeconsgiventruinctTablthe esa4.mpl3 ande covardiscusiancesed imatObsn Exampl rdynamiix c Letmeasures of stif nesbes t(hsehockvectorwave,of variaviblbersatrieoprn)e,sentanding tlheet he vectcoalrcorof varrelaiatiblonesanalrepryessisentofinthgestheedatstaati.c measures ofAndrstifenwsesands. PerHerforzmberbeagtcanoni [ 1 ] gi v e dat a obt a i n ed f r o m a s t u dy of a compar i s o n of nondiabetic and diabetic patgliuecosnts.e Thrintoeleerpranceimary variables, iinnssuulliinn rreessipsonstancee to oral glucose and two secondary variables, rfaeslatitnivgeplweiasmghta glucose werancee matmeasrixured. The data for 46 nondiabetic patients yield the covari 1106. 0 00 396. 7 00 108. 4 00 . 7 87 26. 2 30 396.700 2382.000 1143.000 - .214 -23.960 [ i J __}Q§����----n���i�----�1���i��--l_)��i�-----��9�i-n26.230 -23.960 -20.840 .216 70.560 Detquantermitiiense. tArheesatmplhe feircanoni c al var i a t e s and t h ei r cor r e l a t i o ns . I n t e r p r e t t h es e s t canoni c al var i a t e s good s u mmar y meas u r e s of t h ei r rcalesprectelaitvioenssewits tofh varia.bl05.es? Explain. Test for the significance of the canoni Datstatea werconcere colnilnegctaedperfosron's des110irseutbjoesctmsoke. Theanddatpsa ywercholeorgiescpalonsandes, physcodedical1 X(l)
=
[ x P ) ' x �l ) ' . . . ' X �1 ) ] '
X ( 2)
[ X i2 ) , X �2 ) ] '
A
A
U1 , Vi
(a)
Z (l) X (2 )
(b)
Z (2)
X( l ) Z (l )
U1 ?
Z ( 2)
10.15.
=
Vi
X ( l ) = [ x P ) , X �1 ) ] ' [ x i2 ) ' X �2 ) ] '
S.
X (2 )
10.16.
=
xP ) x� l ) x� l )
=
=
=
x i2 ) X�2 )
= =
n =
s
=
sl l S1 2 S2 1 i S22
---- - ---:--- --- - -
! !
=
i
a =
10.17.
n =
Chapter 1 0
Exercises
579
troel5,atetod eachto theofdes12iquesre totisomnsoke(vararieabldefesi)n.edTheas four standardized measurements ssmmokiokinngg 21 ((fsiercondst worwordindg)ing) ssmmokiokinngg 43 ((fthoiurrdthworwordidnig)ng) Thecal steiatgehtarsetagindarvendbyized measurements related to the psychological and physi concent r a t i o n annoyance � stelensepienesnesss � ialr eirtatnbieslisty tcontirednesentesdness The correlation matrix constructed from the data is [ J where 1..708500 1..700085 ..881016 ..787513 ..787510 ..881613 1..084500 1..804500 ..020086 ..114419 ..214011 ..230122 ..210123 ..212189 ..217499 ..223539 ..202841 ..012260 ..217726 ..212014 ..200139 ..110856 ..213971 ..117100 1..500062 1..500062 ..436057 ..750579 ..580278 ..759596 ..541213 ..479239 ..457957 ..736005 1..207300 1..207300 ..560694 ..373725 ..736498 ..274011 ..580295 ..577896 ..360637 ..752594 1..060005 1..060500 ..642898 ..660597 ..459212 ..743913 ..274098 ..376411 ..660598 ..469728 1..309400 1..300094 z P ) == z�1 ) == z�l ) zi1 ) ==
==
z l2 ) z�2 ) z 2) z12 ) z 2) z�2 ) z�2 ) z�2 )
R
R1 1
==
R 12
==
R 22
==
R2 1
==
==
== == ==
== == == ==
==
Ru R1 2 R 2 1 R 22
---------
---------
580
Chapter 1 0
Ca nonica l Correlation Ana lysis
Detquantermitiiense. tArheesatmplhe feircanoni c al var i a t e s and t h ei r cor r e l a t i o ns . I n t e r p r e t s t canoni c al var i a t e s good s u mmar y meas u r e s of respective sets of variables? Explain.
REFERENCES
these their
1. Andrews, D. F. , and A. M. Herzberg. Data. New York: Springer-Verlag, 1985. 2. Bartlett, M. S. "Further Aspects of the Theory of Multiple Regression." Proceedings of the Cambridge Philosophical Society, 34 (1938), 33-40. 3.
Bartlett, M. S. "A Note on Tests of Significance in Multivariate Analysis." Proceedings of the Cambridge Philosophical Society, 35 (1939) , 180-185.
4. Dunham, R. B. "Reaction to Job Characteristics: Moderating Effects of the Organization." Academy of Management Journal, 20, no. 1 (1977), 42-65. 5. Hotelling, H. "The Most Predictable Criterion." Journal of Educational Psychology , 26 (1935), 139-142. 6. Hotelling, H. "Relations between Two Sets of Variables." Biometrika, 28 (1936), 321-377. 7. Johnson, R. A. , and T. Wehrly. "Measures and Models for Angular Correlation and Angular-Linear Correlation." Journal of the Royal Statistical Society (B) , 39 (1977), 222-229. 8. Kshirsagar, A. M. Multivariate Analysis. New York: Marcel Dekker, Inc. , 1972. 9. Lawley, D. N. "Tests of Significance in Canonical Analysis." Biometrika, 46 (1959), 59-66. 10. Parker, R. N. , and M. D. Smith. "Deterrence, Poverty, and Type of Homicide." American Journal of Sociology, 85 (1979), 614-624. 11. Rencher, A. C. "Interpretation of Canonical Discriminant Functions, Canonical Variates and Principal Components." The American Statistician, 46 (1992) , 217-225. 12. Waugh, F. W. "Regression between Sets of Variates." Econometrica, 10 (1942), 290-310.
CHAPTE R
11
Discrimination and Classification
1 1 .1
I NTRO D U CTI O N
Discrimdiisntatinictonseandts ofclobjas eifcticsat(ioornobsareermulvatitoivnsar)iaandte tewichthniques concernewnedobjwietcths (ob sinernatvatuiornse. )Asto pra esvieparousalytivdefe prinoedcedurgroupse, i.t Diis ofscrtiemninemplant analoyedysonis isarone-athertiexplme basoraitsoriny orderdsertotod.o invesClatsigiatficeatobsionerprvedocedurdif eersencesare lewhens explcausoraatlorreylaintiothnsehsiepnss eartehnotat thweley lleunad tdionwelarilly-drefequiinedrersumorles, ewhiprcohblcanem bestruuscteudrfeotrhasansidigniscnrgimnewinatobjionedoescts. .Clas ification or Thus , t h e i m medi a t e goal s of di s c r i m i n at i o n and cl a s i f i c at i o n, r e s p ect i v el y , ar e as fol Goall. ows: To describe, either graphically (in three or fewer dimensions) or alge brknownaicallycol, thleectdiiofnsere(ntpopulial feaattiounsre)s. ofWeobjtreycttso(foinbsde"dirvatsciroinsmi)nfantroms"swhoseverael numer i c al val u es ar e s u ch t h at t h e col l e ct i o ns ar e s e par a t e d as much as pos s i b l e . Goal emphas To sort iobjs iseonctsder(obsivienrgvata riounsle)thinattocantwbeo orusmored toe optlabelimealdlyclasassiegsn. The obj e ct s t o t h e l a bel e d cl a s e s . We s h al l f o l o w convent i o n and us e t h e t e r m t o r e f e r t o Goall. Thisepars taetrimveinprolooblgyemswas. iAntmorroducede desbycriptiA.ve Fitesrhmerfo[9r]thinisthgoale fir, showever t modern, itsreatment of We shAallfurenctferioton tthheatsescondeparagoaltes objas ects may sometorimes serve as an allocator, and, le andthat alflroecatquentes objly eovercts maylap, andsuggesthet dia distisncctrimioinnatbetorwyeenprocedur et. ioInn prandconveractalicle,soecatlGoaly,iaonrsubecomes s e par a blur ed. rating
allocating
2.
sepa
new
R.
classification
1
discrimination allocation.
separation.
2
581
582
Chapter 1 1
1 1 .2
D iscri m i nation a n d Cl assification
SEPARATI ON AND CLASSIFICATI O N FOR TWO POPU LATI O N S
Toclasfixesidofeasobj, leetctuss orlist situasatsiiogninsnign awhinewch objoneemayct tobeoneinoftertewstoedclians es (soerparbotathin).g Itwt ios conveni e nt t o l a bel t h e cl a s e s 1r and 1r The obj e ct s ar e or d i n ar i l y s e par a t e d or • 1 2 clas ifieXd1on, Xt2h,e bas1, XisPof] . measThe uobsrementerveds on,valfuoesr inofstance,dif easr tsoocisoameted rextandoment fvarromiablonees to thaetiotonherof. xWevalucaesn ftohrin1rk1ofandthethtosotealfitryomoftvalheusesecondfromcltahse fasirstthcleapopul s as beiationng tofclhaexspopul val u es f o r 1r Thes e t w o popul a t i o ns can t h en be des c r i b ed by pr o babi l i t y den • 2 stiiotynsfutonctpopulions aft1i(oxns) andor objf2(exct)s, andto clconsas eseiquent l y , we can t a l k of as s i g ni n g obs e r v a nexampl terchangeabl yt.he fol owing separation You may recal l t h at s o me of t h e e s of clas ification situations were introduced in Chapter Meas u r e d var i a bl e s Popul a t i o ns 1r and 7T 1 2 Solinsuvrentanceandcompani distresees.d property-liability Totof satol cksasseandts, cosbondst of ,stlooscksexpens and bondses, su,rmarplusk, etamount value of pr e mi u ms wr i t e n. Nonul s(t"onormachmceralpr"dys)o. blpeeptmsi)csand(thosconte wirotlhs upset perMeasfecturieosniofsmanxi. ety, dependence, guilt, Fr e quenci e s of di f e r e nt wor d s and l e ngt h s of wr i t e n by James Madi slotonn.and those writ en by Alexander sentences. Hami Sepal and pet a l l e ngt h , pet a l cl e f t dept h , br a ct Two species of chickweed. l e ngt h , s c ar i o us t i p l e ngt h , pol l e n di a met e r . Purlaggarchasdse(rtshofosea new"slow"protoductpurandchase). brEducat i o n, i n come, f a mi l y s i z e, amount of pr e vi o us a nd s w i t c hi n g. Succes s f u l or uns u cces s f u l ( f a i l t o Ent r a nce exami n at i o n s c or e s , hi g h s c hool gr a de poiAntnhtraveropolaoge,gicnumber of hi g h s c hool act i v i t i e s . grMaladuates ande) colfemallegeess.tudents. al meas u r e ment s , l i k e ci r c umf e r ence and vol u me on anci e nt s k ul l s . I n come, age, number of cr e di t car d s , f a mi l y s i z e. Good and poor cr e di t r i s k s . n e oxi d as e enzyme, act i v i t y of Alcoholics and nonalcoholics. Actadenylivitayteofcmonoami yclase enzyme. We s e e f r o m i t e m f o r exampl e , t h at obj e ct s ( c ons u mer s ) ar e t o be s e par a t e d into two labeled clas es ("purchasers" and "laggards") on the basis of observed values (1)
(2)
X'
=
[
p
X
. • •
1.
X
1.
2.
3. Federalist Papers 4. 5. 6. 7. 8. 9.
5,
1 If the values of X were not very different for objects in 1r and 1r , there would be no problem; that 2 1
is, the classes would be indistinguishable, and new objects could be assigned to either class indiscriminately.
Section 1 1 . 2
Separation a n d Classification for Two Popu l ations
583
ofogyprofesumably relevantand variables (educatwe wantion, itnocome,identiandfy ansoobsforethrv).atInionthofe ttehremfionrolm , x3(faatmiionly1rsi2z,e)la,ggarx4(admount of brand switching)] as popul[Atx1a(tethiducat oisnpoi1r1ino, purn)t, we, cxhas2s(hinealcome) rls,concent or popul s . r a t e on cl a s i f i c at i o n f o r t w o popul a t i o ns , re turninAlg tloocatseparionaortionclains Sectificatioionn rules are usually developed from "learning" sam pleaches. ofMeastheutwreodpopul charaacttieornsistiarcseofexamirandomlned fyorsedilefcteeredncesobje. ctEss sential yt,othcomee set offroalml posobsesirbvlate isoanmplfalesoutin Rcomes1 , it isisaldilovcatidededitnotopopultwo raetigionons1r,1R, and1 andifRit2f,asluschintRhat2 , weif aallo catseteofitvalto upopules favoratiosn1r1r2 •2 . Thus, one set of observed values favors 1r1 , while the other You may wonder at t h i s poi n t how i t i s we t h at s o me obs e r v at i o ns be lwhatong tmakes o a parcltiacsuliafricpopul aatprion,oblbutem!we) Severare unsal condiure about othergisv. e(rThiseist,oofthcouris apparse, is at i o n t i o ns can ent anomaly (see I n t h e pas t , ext r e me val u es of cer t a i n f i n anci a l var i a bl e s wer e ob sfierrmvedas yearsorprior to a fionrmt'shesubasbseisquentof obsbankr u pt c y . Cl a s i f y i n g anot h er e r v ed val u es of t h es e l e adi n g i n idis ctoatooAlrastmedimaye. calalloswchoolthe ofapplficericsattioonstakeoffcoricermiectgihtve wantaction,toifclnecesas ifysaanry,applbefoicrante it asand other college recordsor. Here the actual determinatoniotnhecanbasbeis made of testonlscory eats the end of several years of training. The l i f e t i m e of a cal c ul a t o r bat t e r y i s det e r m i n ed by us i n g i t unt i l ibrt feaaksils,. andFailtehdeprstoreductngths cannot of a piebece sofoldlu. mberOne woulis obtdaliinkede tobyclalosadiifynprg oitductuntsilasit or ( n ot meet i n g s p eci f i c at i o ns ) on t h e bas i s of cer t a i n pr e l i m i n ar y measurements. I t i s as s u med t h at cer t a i n of t h e wer e wr i t e n by James Madi s o n or Al e xander Hami l t o n becaus e t h ey s i g ned t h em. Ot h er s of , twerhe unse unsignedigned and itCliseofarlyin, weterecannot st to detaskertmhem.ine whiWorchd offtrheetquenci he twoemens however wr o t e and s e nt e nce l e ngt h s may hel p cl a s i f y t h e di s p ut e d Many medi c al pr o bl e ms can be i d ent i f i e d conc l u s i v el y onl y by conduct ieasngianly obsexpenserved,iveyetoperpotateinton.ialUsy fuaall ilbyl,eone, extwoulernaldslyikmpte toomsdia.gnosThiseappran ilonachesshelfropms avoiIt shdoulneedld beecls e-arandfromexpenstheseivexampl e-opereasttihoatns.clas ification rules cannot usually prdisotviindcteioann beterrworeen-fretehmete meashoduofredascharsignment . Thi s i s becaus e t h er e may not be a cl e ar acteristics of the populations; that is, the groups x'
observation
population,
=
11.5.
known
new
know
[19]): 1. Incomplete knowledge offuture performance. Examples: 2 sound distressed
likely to become M. D.
2.
"Perfect" information requires destroying the object. Example:
good
3.
unlikely to become M. D.
bad
Unavailable or expensive information. Examples: Papers,
Papers.
Federalist Papers
Papers.
584
Chapter 1 1
D iscri m i nation and Classification be
maylongioverng tolap. orIt ias thenobjposectsiasblebel, foorngiexampl e , t o i n cor r e ct l y cl a s i f y a obj e ct as ng to r i d i n gm ower owner s , and t h os e Cons i d er t w o gr o ups i n a ci t y : wi t h ou t rfoidrinang mower s t h at i s , nonowner s . I n or d er t o i d ent i f y t h e bes t s a l e s pr o s p ect s ing-mowers or manuf acturseronis itnhteerbasesteisd of clas ifinyincome integnsfaivandmie lsiaelsesascalmpaiprosot sizpge.ectn, aRandom ivreidowner nonowner s a mpl e s of 1 2 cur r e nt owner s and 12 current nonowners yield the values in Table 11.1. Nonowners Riding-mower owners Income) in (1000Lot sfitze) (in $1000s Income) in (1000Lot sfitze) in ($1000s 60.85.50 18.16.84 75.52.08 20.19.68 64.61.58 20.21.68 64.43.28 20.17.42 87.110.01 23.19.26 49.84.20 17.17.66 108.82.08 22.17.46 66.59.04 18.16.04 69.93.00 20.20.08 47.33.40 16.18.48 51.81.00 22.20.00 51.63.00 14.14.08 Thes e dat a ar e pl o t e d i n Fi g ur e 11. 1 . We see t h at r i d i n gm ower owner s tseendemstotohavebe alabetrgerterin"dicomesscrimandinatbior"ggerthanlotlsotthsanize.nonowner s,halerthhand,ough tinhcome On t h e ot er e i s svalomeuesoverof lap betwtheenat ftahle itnwtoo grreogiupson. If,(faosrdetexamplermien,edwebywertheetsooallidlolcatineeitnhosthee figure) to s, wemower owner ss,oandme tmihossteakes. Somevalruidesinwhig-mcowerh fal owner into s woulto d nonowner woul d make beersi.nThecorriedctealyicls atos crifieeatd easanonowner s and, conver s e l y , s o me nonowner s as own r u l e ( r e gi o ns t h at mi n i m i z es t h e chances and of making these mistakes. (See Exercise 11.2.) A good cl a s i f i c at i o n pr o cedur e s h oul d r e s u l t i n f e w mi s c l a s i f i c at i o ns . I n ot h er worsee, dths,ertheearchances , or pr o babi l i t i e s , of mi s c l a s i f i c at i o n s h oul d be s m al l . As we s h al l ebeaddithtationalonefeclatasureors tpopul hat ana"opttion ihasmala" clgraesatifeircatlikioelnihruoodle shofouloccurd posresncees . I t may tothanheranot. Forherexampl because,ethoneereoftendthettowbeo populmoreatfiionnsanciisarleylastiovundely much l a r g er t h an t h e firms than bankrupt '7T 1
'7T 1
Example 1 1 . 1
'7T2
'7T2 .
{Discri mi nati ng owners from nonowners of riding mowers)
'7T 1 ,
'7T2 ,
in
x1
=
n2
x2
=
n1
=
=
TABLE 1 1 . 1
7T2 :
'7T 1 :
x2
x1
( x 1 , x2 )
'7T 1 ,
x1
2
x2
R1
( x 1 , x2 )
R1
2
R2
R2 )
'7T2 , •
Section 1 1 .2
Separation and Classification for Two Pop u l ations
585
x2
Rl
24 � �
a
:::::3 CT
� 0 00 'l:j 00
0
•
•
�
•
0
•
•
§ 16 :::::3 0 00
-s .s
R2
-� �
0
•
•
0 •
0
•
•
0 0
00
•
0
0
•
0
00
�
0
o
•
Riding-mower owners Nonowners
90
60
1 20
1 1 .1 I n come a n d l ot size for r i d i n g-mower owners a n d nonowners.
Figure
Income in thousands of dollars
fotirhmers.. AsAnanotoptihmeralexampl e , one s p eci e s of chi c kweed may be mor e pr e val e nt t h an an cl a s i f i c at i o n r u l e s h oul d t a ke t h es e "pr i o r pr o babi l i t i e s of occur rdiesnce"tres iendtoandaccountultim. atIfelwey bankrreallyuptbeleidevefirmthatis verthey(psrmioalrl), prthoenbabionelityshofouladficlnancias ifayl ya rbankr andomluptycys.elected firm as nonbankrupt unles the data overwhelmingly favors Anot h er as p ect of cl a s i f i c at i o n i s cos t . Suppos e t h at cl a s i f y i n g a 1r obj e ct as 1 bellongiongingntog 1rto1 . Thenreproneesentshsoula mord beecautseriioousus about error makithan nclgatshiefyfoinrgmaer assobjignment ect as. beAs anly"exampl e , f a i l i n g t o di a gnos e a pot e nt i a l y f a t a l i l n es s i s s u bs t a nt i a l y mor e "cos t tclas hifanicatconclion prudiocedur ng theatshthoule did,swhenever ease is presposentsibwhen,le, account in factf,oirt tihsenotcos. tsAnassoptociiamteald with miLetsclas ificandation. be the probability density functions associated with the and r e s p ect i v el y . An ob vect o r r a ndom var i a bl e f o r t h e popul a t i o ns 1r 1 jtehcetswiamplth ase sspoace-t ciated hmeasat is,utrheementcollesction of albel posassisgibnedle obsto eeirvthaterio1rns1 or Let Letbe thbeat sevalt ofuesvalforueswhifocrhwhiwe cclhawes ifyclaobjs iefyctobjs asects asSinceandevery object Rmus1 bet betheasresimaignednintog oneand exhaus and onltiyve.oneForof the twowepopulmightatihaveons, tahcase seetlsike tandhe one piarcetumutreduialn Filygexclureusive The condi t i o nal pr o babi l i t y , , of cl a s i f y i n g an obj e ct as when, i n f a ct , ) it is from is 12=fl-R1 iSis mreiallalrylyf,rtohme condiis tional probability, ) , of clas ifying an object as 1r1 when it 1r2
p
x
X
f1 (x)
1
1r2
f2 ( x)
X
x must
x
1r2 •
p =
2,
1r2
1r 1
R2
R1
=
n-
=
P(X E R2 1 1T 1 )
P(2 1 1
x.
1r2 •
R2
1r2
P(2 1 1
1r 1
P(2 l l )
1r2 ,
=
!1 (x) dx
R1
n
11.2.
(11-1)
586
Chapter 1 1
D iscri m i n ation a n d Cl assification
Figure 1 1 .2 Classification reg ions for two popu l ations.
(11 -2) f1 ( x)
Theoveritnhteegrreagil osingn in Similarrelypr, tehseeintntsetgrhealvolsigunmein formedrebyprethseentdenss thietyvolfuunctmeiofnormed by over the region This is il ustrated in Figure for the univariate case, Let be the Thenprobabi liatyl ofprobabiandlities beof cortherectly orprobabi leicttylyofclas wher e t h e over i n cor r scatifyiionngprobabi objects lcanitiesbe: derived as the product of the prior and conditional clas ifi f r o m observation is correctly clas ified as andobseisrvcoratiorenctcomes ly clas ified as observation is misclas ified as andobseisrvmiatisocnlascomesified asfrom observation is correctly clas ified as andobseirsvcoratiroenctcomes ly clasfirfoimed as f2 (x) p = 1.
R2 •
p1 p1 + p2
=
(11-1)
1.
(11-2)
R1 •
prior
1r 1
1r 1 )
P(
1r 1 )
P(
1r2 )
P(
=
Jf
2
(x) dx
p2
prior
1r2 ,
1r 1
P(
=
P(X E R l 1 7T l )P( 7T l ) P(
=
=
P(X E R l 1 7T2 )P( 7T2 ) P(
=
=
=
=
P ( l l 2) =
1 1 .3
P(X E R2 1 7T2 )P( 1T2 )
P( 2 l l ) =
Jf
1
=
1r 1 ) P( 1 1 1 ) Pl 1r2 1T 1 ) P ( l I 2 ) p2 1r2 1r2 ) P(2 1 2)p2
( x) dx
R2
Rt
I�----- R l --------� ��--- R 2 ------� Classify as 1t 2 Classify as 1t 1 Figure 1 1 .3
M isclassification proba b i l ities for hypothetical cl assification reg ions when p = 1 .
Section 1 1 . 2
587
Separation and Classification for Two Popu lations
observation is misclas ified as andobseisrvmiatisocnlascomesified fasrom Cl a s i f i c at i o n s c hemes ar e of t e n eval u at e d i n t e r m s of t h ei r mi s c l a s i f i c at i o n prevenobabia seliemitiesnglseyesSectmallioprnobabilibutty suthchis asignor. es misclas ifmayicatibeon tcosootl.arForge ifexampl e , t h e cos t to is extremely high. rule that ignores costs mayof makicausThengecosanprotisnblofcoremsmire.sctclasassiigfinment cation can be definCledasbyifya asco:st matrix: True population: Theiscosincorts arreectly clzerasoiffoierdcorasrect clandas ification, when a whenobseanrvatobsionerivsatiniocorn frreoctm ly clasForifieanyd asrule, the average, or ECM i s pr o vicurdedrence,by mulobtatiipnledyinfgrotmhe off--di.agonal ent r i e s i n by t h ei r pr o babi l i t i e s of oc Cons e quent l y , ECM small, asreaspossoinablble. e clas ification rule should have an ECM as small, or nearly as The r e gi o ns and t h at mi n i m i z e t h e ECM ar e def i n ed by t h e values for which the fol owing inequal( ities hol) d(: ) ( denratiioty) (rcoatiot ) (pr::!��" lit ) rat 1r2 )
P(
(
11 .4),
==
P(
==
P(X E R2 1 7T l )P( 7T l )
06 == P(2 1 1 )
1r2
1r2
A
(11-4)
(2) c( 1 1 2 ) (3) c(2 1 1 ) 1r 1
1r 1 ,
1r2 •
A
7T2 c(2 1 1 ) 0
7T l 0 c( 1 1 2)
(1)
==
1r 1 1r2 ) P(2 1 1 ) Pl (11-3)
expected cost of misclassification ( (11-4)
(11 3 ) == c(2 1 1 )P(2 1 1 )p1 + c( 1 I 2)P( 1 1 2)p2
Result 11.1. x
R1
R1 :
/1 (x) f2 (x)
(11 -5)
R2
>
c( 1 1 2) c(2 1 1 )
�
>
/1 (x) /2 (x)
<
c( 1 1 2 ) c(2 1 1 )
�
<
�
P2 P1
�
IO
(11-6)
Y
( )( ) ( denratiioty) ( rcoatiot ) ( pr::!��" lity)
R2 :
)
P2 P1
IO
t r a See Exercise 11.3. quiresIt is cltheeardensfromity functitohnatratthieo ievalmpluement ated atatiaonnewof tobshe miervnatimioumn ECM trhuelecosret Proof.
(1)
(11-6)
•
x0 , (2)
588
Chapter 1 1
D iscri m i n ation and Classification
rofattihoe, andoptimalthcleaspriifoicratprioonbabiregiliotynsraistisoig. niTheficantappear. Ofatencen, itofisrmucatiohs ieasn thieer defto sipnecitiiofny the raFortios exampl than theei, ritcomponent par t s . iceulritatlowhen,speciifny ftahcte, coshe tors (sihneappris notoprandiateclunias itfsy)inofg cla astsuidentfyingasa notstudentcolleasgemaycolmatlbeeegerdiiamatfl, when, i n f a ct , he or s h e i s . The cos t t o t a xpayer s of s, for ininsgtaance,capablcan bee struodentughlyisasmorses eed.difTheiculcost tot tdeteducat o tehremiunininge.vaercolHowever silteygeanddro,spoutoitcimayetyfoofrbenotyearthateducat a r e al i s t i c number f o r t h e r a t i o of t h es e mi s cltinags iafiproscatiopnectcosivtes ccolanlebegeobtgraaduatined.e What e ver t h e uni t s of meas u r e ment , not admi t may be f i v e t i m es mor e cos t l y , over a s u i t a bl e time horIt isizion,nterthesantinadmig to const ingidanerevent uaals idrficoatpoution .reIginothnsisdefcasien,edtheincos(11-t r6a)tifoorissofimeve. t h e cl special cases. (3)
2
When t h e pr i o r pr o babi l i t i e s ar e unknown, t h ey ar e of t e n t a ken t o be equal , andto thteheramitionimofumthe approruprleiaintevolmivsesclacompar i n g t h e r a t i o of t h e popul a t i o n dens i t i e s s i f i c at i o n cos t s . I f t h e mi s c l a s i f i c at i o n cos t risacompar tio is indeted wiermthitnhateer,aittioisofustuhalelpry tiaokenr protbabio beluniitietsy. , (andNottehtehpopul aprtiioonr densprobabiity rliattieios at t h e artyeandin mithescrleavers ifisceatoriodnercosoft rtahteiodenss areiunitiest.y), orFinoneallyr,awhen bot h t h e pr i o r pr o babi l i t i o i s t h e r e ci p r o cal of t h e ot h er , ttthhhatee densoptis, /im1i(txaly0f)clunctas /iifo2i(nscxat0.)i-Ionnwetrheigisascasosnsigen,arix0fex0tdetois1rea1r•newmOninedobsthseiemotrvplhatyeriobyhand, n compar and i/f1/(x1i(n0x)g/0)ft/2hf(e2x(val0x)0)ues1-1, or /1 (xIt0i)s common /2(x0) , prweactasisciegntox0artboit1rra2r.ily use case (c) in (1 1-7) for clas ification. This ifsotratntheamount t o as s u mi n g equal pr i o r pr o babi l i t i e s and equal mi s c l a s i f i c at i o n cos t s 2 minimum rule. ECM
of
>
<
>
<
ECM
2 This is the justification generally provided. It is also equivalent to assuming the prior probability ratio to be the reciprocal of the misclassification cost ratio.
Section 1 1 .2
589
Separation and Classification for Two Pop u l ations
(Classifying a new observati on i nto one of the two popu latio ns)
Example 1 1 .2
a avaialatblioensto estiandmate theresdenspectityivfuelncty. ioSuppos ns e andc(2re1 s1)earcherasunishasocitsaenough tanded witdath1 2popul ) uni t s . I n addi t i o n, i t i s known t h at about 20%Thusof , tobjhe prectiosr(fprorobabiwhiclhititehsearmease urementands can .be2. recorded) belong to Gitovderen tivhee tpriheoclraprs iofbabiicatilointiersegiandonscostands of miscSpeci las ifficicataliloyn,, wewehavecan use ( 0 ) ( ·2 ) Supposande the density fuDonctweionsclaevals ifyutatheenewd at obsa newervatobsionerasvatioorn giTove answer the question, we form the ratio and compare it with obtained befo(re. Si2)nce) ( ) 1c(2 1 1) we find that and clas ify it as belonging to Cr i t e r i a ot h er t h an t h e expect e d cos t of mi s c l a s i f i c at i o n c a n be us e d t o der i v e "optclas iimficalat"ioclnasandificchoos ation eprocedurand es. tForo miexampl e , one mi g ht i g nor e t h e cos t s of mi s (TPMTPM)P(:misclas ifying a observatinoinmizemithseclas ifying a observation) and i s mi s c l a s i f i e d) P(P(obsobserveatrviatoniocomes f r o m n comes from and is misclas ified) A
/2 (x) ==
all
1r 2 •
5
c(1
==
1r 1
10
p1
(11-6)
/1 (x) /2 (x)
/1 (x0)
==
.3
/2 (x0)
==
.8
x p2
R1
R2 •
=
1
5
<
f1 (x)
1r2 ,
.8
==
==
.5
1r 1
.4.
x0 1r 2 ?
.5
x0 E R 1
fl (x o ) /2 (x o )
=
c(1
R2
P2 P1
== . 5
•
total probability of misclassifi
1r 1
=
=
.?S >
1r 1 .
R1
cation
=
or
1r2
1r 1
+
1r2
P1 { /1 (x) dx + P2 { f2 (x) dx
JR2
JR1
(11-8)
Matclas hifematicatioicnalwhen ly, thisthpre ocosbletms ofismiequisclvaals eifnticattoiomin arnime equal izing t. hConse expectequented cosly, tthofe optmisi mal reWegionscoulindthalissocasalleoarcateegiavnewen byobs(be)rivnation to the population with the largest "posterior" probability P( By Bayes's rule, the posterior probabilities are 1T i
I x0) .
(11-7) . x0
590
Chapter 1 1
Discr i m i nation and Classification
P( occurP(wseandobsewerveobserve P( w e obs e r v e P(we observe P(we observe ( P P is equievaltheentdenomi to using Cltheas(bify) irnugleanfoobsr toetravlatprioonbabiaslity ofwhenmisP(clas ification P(in becaus nattionsors in and arafe ttehre obssame.erviHowever ,encecomputtheinname g the probabilitieprobabi s of thelipopul ais n g ( h t i e s ) frequently useful for purposes of identifying the les clear-cut assignments. P(
1T 1 I Xo )
=
1r 1
x0)
x0)
--------
x0 l 1r 1 ) P( 1T 1 ) x0 l 1r 1 )P( 1r 1 ) + x0 l 1r2 ) P( 1r2 )
P1 f1 (x o ) P1 !1 (x o ) + P2f2 (xo )
( 1T2 I Xo )
=
1 -
( 1T 1 I Xo )
x0 1r 1
1r 1
1 1 .3
(11-9) 1r2
=
P2f2 (x o ) P1 f1 ( X o ) + P2 f2 ( X o ) 1r 1 I x0) > 1r 2 l x0) (11-7)
x0
11-9 )
posterior
CLASSIFICATI O N WITH TWO M U LTIVARIATE NORMAL POPU LATI O N S
Clpraacts iicfiecatbecausion preoofcedurtheiressibasmpleidcitony andnorrmeasalopopul a t i o ns predomi n at e i n s t a t i s t i c al nabl y hi g h ef f i c i e nc y acr o s a wi d e va of populities,atthioenfimodel sh. mean We nowvectasosrumeandthatcovariaandnce matrarixe mulandtivtarheiasteecondnor malwirietthydensmean r s t wi t vect o r and covar i a nce mat r i x The s p eci a l cas e of equal covar i a nce mat r i c es l e ads t o a par t i c ul a r l y s i m pl e l i n ear clas ification statistic. f o r popul a t i o ns Suppos e t h at t h e j o i n t dens i t i e s of and are given by ] fori exp [ Suppos e al s o t h at t h e popul a t i o n par a met e r s and ar e known. Then, af t e r cancellation of the terms the minimum ECM regions in become exp [ -� � ] ( )( ) ] exp [ -� � ( )( ) fGiolvoenwitnhgesreesruelgit. ons and we can construct the clas ification rule given in the IL l
IL 2
I2 •
Classification of Normal Po pulations When
X'
/;(x)
=
1 P ( 27T ) /2 1 I 1 1/2
I1
[ X1 , X2 ,
I2
=
• • •
=
, Xp]
IL l , IL 2 ,
(x - 1L l ) ' I -1 (x - 1L 1 ) +
1r 1
=
(x - 1L l ) ' I - 1 (x - IL l ) +
(x - IL 2 ) ' I - 1 (x c( 1 1 2 ) c(2 1 1 ) (x - ILJ ' I - 1 (x <
R1
R2 ,
1, 2
I
>
R2 :
I1
I
1 - 2 (x - IL;) ' I - 1 (x - IL; )
(21T ) P/2 1 I 1 1/2
R1 :
=
f2 (x)
f1 (x)
c( 1 1 2 ) c(2 1 1 )
1r2
(11-10)
(11-6)
- ILJ p2 P1
- 1L 2 ) P2 P1
(11-11)
Section 1 1 .3
Classification with Two M u ltivariate Norm a l Popu l ations
591
and 1r be des c r i b ed by mul t i v ar i a t e nor Let t h e popul a t i o ns 1r 1 2 malis asdensfol oiwstie:s of the form (1 1-10). Then the allocation rule that minimizes the ECM Allocate x0 to_ 1 if 1 _ ] ( P-1 - P-J , I 1x0 - ( P-1 - P-2 ) I 1 ( P-1 P-J ln [ ( ( 1 1 2 ) ) (P2) p1 ( 1 1-12) Allocate x0 to '1T2 otherwise. Si n ce t h e quant i t i e s i n ( 1 11 1) ar e nonnegat i v e f o r al l , we can t a ke x tExerheircnatiseu11.ral5)lo, garithms and preserve the order of the inequalities. Moreover see -� (x - P-d i-1 (x - IJ- 1 ) � (x - P2)'I-1 (x - /A-2) ( P- 1 - P2 ) ' I-1 x - 1 ( P-1 - P2 )' I-1 (P 1 /A-2) ( 1 1- 1 3) and, consequently, R1 : (p 1 - p2 )'I-1 x - � (p 1 - p2 )'I-1 (P 1 IJ-2 ) ln [ (:g : �D ( ;: ) J R2: (p 1 - p2 )'I-1 x - � (p 1 - p2 )'I-1 ( P-1 p2 ) ln [ (:g : �� ) ( ;: ) J ( 1 11 4) The minimum ECM clas ification rule fol ows. , , I n mos t pr a ct i c al s i t u at i o ns , t h e popul a t i o n quant i t i e s and ar e un JL JL I 1 2 known, 1-12) amustiont beparmodiametfeiresd.byWaltheidr[2sa7]mplande count Anderesropnar[t2s]. have sug gestedSuppos rseopltahcieen,rtguhlen,tehe(1tpopul h at we have obs e r v at i o ns of t h e mul t i v ar i a t e r a ndom var i a bl e n 1 , wi t h == [X1 , X2 2 , ,. XThen ] f r o m 7T and meas u r e ment s of t h i s quant i t y f r o m 7T n 1 2 2 p n 1 n2 - p the respective data matrices are x11 Result 11.2. 1T
2
,
+
>
c c(Z i l )
Proof.
(
+
2
=
+
+
>
+
<
•
X'
. • •
+
>
xl x12 ' Xlnl xx l x2 22 X2 n ==
( n1 X p )
;
(11-15)
==
( n2 X p )
'
2
are detFreormminthedesebydata matrices, the sample mean vectors and covariance matrices
592
Chapter 1 1
Discri m i nation a n d Classification
(11-16) Sithnecesamplit is eascovarsumediancethatmattheriparcesent andpopuls2aartioenscombihaventedhe (spaoolmeecovar ideranceivemata srinixgle, d) t o unbiased estimat[ e of as in- (16-21). In particul[ ar, the wei-ght1ed average J J ( 1 1- 17) 1) 1) 1) 1) ifsroanmunbithe popul ased esattiiomnsate ofandif theredatspecta mativelriyc.es X1 and X2 contain samples and Subs f o r t i f o r i n ( 1 11 2) gi v es t h e "s a mpl e " t f n i o t g r u clas ification rule: I
S poole ct =
I,
sl
n1
( n,
1r1
x1
S,
+ ( n2 I
1r2 , JL1 , x2
+ (n ,
n2
+ ( n2
S2
random
JL 2 ,
I
S poolect
If, in (11-18), ( c( 1 2) ) ( ) 1 c( 2 1) tamount hen ln(s1t)o compar 0, andinthgethesetsimcalataerdvarmiianblimeum rule for two normal populations ( 1 11 9) evaluated at with the number1 1 (11-20) where and 1 1
=
x0 ,
fn
p2
==
P1 ECM
= 2 ( X, - X2) ' S��o1ect ( X, + X2) = 2 ( :Yl + :Y2)
Section 1 1 .3
593
Classification with Two M u ltivariate Norma l Popu l ations
Thatto creiats, tihnge estwtoimated minimpopulum ECM r u l e f o r t w o nor m al popul a t i o ns i s t a nt a mount a t i o ns f o r t h ey val u es by t a ki n g an appr o pr i a t e l i n nratvatioinonof thteo obsorervatiodependi ns fromngpopuluponatwhetionsh1rer1 and 1r2 andfatlhsentoasthseigrniignhtg aorearnewlecombi obs e ftOnceof theparmiadmetpoienrt estbetimatweeens artehientsweroteunid fvoarr tihaetecormeansrespondiy1 andng unknown popu l miniemthizeeopttheimexal pectlrautlieoenindquant cost ofitimiewass,stchlderaers eiifviiedcsatnoasionsasuisminuarnapargncethtiatctulhtathaertmulapplhe rteiicvsatuarlitoiiann.tge Thinorrulesmiwisalbecaus dens i t i e s and wer e known compl e t e l y . Expr e s i o n i s s i m pl y an es t i m at e of t h e opt i malsamplrueles.izHowever ,gie.t seems reasonable to expect that it should perform well if the es ar e l a r To s u mmar i z e, i f t h e dat a appear t o be mul t i v ar i a t e nor m al t h e cl a s i f i c at i o n svattatiiosnticx0.to Thesthe leeftobsofethrveatinioequalns arietyclinas ified bycancompar be calinculg tahteedvalfoureseachof tnewhe stobsatisetric with the value of ln Thimophis exampl e i s adapt e d f r o m a s t u dy concer n ed wi t h t h e det e ct i o n of he lToiaconsA cartrruictersa. prSeeoceduralsoeExerfor detciseecting potential hemophilia A carriers, bltwooodvarsaimplableess, were assayed for two groups of women and measurements on the act i v i t y 1 lloogg1100 AHF AHFl i k e ant i g en rwomen ecorded.wer"eAseHF"lectedenotd fromes anta populihemophi l i c f a ct o r . The f i r s t gr o up of 1he he a t i o n of women who di d not car r y t mophiliawomen gene. Thiwass grseloeupctewasd frocalm lknown ed the hemophigrlioaup.A carTheriesresconddaughtgroeuprs ofof hemophi l i a cs , mot h er s wi t h mor e t h an one hemophi l i c s o n, and mot h er s wi t h one hemophilic son andTheotpaiherrs hemophi l i c r e l a t i v es . Thi s gr o up was cal l e d t h e plot ofedtihneFiprgourbabie 11.li4ty. fAlorsbio vsharownofiatobsearnoreeresvmattaliimodinsatsterdibcont1ut, ioonsurfocentsr conttheeraetwidnoatingrg o1 upsandandarx2e, rcovarespectiaincevelymat. Theirixr commonIncovar i a nce mat r i x was t a ken as t h e pool e d s a mpl e to fit the data fairly well. this example, bivariate normal distributions seem univariate
f2 (x)
Yo
7T2 '
7T l
Xo
m
(11-12)
==
a ' Xo y2 •
f1 (x)
(11-18)
3
\
(11-18)
[ (c ( 1 l 2)/c(2 1 1 ) ) (p2/ PI ) ] .
Example 1 1 .3
(Classification with two normal populations-common
I and eq ual costs)
[4] 11.32.)
(
X X2
==
==
( (
)
)
(
n2
==
)
normal
22
obligatory carriers.
n
30
(
)
( x x2 )
x
95%
==
50%
S pooled .
3 As the sample sizes increase, x1 , x2 , and Spooled become, with probability approaching 1, indistin guishable from J.L 1 , J.L 2 , and I, respectively [see (4-26) and (4-27)]. 4 At the very least, the marginal frequency distributions of the observations on each variable can be checked for normality. This must be done for the samples from both populations. Often, some variables must be transformed in order to make them more "normal looking." (See Sections 4.6 and 4.8.)
594
Chapter 1 1
D iscri m i nation and Cl assification
x2
=
log 1 0 ( AHF-like antigen )
.4 .3
00
x2
.2 .1 0 -.1
•
- .2
•
Normals Obligatory carriers
o
- .3
- .4 - .7
__.____� x
'---'------' -'----'-------'--"'-----'---'------'-_.______._____ _ __._
- .5
- .3
-.1
.1
.3
1
=
log 1 0 ( AHF activity )
Figure 1 1 .4 Scatter p l ots of [log 1 0(A H F activity}, log 1 0(A H F- I i ke antigen}] for the normal g ro u p a n d obligatory hemoph i l i a A carriers.
The investigators (se[e -.[4]0)065pro]vide the info[r-.m2at483ion] x1 - .0390 ' x2 .0262 and -1 - [ -90.131.415823 -90.108.414723 ] Therefore, the equal costs and equal priors discriminant function [see (11-19)] [ X1 - X2 [ 131.158 -90.423 ] [x1 ] [ 2418 0652 -90.423 108.147 x2 37. 6 1 28. 9 2 x x 1 2 Moreover, y1 x1 [37.61 -28.92] [ -..00065390] .88 )12 i' X2 [37.61 -28.92] [ -:��:�J -10.10 and the midpoint between( Y1 thesY2e)means( .8[8se-e (10.11-120)0)] is -4.61 ==
==
s pooled
IS
y
==
a' X
==
·
] ' s �;ole d X
==
_
·
J
==
==
a'
==
=
=
m ==
==
_
�
=
+
==
�
==
Section 1 1 .3
595
Cl assification with Two M u ltivariate Normal Popu l ations
Meas u r e ment s of AHF act i v i t y and AHFl i k e ant i g en on a woman who maywomanbe bea hemophi l i a A car r i e r gi v e 2 10 and . 0 44. Shoul d t h i s -. cla11-s if1ie8d wias th equalnormcosal tors and oblequaligatproriyorcars sroietrh?at ln 1 0, we Us i n g obtain Allocate to if 6 1 -4. 6 1 Al l o cat e t o i f -4. where -.210, -.044 . Since 37.61 -28.92 [ = :�:�] -6.62 -4.61 wedicatcleads ibyfy tahsetawoman as an obl i g at o r y car r i e r . The new obs e r v at i o n i s i n hat itonfaltshewiesthtiimn attheedes.9t5impratoebabid .5l0 prityocontbabioliurty ofcontpopulrourin aofFitigopopulurn e 11.atThusi4o.nWe, thandseeeclatabout s i f i c at i o n i s not cl e ar cut . Suppos e now t h at t h e pr i o r pr o babi l i t i e s of gr o up member s h i p ar e known. Foris drexampl e,msutpposhe mate theernblaloodfirsyit ecousldinignthofe faorhemophi egoing liaandc. Thenmeasthuergenet ementics a wn f r o chance of bei n g a hemophi l i a A car r i e r i n t h i s cas e i s . 2 5. Cons e quent l y , t h e prisomewhat or probabiunrlietialesisofticgralloyup, thmember s h i p ar e . 7 5 and . 2 5. As s u mi n g, c 1 2 c 2 1 , and usingatthteheclacoss itfsicofatimionssctlaatsisitfiiccation are equal, so that orwe have with -.210, -.044 , -4.61, and -6.62, -6. 6 2 4. 6 1 0 1 -2. Applying 11-18 ,"we see that P2 25 -2.01 ln [ ] ln [ _75 ] -1.10 and we clas ify the woman as an obligatory carrier. Thestantcoef, so,ffiocirecnt vect0, anyor vector c wil also seisrvunie asquedisonlcriymupinantto acoefmulfitcipielnticats. ive con The vect o r i s f r e quent l y "s c al e d" or "nor m al i z ed" t o eas e t h e i n t e r p r e t a t i o n of its elSetements. Two of the most commonly employed normalizations are 11-21 so that has unit length. (
x'0
==
1r 1 (
)
x1
)
x2
==
1r2 (
X o 7T l Yo == a ' Xo X o 7T2 Yo == a ' X o ]
[
Yo = a 'xa = [
>
==
-
)
( )
==
m ==
< m ==
]
=
<
7T2 ,
1r2
1T 1 •
x2
x1
p1
( 1 ) w
( 1 ) W == (x l - X 2 )'S �;oled X O a 'xo - m x'o == [
==
==
(
w ==
)
w
=
-
-
<
PI
p2
==
==
� (x l - X2 ) ' S �;oled ( x l + X2) a 'xo ] m ==
(
)
=
==
==
.
=
7T2 ,
•
Sca l i ng
*
a
a == s �;oled ( x l - x2 ) a
1.
a*
a*
==
(
)
596
Chapter 1 1
D i scri m i nation a n d Classification
Set so that the first element of the new coefficient vector is I n bot h cas e s , i s of t h e f o r m For nor m al i z at i o n 1 for The magnia1 . tudes of ai, ai, . . , a; in all lie in the interval anInd tipleissoofn ai.of thConse coeftrfaicinieinntgs.thSie m tiloartlhye, exprintaierevsalingandthe ai,coefus. uf.ialc,ila;eyntfarascaseiliexprtmulatesteisaplvieedssuofasalaimulcompar al l o ws one t o r e adi l y as s e s t h e r e l ative iNormpormaltaincezing(vthise-aa/s-visiXs r1)ecommended of variables only if, theas divarsciraimbliensathaveors. been stan daring dthizeed.resuIflttsh. is is not the case, a great deal of care must be exercised in interpret Aslatiomingcovarht beiaexpect ed,ritchese claraes unequal ification. rules are more complicated when the popu nce mat r e pl a c Cons i d er t h e mul t i v ar i a t e nor m al dens i t i e s i n wi t h ioneng anotThusher,ftohrethcovare twoiapopul nce matatiroinsces. ,Asas welwe havel as thseeen,meanthe vectregioonsrs, ofaremidinfimerumentECMfrom anddensmiitineism, um total proorbabi, equilityvofalmientslcyl,atshieficnatatiuorna(lTloPM)garidepend on t h e r a t i o of t h e t h m of t h e dens i t y r a t i o , lhaven dif erent covarlniance stru-ctulnres, the terWhen t h e mul t i v ari a t e nor m al dens i t i e s mesoverin th, ethdense quadrity raattiico finovolrmsviinngtIhe expodo notnentcancel as t h ey do when Mor sSubsof titutanding multivdoarinotate combi n e t o gi v e t h e r a t h er s i m pl e r e s u l t i n nor m al dens i t i e s wi t h di f e r e nt covar i a nce mat r i c es itnhteoclas ificgiatvieson, afretgierotnsaking natural logarithms and simplifying (see Exercise ln [ ( ) ( )] n [ ( ) ( )] where n ( II ) Thethe quadrclas iaftiiccatteiornm,re-gions are defined bydisappears, andfuncttheiorensgioofns defWhenined by reduce to those defined by 2.
(11-22 )
a * 1.
a*
(2), c
ca.
=
(11-22),
=
(1), c
=
( a ' a ) - 1 12
[ - 1, 1 ] .
(1 1-21)
1
X2 ,
Classificatio n of Normal Populations When
XP
• • •
I1
-:/=
X
I2
Ii , i
(11-10)
I.
f1 (x) //2 (x) , [ /1 (x) //2 (x) ] = [/1 (x) ] f1 (x)
[/2 (x) ].
I1
f2 (x)
=
=
1, 2,
I i 1 1/2
I2 .
(11-13).
(11-6)
R
1:
_
R
2:
_
11.15),
_!_ x ' ( � �-1 1
_
_!_ x ' ( � � -1 1
_
2
2
a;'
[ - 1, 1 J
k
� � -1 ��2- 1 ) x + ( #Ll�l
_
� � -l 1 ��2- 1 ) x + ( #Ll�
_
11
-2 _
I1 l I2 l
+
�x' (I1 1 - I2 1 )x, (11-14).
, �-1 ) x IL2�2
_
I
� � -1 ) x #L2�2
_
k
>
-
k <
1
c( 1 1 2 ) c(2 1 1 ) c( 1 1 2 ) c(2 1 1 )
1
, � 2- 1 #L2 ) , �� -1 1 #L 1 - #L2� ( 1 #L 2 quadratic
P2 Pl P2 Pl
(11-23)
(11-24)
x.
I1
=
I2 ,
(11-23)
Section 1 1 .3
Class ification with Two M u ltivariate Normal Pop u l ations
597
The cl a s i f i c at i o n r u l e f o r gener a l mul t i v ar i a t e nor m al popul a t i o ns f o l o ws di rectly from (11-23). Let the populations and be described by multivariate nor maltivelydens. Theitieals wilocatth mean vect o r s and covar i a nce mat r i c es and r e s p ec giAlvlenocatbye to ifion rule that minimizes the expected cost of misclas ification is 1 n [ ( 1 2) ) ( ) ] Allocate to otherwise. Here is set out in (11-24). I n pr a ct i c e, t h e cl a s i f i c at i o n r u l e i n Res u l t 11. 3 i s i m pl e ment e d by s u bs t i t u t i n g rthesepectsampliveley. quantities and (see (11-16)) for and 1r 1
Result 11.3.
1r2
�t 1 , I 1
7T1
x0
- 2 xo, ( �� -� 1 - ��2- 1 ) Xo
+
, � -1 ) Xo - k > , � - 1 - P2�2 ( P1�1
1r2
x0
1
�t 2 , I 2 ,
c( 1 c(l [ l )
P2 PI
k
5
•
x 1 , x2 , S 1 ,
�t 1 , �t 2 , I 1 ,
S2
I2 ,
Cl a s i f i c at i o n wi t h quadr a t i c f u nct i o ns i s r a t h er awkwar d i n mor e t h an t w o di mens i o ns and can l e ad t o s o me s t r a nge r e s u l t s . Thi s i s par t i c ul a r l y t r u e when t h e data arFiegnoture 11.(es5e(ant)isahl owsy) multhetivequalariatecosnortsmandal. equal priors rule based on the ideal ilezedadscastoea ofregitwoon norR1 consmal idistisntrgibofuttiwonso diwisjtohindit fseetrseofnt poivarniatsnces. . This quadratic rule , tdihestlroibwerutiotn.ail Then, for theas shdiownstriibnutFiiognurwie 11.l be5(bsm), althleerlothweran tparhatt profInetsmany cheribreedgiapplbyon aiRcnorat1 , ipromnsoalduced by t h e quadr a t i c pr o cedur e , does not l i n e up wel l winesths ofthethpopul a t i o n di s t r i b ut i o ns and can l e ad t o l a r g e er r o r r a t e s . A s e r i o us weak eequadr atiec notrulemulis thtiatvaritiaistesenornsitmivale ,totwdepar tiuonsresarfreomavainorlamblale.ityFi. rst, the I f t h dat a ar o opt nonnor m al dat a can be t r a ns f o r m ed t o dat a mor e near l y nor m al , and a t e s t f o r t h e equality of covariance matrices can be conducted to see whether the linear rule (11-18) 1T
1
5 The inequalities n 1 > p and n 2 > p must both hold for S11 and S2 1 to exist. These quantities are used in place of I11 and I2 1 , respectively, in the sample analog (11-25) .
598
D iscri m i nation and Cl assification
Chapter 1 1
(a)
(b)
Quad ratic r u l es for (a) two normal d ist ribution with u n eq u a l va ria nces a n d (b) two d istri butions, o n e of which i s n o n normal-rule n ot a p p ropriate.
Figure 1 1 . 5
orterthe quadr a t i c r u l e i s appr o pr i a t e . Trans f o r m at i o ns ar e di s c us s e d i n Chap ( T he us u al t e s t s f o r covar i a nce homogenei t y ar e gr e at l y af f e ct e d by nonnor maltestiintyg. iThes carconver s i o n of nonnor m al dat a t o nor m al dat a mus t be done bef o r e t h i s r i e d out . ) we canatiusonse aandlinearhope(orthquadr atlicwor) ruklerwieasthooutnablwory welryiln. gStaboutudies t(hseeeform ofandtheSecond, parehavent popul at i t wi s h own, however , t h at t h er e ar e nonnor m al cas e s wher e a l i n ear cl a s sarifeictatheiosname.functTheionmorperfaolrims tsopooralwaysly, evencheckthtoughhe perthfoerpopul a t i o n covar i a nce mat r i c es mthancee datofa sanyets clusaesdiftiocatbuiionldprthoe cedur e . At t h e ver y l e as t , t h i s s h oul d be done wi t h clplaess andifier"val. Idealidatlyi,otn"hersaemplwil ebes. Theenoughtraindatingasavaiampllaebls cane to beprousvieddetfoordevel"traionpinthg"e sclaams sification function, and the validation samples can be used to evaluate its performance. Onecalculimateporitst"erantrwayor raoftesj,"udgior mingstchlaesperificfaotriomnanceprobabiof anylitiecls. aWhen s ificatthioenfoprrmocedur e i s t o ent populations are known completely, misclas ification probabilities can bes ofcaltchule aparted 4.
(11-25)
[21 ])
1 1 .4
EVALUATI NG CLASSIFICATIO N F U N CTI ONS
[20]
Section 1 1 .4
599
Eva l uati ng Classification F u n ctions
wiknown, th relawetivesheasalleconcent , as we srhaowte onin Exampl e Becaus e par e nt popul a t i o ns ar e r a r e l y t h e er r o r r a t e s as s o ci a t e d wi t h t h e s a mpl e cl a s i f i c a tfioornmfanceunctiionn. Oncesatmplhis eclsaiss iofficiatntieornesftu. nction is constructed, a measure of its per From the total probability of misclas ification is Thecallesdmthaleleoptst valimumue oferrtohrisraquantte (OitER).y, obtained by a judicious choice of and is 11.4.
future
(11-8),
TPM
p1 { f1 (x) dx + p2 { /2 (x) dx u�R2 u�R1
=
R1
R2 ,
Thus, the OER is the error rate for the minimum clas ification rule. Let usandderive anarexpre theesmuliontifvoarr itahtee optnorimmalumdenserriotriersaitne when and Now, t h e mi n i m um ECM and mi n i m um cl a s i f i c at i o n r u l e s coi n ci d e when Becaus e t h e pr i o r pr o babi l i t i e s ar e al s o equal , t h e mi n imum clas ification regions are defined for normal populations by with ln [ (:g: �D (�:)] Wefindthat � These sets can be expressed in terms of as Butdensitieiss ofa linear combiand nation, arofe uninorvmaraliarandom var i a bl e s , s o t h e pr o babi l i t y t e nor m al ( s e e Res u l t wi t h means and a variance given by TPM
Example 1 1 .4
(Ca lcu lati ng misclassification probabil iti es)
p1 == p2 == �
f2 (x)
f1 (x)
TPM
c ( 1 \ 2) == c(2 \ 1 ) . TPM
(11-10) .
(11-12),
=
0.
R1 : ( ILl - IL2 ) ' I - 1 x - � ( ILl - IL2 ) ' I - 1 ( 1Ll + 1L2 ) > 0 R2 : ( IL1 - 1L2 ) 'I - 1 X - ( IL1 - 1L2 ) 'I - 1 ( IL1 + 1L2 ) < 0 y == ( IL 1 - IL2 ) ' I - 1 x == a 'x R1 (y) : Y > � ( ILl - IL2 ) ' I - 1 ( 1Ll + 1L2 ) R2 (y) : Y < � ( ILl - IL2 ) ' I - 1 ( 1Ll + 1L2 )
Y
Y, f1 (y)
f2 (y )
4.2)
f.L1 Y == a ' IL1 == ( IL1 - 1L2 ) ' I - 1 1L 1 f.L2 Y == a ' IL2 == ( /L1 - 1L2 ) 'I - 1 1L2 1 a} == a ' Ia == ( 1L 1 - 1L2 ) ' I - ( 1-t 1 - 1L2 )
==
d2
Now, misclas ifying a observation as misclas ifying a observation as TPM
==
�P [
1r2 ]
1r1
+ �P [
1r2
1r 1 ]
600
Chapter 1 1
D iscri m i nation and Cl assification
P ( 1 1 2)
Figure 1 1 . 6
The m iscl assification proba b i l ities based o n Y.
But, as mishownsclasinifyFiinggurae 11.obs6 ervation as P[
1r1 1r2 ] == P ( 2 1 1 ) 1 == P[Y < � ( IL l - IL 2 ) ' 'I - ( 1Ll + 1L2 ) ] � ( IL1 - IL 2 ) ' !, - 1 ( 1Ll + IL 2 ) - ( IL 1 - IL J ' !, - l IL 1 y JL 1 == p < �y 2 _!_ /). = p z < � = cp -2
(y ( d) ( )
d
_
( ·)
)
wher e i s t h e cumul a t i v e di s t r i b ut i o n f u nct i o n of a s t a ndar d nor m al r a n dom varmiiasblclea.s Siifmyinilgaraly, observation as ci>
1r 1 ] 1r2 P ( 1 1 2 ) == P[Y > � ( IL l - IL 2 ) ' 'I - 1 ( 1Ll + 1L2 ) ]
P[
==
=
p
(
z
>
�) = 1
- cp
(�) = ( ) cp
- /). 2
Therefore, the optimum error rate(-d) is (-d) (-d) OER minimum TPM 2 . land,f, forexampl e , d , t h en d 6 using Table in the appendix, we-obtain Minimum TPM ( �·6 ) Theitemsopttoimoneal popul clas ifaictiatonioorn rtuhlee otherhere wi. l incorrectly allocate about of the Exampl e i l u s t r a t e s how t h e opt i m um er r o r r a t e can be cal c ul a t e d when t h e popul ity ifnugnctinioalnsloarcateiknown. Imusf, astisbeusuesaltilmy attheedcasfroe,mcerthteainsapopul atthioenn parthe aevalmetatiuoeatrnsidensoappear o n r u l e s mpl e , n of error rates is not straightforward. ==
==
1
==
1
2 ci> 2
+
1
2 ci> 2
( IL 1 - IL2 ) ' I - 1 ( 1L1 - IL 2 ) = cl>
==
25
==
ci>
2 ==
(11-27)
V236 == 1 .6,
= cl> ( - .8) = .2119
21%
•
11.4
Section 1 1 .4
601
Eva l uating Classification Fu nctions
The per f o r m ance of cl a s i f i c at i o n f u nct i o ns can, i n pr i n ci p l e , be eval u ated by calculating the actual error rate (AER) , AER ( dx ( dx wherande n2 ,andrespectrievpr�lye.seFoint thexampl e clas eif,iicfatthioenclreagis iofnsicatdetionerfmunctinedionbyinsamples ofis emsize plequaloyed,itieths earreegisaotnsisfied.and are defined by the set of x's for which the fol owing in ln [ (:g : �D (� ) J � ln [ ( :g : �� ) (;�) J -� The AER i n di c at e s how t h e s a mpl e cl a s i f i c at i o n f u nc t i o n wi l per f o r m i n f u t u r e sdepends amples. onLiktheetunknown he optimaldenserriotryrfautncte, iitocannot ,)inandgenerxal., beHowever calcula,taned,esbecaus eofit x t i m at e ns adiquant i t y r e l a t e d t o t h e ac t u al er r o r r a t e can be cal c ul a t e d, and t h i s es t i m at e wi l be scusThersed sehoris talymeas. ure of performance that does not depend on the form of the parmeasenturpopul a t i o ns and t h at can be cal c ul a t e d f o r cl a s i f i c at i o n pr o cedur e . Thi s e , cal l e d t h e ( A PER), i s def i n ed as t h e f r a ct i o n of obs e r vattion.ions in the sample that are misclas ified by the sample clas ification func ealntvererrsours prrateedicanctedbegroeasupimember ly calculsahtiepd. fForrom thobse ervations from whiandchThesobshowsappar act u ervations from the confusion matrix has the form Predicted membership Actmemberual ship where number of items �orrectly clas ified as items number of i t e ms mi s c l a s i f i e d as i t e ms number of i t e ms �or r e ct l y cl a s i f i e d number of i t e ms mi s c l a s i f i e d The apparent error rate is then APER which is recognized as the of items in the training set that are misclas ified. sample = P1
R1 "
n1
R2 "
R1
}R2
f1 (x)
+ P2
}R1
f2 (x)
(11-18)
R2
R l : ( X\ - X2 ) 's �;olect x - ( X1 - X2 )' S�;olect ( X1 + X2 )
>
R2 : ( X1 - X2 ) 's�;o1ect X
<
( X1 - X2 ) 's�;olect ( X1 + X2 )
f2 ( )
/1 (
any
apparent error rate training
confusion matrix, 1r 1
n1
1r2 ,
n2
(11-28)
7T l
(11-29)
1r2
n1 c = n1 M = n2 c = n2 M =
1r 1 1r 1 1r2 1r2
1r2
=
proportion
n l M + n2 M n l + n2
1r 1
(11-30)
602
Chapter 1 1
D iscr i m i nation a n d Cl assification
Example 1 1 . 5
(Calculating the apparent error rate)
Cons i d er t h e cl a s i f i c at i o n r e gi o ns and s h own i n Fi g ur e 11. 1 f o r t h e r i d datmowera. Inowner this cass; obse, observeatrvioatnsiosnsoutnorhwestheast ofttofhe tshoelidsolilnidelarineeclaraes clifiaesd siasnifg-iemd asowernonowner s . Not i c e t h at s o me obs e r v at i o ns ar e mi s c l a s i f i e d. The con fusion matrix is Pr e di c t e d member s h i p r i d i n gm ower owner s nonowner s riding10 mower Actmemberual ship owners nonowners The apparent error ratec, expr� es�ed) as a percent( :a)ge, is APER : 2 100% 2 100% 16.7% The APER i s i n t u i t i v el y appeal i n g and eas y t o cal c ul a t e . Unf o r t u nat e l y , i t t e nds tsoizesunderandestinm2atareetverhe yAER,large.andEsstehnte ipraloybl, tehmis optdoesimnotisticdiesstaippear unl e s t h e s a mpl e m at e occur s becaus e t h e data usErerdotro-rbuiatelesd ttihmeatcleasscanificatbeioconsn futnctructioendartheatalarsoeusbetedtetrothevalan tuhateeapparit. ent error rtiaotnse, .reOnemainprroecedur lativeleyiseastoysptolitcalthecultoattael, sandampldoe inotnto raetquirainreindig sstarmplibuteioandnal aasvalsumpida ttihoenvalsamplidateio. nThesampltraieniisnusg seadmplto evale is usuatede itto. consTheterrurctortrhaetecliassdetifiecatrmioinnedfunctby tiohn,e prando porbiastipronomiblesmclabys ifnotied usin itnhge tvalheidsaatmeiondatsampla toebot. Alhthbuioughld andthisjmetudgehodtheoverclascomesificatitohne functiIotn,requiit surfeserlsarfgroemsatmplwo emais. n defects: Thethe datfuncta musiontevalbe usuateeddtoisconsnot tthruectfutnctheiclonasofifiincatterioenst.fuUlncttiimon.ateIlfy,notalm, valostuablofe iAnfsoercondmatioapprn mayoachbetlhoatst.seems to work well is called Lachenbruch's "holdout" procedurStaret6wi(stehetalheso Lachenbr u ch and Mi c key [ 2 2] ) : gr o up of obs e r v at i o ns . Omi t one obs e r v at i o n f r o m t h i s gr o up, andobserdevelvatioonsp. a clas ification function based on the remaining 1, n2 R1
1r2 ,
1r
R2
1,
1r2 :
1r 1 :
n1 c
7T' 1 :
=
1r2 :
=
=
•
=
n1
(i) (ii)
1.
all
1r 1
n1
-
6 Lachenbruch's holdout procedure is sometimes referred to as jackknifing or cross-validation.
Section 1 1 .4
603
Eva l uati ng Classification Functions
ClRepeatas ifySttheeps"holanddout"untobsilealrvl atofiothn,eusinobsg theervfatunctionsioarn conse clatsruifctieed.d iLetn Stnep be tRepeat he numberStepsof holthrdooutugh foobsr theerv1rat2ioobsnsemirvsatcliaosnsi.fieLetd innthisbegrotup.he number of holEstdimoutateobss ervatioandns misclas ifofiedthine condi this grtiooup.nal misclas ification probabilities in and are then given by nl M nn2Ml n 2 andples,taheneartotallyprunbioporasteiodnesmitimscatlase ioffietd,he n n actualner1 rornr2a,teis,,EforAERmoder. ate sam 2.
1.
3.
1
4.
1
(11-1)
2
P (2 i 1 ) (11-2)
ilfJ
1r 1
(H) 3
�lf)
P( 1 i 2)
(H)
P (2 i 1 ) ==
(H)
P ( 1 i 2 ) ==
( ilfJ
+
expected
(11-31)
�lf} )j(
)
+
(
)
' Lachenbr u ch s hol d out met h od i s comput a t i o nal l y f e as i b l e when us e d i n con unct i o n wi t h t h e l i n ear cl a s i f i c at i o n s t a t i s t i c s i n or I t i s of f e r e d as an option in some readily available discriminant analysis computer programs. We s h al l i l u s t r a t e Lachenbr u c h ' s hol d out pr o cedur e and t h e cal c ul a t i o n ofConserriodrerrattheeesfotilmowiatensgfodatr tahematequalricescosandts anddescequal prsitoartsisverticss.ionWeofshall as r i p t i v e stwumeo populthat athtieonns1 7T1 nand2 7T2biwivtarh iaatcommon e observatcovarionsiwerancee matselerctixe.d randomly from ]
j
(11-18)
Example 1 1 . 6
(11-19).
(Calculati ng an esti mate of the e rror rate using the holdout proced u re)
==
xl =
Xz =
== 3
[: � [ ! :J
12 1 ;
X1 =
Xz =
[ �} [�J 1
(
)
2S l =
(11-18).
[ -� �] [ -� -�] -
28 =
2
The pooled covariance matrix is 2 [ Uspriionrgs, we maythclearsesiftyofthtehesadatmplae, andobseRulrvateions. Youwimayth equalthencosvertisfyandseequal e Ex ercise that the confusion matrix is 1
1 _1 (11-18)
S poo led == 4 (2S l + 2S ) ==
S p ooled '
1 1.19)
(
604
Chapter 1 1
D iscri m i n ation and Classification
Clas ify as: 21 21 True population: and consequently, APER( apparent error rate ) = 2 = .33 Holding out the first obs1 ervation = 3[25, 12J from x1 ' we cal·culate X1 H = [� �} x1 H = [ � } and 1 H = [ : �] The new pooled covariance matrix, is = � 2 ] = � [ =� � �J with inverse7 -1 - 1 [ 101 2.15 J I t i s c o mput a t i o nal l y qui c k er t o c l a s i f y t h e hol d out obs e r v at i o n x on t h e H 1 and Thi s pr o ce basduries iofs equiits sqvuaralentedtdio scomput tances firnogmthtehevalgruoeupofmeans t h e l i n ear f u nct i o n y aHxH = o t m t i t n e g h n t i poi d mi compar =( x1 H and H . 0) 9) 2 1 ee ] and 1S 1[ 1 1 ( ( x ( H 1 Thus wi t h xH = [ 2 , 12] we have Squareddistancefromx1 H = (xH - 1 x1 H) .5 = [2 - 3.5 12 - 9] � [ � 2�5J [ 1� = � J = Squared distance from (xH (x1H = [2 - 12 - 7] � [ � 2�5 J [ 1� = �J = 10.3 SiclansceiftyhxeHdiasstaance1r1 fobsromerXvHattioon.x1 HInisthsmisalcaslere,ththane clthaes diifsictaatnceionfirsomcorXrHectto. we If = [ -10] is wi2.th5held, x1 H and1 1become x 1 H - [ 10 ] and - - [ 16 2.5 J 7T 1
7T2
6
XH
1
S H,pooled ,
S
[1 S 1 H + 82
S H,poolect
S H,pooled
8
i2 .
i1 H
( i1 H - X2 ) ' SJ[� pooled X H + i2 ). - X2 ) ' Sfl, pooled �
==
x 1 H ) ' S :H1,pooled (x H -
4.5
x2
==
- x2 ) ' S:H1,pooled
- x2 )
4
X2 ,
XH
4,
Sfl, pooled
-
S H,pooled
-8
4
4
A matrix identity due to Bartlett [3] allows for the quick calculation of S:f/pooled directly from Thus one does not have to recompute the inverse after withholding each observation. (See Exer cise 11 .20.)
s�;oled .
7
Section 1 1 .4
We find that
605
Eva l uating Cl assification Functions
[4 - 2.5 10 - 10 ) 8 [ 164 2.45 J [ 104 -- 210·5 ] == 4. 5 1 (x H - X2 ) ' SJl, pooled (x H - X2 ) [ 4 - 4 10 - 7] � [ � 2�5 J [ 1 � = � J 2.8 andxH ==cons[3,e8]quentleadsly,tweo inwoulcorrdectinlcory asrseicgtnilynasg stihgins xobsH erv[at4,io10n]ttoo'7T'7T22as• Holweldl.inThusg out, n(1MH) -Tur2 ning to the second group, suppose xH == [5, 7 ] is withheld. Then X2H == [ 43 59 ] X2H [ 37. 5 ] and 1 S2H [ -. 25 - 28 ]
(x H - i 1 H ) ' SJl, pooled (x H - i l H )
1_
=
=
==
==
·
==
;
==
;
The new pooled covariance matrix is � [ =� � : J � [28 1 with inverse -1 - 243 [ 4 2.45 J We find that S H , poolect
+ 1 S2H ]
=
S H,pooled
=
16
[5 - 3 7 - 10] :4 [ 146 2�5 J [�� :OJ 4. 8 16 4 5 . 5 (x H - i2H ) ' S il,poo!ed (X H - i2H ) [ 5 - 3. 5 7 - 7] 24 [ 4 2. 5 J [ 7 - 37 ] == 4 . 5 and xHWhen== [ 5,xH7] ==is [cor3, 9re] ctislywiasthshelignedd, to '7T2 • 1 (x H - X r) ' SJl, pooled (x H - X1 ) [ 3 - 3 9 - 10 ] :4 [ � 2�5 ] D = �O] == . 3 1 5 (x H - i2H ) ' SJ/ pooled (x H - i2H ) [ 3 - 4 . 5 9 - 6 ) :4 [ � 2�5 J [ � = :· J 4.5 andleadsxHto ==cor[3re, ct9]lyisclinacors ifryeinctglythasissobsignedervtato i1ron1 . asFi1rnal•lyThus, with, hol�IJJding1.xH [4, 5 ] (x H - X r) ' SJl, pooled (x H - X1 )
=
==
1_
=
_
=
=
==
==
2
n
==
606
Chapter 1 1
D iscr i m i nation and Classification
An estimate of the expected actual error rate is provided by AER Hence, sweurehaveof perweconsfsoereidmterhance.atedtherheOfeappar, andcouretsnthee, ierdinfrproerarectranceitcee,APER sbetamplweenAPER e sizes iars eanandlaoptrgE_firmAERthisantic tmeahmayose not be as large. interested in pursuing the approaches to estimating clas ification error TheIrfatyouesnext, seareeexampl e i l u s t r a t e s a di f i c ul t y t h at can ar i s e when t h e var i a nce of t h e discriminant is not the same for both populations. n l( HM) )= n 1
E(
+ +
n 2( HM) = 2 3 n2
+ +
1= 5 3 · = .33
(
)
M
[21].
Example 1 1 .7
(Classifying Alaskan and Canad ian sa l mon)
Theda. Becaus salmonefiisthiserayliims aitvaled ureablsoeurrcee,soiurt musce ftober botmanaged h the UnieftfeicdieStntaltye.sMorand eCanaover, sThatinceismor, Alaestkhanancoonemmercountcialrfyisihserinmvolenvced,annotproblcaetcmsh tomuso manyt be Canadi solved aequin satlamblony. and viThesce vere sfais. h have a remarkable life cycle. They are born in freshwater ssatrlet amswateandr, thafeyterertauyearrn toortheitwroplsawceimofinbitortthhetoocean. Af t e r a coupl e of year s i n s p awn and di e . At t h e t i m e t h ey arToehelaboutp regulto raetetucatrncashesmat, saumplre feisshof, tfhiseyh tarakene hardurvesintgedthwhie harlevsestilt musin thtebeocean.iden tmatifieidonasaboutcominthgeifrrobimrAlthplasakceaninorthCanadi a n wat e r s . The f i s h car r y s o me i n f o r e gr o wt h r i n gs on t h ei r s c al e s . Typi c al l y , t h e rfoinrgsthase sCanadi ociatedan-wibtorh fnressahlwatmon.er grTablowteh are gismvesallethr efodir tahmete Alerasskofan-thbeorgrnothwtanh ring regions, magnified times, where dihundr ameteerdtofhsrofingsanfoinrchthe first-year freshwater growth dihundr ameteerdtofhsrofingsanfoinrchthe first-year marine growth In addiTrtaioinn,infgemalsamples areseofcodedsizes as and malAlaesks an-arebcoded as or n and Canadi a n born salmon yield the summary statistics 11.2
100
xl = x2 =
(
)
(
)
1
[
98.380 x 1 = 429.660 '
] 137.460 ] x 2 = [ 366.620 '
n 1 = 50 s
[
2. n 2 = 50
260.608 - 188.093 1 = - 188.093 1399.086 s 326.090 133.505 2 = 133.505 893.261
[
]
]
Section 1 1 .4
TABLE 1 1 .2
607
Eva l uati ng Classification Functions
SAL M O N DATA (G ROWTH - R I N G DIAM ETE RS)
Alaskan Canadian Gender Freshwater Marine Gender Freshwater Marine 2
1 1
2
1
2
1
2 2
1 1
2
1
2 2
1
2 2 2
1 1
2 2 2 2 2
1
2
1
2
1 1 1 1 1 1 1 1
2
1
2
108 131 105
469
117 79
402 423 440 489 432
86 99 87 94 99
114 123 123 109 112 104 111 126
105 119 114 100 84
102 101 85 109 106
82
118 105 121 85 83
53 95 76 95 87
70 84
368
355
506
403
428
372 372
420 394
407
422 423 434 474 396
470
1 1 1
2 2 2
1
2
1
2 2
1 1
2
1 1 1
2 2
1
2
1 1
399 429 469 444
2 2 2
442
2
397
431 381
388
403 451 453
427
411
442 426 40 2
397 511
1
1
2
1
2
1 1
2 2
1 1
2 2
1
129 148
179 152
166 124
156 131 140
144 149
108 135 170 152 153 152 136
122 148
90 145 123 145 115 134 117
126
118 120 153 150 154 155 109 117
128 144
163 145 133
420
371 407 381 377
389 419
345
362
345 393 330 355 386
301 397 301 438
306 383 385 337
364
376 354 383 355 345 379 369
403 354 390 349
325
344
400 403 370 355 375
(continues on next page)
608
Discr i m i nation and Cl assification
Chapter 1 1
TABLE 1 1 .2
(continued)
Canadian Alaskan Gender Freshwater Marine Gender Freshwater Marine 91 74 101 80 95
2
1
2
1 1
99 94 87
1
2
1
2
404 481 491 480
1 1 1
1
383
349
144 140
2 2 2
433
92
2
1 28 123
1
469 451 474 398
373
388
150 124 125 153
339
341 3 46
352 339
108
Gender Key: 1 female; 2 == male. Source: Data courtesy of K. A. Jensen and B. Van Alen of the State of Alaska Department of Fish and Game. ==
(
TheExerdatcisea appearbutto sathtiesfcovar y the iasancesumptmatiornicofesbimayvardiiafteenorr. However mal distri,btuto iilounsstrsaetee aficpoiatinont concer n i n g mi s c l a s i f i c at i o n pr o babi l i t i e s , we wi l us e t h e l i n ear cl a s i prThe clocedur e . a s i f i c at i o n pr o cedur e , us i n g equal cos t s and equal pr i o r probabi l i ties, yields the holdout estimated error rates Predicted membership Al a s k an Actmemberual ship AlCanadiaskanan based on the linear clas ification function see and Ther e i s s o me di f e r e nce i n t h e s a mpl e s t a ndar d devi a t i o ns of f o r t h e two populations: Sampl e Sampl e Mean St a ndar d Devi a t i o n AlCanadiaskanan Al t h ough t h e over a l er r o r r a t e or i s qui t e l o w, t h er e i s an un fasairAlnesasskheran eborn,. It is rleastherlikelthyanthatvicaeCanadi versa. aFin-bgurorne salmonwhiwiclhbeshmiowsscltahseiftiweod 11.31),
'1T
'1T
1:
44 1
1: 2:
'1T
[
w ==
y-
m ==
6 49
(11-19)
- 5.54121 - .12839x 1
+
(11-20)]
.05194x2
w
n
50 50
4.144 - 4.147 (7/100, 7%)
11 .7,
3.253 2.450
Section 1 1 . 5
609
Fisher's D iscri m i nant Function-Separation of Pop u l ations
��------�����-�
Y2
Figure 1 1 .7
Y1
m
y
Schematic of normal den sities for l i near discri m i n a nt-sa l mon data.
norof tmhealmidensdpoiitinetsbetforwtheene lintheare tdiwoscsraimmplinante meansexpldoesains notthis make phenomenon. Us e t h e t w o mi s lclcanaargs esbeifticunwivaratioiansnce.epr. obabiThusli,tbliesinequald adher. Itencecleartolytpenal he liniearzes clthaespopul ificatiaotnionprwiocedurth thee It shoulthed sbeeparintautiiotinvelofytclheearpopulthatatgoodions. clTheas iffaicratthieronapar(lowt terherogrrorupsates, )thwie morl dee pend upon lluikdedely ittoisinthSectat aion cliassexplificatorieodn fruurltehcaner inbeSectdevelionsoped. Thiands separative goal, al ee, allmiocatsciloans irfuicleatsiappron cosoprtsiacorte rfeosrptondhe castoefiunnctvolivoinsngdesequalignedpriotor prmaxiobabimAsallilwetyiessesparhandallatsequal e popul a t i o ns . I t i s i n t h i s s i t u at i o n t h at we begi n t o l o s e t h e di s tinction between clas ification and separation. Fily sdihfererentactarugalumently arrifvroedmatthtehonee liniearn SectclaisonificationFissthaertis'tsiicdea was tuso tirnagnsanfoentrmitrhee mulfromtipopul variataetiobsonse1rrv1 atandions1r2xwertoeuniseparvaraiateted asobsmuchervatasionsposysisbulech. Fithsatherthsueggesy's derteditvaedk iofngthlienearx tocombibe handlnatioednseasof ixlyt.oFicrsehatere'syappr's becausoachedoesthey notare sasimsuplmee enough fupopul nctionsa t h at t h e tmationsriarcese noraremequalal. I,tbecaus does, however , i m pl i c i t l y as s u me t h at t h e popul a t i o n covar i a nce e a pool e d es t i m at e of t h e common covar i a nce mat r i x i s used. A fixed linear combination of the x's takes the values y1 , y1 2, . . , y1 n1 for the ob stieornsvatfiroonsm ftrhoemsetcondhe firpopul st populatiaotn.ionTheandsetparhe valatiounesofy2t1h,esy2e ,two sets foforunithevarobsiaetervya's itsioasnsunies tesd. Thatin terims,s of the dif erence between y1 and y2 expres ed in standard devia 2 2 ) ) Y1 Y2 where 2 n n2 itsiotnheofpooltheexdteso tachiimatevee ofmaxithemvarumiance.separTheatioobjn ofectthievesaismplto esemeans lect theyl1inandearycombi n a 2• y,
•
useful 11.1,
1 1 .5
11.5
1 1 .7.
F I S H E R'S DI SCRI M I NANT F U N CTION-SEPARATI ON OF POPU LATI ONS
[9]
11.3.
(1 1-19)
. . .
nl
,
+
y2 n2 n2
L ( Ylj L (Y j j= l 2 j=l s Y = ----2 l+
610
Chapter 1 1
D iscri m i nation and Classification
Result l1.4.
t h e ratio
Thelinearcombination ( betweenSquarsampled diesmeans tance of ) Sample variance of
y = a 'x = (x 1 - x2 ) ' s �;oled x
z s
max mi e
( Y1 - Y"2 ) 2 ( s2 y) ( a ' x 1 - a ' x2 ) 2 a" ' S pooled a" ( a ' d)2 a ' S pooled a a d = (x 1 - x2 ) . 2 (11-33) D = (x 1 - i2 ) ' s�;oled (x 1 - x2 ) .
i
y
y
(11-33 )
The maximum of the roveratio all possiibsle coefficient vectors where The maxi m um of t h e r a t i o i n i s gi v en by appl y i n g di rectly. Thus, set ing we have m�x where is the sample squared distance between the two means. Note that in may be calculated as "th " ' and " ' tudy concere nedRecalwithlththeatdettheectequal ion ofcoshemophi liequala carprriioerrss lwasinearintdirosduced inantn sExampl t s and c r i m i function was = ' Thi s l i n ear di s c r i m i n ant f u nct i o n i s Fi s h er s l i n ear f u nct i o n, whi c h maxi malpleslyisseparates the two populations, and the maximum separation in the sam Proof.
a
D2
(11-33)
d = ( x 1 - x2 ) , ( a" ' d) 2 1 ( x- 1 - x-2 ) - D2 - 1 d - ( x- 1 - x- 2 )' S -pooled "a ' s pooled a" - d ' S pooled
•
s � (11-33)
nl
n2
( Y1 j - Y1 ) 2 + L ( Y2 j - Y2 ) 2 L j= 1 j= 1 -s 2 = -n 1 + n2 - 2 y2 j = a x2 j . Y
WI
y1 j = a x 1 j
Example 1 1 .8
A
(2-50)
(11-34)
(Fisher's li near di scri mi nant fo r the hemoph i l i a data)
A
11.3.
y = a'x
(x 1 - X2 ) ' S�;oled X = 37.61x 1 - 28 . 92x2
D2 = (x 1 - x2 ) ' s�;oled (x 1 - i2 ) 131.158 -9 0.423 = [ •2418 ' -. 0652 ] -90.423 108.147 = 10.98
[
] [ -.0652 .2418 ]
•
' Fi s h er observationss. solution to the separation problem can also be used to clas ify new
Section 1 1 . 5
61 1
Fisher's D i scri m i nant Function-Separation of Popu l ations
poitionntissThevarin tpriheedoscedur cuntatitlertehple osatsmplareiesprsilarouesecttmaxiraetded,montsalcohemat aly sleinpareiciaanlteythd., feodir rectionin Fiandgurethis dirAlecl ' Fi s h er s l i n ear di s c r i m i n ant f u nct i o n i n was devel o ped under t h e as he twloy,popul atnotionsbe, whatsurperverisintgheithratfoFirmsh, erhave's meta common covar ianceto matsauparmptritxii.coulnConsathr atcasetquent i t may h od cor r e s p onds e of t h e mi n i m um expect e dc os t o f m i s c l a s i f i c at i o n r u l e . The f i r s t ttearinmed, by Fisher that maximiziesn tthhee cluniasviarficiaatteio"betn rulweeen" samplis thees varlineariabifluitncy trieolnatObive to the "within" samples variability. See The entire expres ion (11-33)
p =
j
2
a,
11.8.
(11-35)
y
=
( X 1 - X2 ) ' s �;ol ed X ,
(11-18)
[
x2 • • • • • • 1t 2 • .• • . . . . . · . . . •. •
�
y = a'x .
Q.
�· f.ll
�
::1
�
-p. �
.
• • • • • •• / • • �... • .• • \ •• J. •·•·: � . •• •• \� �\ �Oo O / 0 / ? )( g'bo ' ....L 0 0 �......� 0 o.a ...e o 0 ._L\ � 0 oo .... �
\�\ ""'
�·
::1
1 1 .8
-
x2 . .• •. ·. : :·.· ··.. • . . . . ....·.. .'- ...... . .. . . . . . • •·· • · : : . . .. . / . . · -. . 0 • • • o 0 •• • • • •• •• o . 0 0 0 • 0• 0 0• 0 0 1t I 0 o 0 o0 o o o a0o o 0 0 0 00 0 0 Do 0 0 g 0 o o g. o oo '6 o o o o ...- ""'o O 0 Oo 0 0 0 0 / 06 0 0 / 00 0 0 0 0 / / oo 0 ""' ""'
�
/
-
X
I
/
--4��----� xi 0 0
.,.,.-p.
�
Figure
(11-33).]
A pictorial representation of Fisher's proced u re for two popu lations with p = 2 .
8We must have (nl + n 2 not exist.
-
2)
2::
p ; otherwise
spooled i s singular, and the usual inverse, s��oled ' does
612
Chapter 1 1
D iscri m i nation a n d Classification w
= ( x 1 - x2 ) ' s�;olect x - � ( x 1 - x2 ) ' S�;olect ( x 1 + x2 ) (x 1 - x2 ) ' S�;oled [ x - � (x 1 + x2 ) ]
(11 36 ) -
is frequently called so that Once agaiRuln, ief ios vicompar aatblethteo tRulwoenormal populbasedatonionsFishaveher'stlhineearsamedisccovarriminiantancefumatnctiroin.x, Thus , pr d ed t h Fiprsohbabier'slclitiaess iandficatequal ion rucosle itssequiof mivalscelants tifoictatheiomin. nimum ECM rule with equal prior be I n s u m, f o r t w o popul a t i o ns , t h e maxi m um r e l a t i v e s e par a t i o n t h at can ttaointhede dibystaconsnceiderinThig lisniears conveni combienntatibecaus ons of ethe mulcantibevarusiateed,obsin ecerrvtataiinonssituisatequalioobns, taoteteststfowhetr difheerretnceshe populin meanatiovectn meansors can beandvieweddifaserastiegsnit ffoicrantthely"s. iConsgnificeance" quentlofy, the seSuppos paratioentthheatpopulcan beatioachins eved.and are multivariate normal H H ver s u Then, as i n Sec t i o n a t e s t of 0: s : 1 is accomplished by refer ing ( � :: �! � ) ( : J tcoananconclF-diusdetribthutatiotnhewisethparation betandween the two populationsd.f. andIf H0 isisresjigenictfeid,cantwe. Si g ni f i c ant s e par a t i o n does not neces s a r i l y i m pl y good cl a s i f i c a tbeion.evalAsuatweedhaveindependent seen in Sectly ofioanyn test tofheseefparficacatiyon.of aOnclatsheifiotcatherionhand,procedurif theescanep arfruaittiloens i.s not significant, the search for a useful clas ification rule wil probably prove Anderson's classification function (statistic) . ln [ ( c ( 1 1 2 ) /c (2 1 1 ) ) (p2 /p1 ) ] = 0, [ ( c ( 1 1 2 ) /c ( 2 1 1 ) ) (p2 /p1 ) ] = 1, (11-18) (11-35),
D2 •
IL l
1r 1
variance matrix I.
2 n
� n
v1 = p
Comment.
1 1 .6
1r2 6.3,
D2
IL 2
IL 1 = IL 2
l n 2 2 D 2p n1 n v2 = n 1 + n 2 - p - 1
7T1
with a common IL l
*
co
1! 2
1r2
11.4,
CLASSIFICATI O N WITH SEVERAL POPU LATI ONS
Instrathigeorhtfyo,rtwhard.e generHowever alizatio,nnotof clmuchas ificisatknown ion proabout ceduretshefroprmopertoties of thgreocorupsreis sbeenpondifunlgy investigclatased.ification functions, and in particular, their error rates have not The "r o bus t n es s " of t h e gr o up l i n ear cl a s i f i c at i o n s t a t i s t i c s t o , f o r i n s t a nce, unequal erated sacovar mpliniagncesexperorimnonnor ents.9 mForal mordistreibtuthanionstwcano populbe stautdiioensd, withtihs apprcomputoacher doesgen 2
g
>
2
sample
two
9 Here robustness refers to the deterioration in error rates caused by using a classification procedure with data that do not conform to the assumptions on which the procedure was based. It is very difficult to study the robustness of classification procedures analytically. However, data from a wide variety of distributions with different covariance structures can be easily generated on a com puter. The performance of various classification rules can then be evaluated using computer-generated "samples" from these distributions.
Section 1 1 .6
613
Classification with Severa l Popu lations
a l concl u s i o ns , becaus e t h e pr o per t i e s depend on wher e t h e popu notlatiolnseadartoe lgener o cat e d, and t h er e ar e f a r t o o many conf i g ur a t i o ns t o s t u dy conveni e nt l y . timal Asrulebefs andore,thouren apprindicoatache thine modithis sfeictcatioinonswilrebequitroeddevelfor roepalt-hweorthleord appleticialcatlyioopns. Letpart, we sbehaltlhteakedensity astosobeciaatemuld wititvharpopuliate noratiomnal density, but this is[Funneces or the mossaryt for the developmentthe ofpritohreprgenerobabiallitthyeorof ypopul .] Letation tthoe costfofork,allocating an item to when, in fact, it belongs Fork Finally, let R be the set of x's clas ified as and clas ifying item as /;(x) dx fork, with i n t o The condi t i o nal expect e d cos t of mi s c l a s i f y i n g an x f r o m 7T or , 1 or isECM( 1 ) . This condiIn a stiimonalilarexpect e d cos t occur s wi t h pr i o r pr o babi l i t y p , t h e pr o babi l i t y of '7T 1 1 , weg canMulobttaipinlyitnhge eachcondiconditionaltioexpect ed cosbytistsofprmiiosrcprlaos bi fabiicatliitoynandECM(summi2) ,manner .ng. ,giECM( nal ECM v es t h e over a l ECM: ECM ) ) (� 1 (� The M i n i m u m Expected Cost of M i sclassificatio n Method
fi (x)
'7Ti , i = 1, 2, . . . , g .
fi (x)
'7Tk
i = 1, 2, . . . , g
'7Ti , = i, c(i I i ) = 0.
7Tk
k
7Tk 1 7T; ) = r
P(k I i ) = P(
i = 1, 2, . . . , g
'7Tg
i = 1, 2, . . . , g
'7Ti ,
Pi = c(k I i) =
P(i I i ) = 1
-
}Rk
g 2: P( k I i ) . k =l k =F i
7T2 ,
7T3 ,
• • •
= P(2 1 1 )c( 2 1 1 ) + P(3 l 1 )c( 3 l 1 ) + · · + P(g l 1 ) c(g 1 1 ) g = 2: P(k l 1 )c( k 1 1 ) k=2 ·
).
= p1 ECM( 1 ) + p2 ECM( 2) +
P(k l l ) c (k 1 1 ) + P2
=P
+
=
· ··
+ Pg
P(k l 2 )c( k 1 2 )
k =F-2
(% P(k g)c( k g) ) I
� P; ( �k =Fi P( k i ) c ( k i ) ) ·· ·
+ pg ECM (g)
I
I
(11-37)
I
Det e r m i n i n g an opt i m al cl a s i f i c at i o n pr o cedur e amount s t o choos i n g t h e mu tisuala lmiy exclnimuum.sive and exhaustive clas ification regions R1 , R2, , R such that • • •
g
(11-37)
614
Discri m i nation and Classification
Chapter 1 1
The cl a s i f i c at i o n r e gi o ns t h at mi n i m i z e t h e ECM ar e de fined by allocating to that population . . , g, for which Result 11.5.
(11-37)
1T k , k = 1, 2,
x
g
(11-38 )
� Pz fz(x)c(k I i) i i =F-=lk
is smallest. If a tie occurs, can be assigned to any of the tied populations. See Anderson Suppos e al l t h e mi s c l a s i f i c at i o n cos t s ar e equal , i n whi c h cas e t h e mi n i m um expect edrulecos. tWioftmihoutsclalos sifiofcatgener ion rualleitiys, twehe micannismetumalltothtaelmiprsocbabilas ilfiitcyatofiomin cossclatss equal ificatioton Using the. ar. ,g,fgumentorwhilecadih ng to we would allocate to that population iConss smalequent lest. Now, wi l be s m al l e s t when t h e omi t e d t e r m , i s l r e s t g . a l y , when t h e mi s c l a s i f i c at i o n cos t s ar e t h e s a me, t h e mi n i m um expect e d cost of misclas ification rule has the fol owing rather simple form. x
Proof. (
1.)
•
[2].
(11-38),
1Tk , k = 1, 2,
x
g
� ifz(x) i =l P i =l- k
(11-39)
(11-39)
Pkfk (x) ,
i s i n t e r e s t i n g t o not e t h at t h e cl a s i f i c at i o n r u l e i n i s i d ent i c al t o t h e I t onethat thatwasmaxiobsmerizvesedth, ewher"poseterior" probability comes from given prprioiorr lilkikelelihihoodood for . . , g Equation is the generalization of Equation to g groups. (11-40)
P( 1Tk I x) = P (x
)
x
( I [(
)
)
X
X
(
)
(
)J
1Tk
k = 1, 2,
(11-42)
(11-42)
(11-9)
>
2
Section 1 1 .6
61 5
Classification with Severa l Pop u l ations
You s h oul d keep i n mi n d t h at , i n gener a l , t h e mi n i m um ECM r u l e s have t h r e e component s : pr i o r pr o babi l i t i e s , mi s c l a s i f i c at i o n cos t s , and dens i t y f u nct i o ns . Thes e components must be specified or estimated before the rules can be implemented. Letgivenusthasesfiognl oanwinobsg hypot ervatihoetnicaltoproneior profotbabihe lities, mipopulsclaastiiofinscation costors, and density values: True population Clas ify as: PrDensioriprtieosbabiat lities: We s h al l us e t h e mi n i m um ECM pr o cedur e s . The values of see are (
Example 1 1 . 9
)
{Classifying a new observation i nto one of th ree known po pulations)
x0
g
7T J
7T l
1T2 1T3
x0:
c(1 1 1 ) c(2 1 1 ) c(3 1 1 ) PI /1(x0)
3
� Pifi(x0)c(k I i ) [
i= l i =F k
=0 = 10
== 50 = . 05 = .01
=3
1T2 c ( 1 1 2 ) = 500 c ( 212 ) = 0 c ( 3 1 2) = 200 P2 == .60 /2 ( x0) == .85
1r1 , 1r2 ,
1r3 ,
1T3 c ( 1 1 3 ) = 100 c (213) = 50 c (3 1 3) == 0 P3 == .35 f1 (xo ) == 2
(11-38)]
k == 1 : p2/2 ( x0) c ( 1 I 2 ) + p3f:, ( x0) c ( 1 I 3) = (.60) (.85) (500) + ( .35) ( 2 ) ( 100) == 325 k = 2 : p1 /1 (x0)c(2 1 1 ) + p3/3 (x0) c ( 2 1 3) == (.05) (.01) ( 10 ) + (. 35 ) ( 2 ) (50) = 35.055 k = 3 : p1 /1 (x0) c ( 3 1 1 ) + p2/2 (x0) c (3 1 2) = ( .05 ) ( .01 ) (50) + ( .60) ( .85) (200) = 102 .025
3
Since is smallest fork we would allocate to If alwhil cosch trsequiof miressconllasy itfhiceaprtioonductwers e equal, we would assign according to . Since � Pi fi(x0)c( k I i)
x0 1r2 •
== 2,
i=l i =F k
x0
(11-40),
P1 /1(xo ) = (.05) (.01 ) == .0005 P2 /2C xo ) == (.60) (.85 ) == .510 P3 f3 ( Xo ) = ( .35 ) ( 2 ) = 700
61 6
Chapter 1 1
Discri m i nation a n d Cl assification
wetiesshsoulee d allocatewe obtto ain Equivalently, calculating the posterior probabili (11-42)],
[
P( 'TT 1 I X o )
==
P( 1r2 l x0)
==
P( TT3 I xo)
==
7T3 •
x0
P1 /1 (x o ) _3___ :L Pi fz (xo ) i=l
( .05 ) ( .01 )
=
.0005 1.2105
( .05 ) ( .01 ) + (. 60) ( .85) + ( .35 ) ( 2) P2/2 (xo ) == ( .60 ) ( .85) == .510 == .421 3 1.2105 1.2105 ""
Pi fz ( Xo ) i£..J =l P3/:, (xo ) _3___ :L Pi fz (x o ) i =l
==
(.35) ( 2 ) 1.2105
==
=
°
0004
.700 == 578 1.2105 '
Weprobabisee ltihtyat. is allocated to the population with the largest posterior An important special case occurs when t[he ] exp ities with meanor, vectequiovralsentlandy, thcovar Iararf,eefumulalrtlhequalertiv, aria, ttehennormal densbecomes e miscialances ifimatcatiroincescosts Allocate to if � ln ik - � (�) m�x ln The cons t a nt l n c a n be i g nor e d i n s i n ce i t i s t h e s a me f o r al l popul for the ith pop ulationattioonsbe. We therefore defln ine the ln The quadr a t i c s c or e i s compos e d of cont r i b ut i o ns f r o m t h e gener a l i z ed var i a nce n t o t h e popul a t i o t h e pr i o r pr o babi l i t y and t h e s q uar e of t h e di s t a nce f r o m mean Not e , however , t h at a di f e r e nt di s t a nce f u nct i o n, wi t h a di f e r e nt or i e nt a tion and size of the constant-distance ellipsoid, must be used for each population. 1r 3 ,
x0
•
Classificati on with Normal Populations
1
1
-1 - 2 (x - /1-; ) ' I ; (x - /1-;) '
/; (x) - 2 P/2 1 1/2 ( 7T ) 1 I ; _
c(i I i)
x
)
==
0, c ( k I i ) (11-41)
=
==
1, 2, 0 .
0 '
g
(11-43 )
Iz .
ILi
# i(
ln ( 27T) -
ln pk -
Pi fi (x)
l
( 27r)
(p/ 2 )
d F (x)
==
-�
d F (x)
ILi ·
==
1r k
ln pdk (x)
I Ii I ,
== 1, k
i
Pi ,
i
l
(x - P-k ) ' I k 1 (x - P-k )
(11-44)
(11-44), quadratic discrimination score
I I i I - � (x - ILi) ' I i 1 (x - ILi ) +
i
==
Pi
1 ' 2, . . . ' g x
( 11-45)
Section 1 1 .6
Classification with Severa l Popu lations
617
Us i n g di s c r i m i n ant s c or e s , we f i n d t h at t h e cl a s i f i c at i o n r u l e ( 1 14 4) becomes the fol owing: ar e unknown, but a t r a i n i n g s e t of cor r e ct l y cl a s i f i e d I n pr a ct i c e, t h e and IL i en avaiatiloanbl1Te ifarorethe construction of estimates. The relevant sam plobse equantrvatioitnsiesisfoofr tpopul s a mpl e mean vect o r s a mpl e covar i a nce mat r i x S i and s a mpl e s i z e i The estimate of- thlne iquadr a t i c di s c r i m i n at i o n s c or e i s t h en ( 1 14 7) , l n p 1, , S l i i and the clas ification rule based on the sample is as fol ows: Ii
xi
=
=
n
=
d f (x)
=
d f (x)
� (x - xJ ' S i 1 (x - xJ +
�
i
=
2
... , g
ar e equal . s i m pl i f i c at i o n i s pos s i b l e i f t h e popul a t i o n covar i a nce mat r i c es , , Ii When fori - 1,ln i i - the discriminant score in (11-45) becomes l n p i i Thecan befirsitgtnorwoetedrfmosr aralelotcatheisvaemepurfoproses. The, remai, ning termand,s consconsisteofquenta consly, tthaeynt lnNext-, define the and a combination of the components of i - ILi� - 21Li� ILi for n 1, (11-49) A Ii
ci
=
Pi
=
I, d f (x)
=
=
� JL;I - 1 1Li
2, . . . , g,
�
� x ' I - 1 x + 1L; I - 1 x - � 1L; I - 1 1Li +
d � (x) d � (x) . . . , d � ( x) ,
linear linear discriminant score d ( x ) - ' ""' - 1 x 1 ' ""'- 1
x.
+
1
i
Pi
=
2, . . . , g
618
Chapter 1 1
D iscr i m i nation a n d Cl assification
di (x)
Anmateesoftimate of the linear discriminant score is based on the pooled esti and is given by n f o ri 2, Consequently, we have the fol owing: I.
di ( x )
1 S pooled - n + n + . . . + n - g ( (n 1 - 1 ) S 1 + (n 2 - 1 ) S2 + · · · + (n g - 1 ) Sg ) 1 2 g (11 - 5 0 ) ,... 1 x- i + I Pi 1 X - 21 x- i, s-p ooled di ( X ) - x-i, sp-ooled = 1,
(11 4 )
(11-5 1)
... , g
Expr e s i o n 9 i s a conveni e nt l i n ear f u nct i o n of An equi v alinegntthcleaconss ifietrantforttehremequal c ovar i a nce cas e can be obt a i n ed f r o m 5 by i g nor , I n The r e s u l t , wi t h s a mpl e es t i m at e s i n s e r t e d f o r diunknown stances population quantities, can then be interpreted in terms of the squared from Astositghne satmplo thee mean vect o r The al l o cat o r y r u l e i s t h en popul a t i o n 1r f o r whi c h I n i s l a r g es t i Wetion.seeThethatditshtaisncerulmeas e-oru,requie is penal valentilzyed, 11-by 5In2 -assigns to the "closest" popula If the prAnior probserobabivatlitiioens aris ethunknown, t h e us u al pr o cedur e i s t o s e t en assigned to the closest population. Comment.
�
x
(
(11 )
I I I·
D r (x)
x
x.
4
=
(11 - 5 3 )
(x - xi ) ' s�;oled (x - xi ) xi . - � D r (x) + Pi ( ) x .) Pi
(11 5 4 )
p1
= Pg = 1/ g .
Example 1 1 . 1 0
=
p2
=
·
· ·
(Calcu lati ng sample d iscri mi nant scores, ass u m i ng a common covariance matrix)
Letulatiusonscalascsuluamedte thteo lbeinearbivardisiactreimnorinantmalswicorthesabascommon ed on datcovara frioamnce matripopx. s a mpl e s f r o m t h e popul a t i o ns and Random al o ng wi t h t h e sample mean[vect-2ors ]and covariance matrices, are as fol ows: [ ] and [ ] so g =
1r1 , 1r2 ,
1T1 : x 1 =
5 o 3 ,
-1 1
n1
=
3, x 1 =
- 31
'
3
1r 3 ,
s1
=
1 - 41 -1
Section 1 1 .6
Given that
Classifi cation with Severa l Pop u l ations
l e t us cl a s i f y t h e obs e r v at i o n and accor] ding to Fro]m [ ] [ ] p3 = .50,
p 1 = p2 = .25
x0 = [ x0 1 , x0 2 ] = [ - 2
-1 ] 3 - 1 1 -1 s pooled = 9 - 3 -1 4 =
[
[6 - 11 - 11
�
+
+ +
619
+
(11-52). 3 - 1 1 -1 9 - 3 -1 4
1 -1 - 1 1 4+4
so
4
+ +
1 = 4
_!_
(11-50), 3-1 1 1 + 9-3 1 4 1 31 1 4 3
[ ]
3 = l_ 36 3 1 - 9 sp-ooled 35 3 9 35 1 -3 1
Next, and so
J [ ]
36 3 = l_ [ - 27 24] 35 3 9 35
-1 ed - [ -1 3 1 x 1 Spool _,
( .25 ) +
ln NotIn aicseimthilearlinmanner ear form, of =
(��7) x0 1 (��) x0 2 - � (��)
d 1 ( x0 ) =
+
+ (
) x0 1 + (
constant constant constant [ ] J
1 -- [ 1 4 1 x-2, S-poolect
36 3 - 1 [ 48 39 ] 35 35 3 9 24 1 = ° xzs��oled x2 = [ 48 39 ] 35 35
[!J
) x0 2 •
620
Chapter 1 1
Discr i m i nation and Cl assification
and " ln ( .25) ( 4835 ) ( 3935 ) - 1 ( 20435 ) Finally, d2 (x0)
+
==
Xo 1 +
Xo 2
2
and " ln ( .50) ( -635 ) ( -18 ) - 1 ( 3635 ) Substituting the numerical values x0 1 -2 and -1 gives d1 (x0) - 1.386 ( ��7 ) ( -2) (��) ( -1) - ��04 -1.943 -1.386 ( :� ) ( -2) G�) ( -1) - �0 -8.158 d3"(x0) -.693 ( ;: ) ( -2) ( ��8) ( -1) - �� -.350 Since -.350 is the largest discriminant score, we allocate x0 to '7T3 • Theate gradmiades poiionnoft favericer ofagea bus(GPA)inessandschoolgrahasduatuseedmanagement an "index" ofaptunderitudegratduest (thGeMAT)schools'csoresgraduatto hele prp odecigramsde . whiFigcurh eappl11.i9cantshsowsshoulpaidrsbeofadmix1 t GPA, e d to val u es f o r gr o ups of r e cent appl i c ant s who have been cat e gor i z ed xas '7T1 :GMAT 0 1 edcalisnoTablfadmitwareet11.;out6.pdout(Seeinotn Panel11. Exeradmicits;eand111.) 27T9.3): borThesdereldatine.a yiTheeld dat(seae pithcetuSASred arstaetlisistti n1 31[ 3.40] 28[ 2.48 ] n3 26[ 2.99 ] x1 561.23 447.07 x3 446.23 [488.2.4975 ] [ 2..00361188 3655.-2.90011188 ] d3 (x0)
+
==
Xo 1 + 3s Xo 2 x0 2
=
d2 (x0 )
d3 ( x0 )
Example 1 1 . 1 1
2
==
=
+
+
=
=
+
+
=
=
+
+
=
•
=
(Classifyi ng a potentia l busi ness-school g raduate student)
=
2
=
7T2 :
X
=
n2
==
=
=
x2
=
==
=
Spooled
==
1 0 In this case, the populations are artificial in the sense that they have been created by the admis sions officer. On the other hand, experience has shown that applicants with high GPA and high GMAT scores generally do well in a graduate program; those with low readings on these variables generally ex perience difficulty.
Section 1 1 .6
621
Classification with Severa l Popu l ations
GMAT 720 A
B
540 B B B B
BB B B B
360 B
B
B
B BB B
c
B
c
BB B
B
BB
B
c cere
c
CA c cc c c cc cc c
B B
2. 1 0
2.40
A A
B
c
2.70
3.00
3.30
A
A
A
A
A
c
270
A A
A AAAA A A A A AA A A
c
cc c
A A
A
450
A
A
630
A A
c
A : Admit ( n 1 ) B : Do not admit ( n 2 ) C : Borderline ( n 3 ) 3 .60
3.90
GPA
1 1 .9 Scatter p l ot of (x1 G PA, x2 GMAT ) for a p p l icants to a g ra d u ate school of busi ness who have been cl assified as adm it, do not a d m it, or borde r l i n e .
Figure
PANEL 11.1
=
=
SAS ANALYSIS FOR ADMISSION DATA USING PROC DI SCRI M .
title 'Discr i m i n a nt Ana lysis'; data g pa; i n fi l e 'T1 1 -6.dat'; i n put g pa gmat a d m it $; proc d iscrim data = g pa method = normal pool = yes manova wcov pcov l i sterr cross l i sterr; priors 'admit' = .3333 'notadm it' = .3333 'border' = .33 33; class a d m it; va r g pa g m at;
PROGRAM COMMANDS
D I SCRI M I NANT ANALYSIS 85 Observations 84 D F Tota l 2 Va ria bles 82 D F Wit h i n Classes 3 Classes 2 DF B etwee n Classes
OUTPUT
Class Leve l I n fo rmation Weight 3 1 .0000 26.0000 28.0000
Proportion 0.364706 0.305882 0.3294 1 2
(continues on next page)
622
D iscri m i nation and Cl assification
Chapter 1 1
PANEL 1 1 .1
(continued) DISCRI M I NANT ANALYSIS WITH I N-CLASS COVARIANCE MATRICES DF = 30 ADMIT = a d m it G MAT G PA Va riable 0.043558 0.058097 G PA G MAT 0.058097 46 1 8. 2473 1 2 ADMIT = border G PA Va riable G PA 0.029692 - 5.403846 G M AT
DF
ADMIT = notadmit Va riable G PA G PA 0.033649 - 1 . 1 92037 G MAT
D F = 27
Va riable
=
25
G MAT - 5 .403846 2246.9046 1 5 G MAT - 1 . 1 92037 389 1 . 2 53968
GMAT
G PA
G PA G MAT
Stat istic Wi l ks' La mbda P i l l a i 's Trace H ote l l i n g-Lawley Trace Roy's G reatest Root
M u ltivariate Statistics a n d F App roxi mations N = 39.5 S=2 M = -0 . 5 Den D F Va l u e F Num DF 1 62 73 .4257 4 0 . 1 263766 1 1 . 00963002 4 1 .7973 4 1 64 5 .83665601 1 60 1 1 6.733 1 4 5 . 64604452 2 82 2 3 1 .4878 N OTE: F Stat istic fo r Roy's G reatest Root is a n u p per bo u n d . NOTE: F Statistic for Wi l ks' Lambda is exact.
LINEAR DISCRI M I NANT F U N CTION DISCRI M I NANT ANALYSIS 1 Con sta nt = . 5Xj cov- Xj + I n PRIORj Coefficient Vector = cov- 1 X j ADMIT n ota d m it a d m it border 1 34.99753 24 1 .47030 CON STANT 1 78.4 1 437 78.08637 G PA 1 06 . 2499 1 92.66953 0 .2 1 2 1 8 0 . 1 6541 G MAT 0. 1 7323 -
General ized Sq u a red Dista n ce F u n ction: Df ( X ) = ( X - Xj ) cov- 1 ( X - Xj ) Posterior Proba b i l ity of Members h i p i n each ADM IT: Pr ( j I X ) exp ( - . 5 Df ( X ) ) /SUM exp ( - . 5D� ( X ) ) '
=
k
Pr > F 0.0001 0.0001 0.0001 0.0001
Section 1 1 .6
PANEL 1 1.1
Classification with Severa l Popu l ations
(continued) Obs 2 3 24 31 58 59 66
From ADMIT a d m it a d m it a d m it a d m it nota d m it nota d m it border
Poste rior Proba b i l ity of Members h i p i n ADM IT: Cl assified i nto ADMIT a d m it border * border 0.8778 0. 1 202 * border 0. 6342 0.3654 * border 0.4766 0. 5234 * border 0.7032 0. 2964 * border 0. 000 1 0.7550 * border 0.8673 0.000 1 * a d m it 0.4664 0.5336
n ota d m it 0 .0020 0.0004 0.0000 0.0004 0.2450 0. 1 326 0.0000
* M isclassified observat ion
Genera l i zed Sq u a red Distance F u n ction : Df ( X ) = ( X - X (X)j )' COV(�) (X - X (X)j ) Posterior Proba b i l ity of Members h i p i n each ADM IT: Pr ( j I X ) = exp ( - . 5 Df ( X ) )/ S U M exp ( - . 5 D� ( X ) ) k
N u m ber of O bse rvat ions a n d Pe rcen t Classified i nto ADM IT: From
ADMIT
Tota l 31 83.87
1 6. 1 3
0.00
1 00.00 26
3.85
92. 3 1
3.85
1 00.00 28
Total Pe rce nt Priors
0.00 27 3 1 .76 0.3333
7. 1 4 31 36.47 0.3333
Rate Priors
E rror Co u n t admit 0. 1 6 1 3 0.3333
Estimates for ADM IT: nota d m it border 0 . 07 1 4 0. 0769 0.3333 0.3333
92.86 27 3 1 .76 0.3333
1 00.00 85 1 00.00
Tota l 0. 1 032
623
624
Chapter 1 1
D iscri m i nation a n d Classification
Suppos e a new appl i c ant has an under g r a duat e GPA of and Let us cl a s i f y t h i s appl i c ant us i n g t h e r u l e i n s c or e of with equalWith prior probabilitietsh. e sample squared distances are
GMAT
x1
x2 == 497.
D i (x o )
==
=
x0 == [3.21, 497], (x o - i 1 ) ' s�;olect (xo - i 1 )
[3.21 - 3.40, 497 - 561.23 J
[28.6096 .0158
.0158 .0003
== 2.58 D �(x0) (xo - i2 ) ' s�;olect (x0 - x2 ) 17.10 D � (x0) == (xo - i3 ) ' s�;o1ect (x0 - x3) == 2.47 ==
][
==
a
3.21 (11-5 4)
3.21 - 3.40 497 - 561. 2 3
J
==
Sisignnceththiseappldistaicnceantfrtoom1r3 , borderlin, e. to the group mean is smallest, we as The l i n ear di s c r i m i n ant s c or e s c a n be compar e d, t w o at a t i m e. Us i n g tnanthesescquantore among ities, we see that the conditioisnequithatvalent tios the largest linear discrimi ln (��) � for allAddii ng -ln ln to both sides of the preceding inequality gives tmihescallatserinfiatcativieon.forThusm of,twehe clas ification rule that minimizes the total probability of Allocate to if ln ( ; J � for allNow,i denote the left-hand side of by Then the conditions in whi c h ar e s e par a t e d by hyper def i n e cl a s i f i c at i o n r e gi o ns , plForanesexampl . Thise,fwhen ol ows becausthee clas ifiiscata liionnearregicombion nconsationisofts ofthealcomponent s of l s a t i sfying ln (:: ) fori That is, consists of those for which ln (;:) � x0
==
[ 3 .21 497 ]
(11-49)
0
<
=
•
dk (x)
d1 (x) , d2 (x) , . . . , dg (x) dk (x) - di (x)
( ILk - P-; ) ' I- 1 ( P-k + P-; ) +
( P-k - P-; ) ' I - 1 x -
== 1, 2, . . . , g . (pk j Pi ) ==
x3
( Pi/ Pk )
1rk
x
( ILk - p,) ' I - 1 x -
( ILk - P ; ) ' I - 1 ( Pk + p, )
== 1, 2, . . . , g.
(11-55)
(11-55)
g
dki (x)
== 3,
R 1 : dl i (x)
R1
d n (x)
>
R 1 , R2 ,
• • •
Rg ,
(11-55)
>
dki (x).
(
R1
x
== 2, 3
x
=
( P- 1 - P z ) ' I - 1 x -
( P- 1 - P z ) ' I- 1 ( P 1 + P z )
>
)
x.
Section 1 1 .6
625
Cl ass ification with Severa l Pop u l ations
•
• •
The cl assificat ion reg ions R 1 , R2 , and R3 fo r the l i near m i n i m u m TPM rule 1 1 1 ( P 1 = 4 , P2 = 2 , P3 = 4 )
Figure 1 1 . 1 0
·
and, ln (;: ) -� Asln sumipng andthat IL1 , IL2 , andln dop notdefilnieealtwongo inatesrtrsaeictghtinglihyper ne, theplequat i o ns a nes t h at del i n eat e ito n tthheanp-dimifensp2 iiosnalgrevarateriatblhane sppace.1 . TheTheretgieromnsln p andplaRces3 arteheshplownaneicln oFisegr urweegr11.aph10 fthore tplheanecastehatof conttwo avarinsiatblheest.hrTheee meanpicturvecte isotrhse. same for more variables if The s a mpl e ver s i o n of t h e al t e r n at i v e f o r m i n ( 1 15 5) i s obt a i n ed by s u bs t i t u t ing nifo-r IL1)i andp,inssoetrhtiatng the poolexiestds,stahmplis saemplcovare analianceogmatbecomes rix for When Allocate to if 1 ln (::) for all (11-56) Gi v en t h e f i x ed t r a i n i n g s e t val u es i s a l i n ear f u nct i o n of and he clas ifbyicathyperion rpegilaonesns, defas iinnFiedgbyure(11.11-150.6)-or, equithe component vAsalentwilyth, bythse(of1s1-ampl52)Ther-e larineearefoalrsdieo,stbounded c r i m i n ant r u l e of ( 1 15 2) , i f t h e pr i o r pr o babi l i t i e s arlne dif icult 0tofoasr alselspai, trhsey. are frequently all taken to be equal. In this case, simultaneously,
d 1 3 ( x) = ( P- 1
R1
IL3) 1 I - l x
d1 3(x) =
(p2 / 1 )
( IL 1
IL3) 1 I - 1 ( IL 1 + IL 3)
d1 2 (x) =
1L3 (p3 / 1 )
(p2 / 1 ) R 1 , R2 ,
IL 2
ILl
>
S pooled
xi �( i= l g
s�;oled
>
x
'7T
k
A
dk; (x) = ( Xk >
x.
( Pi/ Pk ) =
X;) 1 s�;oled x - 2 (X k - X; ) 1 s �;oled ( Xk + X ; ) h � k
xi
A
S pooled ' dki (x)
I.
626
Chapter 1 1
Discri m i nation and Cl assification
Becaus e t h ey empl o y es t i m at e s of popul a t i o n par a met e r s , t h e s a mpl e cl a s i f i cateverio,ncarnulbees evaluateandd using Lachenbr may nouchlo'ngers holdbeoutoptprimocedur al. Theie. rIfperfoirsmthance,e numberhow ofmatmie sofclaths eifexpect ied holeddoutactuobsal ererrvoatriroanste,iEn tAER he ith ,grisoprup,ovided by then an esti AER (11-48)
(11-52)
(
i
)
g
==
n�Z ) 1, 2, . . . , g ,
M
n i( H ) L � ) == _i=_ L ni
E(
(11-57)
-
i =l
Example 1 1 . 1 2
(Effective classification with fewer variables)
on discsroimn inantonftuhncreteiosnspeci, Fiesshofer iris prfloewersents.edSeean analTablye sInis hiofsExerdatpioaneerccoliseliencgteword byk Ander Let the clas es be defined as The fol owing four variablseespalwerleengtmeash, ured frosmepal wipldantths of each species. pet a l wi d t h pet a l l e ngt h , Usfusiinognalmatl threixdata in Table a linear discriminant analysis produced the con Predicted membership Percorcreentct 11.5,
[8]
[1]
11.27.)
(
1r 1 : Iris setosa; 1r2 : Iris versicolor; 1r3 : Iris virginica
50
xl == x3 ==
x2 x4
==
==
11.5,
Actmemberual ship
7T 1 : Setosa 1r 2 : Versicolor 1r 3 : Virginica
1r 3 : Virginica
1r2 : Versicolor
1r 1 : Setosa
50 0 0
0 2 49
0 48 1
100 96 98
Thesee elements in this matrix were generated using the holdout procedure, so E AER The erOfrotrern,atite,is posissiblolew.to achieve effective clas ification with fewer variables. Ititmise,goodand prsoafctoricthe,ttootsreyealhowl thewelvarliathbleyescloneas iatfyacompar time, twedo tato athteimdie,scthrirmeeinatant function, which uses all the variables. (
11-57)
A
(
)
==
3
150
==
.02
2%,
a
Section 1 1 .6
627
Cl assification with Several Pop u l ations
I f we adopt t h e hol d out es t i m at e of t h e expect e d as our cr i t e r i o n, we find for the data on irises: Single variable Misclas ification rate AER
.253 .480 .053 .040
Pairs of variables Misclas ification rate xl , x2 xl , x3 Xl , X4 X2 , X3 x2 , x4 X3 , X4
.207 .040 .040 .047 .040 .040
Wetinguisesehtihnatg ththeethsirnegleespvareciieasblofe iris. Morpetaeloverwid,thverdoesy litaleveris ygaigoodned byobiofncldiuds itnhge morthreee varspeciiableseofs. iBoxris. Iplt oistsclofear frompetthealfiwigurdtehtarhate spethownal wiindtFihgsurepare ates tfhoer tshmraleelegrr tohupsan tquihetpete welal lwi, widtthhs, ffoorr example, the petal widths for much Dar r o ch and Mos i m ann have s u gges t e d t h at t h es e s p eci e s of i r i s may be discriminated on the basis of "shape" or scale-free information alone. Let X4 =
j
X4 =
[6]
11.11
Iris virginica.
Iris setosa
2.5
2.0
..c=
�
·�
ca � �
$
1 .5
1 .0 * *
0.5
I
0.0 1t 2
Group Figure
11.11
Box p l ots of peta l width for the th ree species of i ris.
628
Chapter 1 1
D iscri m i n ation and Classification
be t h e pet a l s h ape. X X be t h e s e pal s h ape and Y2 X t The us 4 e o 1 2 the varTheiablseesleYct1 iandon ofY2apprfor dioprsciraimteivarnatiioabln iessexplto usoereind iandiExerscrimciisneant11.2anal8. ysis is gatofhowteonrweltdioflmakeitchule tpr. roeascedursounablmmare cle ayandssuifchiseismasipltstethaechoirgoneetcobjesinbasethctiess.dexampl on theeulaltliomwsatethcreiitnevesria tofi Our di s c us s i o n has t e nded t o emphas i z e t h e l i n ear di s c r i m i n ant r u l e of orthe linear diandscrimmanyinantcommer c i a l comput e r pr o gr a ms ar e bas e d upon i t . Al t h ough r u l e has a s i m pl e s t r u ct u r e , you mus t r e member t h at i t was de rvarivediancesunder. Befthe orraetherimspltreomentng asisnugmpta iloinsnearof mulclastiivfaricatiatieonnorrulmeal, ittyhesande equal co tentequalative asitysuofmptcovarionsiasnceshoul. dIfbeonechecked i n t h e or d er mul t i v ar i a t e nor m al i t y and t h en or bot h of t h es e as s u mpt i o ns i s vi o l a t e d, i m pr o ved cl a s i f i cationThemayquadrbe posatiscibrlueleifs tarhee datan ala arterenfatirisvtesutoitaclblays itfriacnsatfioonrmwied.th linear discriminant fequalunctiocovarns. Theyiancearmate apprricesoprisiasteeriifousnorlymvialoitlaytappear s t o hol d , but t h e as s u mpt i o n of e d. However , t h e as s u mpt i o n of nor malas toitythseeemsapprtooprbeiatmorenesescrofitaiclalinfearor quadr aticartuiclersutleh,anbotlihnearrulersucanles. beIf doubt exictsetds or quadr cons t r u and their error rates examined using Lachenbruch's holdout procedure. Yi
==
1
==
x3 1
A
•
(1 1- 52)
(11-56),
1 1 .7
F I S H E R'S M ETH O D FOR DI SCRI M I NATI NG AMONG SEVERAL POPULATIONS
Fishertoalsesoverpraol pospopuledaantioextns. eThensiomotn ofivhiatsidionscbehiriminndantthemetFishhod,er didissccrusimsiendantinanalSectyiosins ionls thyea needlintoearobtcombi ain anrateasioonsnablof tehreeobspreesrevntataitoinson, sofuchthase populatioandns that inHivols vapes prtioonsachforhas svieversualalinadvant a ges when one i s i n t e r e s t e d i n s e ver a l popul a s p ect i o n or gr a phi c al des c r i p t i v e pur p os e s . I t al l o ws f o r t h e fol owiConveni ng: ent representations of the populations that reduce the dimension ftiroonsm. aOfvercoury larsgee, number of char a ct e r i s t i c s t o a r e l a t i v el y f e w l i n ear combi n a s o me i n f o r m at i o n-needed f o r opt i m al cl a s i f i c at i o n-may bespacelosts,eunllecteesd.the population means lie completely in the lower dimensional Plnantot si)n.gThiof sthhele means of t h e f i r s t t w o or t h ree l i n ear combi n at i o ns di s c r i m i p s di s p l a y t h e r e l a t i o ns h i p s and pos s i b l e gr o upi n gs of t h e pop ulScatatitoensr pl. ots of the sample values of the first two discriminants, which can indi c at e out l i e r s or ot h er abnor m al i t i e s i n t h e dat a . ationecesns. It Thecan, however primary pur, alspoosbee usofeFid sthoercl'as sdiisfyc,riandminweantsanalhallyinsidis icsattoe this use. popul I t i s not sary to assume that the populations are multivariate normal. However, we do 11.5,
few
ai x, a2x,
(1)
separating
(2)
a3x.
g
1.
(
2.
3.
separate
g
Section 1 1 . 7
Fisher's Method for Discri m i nating among Severa l Pop u l ations
629
asThatsumeis, Ith1at the population covariance matrices are equal and of ful rank.1 Let denot e t h e mean vect o r of t h e combi n ed popul a t i o ns and t h e be tween groups sums of cros products, so that where We consider the linear combination Y which has expectedYvalue f o r popul a t i o n and variance Var (Y) Cov f o r al l popul a t i o ns Consis seequent l y , t h e expect e d val u e changes as t h e popul a t i o n f r o m whi c h lected changes. We first define the overal mean pXp = I 2 = · · · = I g = I. ji
B J.t
g B �-t = � ( ILi - ji ) ( ILi - ji ) ' i =l
ji
==
1 g � ILi g i= l
-
(11-58)
= a'X
E( ) == a' E(X l 1rJ == a' ILi = a'
X
Jiy =
1Ti
(X)a == a' Ia JLi Y == a' #L i
! i±=l JLi Y = ! i±=l a' P-i = a' (! i±=l ll-i) g
g
g
= a' ji
and foSumrm tofhesrqauartioed distances from ( populations to overal mean of Y ) (Variance of Y)
g 2 � ( JLi Y - JL- y ) i =l a'
g � (a' #Li - a' ji ) 2 i =l a' Ia
( � ( P-i - ji ) ( P-i - ji ) ' ) a a' Ia
or
g - )2 � ( JLi Y - JLy i =l
a' B J.t a a'Ia between
(11-59)
Theto threacommon tio in variameasbilityures thegrvaroupsiabi. lWeity can thenthseelegrctoupsto ofmaxiY-mvalizueesthrieslraattiivoe. Or d i n ar i l y , and t h e ar e unavai l a bl e , but we have a t r a i n i n g s e t cons i s t i n g ofsamplcorreeofctlysizcleasniiffireodmobspopulervatatiioonsn . Suppose the trainDenot ing seteconsthe ists of adatrandom a s e t , fmeanrom popul a t i o n by Af t e r f i r s t cons t r u ct i n g t h e s a mpl e and i t s jt h row by vectors (11-59)
I
1Tj ,
11
[J\ 1 ,
. . .
within #L i
Xi
1T i ,
i = 1 , 2, xi j ·
. . .
, g.
a
ni X p
If not, we let P = [ e 1 , , eq] be the eigenvectors of I corresponding to nonzero eigenvalues , Aq ] · Then we replace X by P ' X which has a full rank covariance matrix P ' I.P. . . .
630
Chapter 1 1
Discri m i nation and Classification
and the covariance matrices S , . , g, we define the "overal average" vector i i = 1, 2,
. .
g "" n· X · ..£.J i =1 x=g-- = L ni i l
l
nt
g
2: xi j i =1 jL =1
g ni L i= 1
=1
whitrainchinigs stehte. vector average taken over of the sample observations i n the Next , we def i n e t h e mat r i x B whi c h i n cl u des t h e s a mpl e sizes. Let B Also, an estimate of is based on the matrix s BefConsoreequentpresently, iWng/ theg)tsaimplmesSpe dio slecdr,simointhantg)essa,meatweSpo notlehdatmaxi eisthtathmeWizesa'esistimtBhataja'e econsSofpo tlaednta alcussotomaximarymfiozesrma'asBa/a'eigenvect n preseentifthe optBeimizinegthaenin s�the�omorled Bee Wa. Morors eoverof , weB,cabecaus g) e. p
all
1
X
sample between groups g
=
I
g w = 2: (n i i =1
(n1
(n 1
+
n2
= A(n 1
+
+
+
···
n2
+
+
n2
2: ni( xi - x) ( xi - x) ' i =1
(11- 60)
sample within groups g 1 ) i = 2: 2: (xi j xt ) (xi j nt
-
+
···
+
i =1 j =1
ng
-
-
-
=
xJ '
(11-61)
l.
ng -
···
+
ng
ei
w-1
w -1
=A
-
rimin
Exer c i s e out l i n es t h e der i v at i o n of t h e Fi s h er di s c ant s . The di s c r i m i n ant s will not have zero covariance for each random{ sample Rather, the condition Spo led otherw1. se 11.21
Xi .
,... a;
,... 1 if i = k < s ak = 0
(11- 63)
Section 1 1 . 7
63 1
Fisher's Method for D iscri m i nating among Severa l Pop u l ations
te becaus. e we tentatively assumed withatl bethesatpopul isfied.atTheion covaruse ofiance matisriapprces weropreiaequal Cons i d er t h e obs e r v at i o ns on 2 var i a bl e s f r o m 3 popul a t i o ns gi v en i n Exampl e 11. 1 0. As s u mi n g t h at t h e popul a t i o ns have a common c o var i a nc e ma trix let us obta3in the Fisher discriminant3 s. The data are 3 [ -2 53 ] ; [ � �} [ � -�] 1 1 -4 -1 In Example 11. 10, we found that so [ 6 3 ] 3 � 3 62 [i} -3 1 - [ -26 -224] 1. 4 ] -1 - [ 1..027143 w-1 - 1401 [242 1429 2. 7 Tow-s1oB,lvwee fomusr thtesolve min g - 1, min 2, 2 2 nonzero eigenvalues of 4 1. A -1 - AI [ 1. 07143 . 2 1429 2. 7 A J 0 or 1.07143 - A 2.7 - A - 1.4 .21429 A - 3.77143A 2.5929 0 . 9 044. Us i n g t h e quadr a t i c f o r m ul a , we f i n d t h at A 2. 8 671 and 1 The normalized eigenvectors and are obtained by1, 2solving and scal_1ing the results su[ch1.0t7143hat - 2.8671 1. For1.4example[a, th11e]solut[ion] of A1I 31 - .21429 2.7 - 2.8671 J 0 S pooled
g
Example 1 1 . 1 3
(Calcu lating Fisher's sample discri m i nants for th ree popu latio ns)
p=
I,
g
7T l (n l = )
1T2 (n 2 =
xl =
x2 =
o
3
B=
X=
3
nt
7T3 ( n 3 = )
)
x3 =
(X ; - X) ( X; - X) ' =
xJ (x ij - xi) ' = ( n +
W = � � (x ij i =l j =l
=
) S pooled
n2 + n3
W B
(
s<
IW B
(
p) =
(
) =
=
I =
)(
)
(
al
)(
) =
2
A2 =
=
32
=
+
i=
a ; s pooled a i =
(W B
_
,...
)
,...
_
tl 12
=
o
632
Chapter 1 1
D iscr i m i nation and Cl assification
is, after the normalization Similarly, The two discriminants are
a 1S poolect a 1 = 1, a 1 = [ .386 .495 J a2 = [.938 - .112 ]
[ :J = .386xl .495x2 - .112] [ :J = .938x l - .ll2x2
yl = a l x = [ .386 .495 ]
y2 = az x = [.938 Example 1 1 . 1 4
+
•
(Fisher's discri m i na nts for the crude-o i l data)
GerHil sr,ilCald andiforLantnia, petz rolecolumlercteseedrvce.ruThesde-oiel scramplude eoislfsrcoamn besandsasstiognenedintothonee Elk the three stratigraphic units (Wipopullhelamtiosansnds) tone Sub-UpperMulsaindsniatsoanendstone onfivethvare basiablisesof: their chemistry. For il ustrative purposes, we consider only the vanadi u m ( i n per c ent as h ) VVberironyl(iinumper(icnentperascenth) ash) [arosmatatuircathydred hydrocarobcaronsbons(in per(in cperentcentarea)area) Thefromfiarsstegment three varof tihaeblcures arveeprtroaduced ce element s , and t h e l a s t t w o ar e det e r m i n ed sablis. eTabls (vanadi e u(m,seeirExeron, cberiseyl ium, gisatbyvesuarattgasheedvalchrhydrocarbons uoesmatofogrtheaphfiv,chemi eandorigcarinalaloanalmatvaryiic hydrocarcomput bons) feorrcalculcasateisonwhosyieledspopulthe suammar tion asysisgtanment was cer t a i n . tistics [12]
of
7T 1 : 7T2 : 1r3 :
X1 = x2 = x3 = X4 = 1/ X5 =
11.7
A
3 . 229 6.587 .303 xl = .150 11.540
and
J
11 .30)
56
4.445 5.667 x 2 = .344 .157 5.484
7.226 4.634 x3 = .598 .223 5.768
6.180 5.081 x = .511 .201 6.434
Section 1 1 . 7
633
Fish er's Method for Discri m i nating among Severa l Popu l ations
(n l + n 2 + n 3 - 3) s pooled == (38 + 11 + 7 - 3 ) s pooled 187.575 1.957 141.789 2.128 3.580 == W == - 4.031 1.092 - .143 -.284 .077 79.672 - 28.243 2.559 - .996 338.023 s == (g - 1, p) == ( 2, 5 ) == 2 1 w- B, 4.354 .559. )\ == .312 (x 1 - 6.180 ) - .710 ( x2 - 5.081 ) + 2.764(x 3 - .511 ) + 11.809(x4 - .201 ) - .235 (x5 - 6.434) y2 == .169 (x 1 - 6.180 ) - .245(x2 - 5.081 ) - 2.046 (x3 - .511 ) - 24.453 (x4 - .201 ) - .378 ( x5 - 6.434)
There arande atthmosey art e minand The centminered Fisher lposinearitivdieseicrgimenvalinantusesarofe Thedimensseparat i o n of t h e t h ree group means i s f u l y expl a i n ed i n t h e t w o i o nal "di s c r i m i n ant s p ace. " The gr o up means and t h e s c at t e r of t h e i n dividualTheobsseerparvataiotinsoninisthquie ditescgood.riminant coordinate system are shown in Figure 11.12.
•
0
3
2
Y2
0
00 0
0
..
D D D
0
0
D
0
-1
-2
-3
•
•
•
•
0 D ..
• •
..
•
0
•
0
0
D D
D
D D
D
D
Wilhelm Sub-Mulinia Upper Mean coordinates
-4
D
D
D DD ..
ef D D
0
-2 A
Y1 Figure 1 1 . 1 2
DB D
Cru de-o i l samples i n d i scri m i n a nt space.
D Q] D DD
D
D
2
D D
D
634
Chapter 1 1
D iscri m i nation and Cl assification
Example 1 1 . 1 5
{Plotti ng sports data i n two-d i mensional discri mi nant space)
ItinpveshastiicgatPerorssoinalnteirteysItendventin soprory t(sMpsMPIychol) toogy670admilet neirstwiernenerd thseatMithnenesUniotvaerMulsity ofcriWiminscantonsfinunctin Madiions arsoen.giTheven sipnorTablts inevol11.v3ed. and the coefficients in the two dis nant scorsceors iesssihsownnot igood,n Figuralpletho11.ought of13.thaeHertegrstoeupftohremeans stheeparequalusatiinoigtnytonhofe fmeans tihrest bastwoiissdisofisgcnirtihmfeiciMMPI ant at t h e 5% l e vel . (This Whiis duele ttohethdieslcarrigmeinsaantmplcoefe sizfiesci.e)nts suggest that the first discriminant is mosclosetlcly oassesolyciraetleadtewid ttohtthhee LandandPaPtssccalaleess,,andwe withel sgievconde thediinsctreirmprinetantatioisnmosprot videdTheby thfiersitnvesdisctrigimationrants. , which accounted for 34.4% of the common vari ance, was hi g hl y cor r e l a t e d wi t h t h e Mf s c al e ( r 7 8) . The s e cond di s c r i m -. irneantlate,dwhitocshcoraccount e d f o r an addi t i o nal18. 3 % of t h e var i a nce, was mos t hi g hl y e s on t h e Sc, F, and s c al e s ( r ' s . 6 6, . 5 4, and . 5 0, r e s p ect i v e ltye)r.esThet diminensvesitoin;gatthoressseucondggestdithscatrimthienantfirstrediflsecctrismpsinyantcholbesogit cralepradjesuentstms entan i.n I d eal l y , t h e s t a ndar d i z ed di s c r i m i n ant f u nct i o n coef f i c i e nt s s h oul d be ex amiablens.ed(Steeo asses tCorhe irmeporlatiotanncecoefofficaievarntsiainbldieciatnethonle pry howesenceeachofvarothierablvare byi iUnftselfodirtustnatinguielys,hiesn tthhiesgrcasoupse, t,higenorstaindarng thdeizconted diribscutriimonsinantof thcoefe othfiercievarntsiawerblese. unavailable. Fi r s t Second MMPI Sport Sample size Scale discri.m055inant disc-rim.098inant FootBaskbetalblall 15842 FL --..014794 -..004699 BasCreweball 6179 Hs ..007753 -.- .007617 Fenci GolGymnasf ng tics 262850 PdHy -...000284901 -...001698331 0 74 28 0 76 Mf Hockey -. -. ..012589 -..018888 Swimmis ng 5131 PaPt Tenni ..008853 TrWraeckstling 64 MaSc -.-.014603 .041 Si .016 A
D
=
D
[25] .)
TABLE 1 1 .3
QE
K
D
52
Source: W. Morgan and R. W. Johnson.
=
Section 1 1 .7
Fish er's Method for Discri m i nating among Severa l Popu lations
635
Second discriminant
.6 • Swimming
.4 • Fencing • Wrestling
.2
• Tennis Hockey • -
.
8
- .6
- .4
- .2 • Gymnastics • Crew
• Baseball
• Golf
- .4
.4 .
.2
• Track
.6
.8
First discriminant
Football
- .2
• Basketball
- .6 Figure 1 1 . 1 3
The d i scri m i n a nt means y' = [ y 1 , y2 ] for each sport.
crcriimmiinnInantantgenerss. canInaaddibel, plmadeotitosn,shsfoulcoatr eachtderalplsosoptbeors oftmade t. Under he diofscotrtihmheeriasnantspaiumptrsscorofioentshfofeorfmulipairstrtfsievofwarididiatsse norshoulmald icontty, thaeinuniapprt elolxiipmseat(ecliyrcalepr) centoporetrioedn at the discriminant mean vector y of the points when two discriminants are plot ed. Fireprsherese'sntdiastciorinmofinantthes datweraethderativseedparfoartetshethpure populposeaoftioobtns aasinmuch ing a loasw-posdimsiensble.ioAlnal the disciroimn iinnanttersmals sofo prtthhough oe vipopuldetthhaeyetibasowern idisesfcoderrrimaivclinedaantsfirfsoicmatconsion riudlere. aWetionsfirofst explseparaiantitohn,e connect Setting Yk kth discriminant, k we conclude that P[ (Y
-
p, y ) '
(Y
-
p, y )
<
1]
=
P[x�
<
1]
=
.39
•
Using Fisher's Di scri mi nants to Classify Objects
a; X.
=
akX
=
<
s
(11-64)
636
Cha pter 1 1
D iscri m i nation and Classification
Y ==
(
underBecaus populaetitohne coandmponent covarsiaofnce mathaverixuniftovarr alilapopul atiozerns. o SeecovarExeriancescise, the nces and propriate measure of squared distance from to is a t i o n i f t h e s q uar Atherediassotanablnceefrcloams ifitcoation rius lsemisaloneler tthhanat asthseigsnsquartoe popul e of of t h e di s t a nce f r o m t o fori If only of the discriminants are used for allocation, the rule is Allocate to if for all i Bef o r e r e l a t i n g t h i s cl a s i f i c at i o n pr o cedur e t o t h os e of Sect i o n we l o ok more closnumber ely at theofredistsrcircitmioinnanton sthe number of di s c r i m i n ant s . Fr o m Exer c i s e number of nonzer o ei g enval u es of or of Now, is so Further, the vect, ors sdiaftisefryence can be writ en as a linear combination of the lasThatt is, thdieffierrst ences . Li n ear combi n at i o ns of t h e vect o r s i n det e r m i n e a hyper p l a ne of dihencemensthioenhyperplane, giTakives ng any vector perpendicular to every and '7Ti
I,
Y
Y == y
(y
y
# k.
- ILiY ) ' ( y - ILi Y )
11.21.)
ap
ILiY
s
==
� (y] - JLiY) 2
j=l
'7Tk
y
IL k Y
y
r
x
'7Tk
r
� (y - k 2 j =l j i-L Y)
r
==
� [aj (x - 1L k) ] 2
j= l r
< � [aj (x - IL J ] 2
#k
j= l
(11-65)
1 1 .6,
s
J.t 1 y
==
I - 1 BJ.t
==
I - 1;2 B J.t I- 1;2 g I - 1 BJ.t p X p, s < p. IL1 - J.t- , IL2 - J.t- , · · · IL g - J.t( IL 1 - ji ) + ( J.L2 - ji ) + · . . + ( IL g - ji ) == g ji - g ji IL l - ji (11-66) g q < g - 1. e B �-t e
==
g i= l
� ( ILi - ji ) ( ILi - ji ) 'e
==
11.21 ,
(11-66)
==
g i= l
� ( ILi - ji )O
0.
g-1
ILi - ji ,
== 0
so Therimplieesartehat therore artheogonalor feiewergenvectors coreirgeenval spondiuesng. tSio nthcee zerit ios eialgwenvalays truue.e Thithats the number of nonzero eigenvalues must satisfy min p, p-
q
< g - 1,
q
q
nonzero
s
s<
(
g - 1).
Section 1 1 .7
637
Fisher's Method for Discri m i nating among Several Pop ulations
Thus , t h er e i s no l o s of i n f o r m at i o n f o r di s c r i m i n at i o n by pl o t i n g i n t w o di mensions if the fol owing conditions hold. Number of Number of Maxi m um number varAnyiables populations of discriminants Any Anyg and thWee "nornowmpral ethseeornt yan" diimscporrimtianntantreslcaortioens bet[seeween then clas ification rule or, equivalently, l n obtained by adding the same constant to each Let and i s an ei g envect o r of wher e Then p p p=2
1 2 2
g=2 g=3
(11-65)
(11-49) ] ,
di ( x ) - ILi� �..,. - 1 x - 21 11- i, �..,. - 1 ILi
+1
Pi
di(x) - � x' I- 1 x = - � (x - p. J ' I- 1 (x - p.J + di(x) . - �x' I - 1 x
pi
= aj x , Result 11.6. aj = I - 1 1 2 e j y ej j I - 1;2 BI - 1;2 . p p 2 :L (yj - JLiY) = :L [aj (x - P. i) ] 2 = (x - P.i) ' I - 1 (x - p.J j =1 j= 1 = - 2di(x) + x' I - 1 x + Pi p A 1 · · · As > 0 = A5 +1 = · · · = A.P , :L (yj - JLi y) 2 j =S + 1 i = 1 , 2, yj , :L (yj - JLiY ) 2 , j =1
2ln is constant for all popu If ltaritbioutnse to the cla.s .i,figcatsoioonln. y the first s discriminants or conAlwiso,tihf the prsiiosrequiprobabivalelntitietsoartheesupopulch thatation version of the minimg,umtheTPMrule rule The squared distance wher e i s t h e or t h ogonal mat r i x whos e col u mns ar e ei g envect o r s of ( S ee Exer c i s e Since or >
>
s
J
(11-65) (11-52).
r
p1
=
=
p2 = · · · = Pg = 1/
(x - p.J ' I - 1 (x - IL i) = (x - P.i ) ' I - 112 I - 112 (x - ILi) = (x - �LJ ' I - 112 EE' I - 1 12 (x - p.z), E = [ e 1 , e2 , . . . , e p ] 11 .21 .) I - 112 BJ.ti - 112 . �..,. - 1 /2 e· = a· a� = e� � - 112 a 1 (x - �LJ a2 (x - �LJ Proof.
l
l
l
l ..,.
'
638
Chapter
11
D i scri m i nation a n d Classification
and
(x -
Itt ) 1 I 1/2 EE 1 I - 112 ( x - Itt ) = -
aj = I - 1 12 ej , j
L [ aj ( x j= 1 p
- Itt ) ] 2
I- 1 BJ.t
Next , eac h s , i s an ( u ns al e d) ei g envect o r of wi t h n e ei g e value zerando. Ashenceshownto in the discussion fol owing fiosriper, pendicul.ar to everThey condition is constant for all i implies Therthatefore, only the first s dissocriminants need to be used for clas ification. We now s t a t e t h e cl a s i f i c at i o n r u l e bas e d on t h e f i r s t s s a mpl e discriminants. >
(11-66), aj k = 1, 2 , , g. ( Ilk - ji ) - ( lti - ji ) = Il k - lt i lti - ji yj - JLkY1 = yj - JLiY1 0 = a}( ltk - �-tJ = JLkY1 - JLiY1 p = 1, 2 , , g . L (yj - f.Lz y) 2 1=. s+1 yj . .
. . .
•
r<
and s, When t h e pr i o r pr o babi l i t i e s ar e s u ch t h at rcruilme inant scoriseequi. Invaddialenttiotn,o rifule s discr, iwhimincanth iss arbase eusd eond fothrecllaasrgifesictatliinon,earthdierse for each population is a los of squared distance, or score, of is the part useful for clas ification. where Let us use the Fisher discriminants wifrotmh Example to clas ify the new observation in accordance (11-67)
s L [ a}(x j =r+ 1
p1 = p2 = · · · =
(11 52 )
r<
-
Example 1 1 . 1 6
(11-67).
xJ J 2
-
Pg
= 1/ g
L [aj(x - xi) ] 2 j =r+1 p
r=
1Tz
(Classifyi ng a new observation with Fisher's discri minants)
11.13
Y1 = a 1 X = .386 x 1 + .495x2 .Y2 = a2 x = .938 x 1 - .11zx2 x0 = [1 3 ]
Section 1 1 . 7
Fisher's Method for Discri m i nati ng among Severa l Popu l ations
639
I n s e r t i ng we have Moreover, so that see Example
x0 = [ x0 1 , x0 2] = [ 1 3 ] , )\ = . 3 86 x0 1 + .495 x0 2 = . 386 ( 1 ) + .495 (3) = 1.87 y2 = .9 38x0 1 - .112x0 2 = .9 38 ( 1 ) - .112 (3) = .60 ( 11.1 3) Yk j = a} xk , 3! i l =
y1 1 =
[.386 .495]
[ -� J = 1 10 .
Y12 = a;xl = [ .938 -.112]
Similarly,
Y2 1 Y22 Y3 1 Y3 2
= = = =
a 1 x2 a 2 x2 a 1 x3 a2 x3
[ -�J
= - 1.27
2 .37 .49 - .99 .22
= = = =
Finally, the smallest value of fork must be identified. Using the preceding numbers gives 2
2
2: (y1 - Yk 1 ) 2 = 2: [ a} (x - xk) ] 2 j =l j =l
= 1, 2, 3,
2 j2: =l 2 2: j= l 2 2: j= l
(y1 - y11 ) 2 = ( 1.87 - 1.10) 2
+
( .60
(y1 - y21 ) 2 = (1 .87 - 2.37 ) 2
+
( .60 - .49) 2 = .26
(y1 - y3 1 ) 2 = (1.87 + .99) 2 2 2 2: (.y1 - yk 1 ) J =l 11.14.
+
+
1.27) 2 = 4.09
(. 60 - .22) 2 = 8.32
Siulnatceionthe miThenimumsituofation, in terms ofoccurthe clsawhens ifierks isweil ualstlroacattede schemat to popically in Figure When t w o l i n ear di s c r i m i n ant f u nct i o ns ar e us e d f o r cl a s i f i c at i o n, obsmenserivoatnalionsdisarcreimasinsiantgnedsptace.o populations based on Euclidean distances in the two-di rsbecomes t few discappar rimineantntsfrarome mortheier icontmporriUpbtautntitootnhtanhtoisatpoihnumer e lnats,twefiecw.alhavemeasTheinotur rreeslhofaowntisvpereiwhymadporofthttaehnceefipopul a t i o ns . Cons i d er t h e s e p aratory measure 1r2 •
= 2,
y1 ,
x0
•
Comment.
g
d� = 2: ( IL i - ji ) ' I- 1 ( 11-i - ji )
i=l
(11-68)
640
Chapter 1 1
Discri m i nation and Classification 2 2
Smallest distance
9 �• -Y2 2
-1
3 Figure 1 1 . 1 4
The poi nts
9' = [ y, , Y2 J, y; = [ y , , , Y1 2J, Y2 = [J12 1 , Y2 2J, and Y 3 = [ y3 , , Y3 2 J in
-1
the cl assification p l a n e .
where ji IL i i = 1 andmean( JLILi i-toji)the'Icent-1 ( JLroi i-d jiji). iIst tcanhe sbequarsheownd sta(tsiseteicExeral disctiasncee fromtthhate id�th populA1 atio1An2 As ar e t h e ei g enval u es of A wher e t h e A B A I 1 P 2 1 1 2 2 (or I-TheBsIe-par)aandtion "-gis+vben. by. , Ad�P arcan1 e2 thbee zerreprooeiducedgenvalinuteser. ms of1 2discriminant means. Thedistafnceirst disc(rJLimiyi1n-antji,yJ2 ofet1hIe-JLiyX1'shasfrommeansthe centJLiyr1 al vale1Iue- jiy1L1 i ande!Ith-e1 2sjiquaris Ae1d. (See Exerd�i=c1iseA1 A2 Since d�APcan also be writ en as i=1 ((JLILiiY -- jiy))'2( ILiY - (jiy ) - 2 ( - )2 Yl JLyl i=1 JLiY2 JLyJ i=1 JLiYp JLyp i = 1 istefparol aotwsivethmeasat thuerefird�.st diIsncrgener iminantal, makes tdihesclrairmgiesnantt sin, gle conte�Iri-b1ut2Xio, n,contA1 r, itobutthees t h e r t h A5 Asurchto td�.hat I"-fr+th1 e next"-r+2 eigenval u es ( r e cal l t h at A5 ar e A + +1 P 2 As i s s m al l compar e d t o A A , t h en t h e A r 1 2 ltahset amount discriminofantsespar¥,+1ati,o¥,n.+12, can be neglected without appreciably decreasing Not muc h i s known about t h e eff i c a c y of t h e al l o c a t i o n r u l e Some i n s i g ht imars prizoviesdiedts perbyfcomput e r g ener a t e d s a mpl i n g exper i m ent s , and Lac h enbr u c h s u m e in parcotivarculiaarnccaesmates. Therix Idevel. If thoipment s is essentofitahleypopul true andatiotnheresasum-lt requiroerdmaanccommon 1 g =-L g
+
···
+
>
>
nonz e ro
>
=
Yi =
g L
=
11 .22.)
=
···
+
=
11 .22)
g L g L
+
+
+
···
+
+
-
s
-
+
r
···
+
. . .
g L
-
+ ... +
g L
-
Y, = = =
, Ys
+
+
···
· ··
=
+
(11-67) .
in (11-65)
1 2 See [17] for further optimal dimension-reducing properties.
= 0)
[21]
Section 1 1 .8
F i n a l Comments
641
plfoerms arancee reascanonablbe checked y large, rubyle computisnhgoulesdtipermatfeodrmerfraoirrlyraweltes.l. SpecIn anyificevent , i t s per a l y , Lachen ' bruch s estimate of the expected actual error rate given by should be calculated. (11-67)
(11-57)
1 1 .8
FI NAL CO M M E NTS
I ncl uding Qual itative Variables
Ourablesdi, Xsc1us, Xsi2o,n. i.n, XthPishavechaptnateruasrasluunimeststhofatmeasthe diursecmentrimin.atThatory oris, cleachas ifvaricatiaoblryevarcan,i in principleor, assume any real numbermay, andbe athuseseefunumber s can be r e cor d ed. Of t e n, a lidic sscurchimasinatthoer col(claosrirfeiedr)may. Forbeexa ampl e , t h e pr e s e nce or abs e nce of a char a ct e r i s t worwhosthewhinumerle cliacsalivalfieru.eThiis sifsitthueatobjionectisposfrequent l y handl e d by cr e at i n g a var i a bl e X s e s e s t h e char a ct e r i s t i c and zer o i f t h e ob jseucredt doesvarinotableposs insethse tusheualchardisacctrimeriinstatici.oThen andvarcliaasblifeiciats tiohnenprtroecedurated elisk. e the mea Ther e i s ver y l i t l e t h eor y avai l a bl e t o handl e t h e cas e i n whi c h s o me var i a bl e s ardiceatconte thinatuousFishander's slionmeearqualdiscirtiamtiivne.antComput e r s i m ul a t i o n exper i m ent s ( s e e i n functthioenqualcanitperativfoerandm poorcontlyinoruoussatisvarfactiaoblrielys., dependi n g upon t h e cor r e l a t i o ns bet w een AstionKrinztanows k i not e s , " A l o w cor r e l a t i o n i n one popul a t i o n but a hi g h cor r e l a h e ot h er , or a change i n t h e s i g n of t h e cor r e l a t i o ns bet w een t h e t w o popu ' diescoatmee condiarea tandionsoneunfathvorataneeds ble tofFiursthhererssltiundy.ear discriminant function." ThilatiosnsisWhencoula trodublainnumber of var i a bl e s ar e of t h e t y pe, i t may be bet t e r t o cons i d er an alternatTheiveprapprobabioach,lity calof member led the ship in the first groapprup, po1ach tios model clas ifiecdatdiiorn.ectl(ySeeas fdiosrtrthibeuttwioo-n,pandopulitamaytion prnotoblbeem.easyTheto inclneedsude costo tbes. adjIf tuhestpopul ed to accommodat e a pr i o r a t i o ns ar e near l y nor mal with equal covariance matrices, the linear clas ification approach is best. Anpreviapprousosachectitoonsclofas tihfiicsatchaption compl e t e l y di f e r e nt f r o m t h e met h ods di s c us s e d i n t h e e r has been devel o ped. ( S ee I t i s ver y comput e r iprntoeach,nsivecalandled its implementation is only now becomi n g wi d es p read. The new ap (CART), is closely related to divi sive clIunsitteiarlinyg, altelchniobjeqctuess ar. e(Sconsee Chapt e r i d er e d as a s i n gl e gr o up. The gr o up i s s p l i t i n t o t w o sotuhbgrer.oupsTheustwinog,susbgray, hioupsgh valareuesthenof eacha varisapblliteusfoirngonethegrvaloupuesandof laoswecondvaluesvarfoirablthee. Theof thsepslpitliitnigngprvarocesiabls contes caninuesbe oruntderil eadsuoritaunorble sdteroppied ncatg epoigorniteiss. rIetached. The val u es i s t h i s f e at u r e t h at makes the CART procedure so general. qualitative
categorical variable 1
[20])
[20]
0-1
logistic regression
[15] .)
Pl (x) =
e a+ P ' x 1 + e a+ fJ ' x
(x) ,
a
Classification Trees
[5] .)
classification and regression trees 12.)
642
Chapter 1 1
D iscri m i nation and Cl assification
Over 45
No
Yes
Overweight
No
Yes
Exercise regularly
No
Yes
1t 1
1t 2
:
Heart-attack prone : Not heart-attack prone
Figure 1 1 . 1 5
A cl assificat ion tree.
For example, suppose subjhearectstare-at atockbeprcloanes ified as not hear t a t a ck pr o ne oncanthbee basdiagris aofmmedage, weias tghhte ,tandree sexerhowncisienactFigivuritey. In thisThecasebr, tahnchese CARTof thpreotcedur e r e e ac toverually corbeirenspgondovertwoeidigvhtis,ioandns iunder n thetsaakimplngenosprace.egulaTher exerrecgiisoen, coulddefbeinusededastobeiclansg sdiiffyearesuntbjagesect as, as welhearl astf-iartstasckpliprt ionne.g onTheweiCART pr o cedur e woul d t r y s p l i t i n g on g ht or on t h e amount of exer c i s e . The cl a s i f i c at i o n t r e e t h at r e s u l t s f r o m us i n g t h e CART met h odol o gy wi t h t h e andis shownvariainblFiesgurXe3 petaThel lebingtnarh y(sPpetlitLiengtng ruhl)esandare XiInri4dis cdatatpeteadail(nswietehdetTablhfi(gPurete eWi. dForth),exampl e , t h e f i r s t s p l i t occur s at pet a l l e ngt h Fllengtowerhss with petforaml ltehngte othhser group f(orrigmhtone). group (left), and those with petal The next s p l i t occur s wi t h t h e r i g ht h and s i d e gr o up ( p et a l l e ngt h at petandatlhwiosdethwith petal Flwiodwerths s with petforaml withdethots her groupare(rputighti)n. Theone prgrooupcess(lconeft), no gai(TnN)wit. h additional split ing. In this case, the process stops witintuesh fTheountur tbiiel rntmhareriynealspisnodes l i t s f o r m t e r m i n al node r e ct a ngl e s ( r e gi o ns ) i n t h e pos i t i v e quad rant of the X3 , X4 sample space as shown in Figure For example, TN con1r1 : 1r2 :
11.15.
45,
R1 ,
1r 1 :
1 1 .5),
=
11.16.
=
<
>
2.45 =
1 .75.
=
2 . 45
<
>
1.75
1.75
11.17.
>
2.45.
2.45)
#2
Section 1 1 .8
Final Comments
643
s 2.45
Node 1 PetLength
s 1 .75
N = 150 Terminal
Node 2
Node 1
PetWidth
N = 50
N = 1 00 Terminal Node 4 N = 46 Terminal
Node 3
N = 48
N=6
A c l assification tree for the Iris data.
Figure 1 1 . 1 6 7 6
00 0 0 o0 0 0 §oS � �B§�o© �
-
-
TN # 3
! ! P +o 6+ +
5 TN # 2
t+�
-
**
-
2 1
-
-
I
�X X �X X X X �X �
0.0
Terminal
Node 2
I
0.5
I
+
+
+
Setosa Versicolar Virginica
�
TN # 4
TN # 1 I
1 .0
[;]
1 .5
I
2.0
I
2.5
PetWidth
Figure 1 1 . 1 7
sa m p l e space.
C l assification tree term i n a l nodes (reg ions) i n t h e petal width, petal l e n gth
tains those flowers with 2.45 < petal lengths < 4.95 and petal widths < 1 .75-es sentially the Iris Versicolor group. Since the majority of the flowers in, for example, TN #3 are species Virginica, a new item in this group would be classified as Virginica. That is, TN #3 and TN #4 are both assigned to the Virginica population. We see that CART has correctly classified 50 of 50 of the Setosa flowers, 47 of 50 of the Versicolor flowers, and 49 of 50 of the
�
= .027. This result is comparable to the result 1 0 obtained for the linear discriminant analysis using variables x3 and x4 discussed in Example 11.12. The CART methodology is not tied to an underlying population probability distribution of characteristics. Nor is it tied to a particular optimality criterion. In practice, the procedure requires hundreds of objects and, often, many variables. The resulting tree is very complicated. Subjective judgments must be used to prune the tree so that it ends with groups of several objects rather than all single objects. Each terminal group is then assigned to the population holding the majority membership. A new object can then be classified according to its ultimate group.
Virginica flowers. The APER
=
644
Chapter 1 1
Discr i m i nation and Cl assification
Breiman, Friedman, Olshen, and Stone [5] have developed special-purpose soft ware for implementing a CART analysis. Their program uses several intelligent rules for splitting and usually produces a tree that often separates groups well. CART has been very successful in data mining applications (see Supplement 12A). Neural Netwo rks
A neural network (NN) is a computer-intensive, algorithmic procedure for trans forming inputs into desired outputs using highly connected networks of relatively simple processing units (neurons or nodes) . Neural networks are modeled after the neural activity in the human brain. The three essential features, then, of an NN are the basic computing units (neurons or nodes), the network architecture describing the connections between the computing units, and the training algorithm used to find values of the network parameters (weights) for performing a particular task. The computing units are connected to one another in the sense that the output from one unit can serve as part of the input to another unit. Each computing un it transforms an input to an output using some prespecified function that is typically mo notone, but otherwise arbitrary. This function depends on constants (parameters) whose values must be determined with a training set of inputs and outputs. Network architecture is the organization of computing units and the types of connections permitted. In statistical applications, the computing units are arranged in a series of layers with connections between nodes in different layers, but not be tween nodes in the same layer. The layer receiving the initial inputs is called the input layer. The final layer is called the output layer. Any layers between the input and output layers are called hidden layers. A simple schematic representation of a multilayer NN is shown in Figure 11 .18.
Output
I
I
I
Middle (hidden)
Input
t Figure
1 1 .18
t
A neural n etwo rk with o n e hidden layer.
Section 1 1 .8
645
F i n a l Comments
Neural networks can be used for discrimination and classification. When they are so used, the input variables are the measured group characteristics X1 , X2 , , XP , and the output variables are categorical variables indicating group membership. Cur rent practical experience indicates that properly constructed neural networks per form about as well as logistic regression and the discriminant functions we have discussed in this chapter. Reference [26] contains a good discussion of the use of neural networks in applied statistics. . • •
Selection of Variables
In some applications of discriminant analysis, data are available on a large number of variables. Mucciardi and Gose [23] discuss a discriminant analysis based on 157 variables. 1 3 In this case, it would obviously be desirable to select a relatively small subset of variables that would contain almost as much information as the original collection. This is the objective of stepwise discriminant analysis, and several popu lar commercial computer programs have such a capability. If a stepwise discriminant analysis (or any variable selection method) is em ployed, the results should be interpreted with caution. (See [24] .) There is no guar antee that the subset selected is "best," regardless of the criterion used to make the selection. For example, subsets selected on the basis of minimizing the apparent error rate or maximizing "discriminatory power" may perform poorly in future samples. Problems associated with variable-selection procedures are magnified if there are large correlations among the variables or between linear combinations of the variables. Choosing a subset of variables that seems to be optimal for a given data set is especially disturbing if classification is the objective. At the very least, the derived clas sification function should be evaluated with a validation sample. As Murray [24] sug gests, a better idea might be to split the sample into a number of batches and determine the "best" subset for each batch. The number of times a given variable ap pears in the best subsets provides a measure of the worth of that variable for future classification. Testi ng for Gro u p Differences
We have pointed out, in connection with two group classification, that effective allo cation is probably not possible unless the populations are well separated. The same is true for the many group situation. Classification is ordinarily not attempted, un less the population mean vectors differ significantly from one another. Assuming that the data are nearly multivariate normal, with a common covariance matrix, MAN OVA can be performed to test for differences in the population mean vectors. Although apparent significant differences do not automatically imply effective clas sification, testing is a necessary first step. If no significant differences are found, con structing classification rules will probably be a waste of time. 1 3 Imagine the problems of verifying the assumption of 157-variate normality and simultaneously estimating, for example, the 12,403 parameters of the 157 X 157 presumed common covariance matrix!
646
Chapter 1 1
D iscri m i nation and Cl assification
Graphics
Sophisticated computer graphics now allow one visually to examine multivariate data in two and three dimensions. Thus, groupings in the variable space for any choice of two or three variables can often be discerned by eye. In this way, potentially impor tant classifying variables are often identified and outlying, or "atypical," observations revealed. Visual displays are important aids in discrimination and classification, and their use is likely to increase as the hardware and associated computer programs be come readily available. Frequently, as much can be learned from a visual examina tion as by a complex numerical analysis. Practica l Co nsiderations Regarding M u ltiva riate N o rmal ity
The interplay between the choice of tentative assumptions and the form of the re sulting classifier is important. Consider Figure 11.19, which shows the kidney-shaped density contours from two very nonnormal densities. In this case, the normal theo ry linear (or even quadratic ) classification rule will be inadequate compared to an other choice. That is, linear discrimination here is inappropriate. Often discrimination is attempted with a large number of variables, some of which are of the presence-absence, or 0-1 , type. In these situations and in others with restricted ranges for the variables, multivariate normality may not be a sensible assumption. As we have seen, classification based on Fisher ' s linear discriminants can be optimal from a minimum ECM or minimum TPM point of view only when mul tivariate normality holds. How are we to interpret these quantities when normality is clearly not viable? In the absence of multivariate normality, Fisher ' s linear discriminants can be viewed as providing an approximation to the total sample information. The values of the first few discriminants themselves can be checked for normality and rule (11-67) employed. Since the discriminants are linear combinations of a large number of vari ables, they will often be nearly normal. Of course, one must keep in mind that the first few discriminants are an incomplete summary of the original sample information. Classification rules based on this restricted set may perform poorly, while optimal rules derived from all of the sample information may perform well.
j
"Linear classification" boundary \
Contor of f2 ( x )
\
X X
�X/
"Good classification" boundary
/
X
I
Contour of !1 ( x )
\
\
\
\
� xi
�----------------------�------
1 1 . 1 9 Two nonnormal po p u l ations for which l i near d iscri m i n ation is i n a p p ropriate.
Figure
Chapter 1 1
Exercises
647
EXE RCI SES
[ ! ;]
11.1. Consider the two data sets
X1 = for which
and X 2 =
and
S pooled =
[ ! �]
D �]
(a) Calculate the linear discriminant function in (11-19).
(b) Classify the observation x0 = [2 7] as population 1r 1 or population 7T2 , using Rule (11-18) with equal priors and equal costs. 11.2. (a) Develop a linear classification function for the data in Example 11.1 using (11-19). (b) Using the function in (a) and (11-20), construct the "confusion matrix" by classifying the given observations. Compare your classification results with those of Figure 11.1, where the classification regions were determined "by eye." (See Example 11.5.) (c) Given the results in (b), calculate the apparent error rate (APER) . (d) State any assumptions you make to justify the use of the method in Parts a and b. 11.3. Prove Result 11.1. Hint: Substituting the integral expressions for P(2 1 1 ) and P( 1 1 2) given by (11-1) and (11-2), respectively, into (11-5) yields ECM = c(2 1 1 )p1 { f1 (x) dx + c( 1 l 2)p2 { f2 (x)dx
JR2
Noting that il = R 1 U R2 , so that the total probability
JR 1
1 = r fl (x) dx = r fl (x) dx + r fl (x) dx
ln
we can write
[ 11
ECM = c(2 1 1 )p1 1 -
JR1
J
JR2
f1(x) dx + c( 1 1 2)p2
By the additive property of integrals (volumes),
1
1
f2 (x) dx
ECM = { [c( 1 1 2)pd2 (x) - c(2 1 1 )pd1(x) J dx + c(2 1 1 )pl
JR1
Now, p1 , p2 , c( 1 1 2) , and c (2 1 1 ) are nonnegative. In addition, /1 (x) and /2 (x) are nonnegative for all x and are the only quantities in ECM that depend on x .
648
Chapter 1 1
D iscri m i nation and Cl assification
11.4.
11.5.
11.6.
11.7.
11.8.
11.9.
Thus, ECM is minimized if R 1 includes those values x for which the integrand [ c ( 1 I 2 ) P2 !2 ( X ) - c ( 2 1 1 ) Pl !I ( X ) J < 0 and excludes those x for which this quantity is positive. A researcher wants to determine a procedure for discriminating between two multivariate populations. The researcher has enough data available to esti mate the density functions /1 ( x) and /2 ( x) associated with populations 1r 1 and '7T2 , respectively. Let c(2 1 1 ) = 50 (this is the cost of assigning items as 1r2 , given that '7T 1 is true) and c ( 1 I 2 ) = 100. In addition, it is known that about 20% of all possible items (for which the measurements x can be recorded) belong to '7T 2 • (a) Give the minimum ECM rule (in general form) for assigning a new item to one of the two populations. (b) Measurements recorded on a new item yield the density values f1 (x) = . 3 and f2 (x) = .5. Given the preceding information, assign this item to pop ulation '7T 1 or population '7T2 • Show that - � (x - ILI ) ' I- 1 (x - 1L1 ) + � (x - IL 2 ) ' I- 1 (x - 11- 2 ) 1 = ( IL l - IL 2 ) ' I - 1 x - � ( IL l - IL 2 ) ' I- ( 1LI + 11- 2 ) [see Equation (11-13).] Consider the linear function Y = a' X. Let E(X) = IL l and Cov (X) = I if X belongs to population 1r1 . Let E(X) = IL2 and Cov (X) = I if X belongs to population '7T2 • Let m = � ( JL1 y + JL2 y ) = � (a' �L 1 + a' �L 2 ) . Given that a' = ( IL1 - P- 2 ) ' I- 1 , show each of the following. (a) E(a' X I 7T 1 ) - m = a' IL1 - m > 0 (b) E( a'X I 7T2 ) - m = a' IL 2 - m < 0 Hint: Recall that I is of full rank and is positive definite, so I- 1 exists and is positive definite. Let /1 (x) = � ( 1 - l x l ) for l x l < 1 and /2 (x) = � ( 1 - l x - .5 1 ) for - .5 < X < 1.5. (a) Sketch the two densities. (b) Identify the classification regions when p1 = p2 and c ( 1 1 2 ) = c ( 2 1 1). (c) Identify the classification regions when p 1 = .2 and c ( 1 1 2) = c ( 2 1 1 ) . Refer to Exercise 11.7. Let f1 (x) be the same as in that exercise, but take f2 (x) = � ( 2 - l x - .5 1 ) for - 1 .5 < x < 2.5. (a) Sketch the two densities. (b) Determine the classification regions when p1 = p2 and c ( 1 1 2 ) = c ( 2 1 1) . For g = 2 groups, show that the ratio in (11-59) is proportional to the ratio Squared distance 2 between means of Y ( JL1 y - JL2 y ) 2 (a' IL1 - a' IL 2 ) = = a' I a (Variance of Y) a-} a' ( IL 1 - IL 2 ) ( IL1 - 1L2 ) ' a (a' B ) 2 a ' Ia a' Ia
( ) -------
------
Chapter 1 1
Exercises
649
where B = ( IL1 - IL 2 ) is the difference in mean vectors. This ratio is the pop ulation counterpart of (11-33). Show that the ratio is maximized by the linear combination for any c # 0.
11.10.
Hint: Note that ( IL i - ji) ( IL i - ji )' = � ( P- 1 - 1L2 ) ( 1L1 - IL 2 )' for i = 1, 2, where ji = � ( 11- 1 + 1L2 ) . Suppose that n 1 = 11 and n2 = 12 observations are made on two random vari
ables xl and x2 ' where xl and x2 are assumed to have a bivariate normal distribution with a common covariance matrix I, but possibly different mean vectors IL l and IL 2 • The sample mean vectors and pooled covariance matrix are xl
=
[ = � } [�J [ ] X2
=
7.3 -1.1 -1.1 4.8 (a) Test for the difference in population mean vectors using Hotelling ' s two sample T 2-statistic. Let a = .10. (b) Construct Fisher ' s (sample) linear discriminant function. [See (11-19) and (11-35) .] (c) Assign the observation x0 = [0 1 J to either population 1r 1 or 1r2 . Assume equal costs and equal prior probabilities. 11.11. Suppose a univariate random variable X has a normal distribution with vari ance 4. If X is from population 1r 1 , its mean is 10; if it is from population 1r2 , its mean is 14. Assume equal prior probabilities for the events A1 = X is from population 1r 1 and A2 = X is from population 1r2 , and assume that the misclassification costs c(2 1 1 ) and c ( 1 1 2) are equal (for instance, $10). We de cide that we shall allocate (classify) X to population 1r 1 if X < c, for some c to be determined, and to population 1r2 if X > c. Let B1 be the event X is clas sified into population 1r 1 and B2 be the event X is classified into population 1r2 . Make a table showing the following: P(B1 1 A2), P (B2 1 A1 ) , P(A1 and B2) , P(A2 and B1 ), P(misclassification), and expected cost for various values of c. For what choice of c is expected cost minimized? The table should take the following form:
s pooled
c
P(B1 \ A2)
P(B2 \ A1 )
=
P(A1 and B2)
P(A2 and B1 )
P( error)
Expected cost
10 14
What is the value of the minimum expected cost? 11.12. Repeat Exercise 11.11 if the prior probabilities of A1 and A2 are equal, but c(2 1 1 ) = $5 and c ( 1 1 2) = $15.
650
Chapter 1 1
D iscr i m i nation a n d Class ification
11.13. Repeat Exercise 11.11 if the prior probabilities of A1 and A2 are P( A1 )
== . 2 5 and P ( A2 ) == .75 and the misclassification costs are as in Exercise 11.12. 11.14. Consider the discriminant functions derived in Example 11.3. Normalize a using (11-21) and (11-22). Compute the two midpoints mi and mi corre sponding to the two choices of normalized vectors, say, ai and ai . Classify Xo [ - .210, - .044] with the function .Y6 == a* ' xo for the two cases. Are the results consistent with the classification obtained for the case of equal prior probabilities in Example 11.3? Should they be? 11.15. Derive the expressions in (11-23) from (11-6) when /1 (x) and /2 (x) are multivariate normal densities with means IL l , IL 2 and co variances I 1 , I 2 , respectively. 11.16. Suppose x comes from one of two populations: 1r 1 : Normal with mean IL l and covariance matrix I 1 1r2 : Normal with mean IL 2 and covariance matrix I2 If the respective density functions are denoted by /1 ( x) and /2 ( x) , find the ex pression for the quadratic discriminator ==
Q = ln
[�:�:� J
If I 1 == I 2 == I, for instance, verify that Q becomes 11.17.
( IL l - 1L2 ) ' I - 1 X - � ( IL l - IL 2 ) ' I - 1 ( 1Ll + 1L2 ) Suppose populations 1r 1 and 1r2 are as follows: 1Tl Distribution Mean IL Variance-Covariance �
Population
1T2
Normal [ 10, 25] ' 20 -7
Normal [ 10, 15 ] ' 18 12 12 32
[ J [ n
Assume equal prior probabilities and misclassifications costs of c(2 1 1 ) == $10 and c ( 1 1 2) $73.89. Find the posterior probabilities of populations 1r 1 and 1r2 , P( 1r 1 I x) and P( 1r2 l x) , the value of the quadratic discriminator Q in Ex ercise 11 .16, and the classification for each value of x in the following table: ==
X
Q
Classification
[ 10, 15 ] ' [ 12, 17] ' [30, 35 ] ' (Note: Use an increment of 2 in each coordinate-11 points in all.)
Chapter 1 1
Exercises
651
Show each of the following on a graph of the x 1 , x2 plane. (a) The mean of each population (b) The ellipse of minimal area with probability .95 of containing x for each population (c) The region R 1 (for population 1r 1 ) and the region il-R 1 = R2 (for popu lation 1r2 ) (d) The 1 1 points classified in the table 11.18. If B is defined as c( IL l - IL 2 ) ( IL l - IL 2 ) ' for some constant c, verify that e = ci- 1 ( 1L 1 - IL 2 ) is in fact an (unsealed) eigenvector of i- 1 B, where I is a covariance matrix. 11.19. (a) Using the original data sets X 1 and X 2 given in Example 11.6, calculate xi , S i , i = 1, 2, and S poolect , verifying the results provided for these quantities in the example. (b) Using the calculations in Part a, compute Fisher ' s linear discriminant func tion, and use it to classify the sample observations according to Rule (11-35). Verify that the confusion matrix given in Example 11.6 is correct. (c) Classify the sample observations on the basis of smallest squared distance D r (x) of the observations from the group means x1 and x2 • [See (11-54) .] Compare the results with those in Part b. Comment. 11.20. The matrix identity (see Bartlett [3])
(
ck n - 3 -1 S pooled + SH- 1,pooled n-2 1 - ck (X y - ik ) ' s�;oled (x y - ik )
. s �;oled (x y - ik ) (x y - ik ) ' s�;o!ed where
)
(n k - 1 ) ( n - 2) allows the calculation of S/{� poolect from s�;olect . Verify this identity using the data from Example 1 1 .6. Specifically, set n = n 1 + n 2 , k = 1, and X H = [2, 12] . Calculate Sil, pooled using the full data s�;oled and x l , and compare the result with S /{� p ooled in Example 11.6. 11.21. Let A 1 A2 · · · As > 0 denote the s < min(g - 1, p) nonzero eigenval ues of I- 1 B J.t and e 1 , e 2 , . . . , e s the corresponding eigenvectors (scaled so that e ' Ie = 1). Show that the vector of coefficients a that maximizes the ratio >
>
>
a ' B J.t a a'Ia
a'
[ � ( P-i - ji ) ( p.i - ji ) ' ] a a ' Ia
is given by a1 = e 1 . The linear combination a 1 X is called the first discriminant. Show that the value a2 = e 2 maximizes the ratio subject to Cov ( a1X, a2X) = 0. The linear combination a2X is called the second discriminant. Continuing, ak = ek maximizes the ratio subject to
652
Chapter 1 1
D i scri m i n ation and Cl ass ification
0 = Cov ( akX, a; X) , i < k, and akX is called the kth discriminan t. Als o, Var ( ajX) = 1, i = 1, . . . , s. [See (11-62) for the sample equivalent.] Hint: We first convert the maximization problem to one already solved. By the spectral decomposition in (2-20), I = P' AP where A is a diagonal matrix with positive elements Ai . Let A 1 /2 denote the diagonal matrix with elements YT; . By (2-22), the symmetric square-root matrix I 1 12 = P' A 1 12 P and its in verse I - 1 ;2 = P' A - 1 ;2 P satisfy I 1 12 I 1 12 = I, I 1 12 I - 1 12 = I = I - 1 ;2 I 1 12 an d I - 112 I - 1 12 = I- 1 . Next, set 12 u = I 1 a so u'u = a' I 1 12 I 112 a = a' I a and u'I - 112 B � I - 1 12 u = a'I 1 12 I - 112 B � I - 112 I 112 a = a' B�a. Consequently, the problem reduces to maximizing u'I - 1f2 B � I - 1/2 u u' u over u. From (2-51), the maximum of this ratio is A 1 , the largest eigen value of I- 1 12 B �I- 1 12 . This maximum occurs when u = e 1 , the normaliz ed eigenvector associated with A 1 . Because e 1 = u = I 112 a 1 , or a 1 = I - 1/2 e 1 , Var ( a1X) = a!Ia 1 = e1I - 1 12 II - 112 e 1 = e1I - 1 12 I 1 12 I 1 12 I - 112 e 1 = e1 e 1 = 1. By (2-52) , u j_ e 1 maximizes the preceding ratio when u = e2 , the nor malized eigenvector corresponding to A2 . For this choice, a2 = I - 1 /2 e2 , and Cov (a2X, a1X) = a2Ia 1 = e;I - 1 12 II - 1 12 e 1 = e2e 1 = 0, since e 2 j_ e 1 . Similarly, Var (a2X) = a2Ia2 = e2e 2 = 1. Continue in this fashion for the remaining discriminants. Note that if A and e are an eigenvalue eigenvector pair of I - 1 12 B �I - 1 12 , then I- 1 ;2 B � I- 112 e = Ae and multiplication on the left by I - 1 12 gives I - 1/2 I - 1f2 B �I - 1 /2 e = AI - 1f2 e or I- 1 B� ( I- 1f2 e) = A(I - 1f2 e ) Thus, I - 1 B� has the same eigenvalues as I - 112 B �I - 1 12 , but the corresponding eigenvector is proportional to I - 1 12 e = a, as asserted. 11.22. Show that d� = A 1 + A2 + · · · + AP = A 1 + A2 + · · · + A5 , where A 1 , A2 , . . . , A5 are the nonzero eigenvalues of I - 1 B� (or I - 1 12 B �I - 112 ) and d� is given by (11-68). Also, show that A1 + A2 + · · · + Ar is the resulting separation when only the first r discriminants, Yi , }2, . . . , Y, are used. Hint: Let P be the orthogonal matrix whose ith row ej is the eigenvector of I- 112 B �I - 112 corresponding to the ith largest eigenvalue, i = 1, 2, . . . , p. Consider Yi e1I - 112 X y
( pX 1 )
=
Ys
yp
e�I - 112 X e'p I - 1;2 X
=
P I - 1 12 X
Chapter 1 1
Exercises
653
Now, ILi Y = E ( Y l 1rz ) = PI- 112 p,i and ji y = P I- 112 ji , so ( ILi Y - ji y ) ' ( ILi Y - ji y ) = ( 11- i - ji ) ' I - 112 P' PI - 112 ( P.. i - ji ) 1 = ( IL i - ji ) ' I - ( IL i - ji ) Therefore, Ll �
g
:L ( ILi Y - ji y ) ' ( ILi Y - ji y ) . Using Yi , we have i =1 g g � -1 2 "" - 2 "" � -1/2 e ' / ( ILi - 11-- ) ( ILi - 11-- ) ' � £.J ( JLi Y1 - i-LY 1 ) - £.J e 1� 1 i =1 i =1 12 -1 2 = e } I ; B J.t i - 1 e 1 = "- 1 because e 1 has eigenvalue A 1 • Similarly, Y2 produces g :L ( JLi Y2 - Jiy2 ) 2 = e2I - 1 12 B �-t i - 1 12 e 2 = A.2 i =1 and YP produces
Thus, d�
=
g
:L ( ILi Y - ji y ) ' ( ILi Y - ji y ) i =1 g g g - )2 :L ( JLi Y 1 - JL- y l ) 2 + :L ( JLi Y2 - JL- y2 ) 2 + . . . + :L ( JLi Y P - JLyp i =1 i =1 i =1 = "- 1 + "-2 + . . . + A = "- 1 + "- 2 + . . . + "P since "-s +1 = · · · = AP = 0. If only the first r discriminants are used, their con tribution to d� is A 1 + A2 + · · · + "The following exercises require the use of a computer. 11.23. Consider the data given in Exercise 1.14. (a) Check the marginal distributions of the x/s in both the multiple-sclerosis (MS) group and non-multiple-sclerosis (NMS) group for normality by graphing the corresponding observations as normal probability plots. Sug gest appropriate data transformations if the normality assumption is suspect. (b) Assume that I 1 = I 2 = I. Construct Fisher ' s linear discriminant func tion. Do all the variables in the discriminant function appear to be im portant? Discuss your answer. Develop a classification rule assuming equal prior probabilities and equal costs of misclassification. (c) Using the results in (b), calculate the apparent error rate. If computing resources allow, calculate an estimate of the expected actual error rate using Lachenbruch ' s holdout procedure. Compare the two error rates. 11.24. Annual financial data are collected for bankrupt firms approximately 2 years prior to their bankruptcy and for financially sound firms at about the same =
s
r·
654
Chapter 1 1
Discri m i n ation and Cl assification
time. The data on four variables, X1 == CF/TD == (cash flow)/(total debt) , X2 NI/TA == (net income)/(total assets), X3 == CA/CL == (current as sets)/(current liabilities), and X4 == CA/NS == (current assets)/(net sales), are given in Table 11.4. (a) Using a different symbol for each group, plot the data for the pairs of ob servations (x 1 , x2 ) , (x 1 , x3) and (x 1 , x4) . Does it appear as if the data are approximately bivariate normal for any of these pairs of variables? (b) Using the n 1 == 21 pairs of observations ( x 1 , x2 ) for bankrupt firms and the n 2 25 pairs of observations ( x 1 , x2 ) for non bankrupt firms, calc ulate the sample mean vectors x 1 and x2 and the sample covariance matrices S 1 and S 2 • (c) Using the results in (b) and assuming that both random samples are from bivariate normal populations, construct the classification rule (11-25) with p1 p2 and c( 1 1 2) c(2 1 1 ) . (d) Evaluate the performance of the classification rule developed in (c) by computing the apparent error rate (APER) from (11-30) and the estimat ed expected actual error rate E (AER) from (11-32). (e) Repeat Parts c and d, assuming that p1 .05, p2 .95, and c ( 1 1 2) == c(2 1 1 ) . Is this choice of prior probabilities reasonable? Explain. (f) Using the results in (b), form the pooled covariance matrix S pooled , and construct Fisher 's sample linear discriminant function in (11-19). Use this function to classify the sample observations and evaluate the APER. Is Fisher ' s linear discriminant function a sensible choice for a classifier in this case? Explain. (g) Repeat Parts b-e using the observation pairs ( x 1 , x3) and (x 1 , x4) . Do some variables appear to be better classifiers than others? Explain. (h) Repeat Parts b-e using observations on all four variables ( X1 , X2 , X3 , X4 ) . 11.25. The annual financial data listed in Table 1 1 .4 have been analyzed by Johnson [18] with a view toward detecting influential observations in a discriminant analysis. Consider variables X1 == CF/TD and X3 == CA/CL. (a) Using the data on variables X1 and X3 , construct Fisher ' s linear discrimi nant function. Use this function to classify the sample observations and evaluate the APER. [See (11-35) and (11-30).] Plot the data and the dis criminant line in the (x 1 , x3 ) coordinate system. (b) Johnson [18] has argued that the multivariate observations in rows 16 for bankrupt firms and 13 for sound firms are influential. Using the X1 , X3 data, calculate Fisher ' s linear discriminant function with only data point 16 for bankrupt firms deleted. Repeat this procedure with only data point 13 for sound firms deleted. Plot the respective discriminant lines on the scatter in part a, and calculate the APERs, ignoring the deleted point in each case. Does deleting either of these multivariate observations make a difference? (Note that neither of the potentially influential data points is particularly "distant" from the center of its respective scatter.) 11.26. Using the data in Table 1 1 .4, define a binary response variable Z that assumes the value 0 if a firm is bankrupt and 1 if a firm is not bankrupt. Let X CA/CL, and consider the straight-line regression of Z on X. Although a binary response variable does not meet the standard regression assumptions, ==
==
==
==
==
==
==
Chapter 1 1
TABLE 1 1 .4
BAN K R U PTCY DATA
CF xl = TD - .45 - .56 .06 - .07 - .10 - .14 .04 - .06 .07 - .13 - .23 .07 .01 - .28 .15 .37 - .08 .05 .01 .12 - .28 .51 .08 .38 .19 .32 .31 .12 - .02 .22 .17 .15 - .10 .14 .14 .15 .16 .29 .54 - .33 .48 .56 .20 .47 .17 .58
Row
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
NI x2 = TA -.41 - .31 .02 - .09 - .09 - .07 .01 - .06 - .01 -.14 - .30 .02 .00 - .23 .05 .11 - .08 .03 - .00 .11 - .27 .10 .02 .11 .05 .07 .05 .05 .02 .08 .07 .05 - .01 -.03 .07 .06 .05 .06 .11 - .09 .09 .11 .08 .14 .04 .04
CA x3 = CL 1 .09 1 .51 1 .01 1 .45 1 .56 .71 1 .50 1 .37 1 .37 1 .42 .33 1 .31 2.15 1.19 1 .88 1 .99 1 .51 1 .68 1 .26 1 .14 1 .27 2.49 2.01 3.27 2.25 4.24 4.45 2.52 2.05 2.35 1.80 2.17 2.50 .46 2.61 2.23 2.31 1 .84 2.33 3.01 1 .24 4.29 1 .99 2.92 2.45 5.06
Legend: 1r1 0: bankrupt firms; 1r2 1: nonbankrupt firms. Source: 1968, 1969, 1970, 1971, 1972 Moody's Industrial Manuals. =
655
Exercises
=
CA NS .45 .16 .40 .26 .67 .28 .71 .40 .34 .44 .18 .25 .70 .66 .27 .38 .42 .95 .60 .17 .51 .54 .53 .35 .33 .63 .69 .69 .35 .40 .52 .55 .58 .26 .52 .56 .20 .38 .48 .47 .18 .45 .30 .45 .14 .13
x4 =
-
Population '7Ti , i = 1 , 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
656
Chapter 1 1
D iscri m i nation and Classification
it is still reasonable to consider predicting the response with a linear fun ction of the predictor( s). (a) Use least squares to determine the fitted straight line for the X, Z data. Plot the fitted values for bankrupt firms as a dot diagram on the interval [0, 1 ]. Repeat this procedure for nonbankrupt firms and overlay the two dot di agrams. A reasonable discrimination rule is to predict that a firm will go bankrupt if its fitted value is closer to 0 than to 1. That is, the fitted value is less than .5. Similarly, a firm is predicted to be sound if its fitted value is greater than .5. Use this decision rule to classify the sample firms. Cal culate the APER. (b) Repeat the analysis in Part a using all four variables, X1 , . . . , X4 • Is the re any change in the APER? Do data points 16 for bankrupt firms and 13 for nonbankrupt firms stand out as influential? 11.27. The data in Table 1 1 .5 contain observations on
X2 = sepal width and X4 = petal width for samples from three species of iris. There are n 1 = n 2 = n 3 = 50 observations in each sample. (a) Plot the data in the (x2 , x4) variable space. Do the observations for the three groups appear to be bivariate normal? (b) Assume that the samples are from bivariate normal populations with a common covariance matrix. Test the hypothesis H0 : JL 1 = JL 2 = JL3 versus H1 : at least one 11- i is different from the others at the a = .05 significance level. Is the assumption of a common covariance matrix reasonable in this case? Explain. (c) Assuming that the populations are bivariate normal, construct the qua dratic discriminate scores dP(x) given by (11-47) with p 1 = p2 = p3 = � · Using Rule (11-48), classify the new observation x0 = [3.5 1 .75 ] into pop ulation 1r 1 , 1r2 , or 1r3 • (d) Assume that the covariance matrices I i are the same for all thiee bivari ate normal populations. Construct the linear discriminate score d i (x) given by (11-51), and use it to assign x 0 = [3.5 1 .75 ] to one of the populations 7T i , i = 1, 2, 3 according to (11-52). Take p1 = p2 = p3 = � · Compare the results in Parts c and d. Which approach do you prefer? Explain. (e) Assuming equal covariance matrices and bivariate normal populations, and supposing that p 1 = p2 = p3 = � , allocate x0 = [3.5 1.75 ] to 1r 1 , 1r2 , or 1r3 using Rule (11-56). Co1ppare the r�sult with that in Part d. Delin eate the classification regions R 1 , �2 , and R3 on your graph from Part a de termined by the linear functions dki (x0) in (11-56) . (f) Using the linear discriminant scores fr9m Part d, classify the sample ob servations. Calculate the APER and E(AER) . (To calculate the latter, you should use Lachenbruch ' s holdout procedure. [See (11-57).])
11.28. Darroch and Mosimann [6] have argued that the three species of iris indicat
ed in Table 11.5 can be discriminated on the basis of "shape" or scale-free in formation alone. Let Y1 = X1 / X2 be sepal shape and Y2 = X3 j X4 be petal shape. (a) Plot the data in the (log Y1 , log Y2 ) variable space. Do the observations for the three groups appear to be bivariate normal?
Chapter 1 1
TABLE 1 1 .5
Exercises
657
DATA O N I R I S E S
1r 1 : Iris setosa 1r2 : Iris versicolor 1r3 : Iris virginica Sepal Sepal Petal Petal Sepal Sepal Petal Petal Sepal Sepal Petal Petal length width length width length width length width length width length width xl x2 x3 xl x4 x2 x3 x4 xl x2 x3 x4 5.1 1.4 3.5 7.0 0.2 4.7 3.2 1.4 6.3 3.3 6.0 2.5 4.9 3.0 1.4 6.4 0.2 3.2 4.5 1 .5 5.8 2.7 5.1 1 .9 4.7 3.2 1.3 0.2 6.9 4.9 3.1 1.5 7.1 3.0 5.9 2.1 4.6 3.1 1 .5 5.5 0.2 2.3 4.0 1 .3 6.3 2.9 5.6 1 .8 5.0 3.6 1.4 0.2 6.5 2.8 4.6 1.5 6.5 3.0 5.8 2.2 3.9 5.4 5.7 1.7 0.4 2.8 4.5 1.3 7.6 3.0 6.6 2.1 4.6 3.4 1.4 6.3 0.3 4.7 3.3 1.6 4.9 2.5 4.5 1 .7 4.9 5.0 3.4 1 .5 0.2 2.4 3.3 1 .0 2.9 6.3 7.3 1.8 4.4 1.4 2.9 0.2 6.6 2.9 4.6 1.3 2.5 6.7 5.8 1.8 4.9 3.1 1 .5 0.1 5.2 2.7 1.4 3.9 7.2 3.6 6.1 2.5 5.4 1.5 3.7 5.0 0.2 2.0 1.0 3.5 6.5 3.2 5.1 2.0 4.8 3.4 1.6 5.9 0.2 3.0 1.5 4.2 6.4 2.7 5.3 1.9 4.8 1 .4 3.0 0.1 6.0 2.2 1 .0 4.0 6.8 5.5 3.0 2.1 4.3 3.0 1.1 6.1 0.1 5.7 2.9 4.7 1.4 2.5 5.0 2.0 5.8 1.2 4.0 5.6 0.2 2.9 3.6 1 .3 5.8 2.8 5.1 2.4 5.7 1 .5 6.7 4.4 0.4 3.1 4.4 1.4 6.4 3.2 5.3 2.3 5.4 3.9 1.3 5.6 0.4 3.0 4.5 1.5 6.5 3.0 5.5 1.8 5.1 3.5 1.4 5.8 0.3 2.7 4.1 1 .0 7.7 3.8 2.2 6.7 5.7 3.8 1 .7 6.2 2.2 0.3 4.5 1.5 7.7 2.6 2.3 6.9 5.1 1.5 3.8 5.6 2.5 0.3 1.1 3.9 2.2 6.0 5.0 1 .5 5.4 3.4 5.9 0.2 1 .7 3.2 4.8 1.8 6.9 3.2 5.7 2.3 3.7 5.1 0.4 1.5 6.1 2.8 4.0 4.9 1 .3 5.6 2.8 2.0 4.6 3.6 1.0 0.2 6.3 2.5 4.9 7.7 1 .5 2.8 6.7 2.0 5.1 1 .7 3.3 6.1 2.8 0.5 4.7 1.2 4.9 6.3 2.7 1.8 4.8 3.4 0.2 6.4 1.9 2.9 4.3 1 .3 6.7 2.1 5.7 3.3 5.0 1.6 0.2 3.0 6.6 4.4 3.0 1.4 7.2 1.8 3.2 6.0 5.0 1.6 6.8 3.4 0.4 2.8 4.8 1 .4 2.8 1.8 4.8 6.2 5.2 3.5 1.5 0.2 6.7 4.9 3.0 5.0 1.8 1 .7 6.1 3.0 5.2 1.4 3.4 0.2 6.0 2.9 2.1 4.5 1 .5 2.8 5.6 6.4 4.7 1.6 3.2 0.2 5.7 2.6 1.6 1.0 5.8 7.2 3.5 3.0 4.8 1.6 3.1 0.2 5.5 6.1 1 .9 2.4 2.8 7.4 1.1 3.8 5.4 1.5 0.4 5.5 3.4 7.9 6.4 2.0 2.4 3.8 3.7 1.0 5.2 2.2 5.8 1.5 0.1 5.6 1.2 2.8 6.4 4.1 2.7 3.9 0.2 1 .4 5.5 6.0 1.5 1.6 2.8 5.1 4.2 2.7 6.3 5.1 4.9 5.4 1 .4 0.2 1.5 6.1 2.6 1.5 5.6 4.5 3.1 3.0 4.5 3.0 5.0 0.2 6.0 1.6 6.1 7.7 1.2 2.3 3.2 3.4 4.7 5.5 6.7 1 .5 2.4 0.2 3.1 1.3 6.3 5.6 3.5 3.4 4.9 6.3 1.3 0.1 3.1 1.4 5.5 6.4 1.8 2.3 4.4 3.6 5.6 1 .3 4.4 0.2 6.0 4.1 3.0 1.3 3.0 4.8 3.0 1.8 5.5 5.1 1 .3 4.0 2.5 3.1 1.5 0.2 6.9 3.4 2.1 5.4 (continues on next page)
658
Chapter 1 1
TABLE 1 1 .5
Sepal length x1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
D iscri m i nation and Cl assification
(continued)
1r2 : Iris versicolor 1r3 : Iris virginica 1r 1 : Iris setosa Sepal Petal Petal Sepal Sepal Petal Petal Sepal Sepal Petal Petal width length width length width length width length width length width x2 x2 x1 x1 x3 x3 x2 x4 x3 x4 x4 2.6 4.4 5.5 1.2 6.7 3.1 5.6 3.5 0.3 1.3 2.4 6.1 3.0 3.1 1 .4 6.9 4.6 5.1 2.3 0.3 1 .3 2.3 5.8 2.6 1.2 5.8 2.7 3.2 5.1 4.0 0.2 1.3 1.9 5.0 2.3 1 .0 6.8 0.6 3.2 5.9 3.3 3.5 1 .6 2.3 5.6 2.7 6.7 0.4 1.3 3.3 4.2 5.7 3.8 1.9 2.5 5.7 3.0 1.2 6.7 4.2 5.2 0.3 3.0 1 .4 3.0 2.3 5.7 2.9 0.2 6.3 1.3 2.5 4.2 5.0 1.6 3.8 1.9 2.9 6.2 6.5 0.2 3.0 1.3 4.3 1.4 3.2 5.2 2.0 5.1 2.5 6.2 1.1 0.2 3.7 1 .5 3.0 5.4 3.4 2.3 5.9 5.7 2.8 1.3 4.1 3.0 0.2 5.1 1 .4 3.3 1.8
-
Source: Anderson [1] .
(b) Assuming equal covariance matrices and bivariate normal populations, and suppos�ng that p1 == p2 == p3 == � , construct the linear discrimi nant scores di( x) given by (11-51) using both variables log Yi , log Y2 and each variable individually. Calculate the APERs. (c) Using the linear discriminant functions from Part b, calculate the holdout estimates of the expected AERs, and fill in the following summary table: Variable( s) log Y1 log Y2 log Yi , log Y2
Misclassification rate
Compare the preceding misclassification rates with those in the summary tables in Example 11.12. Does it appear as if information on shape alone is an effective discriminator for these species of iris? (d) Compare the corresponding error rates in Parts b and c. Given the scat ter plot in Part a, would you expect these rates to differ much? Explain. 11.29. The GPA and GMAT data alluded to in Example 1 1 . 1 1 are listed in Table 11.6. (a) Using these data, calculate x 1 , x2 , x3 , x, and S poolect and thus verify the re sults for these quantities given in Example 11.11. (b) Calculate w - 1 and B and the eigenvalues and eigenvectors of w - 1 B. Use the linear discriminants derived from these eigenvectors to classify the new observation x0 == [3.21 497] into one of the populations 1r 1 : admit; 1r2 : not admit; and 1r 3 : borderline. Does the classification agree with that in Example 11.11? Should it? Explain.
Chapter 1 1
TABLE 1 1 .6
Exercises
659
A D M I S S I O N DATA FOR G RADUATE SCHOOL O F B U S I N ESS
1r 1 : Admit 1r2 : Do not admit 1r3 : Borderline GMAT Applicant GPA Applicant GPA GMAT Applicant GPA GMAT no. no. no. (x i ) (x2 ) (x2 ) (x i ) (x i ) (x2 ) 446 1 60 2.54 32 2.96 596 2.86 494 61 425 2 33 2.43 473 3.14 2.85 496 482 62 474 3 2.20 34 3.22 3.14 419 3.28 4 531 63 35 2.36 3.29 527 371 542 64 5 505 2.89 2.57 36 3.69 447 65 37 406 3.46 2.35 6 3.15 693 313 66 412 2.51 3.50 38 626 3.03 7 402 8 458 2.51 2.89 67 3.19 39 663 485 40 2 . 80 447 68 2.36 399 9 3.63 444 482 41 3.13 2.36 10 588 69 3.59 416 3.0 1 42 70 11 420 471 2.66 3.30 563 43 414 71 2.68 2.79 553 12 3.40 490 72 533 44 2.48 13 431 3.50 572 2.89 45 73 2.46 509 14 446 2.91 591 3.78 504 46 15 2.75 74 692 2.63 3.44 546 47 75 528 2.44 16 336 467 3.48 2.73 408 17 76 3.12 48 463 552 2.13 3.47 49 440 469 2.41 18 77 520 3.08 3.35 538 78 3.03 2.55 543 419 19 50 3.39 20 505 51 2.31 79 3.00 3.28 523 509 80 52 438 489 530 21 2.41 3.03 3.21 53 411 22 81 564 3.05 399 3.58 2.19 82 2.85 321 483 23 54 3.33 565 2.35 24 3.01 394 453 83 2.60 431 3.40 55 56 84 414 25 2.55 3.03 528 3.38 605 446 85 26 664 3.04 2.72 57 399 3.26 58 27 381 2.85 3.60 609 384 28 59 2.90 3.37 559 29 521 3.80 30 646 3.76 467 31 3.24 11.30. Gerrild and Lantz
[12] chemically analyzed crude-oil samples from three zones
of sandstone: 1r 1 : Wilhelm 1r2 : Sub-Mulinia 1r3 : Upper (Mulinia, second subscales, first subscales) The values of the trace elements X1 = vanadium (in percent ash) x2 = iron (in percent ash) x3 = beryllium (in percent ash)
660
Chapter 1 1
D iscri m in ation and Cl assification
and two measures of hydrocarbons, x4 = saturated hydrocarbons ( in percent area ) X5 = aromatic hydrocarbons (in percent area ) are presented for 56 cases in Table 11.7 on pages 661-662. The last two mea surements are determined from areas under a gas-liquid chromatography curve. (a) Obtain the estimated minimum TPM rule, assuming normality. Comment on the adequacy of the assumption of normality. (b) Determine the estimate of E(AER ) using Lachenbruch ' s holdout proce dure. Also, give the confusion matrix. (c) Consider various transformations of the data to normality ( see Example 1 1 . 14), and repeat Parts a and b. 11.31. Refer to the data on salmon in Table 11.2. (a) Plot the bivariate data for the two groups of salmon. Are the sizes and orientation of the scatters roughly the same? Do bivariate normal distri butions with a common covariance matrix appear to be viable population models for the Alaskan and Canadian salmon? (b) Using a linear discriminant function for two normal populations with equal priors and equal costs [ see (11-19)] , construct dot diagrams of the dis criminant scores for the two groups. Does it appear as if the growth ring diameters separate for the two groups reasonably well? Explain. (c) Repeat the analysis in Example 1 1 .7 for the male and female salmon sep arately. Is it easier to discriminate Alaskan male salmon from Canadian male salmon than it is to discriminate the females in the two groups? Is gender (male or female ) likely to be a useful discriminatory variable? 11.32. Data on hemophilia A carriers, similar to those used in Example 11.3, are list ed in Table 11.8 on page 663. ( See [14].) Using these data, (a) Investigate the assumption of bivariate normality for the two groups. (b) Obtain the sample linear discriminant function, assuming equal prior prob abilities, and estimate the error rate using the holdout procedure. ( c ) Classify the following 10 new cases using the discriminant function in Part b. N EW CAS ES REQ U I R I N G CLAS S I F I CATI O N
Case 1 2 3 4 5 6 7 8 9 10
log 10 ( AHF activity )
- .112 - .059 .064 - .043 - .050 - .094 - .123 - .011 - .210 - . 126
log 1 0 ( AHF antigen ) - 279 - 068 .012 - .052 - 098 - .113 - .143 - . 037 - .090 - .019 .
.
.
TABLE 1 1 . 7
7T l
1T2
1T3
Chapter 1 1
Exercises
CRU D E-O I L DATA
xl
x2
x3
x4
Xs
3.9 2.7 2.8 3.1 3.5 3.9 2.7 5.0 3.4 1.2 8.4 4.2 4.2 3.9 3.9 7.3 4.4 3.0 6.3 1 .7 7.3 7.8 7.8 7.8 9.5 7.7 1 1 .0 8.0 8.4 10.0 7.3 9.5 8.4 8.4 9.5 7.2 4.0 6.7 9.0 7.8 4.5 6.2 5.6
51.0 49.0 36.0 45.0 46.0 43.0 35.0 47.0 32.0 12.0 17.0 36.0 35 . 0 41.0 36.0 32.0 46.0 30.0 13.0 5.6 24.0 18.0 25.0 26.0 17.0 14.0 20.0 14.0 18.0 18.0 15.0 22.0 15.0 17.0 25.0 22.0 12.0 52.0 27.0 29.0 41.0 34.0 20.0
0.20 0.07 0.30 0.08 0.10 0.07 0.00 0.07 0.20 0.00 0.07 0.50 0.50 0.10 0.07 0.30 0.07 0.00 0.50 1 .00 0.00 0.50 0.70 1 .00 0.05 0.30 0.50 0.30 0.20 0.10 0.05 0.30 0.20 0.20 0.50 1.00 0.50 0.50 0.30 1 .50 0.50 0.70 0.50
7.06 7.14 7.00 7.20 7.81 6.25 5.11 7.06 5.82 5.54 6.31 9.25 5.69 5.63 6.19 8.02 7.54 5.12 4.24 5.69 4.34 3.92 5.39 5.02 3.52 4.65 4.27 4.32 4.38 3.06 3.76 3.98 5.02 4 . 42 4 . 44 4.70 5.71 4.80 3.69 6.72 3.33 7.56 5.07
12.19 12.23 1 1 .30 13.01 12.63 10 .42 9.00 6.10 4.69 3.15 4.55 4 .95 2.22 2.94 2.27 12.92 5.76 10.77 8.27 4.64 2.99 6.09 6.20 2.50 5.71 8.63 8.40 7.87 7.98 7.67 6.84 5.02 10.12 8.25 5.95 3.49 6.32 3.20 3.30 5.75 2.27 6.93 6.70
(continues on next page)
661
662
Chapter 1 1
D iscri m i nation a n d Classification
TABLE 1 1 .7
(continued) xl
x2
x3
x4
Xs
9.0 8.4 9.5 9.0 6.2 7.3 3.6 6.2 7.3 4.1 5.4 5 .0 6.2
17.0 20.0 19.0 20.0 16.0 20.0 15.0 34.0 22.0 29.0 29.0 34.0 27.0
0.20 0.10 0.50 0.50 0.05 0.50 0.70 0.07 0.00 0.70 0.20 0.70 0.30
4.39 3.74 3.72 5.97 4.23 4.39 7.00 4.84 4.13 5.78 4.64 4.21 3.97
8.33 3.77 7.37 1 1.17 4 . 18 3.50 4.82 2.37 2.70 7.76 2.65 6.50 2.97
(d) Repeat Parts a-c, assuming that the prior probability of obligatory carri ers (group 2 ) is � and that of noncarriers (group 1 ) is � . 11.33. Consider the data on bulls in Table 1 .10. (a) Using the variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and Sale Wt, calculate Fisher ' s linear discriminants, and classify the bulls as Angus, Hereford, or Simental. Calculate an estimate of E(AER) using the holdout procedure. Classify a bull with characteristics YrHgt = 50, FtFrBody = 1000, PrctFFB = 73, Frame = 7, BkFat = .17, SaleHt = 54, and SaleWt = 1525 as one of the three breeds. Plot the dis criminant scores for the bulls in the two-dimensional discriminant space using different plotting symbols to identify the three groups. (b) Is there a subset of the original seven variables that is almost as good for discriminating among the three breeds? Explore this possibility by com puting the estimated E(AER) for various subsets. 11.34. Table 1 1 .9 on pages 664-665 contains data on breakfast cereals produced by three different American manufacturers: General Mills (G), Kellogg (K), and Quaker (Q). Assuming multivariate normal data with a common covariance matrix, equal costs, and equal priors, classify the cereal brands according to manufacturer. Compute the estimated E(AER) using the holdout procedure. Interpret the coefficients of the discriminant functions. Does it appear as if some manufacturers are associated with more "nutritional" cereals (high pro tein, low fat, high fiber, low sugar, and so forth) than others? Plot the cereals in the two-dimensional discriminant space, using different plotting symbols to identify the three manufacturers.
Chapter 1 1
TABLE 1 1 .8
Exercises
663
H E M O P H I LIA DATA
N oncarriers ( '7T 1 ) log l o log lo Group (AHF activity) (AHF antigen) 1 - .0056 - .1657 1 - . 1698 - .1585 - .3469 1 -.1879 1 .0064 - .0894 1 .0713 - .1679 1 .0106 - .0836 1 - .1979 - .0005 - .0762 1 .0392 1 - . 1913 - .2123 - .1092 1 - .1190 - .5268 1 - .4773 1 .0248 - .0842 1 - .0225 - .0580 .0084 1 .0782 - .1827 1 - .1138 1 .1237 .2140 1 - .4702 - .3099 - . 1519 1 - .0686 1 .0006 - .1153 1 - .2015 - .0498 1 - . 1932 - .2293 1 . 1507 .0933 1 -.0669 - . 1259 1 - .1551 - .1232 1 - .1952 - .1007 1 .0291 .0442 1 - .2228 - .1710 - .0997 1 - .0733 -.1972 1 - .0607 1 - .0867 - .0560
Source: See [14] .
Group 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Obligatory carriers ( 7T2 ) log lo log lo (AHF activity) (AHF antigen) - .3478 .1151 - .3618 - .2008 - .4986 - .0860 - .5015 - .2984 - .1326 .0097 - .6911 - .3390 - .3608 .1237 - .4535 - .1682 - .3479 - . 1721 .0722 - .3539 - .1079 - .4719 - .0399 - .3610 .1670 - .3226 - .0687 - .4319 -.0020 - .2734 .0548 - .5573 - .3755 - .1865 - .0153 - .4950 - .2483 - .5107 -.1652 .2132 - .2447 - .0407 - .4232 - .0998 .2876 - .2375 .0046 - .2205 - .0219 - .2154 .0097 - .3447 - .0573 - .2540 - .3778 - .2682 - .4046 - .1162 .1569 - .0639 - .1368 - .3351 .1539 - .0149 .1400 - .0312 - .1740 - .0776 .1642 - .1416 . 1137 - .1508 .0531 - .0964 .0867 - .2642 - .0234 .0804 .0875 - .3352 .2510 - .1878 . 1892 - .1744 -.2418 - .4055 .1614 - .2444 .0282 - .4784
� � �
TABLE 1 1 .9
DATA O N B RA N DS O F C E REAL
Brand Manufacturer Calories Protein Fat Sodium Fiber Carbohydrates Sugar Potassium Group G 110 180 10.5 10 1.5 1 Apple_Cinnamon_Cheerios 1 2 2 70 G 2 Cheerios 110 105 290 17.0 1 6 2 2.0 1 3 Cocoa_Puffs 13 G 55 1 1 1 180 110 0.0 12.0 65 4 Count_Chocula 12.0 13 1 G 1 1 0.0 180 110 45 5 Golden_Grahams G 1 1 1 15.0 280 110 0.0 9 1 G 1 1.5 250 110 3 10 1 1 .5 90 6 Honey_Nut_Cheerios 3 1 G 0.0 21 .0 1 40 110 7 Kix 260 2 55 8 Lucky_ Charms 1 0.0 12 G 110 180 12.0 1 2 9 Multi_Grain_Cheerios 1 G 6 1 90 15.0 100 220 2 2.0 G 13.5 10 Oatmeal_Raisin_Crisp 1 130 3 120 1 .5 2 10 170 3 G 1 11 Raisin_Nut_Bran 140 8 140 10.5 2.5 2 100 35 12 Total_Corn_Flakes G 3 1 1 110 200 0.0 21.0 2 G 1 13 Total_Raisin_Bran 4.0 3 15.0 14 1 230 140 1 90 3 3 1 G 1 14 Total_Whole_Grain 110 200 100 3.0 16.0 15 Trix 1 12 25 13.0 0.0 G 110 1 140 1 1 16 Wheaties 3.0 3 17.0 100 1 G 110 200 3 1 60 1 1 .0 8 17 Wheaties_Honey_Gold 1 6.0 200 2 G 110 2 320 5 K 7.0 1 18 All_Bran 260 4 9.0 70 14 2 1 1 .0 30 1.0 125 2 K 110 0 1 9 Apple_Jacks
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Corn_Flakes Corn_Pops Cracklin '_Oat_Bran Crispix Froot_Loops Frosted_Flakes Frosted_Mini_Wheats Fruitful_Bran Just_Right_Crunchy_Nuggets Mueslix_Crispy_Blend Nut&Honey_Crunch Nutri-grain_Almond-Raisin Nutri-grain_Wheat Product_19 Raisin Bran Rice_Krispies Smacks Special_K Cap ' n ' Crunch Honey_Graham_Ohs Life Puffed_Rice Puffed_Wheat Q uaker Oatmeal
Source: Data courtesy of Chad Dacus.
� � U'1
K K K K K K K K K K K K K K K K K K Q Q Q Q Q Q
100 110 110 110 110 110 100 120 110 160 120 140 90 100 120 110 110 110 120 120 100 50 50 100
2 1 3 2 2 1 3 3 2 3 2 3 3 3 3 2 2 6 1 1 4 1 2 5
0 0 3 0 1 0 0 0 1 2 1 2 0 0 1 0 1 0 2 2 2 0 0 2
290 90 140 220 125 200 0 240 170 150 1 90 220 170 320 210 290 70 230 220 220 150 0 0 0
1 .0 1 .0 4.0 1 .0 1 .0 1 .0 3.0 5.0 1.0 3.0 0.0 3.0 3.0 1 .0 5.0 0.0 1.0 1.0 0.0 1.0 2.0 0.0 1 .0 2.7
21.0 13.0 10.0 21.0 1 1 .0 14.0 14.0 14.0 17.0 17.0 15.0 21 .0 18.0 20.0 14.0 22.0 9.0 1 6.0 12.0 12.0 12.0 13.0 10.0 1.0
2 12 7 3 13 11 7 12 6 13 9 7 2 3 12 3 15 3 12 11 6 0 0 1
35 20 160 30 30 25 100 190 60 160 40 130 90 45 240 35 40 55 35 45 95 15 50 110
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
666
Chapter 1 1
D iscri m i nation and Cl assification
REFERENCES 1. Anderson, E . "The Irises o f the Gaspe Peninsula." Bulletin of the American Iris Society, 59 (1939), 2-5. 2. Anderson, T. W. An Introduction to Multivariate Statistical Methods (2d ed.). New York: John Wiley, 1984. 3. Bartlett, M. S. "An Inverse Matrix Adjustment Arising in Discriminant Analysis." Annals of Mathematical Statistics, 22 (1951), 107-111. 4. Bouma, B. N. , e t al. "Evaluation o f the Detection Rate o f Hemophilia Carriers." Statis tical Methods for Clinical Decision Making, 7, no. 2 (1975), 339-350. 5. Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees . Belmont, CA: Wadsworth, Inc. , 1984. 6. Darroch, J. N. , and J. E. Mosimann. "Canonical and Principal Components of Shape." Biometrika, 72, no. 1 (1985), 241-252. 7. Eisenbeis, R. A. "Pitfalls in the Application of Discriminant Analysis in Business, Finance and Economics." Journal of Finance, 32, no. 3 (1977), 875-900. 8. Fisher, R. A. "The Use of Multiple Measurements in Taxonomic Problems." Annals of Eugenics, 7 (1936) , 179-188. 9. Fisher, R. A. "The Statistical Utilization of Multiple Measurements." Annals of Eugen ics, 8 (1938), 376-386. 10. Ganesalingam, S. "Classification and Mixture Approaches to Clustering via Maximum Likelihood." Applied Statistics, 38, no. 3 (1989), 455-466. 11. Geisser, S. "Discrimination, Allocatory and Separatory, Linear Aspects." In Classifica tion and Clustering, edited by J. Van Ryzin, pp. 301-330. New York: Academic Press, 1977. 12. Gerrild, P. M., and R. J. Lantz. "Chemical Analysis of 75 Crude Oil Samples from Pliocene Sand Units, Elk Hills Oil Field, California." U. S. Geological Survey Open-File Report,
1969. 13. Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations. New York: John Wiley, 1977. 14. Habbema, J. D. F. , J. Hermans, and K. Van Den Broek. "A Stepwise Discriminant Analy
sis Program Using Density Estimation." In Compstat 1974, Proc. Computational Statis tics, pp. 101-110. Vienna: Physica, 1974.
15. Hand, D. J. Discrimination and Classification. New York: John Wiley, 1981 . 16. Hills, M. "Allocation Rules and Their Error Rates." Journal of the Royal Statistical Society (B) , 28 (1966), 1-31. 17. Hudlet, R., and R. A. Johnson. "Linear Discrimination and Some Further Results on Best Lower Dimensional Representations." In Classification and Clustering, edited by J. Van Ryzin, pp. 371-394. New York: Academic Press, 1977.
18. Johnson, W. "The Detection of Influential Observations for Allocation, Separation, and the Determination of Probabilities in a Bayesian Framework." Journal of Business and Economic Statistics, 5, no. 3 (1987), 369-381.
19. Kendall, M. G. Multivariate Analysis. New York: Hafner Press, 1975. 20. Krzanowski, W. J. "The Performance of Fisher's Linear Discriminant Function under Non-Optimal Conditions." Technometrics, 19, no. 2 (1977), 191-200.
Chapter 1 1
References
667
21 . Lachenbruch, P. A. Discriminant Analysis. New York: Hafner Press, 1975. 22. Lachenbruch, P. A., and M. R. Mickey. "Estimation of Error Rates in Discriminant Analy sis." Technometrics, 10 , no. 1 (1968), 1-1 1. 23. Mucciardi, A. N. , and E. E. Gose. "A Comparison of Seven Techniques for Choosing Sub sets of Pattern Recognition Properties." IEEE Trans. Computers, C20 (1971), 1023-1031. 24. Murray, G. D. "A Cautionary Note on Selection of Variables in Discriminant Analysis." Applied Statistics, 26, no. 3 (1977), 246-250. 25. Rencher, A. C. "Interpretation of Canonical Discriminant Functions, Canonical Variates and Principal Components." The American Statistician, 46 (1992), 217-225. 26. Stern, H. S. "Neural Networks in Applied Statistics." Technometrics, 38, (1996), 205-214. 27. Wald, A. "On a Statistical Problem Arising in the Classification of an Individual into One of Two Groups." Annals of Mathematical Statistics, 15 (1944), 145-162. 28. Welch, B. L. "Note on Discriminant Functions." Biometrika, 31 (1939), 218-220.
C H APTER
12
Clustering, Distance Meth ods, and Ordination
1 2. 1
I NTRO DUCTI ON
Rudimentary, exploratory procedures are often quite helpful in understanding the complex nature of multivariate relationships. For example, throughout this book, we have emphasized the value of data plots. In this chapter, we shall dis cuss some additional displays based on certain measures of distance and sug gested step-by-step rules (algorithms) for grouping objects (variables or items). Searching the data for a structure of "natural" groupings is an important ex ploratory technique. Groupings can provide an informal means for assessing dimensionality, identifying outliers, and suggesting interesting hypotheses con cerning relationships. Grouping, or clustering, is distinct from the classification methods discussed in the previous chapter. Classification pertains to a known number of groups, and the operational objective is to assign new observations to one of these groups. Cluster analysis is a more primitive technique in that no assumptions are made concerning the number of groups or the group structure. Grouping is done on the basis of sim ilarities or distances (dissimilarities). The inputs required are similarity measures or data from which similarities can be computed. To illustrate the nature of the difficulty in defining a natural grouping, consid er sorting the 16 face cards in an ordinary deck of playing cards into clusters of sim ilar objects. Some groupings are illustrated in Figure 12.1. It is immediately clear that meaningful partitions depend on the definition of similar. In most practical applications of cluster analysis, the investigator knows enough about the problem to distinguish "good" groupings from "bad" groupings. Why not enumerate all possible groupings and select the "best" ones for further study? For the playing-card example, there is one way to form a single group of 16 face cards, there are 32,767 ways to partition the face cards into two groups (of varying 668
Section 1 2 . 1
• • • • A DODD KDDDD QDDDD 1 DDDD A
669
• • • •
K
Q J
(a) Individual cards
(b) Individual suits
• • • •
• • • •
K
A K
Q
Q
J
A
A
I ntrod uction
J
(c) Black and red suits
(d) Major and minor suits (bridge)
• • • •
• • • •
(e) Hearts plus queen of spades and other suits (hearts) Figure 1 2 . 1
A
�------�
(f) Like face cards
G rou p i n g face cards.
sizes) , there are 7,141 ,686 ways to sort the face cards in to three groups (of vary ing sizes), and so on. 1 Obviously, time constraints make it impossible to determine the best groupings of similar objects from a list of all possible structures. Even large computers are easily overwhelmed by the typically large number of cases, so one must settle for algorithms that search for good, but not necessarily the best, groupings. To summarize, the basic objective in cluster analysis is to discover natural group ings of the items (or variables). In turn, we must first develop a quantitative scale on which to measure the association (similarity) between objects. Section 12.2 is de voted to a discussion of similarity measures. After that section, we describe a few of the more common algorithms for sorting objects into groups. Even without the precise notion of a natural grouping, we are often able to group objects in two- or three-dimensional plots by eye. Stars and Chernoff faces, 1
The number of ways of sorting n objects into k nonempty groups is a Stirling number of the second
kind given by ( 1/ k ! )
± ( - 1 ) k -; (�)jn. (See [1 ].) Adding these numbers for k
;=0
1
obtain the total number of possible ways to sort n objects into groups.
=
1, 2,
. . .
, n
groups, we
670
Chapter 1 2
Cl usteri ng, Distance Methods, and Ord i n ation
discussed in Section 1 .4, have been used for this purpose. (See Examples 1 . 1 1 and 1.12.) Additional procedures for depicting high-dimensional observations in two di mensions such that similar obj ects are, in some sense, close to one another are con sidered in Sections 12.5-12.7 . 1 2.2
S I M I LARITY M EASU R E S
Most efforts to produce a rather simple group structure from a complex data set re quire a measure of "closeness," or "similarity." There is often a great deal of subjec tivity involved in the choice of a similarity measure. Important considerations include the nature of the variables (discrete, continuous, binary), scales of measurement (nom inal, ordinal, interval, ratio), and subject matter knowledge. When items (units or cases) are clustered, proximity is usually indicated by some sort of distance. On the other hand, variables are usually grouped on the basis of correlation coefficients or like measures of association. Distances and Similarity Coefficients fo r Pai rs of Items
We discussed the notion of distance in Chapter 1, Section 1 .5. Recall that the Eu clidean (straight-line) distance between two p-dimensional observations (items) x ' = [ x 1 , x2 , . . . , xp ] and y ' = [y1 , y2 , . . . , Yp ] is, from (1-12),
d(x, y ) =
V(x 1 - Y1 ) 2 + ( x2 - Y2 ) 2 +
· · ·
= V (x - y) ' (x - y )
+
( xp - Yp ) 2
(12-1)
The statistical distance between the same two observations is of the form [see (1-23 ) ]
d(x, y) = V(x - y ) ' A (x - y )
( 12-2 )
Ordinarily, A = s - 1 , where S contains the sample variances and co variances. How ever, without prior knowledge of the distinct groups, these sample quantities can not be computed. For this reason, Euclidean distance is often preferred for clustering. Another distance measure is the Minkowski metric
d(x, y ) =
[ p l x; - Y; Im ] 1/m �
( 12-3 )
For m = 1 , d(x, y) measures the "city-block" distance between two points in p di mensions. For m = 2, d(x, y ) becomes the Euclidean distance. In general, varying m changes the weight given to larger and smaller differences. Two additional popular measures of "distance" or dissimilarity are given by the Canberra metric and the Czekanowski coefficient. Both of these measures are de fined for nonnegative variables only. We have
Section 1 2 .2
S i m i l a rity Measures
Canberra metric:
671 (12-4)
p
Czekanowski coefficient:
d(x, y )
=
1
2 :L min ( xi , Yi )
-
_i�-1---:L ( xi + Yi )
(12-5)
i= l
Whenever possible, it is advisable to use "true" distances-that is, distances satisfy ing the distance properties of (1-25)-for clustering objects. On the other hand, most clustering algorithms will accept subj ectively assigned distance numbers that may not satisfy, for example, the triangle inequality. The following example illustrates how rudimentary groupings can be formed by simply reorganizing the elements of the distance matrix. Example 1 2. 1
(Cl ustering by shad ing a distance matrix)
Table 12.1 gives the Euclidean distances between pairs of 22 U.S. public util ity companies, based on the data listed in Table 12.5 after they have been standardized. Because the distance matrix is large, it is difficult to visually select firms that are close together (similar). However, the graphical method of shading al lows us to pick out clusters of similar firms quite easily. First, distances are arranged into several classes (for instance, 15 or fewer), based on their magnitudes. Next, all distances within a given class are replaced by a common symbol with a certain shade of gray. Darker symbols correspond to smaller distances. Finally, the distance matrix is reorganized so that items with common symbols appear in contiguous locations along the main diagonal. Groups of similar items correspond to patches of dark shadings. From Figure 12.2 on page 673, we see that firms 1, 18, 19, and 14 form a group, firms 22, 10, 13, 20, and 4 form a group, firms 9 and 3 form a group, firms 3 and 6 form a group, and so forth. The groups (9, 3 ) and (3, 6) overlap, as do other groups in the diagram. Firms 11, 5, and 17 appear to stand alone. • When items cannot be represented by meaningful p-dimensional measurements, pairs of items are often compared on the basis of the presence or absence of certain characteristics. Similar items have more characteristics in common than do dissimi lar items. The presence or absence of a characteristic can be described mathemati cally by introducing a binary variable, which assumes the value 1 if the characteristic is present and the value 0 if the characteristic is absent. For p = 5 binary variables, for instance, the "scores" for two items i and k might be arranged as follows:
Item i Item k
1 1 1
Variables 3 2 0 0 1 0
4 1 1
5 1 0
� ._,... N
TABLE 1 2. 1
Firm no.
1 2 3 4 5 6
D I STA N C E S B ETWE E N
1
2
.00 3.10 3.68 2.46 4.12 3.61
.00 4.92 2.16 3.85 4.22
3
4
5
.00 4.11 .00 4.47 4.13 .00 2.99 3.20 4.60
6
22
UTI LITI E S
7
8
9
10
11
12
13
14
15
16
17
.00 5.56 3.97 5.23 4.82 4.57 3.46
.00 4.43 6.09 4.87 3.10 3.63
18
19
20
21
22
.00
7 3.90 3.45 4.22 3.97 4.60 3.35 .00 8 2.74 3.89 4.99 3.69 5.16 4.91 4.36 .00 9 3.25 3.96 2.75 3.75 4.49 3.73 2.80 3.59
.00
10 11 12 13 14 15
3.10 3.49 3.22 3.96 2.11 2.59
2.71 4.79 2.43 3.43 4.32 2.50
3.93 5.90 4.03 4.39 2.74 5.16
1 .49 4.86 3.50 2.58 3.23 3.19
4.05 6.46 3.60 4.76 4.82 4.26
3.83 6.00 3.74 4.55 3.47 4.07
4.51 6.00 1 .66 5.01 4.91 2.93
3.67 3.46 4.06 4.14 4.34 3.85
3 .57 5.18 2.74 3.66 3.82 4.11
.00 5.08 3.94 1.41 3.61 4.26
.00 5.21 .00 5.31 4.50 .00 4.32 4.34 4.39 .00 4.74 2.33 5.10 4.24
16 17 18 19 20 21 22
4.03 4.40 1 .88 2.41 3.17 3.45 2.51
4.84 3.62 2.90 4.63 3.00 2.32 2.42
5.26 6.36 2.72 3.18 3.73 5.09 4.11
4.97 4.89 2.65 3.46 1 .82 3.88 2.58
5.82 5.63 4.34 5.13 4.39 3.64 3.77
5.84 6.10 2.85 2.58 2.91 4.63 4.03
5.04 4.58 2.95 4.52 3.54 2.68 4.00
2.20 5.43 3.24 4.11 4.09 3.98 3.24
3.63 4.90 2.43 4.11 2.95 3.74 3.21
4.53 5.48 3.07 4.13 2.05 4.36 2.56
3.43 4.75 3.95 4.52 5.35 4.88 3.44
4.62 3.50 2.45 4.41 3.43 1.38 3.00
4.41 5.61 3.78 5.01 2.23 4.94 2.74
5.17 5.56 2.30 1.88 3.74 4.93 3.51
.00 5.18 3.40 3.00 4.03 3.78 2.10 3.35
.00 2.47 .00 2.92 3.90 .00 3.19 4.97 4.15 .00 2.55 3.97 2.62 3.01
.00
Section 1 2 .2 M
a X
X
Firm no.
1 18 19 14 9 3 6 22 10 13 20 4 7 12 21 15 2 11 16 8 5 17
+
•
M
M M
M M M
M M M M
U - +
2.563 2.994 3.424 3.661 3.963 4. 135 4.458 4.458
less than to to to to to to greater than
2.563 2.994 3.424 3.66 1 3.963 4. 1 35
from from from from from from
673
S i m i l a rity Measures
�
+ M XM M M
X M MX + M M
M M - X X - - M
M M - X X + + M M
+ +
e X e
M M M
M a+ + M + M M M M M M aX X + -
a M M M M M
+ M
a e X -
X M
+
a M
• + •
X - M
a M e e a - + a +
a M - e X +
X
X X M M
a e
e + M M M
- a e
+ a M M M M
e a a x M a X M M M M X
: :
a M - e X
+
+ -
- ·
• •
X
Figure 1 2 .2
- + e - - + + e -
X X e + X M M X
M
�
\ M M
M
M
Shaded distance matrix fo r 2 2 util ities.
In this case, there are two 1-1 matches, one 0-0 match, and two mismatches. Let xij be the score (1 or 0) of the jth binary variable on the ith item and xk j be the score (again, 1 or 0) of the jth variable on the kth item, j = 1, 2, . . . , p. Consequently, 0 if = = 1 or Xl· 1· = Xk 1· = 0 (12-6) ( Xi j _ Xk j ) 2 _- 1 If xi· j· # xkk j· Xl 1 X 1
{
.
p
and the squared Euclidean distance, � (xi j - xk j ) 2 , provides a count of the number
j= l
of mismatches. A large distance corresponds to many mismatches-that is, dissimilar items. From the preceding display, the square of the distance between items i and k would be
5 � (xi j - xk j ) 2 = ( 1 - 1 ) 2 j =l
+
(0 - 1 ) 2
+
(0 - 0 ) 2
+ (1
- 1 )2
+ (1
- 0) 2
=2 Although a distance based on (12-6) might be used to measure similarity, it suf fers from weighting the 1-1 and 0-0 matches equally. In some cases, a 1-1 match is a stronger indication of similarity than a 0-0 match. For instance, in grouping peo ple, the evidence that two persons both read ancient Greek is stronger evidence of
674
Chapter 1 2
Cl usteri ng, Dista nce Methods, and Ord i nation
similarity than the absence of this ability. Thus, it might be reasonable to discount the 0-0 matches or even disregard them completely. To allow for differential treatment of the 1-1 matches and the 0-0 matches, several schemes for defining similarity co efficients have been suggested. To introduce these schemes, let us arrange the frequencies of matches and mis matches for items i and k in the form of a contingency table: Item k 1 0 b a c d b+d a+c
1 Item i 0 Totals
Totals a+b c+d p=a+b+c+d
(12-7)
In this table, a represents the frequency of 1-1 matches, b is the frequency of 1-0 matches, and so forth. Given the foregoing five pairs of binary outcomes, a = 2 and b = c = d = 1. Table 12.2 lists common similarity coefficients defined in terms of the frequen cies in (12 7 ) . A short rationale follows each definition. -
TABLE 1 2.2
S I M I LAR ITY CO E FFICI E NTS FOR CLUSTE R I N G ITE M S *
Coefficient
a+d 1 . -p 2 (a + d) 2. ------2 (a + d) + b + c a+d 3. a + d + 2 (b + c ) a 4. p a 5. ---a+b+c 2a 6. ---2a + b + c 7·
a a + 2 (b + c)
a 8. -b+c *
Rationale Equal weights for 1-1 matches and 0-0 matches. Double weight for 1-1 matches and 0-0 matches. Double weight for unmatched pairs. No 0-0 matches in numerator. No 0-0 matches in numerator or denominator. ( The 0-0 matches are treated as irrelevant. ) No 0-0 matches in numerator or denominator. Double weight for 1 -1 matches. No 0-0 matches in numerator or denominator. Double weight for unmatched pairs. Ratio of matches to mismatches with 0-0 matches excluded.
[p binary variables; see (12-7).]
Section 1 2 .2
675
S i m i l a rity Measu res
Coefficients 1, 2, and 3 in the table are monotonically related. Suppose coefficient 1 is calculated for two contingency tables, Table I and Table II. Then if ( a1 + d1)/p (au + dn ) /p, we also have 2( a1 + d1 )/[2( a1 + d1) + b1 + cJ] 2 ( au + dn)/[2 (an + dn) + bn + en ] , and coefficient 3 will be at least as large for Table I as it is for Table II. ( See Exercise 12.4. ) Coefficients 5, 6, and 7 also retain their relative orders. Monotonicity is important, because some clustering procedures are not affected if the definition of similarity is changed in a manner that leaves the relative orderings of similarities unchanged. The single linkage and complete linkage hierarchical procedures discussed in Section 12.3 are not affected. For these methods, any choice of the coefficients 1 , 2, and 3 in Table 12.2 will produce the same groupings. Similarly, any choice of the coefficients 5, 6, and 7 will yield identical groupings. >
>
Example 1 2.2
(Calculati ng the va l ues of a s i m i larity coefficient)
Suppose five individuals possess the following characteristics: Hair Eye Height Weight color color Handedness Individual 1 Individual 2 Individual 3 Individual 4 Individual S
140 lb 185 lb 165 lb 120 lb 210 lb
68 in 73 in 67 in 64 in 76 in
{ { {
green brown blue brown brown
blond brown blond brown brown
right right right right left
Gender female male male female male
{ { {
Define six binary variables X1 , X2 , X3 , X4 , X5 , X6 as 1 height 72 in. 1 blond hair xl == o x4 == o height < 72 in. not blond hair 1 right handed 1 weight 150 lb X2 == Xs == 0 left handed 0 weight < 150 lb 1 female 1 brown eyes x3 == o x6 == o male otherwise The scores for individuals 1 and 2 on the p == 6 binary variables are Individual
>
>
1 2
0 1
0 1
1 0
0 1
1 1
1 0
and the number of matches and mismatches are indicated in the two-way array Individual 2 1 Total 0 Individual 1
1 0 Totals
1 3 4
2 0 2
3 3 6
676
Chapter 1 2
Cluster i n g, D i stance Methods, and Ord i n ation
Employing similarity coefficient 1, which gives equal weight to match es, we compute a+d 1 +0 1 p 6 6 Continuing with similarity coefficient 1 , we calculate the remaining simil ari ty numbers for pairs of individuals. These are displayed in the 5 X 5 sy m metric matrix Individual 1 2 3 4 5 1 1 2 61 1 Individual 3 46 36 1 4 46 36 26 1 5 0 CD 26 62 1 Based on the magnitudes of the similarity coefficient, we should conclude that individuals 2 and 5 are most similar and individuals 1 and 5 are least simi lar. Other pairs fall between these extremes. If we were to divide the individ uals into two relatively homogeneous subgroups on the basis of the similarity numbers, we might form the subgroups ( 1 3 4 ) and (2 5 ) . Note that X3 = 0 implies an absence of brown eyes, so that two people, one with blue eyes and one with green eyes, will yield a 0-0 match. Conse quently, it may be inappropriate to use similarity coefficient 1, 2, or 3 because these coefficients give the same weights to 1-1 and 0-0 matches. •
We have described the construction of distances and similarities. It is always possible to construct similarities from distances. For example, we might set 1 (12-8) S;k = 1 + d ;k where 0 < sik < 1 is the similarity between items i and k and dik is the correspond ing distance. However, distances that must satisfy (1-25) cannot always be constructed from similarities. As Gower [9, 10] has shown, this can be done only if the matrix of simi larities is nonnegative definite. With the nonnegative definite condition, and with the maximum similarity scaled so that sii = 1 , ,-----di k = V2( 1 - sik ) (12-9) has the properties of a distance. S i m i l a rities and Associatio n Meas u res for Pai rs of Variables
Thus far, we have discussed similarity measures for items. In some applications, it is the variables, rather than the items, that must be grouped. Similarity measures for variables often take the form of sample correlation coefficients. Moreover, in some clustering applications, negative correlations are replaced by their absolute values.
Section 1 2 .2
S i m i l a rity Measures
677
When the variables are binary, the data can again be arranged in the form of a contingency table. This time, however, the variables, rather than the items, delineate the categories. For each pair of variables, there are n items categorized in the table. With the usual 0 and 1 coding, the table becomes as follows: Variable k Totals 0 1 1 b a+b a Variable i (12-10) 0 d c c+d b+d Totals n == a + b + c + d a+c For instance, variable i equals 1 and variable k equals 0 for b of the n items. The usual product moment correlation formula applied to the binary variables in the contingency table of (12-10) gives (see Exercise 12.3)
r ==
ad - be [ (a + b) ( c + d) (a + c) ( b + d) J 112
------
(12-11)
This number can be taken as a measure of the similarity between the two variables. The correlation coefficient in (12-11) is related to the chi-square statistic 2 == x2/ n) for testing the independence of two categorical variables. For n fixed, a r ( large similarity (or correlation) is consistent with the absence of independence. Given the table in (12-10), measures of association (or similarity) exactly anal ogous to the ones listed in Table 12.2 can be developed. The only change required is the substitution of n (the number of items) for p (the number of variables). Concl uding Comments o n S i m i larity
To summarize this section, we note that there are many ways to measure the similarity between pairs of objects. It appears that most practitioners use distances [see (12-1) through (12-5)] or the coefficients in Table 12.2 to cluster items and correlations to cluster variables. However, at times, inputs to clustering algorithms may be simple frequencies. Example 1 2.3
(Measuring the s i m i l arities of 1 1 languages)
The meanings of words change with the course of history. However, the mean ing of the numbers 1, 2, 3, . . . represents one conspicuous exception. Thus, a first comparison of languages might be based on the numerals alone. Table 12 . 3 gives the first 10 numbers in English, Polish, Hungarian, and eight other mod ern European languages. (Only languages that use the Roman alphabet are considered, and accent marks, cedillas, diereses, etc., are omitted.) A cursory ex amination of the spelling of the numerals in the table suggests that the first five languages (English, Norwegian, Danish, Dutch, and German) are very much alike. French, Spanish, and Italian are in even closer agreement. Hungarian and Finnish seem to stand by themselves, and Polish has some of the charac teristics of the languages in each of the larger subgroups.
� ._,... co
TABLE 1 2.3
N U M E RALS I N 1 1 LAN G UAG ES
English (E)
Norwegian Danish (N) (Da)
Dutch German (Du) (G)
one two three four five
en to tre fire fern seks
een twee drie vier vijf zes zeven acht neg en tien
SIX
seven eight nine ten
SJ U
atte ni ti
en to tre fire fern seks syv otte ni ti
e1ns zwe1 drei vier funf sechs sieben acht neun zehn
French (Fr)
Spanish Italian (I) (Sp)
un deux trois quatre cinq
uno dos tres cuatro CinCO seis siete ocho nueve diez
SIX
sept huit neuf dix
uno due tre quattro cinque sei sette otto nove dieci
Polish (P)
Hungarian Finnish (Fi) (H)
jed en dwa trzy cztery piec szesc siedem os1em dziewiec dziesiec
egy ketto harom negy ot hat het nyolc kilenc tiz
yksi kaksi kolme neua VllSI
kuusi seitseman kahdeksan yhdeksan kymmenen
Section 1 2 .3
679
H i e ra rch ica l Cl ustering Methods
The words for 1 in French, Spanish, and Italian all begin with u . For il lustrative purposes, we might compare languages by looking at the first letters of the numbers. We call the words for the same number in two different lan guages concordant if they have the same first letter and discordant if they do not. From Table 12.3, the table of concordances (frequencies of matching first initials) for the numbers 1-10 is given in Table 12.4. We see that English and Norwegian have the same first letter for 8 of the 10 word pairs. The remaining frequencies were calculated in the same manner. TABLE 1 2.4 CO N CO RDANT FI RST LETTERS FOR N U M B E RS I N 1 1 LAN G UAG ES
E N Da Du G Fr Sp I p
H Fi
E 10 8 8 3 4 4 4 4 3 1 1
N
Da
Du
G
Fr
Sp
I
10 9 5 6 4 4 4 3 2 1
10 4 5 4 5 5 4 2 1
10 5 1 1 1 0 2 1
10 3 3 3 2 1 1
10 8 9 5 0 1
10 9 7 0 1
10 6 0 1
p
10 0 1
H
Fi
10 2
10
The results in Table 12.4 confirm our initial visual impression of Table 12.3. That is, English, Norwegian, Danish, Dutch, and German seem to form a group. French, Spanish, Italian, and Polish might be grouped together, where • as Hungarian and Finnish appear to stand alone . In our examples so far, we have used our visual impression of similarity or dis tance measures to form groups. We now discuss less subjective schemes for creat ing clusters. 1 2.3
H I E RARC H I CAL CLUSTE RING M ETH ODS
We can rarely examine all grouping possibilities, even with the largest and fastest computers. Because of this problem, a wide variety of clustering algorithms have emerged that find "reasonable" clusters without having to look at all configurations. Hierarchical clustering techniques proceed by either a series of successive merg ers or a series of successive divisions. Agglomerative hierarchical methods start with the individual objects. Thus, there are initially as many clusters as objects. The most similar objects are first grouped, and these initial groups are merged according to their similarities. Eventually, as the similarity decreases, all subgroups are fused into a single cluster.
680
Chapter 1 2
Cl usteri ng, Dista nce Methods, and Ord i nation
Divisive hierarchical methods work in the opposite direction . An initial single group of objects is divided into two subgroups such that the objects in one subgroup are "far from" the objects in the other. These subgroups are then further divided into dissimilar subgroups; the process continues until there are as many subgroups as objects-that is, until each object forms a group. The results of both agglomerative and divisive methods may be displayed in the form of a two-dimensional diagram known as a dendrogram . As we shall see, the dendrogram illustrates the mergers or divisions that have been made at succes sive levels. In this section we shall concentrate on agglomerative hierarchical procedures and, in particular, linkage methods . Excellent elementary discussions of divisive hierarchical procedures and other agglomerative techniques are available in [3] and [8] . Linkage methods are suitable for clustering items, as well as variables. This is not true for all hierarchical agglomerative procedures. We shall discuss, in turn, single linkage (minimum distance or nearest neighbor), complete linkage (maxi mum distance or farthest neighbor) , and average linkage (average distance) . The merging of clusters under the three linkage criteria is illustrated schematically in Figure 12.3. From the figure, we see that single linkage results when groups are fused ac cording to the distance between their nearest members. Complete linkage occurs when groups are fused according to the distance between their farthest members. For average linkage, groups are fused according to the average distance between pairs of members in the respective sets.
Cluster distance
- �
• 3"'-
( a)
"-...
....__ _
\ J . s; /
(b ) d 1 3 + di 4
+
d t s + d2 3
6
+
d 24
+
d2 s
(c) l ntercluster d ista n ce (dissi m i l a rity) for (a) s i n g l e l i n kage, (b) co m p l ete l i n kage, a n d (c) average l i n kage. Figure 1 2 .3
Section 1 2 . 3
H ierarc h i ca l Cl ustering Methods
681
The following are the steps in the agglomerative hierarchical clustering algo rithm for grouping N objects (items or variables): 1. Start with N clusters, each containing a single entity and an N X N symmetric matrix of distances (or similarities) D = { di k } . 2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between "most similar" clusters U and V be du v . 3. Merge clusters U and V. Label the newly formed cluster ( VV ) . Update the entries in the distance matrix by (a) deleting the rows and columns corre sponding to clusters U and V and (b) adding a row and column giving the dis tances between cluster ( UV) and the remaining clusters. 4. Repeat Steps 2 and 3 a total of N 1 times. (All objects will be in a single clus ter after the algorithm terminates.) Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place. (12-12) The ideas behind any clustering procedure are probably best conveyed through examples, which we shall present after brief discussions of the input and algorithmic components of the linkage methods. -
S i n g l e Li n kage
The inputs to a single linkage algorithm can be distances or similarities between pairs of objects. Groups are formed from the individual entities by merging nearest neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity. Initially, we must find the smallest distance in D = { di k } and merge the corre sponding objects, say, U and V, to get the cluster ( UV) . For Step 3 of the general algo rithm of (12-12), the distances between ( UV) and any other cluster W are computed by (12-13) d ( u v ) w = min {du w , dv w } Here the quantities du w and dv w are the distances between the nearest neighbors of clusters U and W and clusters V and W, respectively. The results of single linkage clustering can be graphically displayed in the form of a dendrogram, or tree diagram. The branches in the tree represent clusters. The branches come together (merge) at nodes whose positions along a distance (or sim ilarity) axis indicate the level at which the fusions occur. Dendrograms for some spe cific cases are considered in the following examples. Example 1 2.4
(Cl u stering using single l i n kage)
To illustrate the single linkage algorithm, we consider the hypothetical distances between pairs of five objects as follows: 1 2 3 4 5 0 1 9 0 3 7 0 4 6 5 9 0 5 1 1 10 @ 8 0
682
Chapter 1 2
Cl ustering, D i sta nce Methods, and Ord i nation
Treating each object as a cluster, we commence clustering by merging the two closest items. Since �Ik·n ( di k ) == d5 3 2 ==
l,
objects 5 and 3 are merged to form the cluster (35) . To implement the next level of clustering, we need the distances between the cluster (35) and the re maining objects, 1, 2, and 4. The nearest neighbor distances are d ( 3S ) l == min { d31 , d51 } min {3, 11 } 3 d ( 3s ) 2 == min { d32 , d5 2 } == min {7, 10 } == 7 d ( 3s ) 4 == min { d34 , d54} min {9 , 8} == 8 Deleting the rows and columns of D corresponding to objects 3 and 5, and adding a row and column for the cluster (35), we obtain the new distance matrix (35 ) 1 2 4 0 (35) 1 ® 0 2 7 9 0 4 8 6 5 0 The smallest distance between pairs of clusters is now d ( 3s ) 1 3, and we merge cluster (1 ) with cluster (35) to get the next cluster, (135). Calculating d( l3 S ) 2 min { d ( 3s ) 2 , d1 2 } == min { 7, 9} == 7 d ( 135 ) 4 == min { d( 3s ) 4 , d14} == min {8 , 6 } == 6 we find that the distance matrix for the next level of clustering is ( 135 ) 2 4 ( 135) 2 4 0 The minimum nearest neighbor distance between pairs of clusters is d42 5, and we merge obj ects 4 and 2 to get the cluster (24) . At this point we have two distinct clusters, (135) and (24) . Their nearest neighbor distance is d ( 135 ) ( 24 ) min { d ( 135 ) 2 , d ( 135 ) 4} == min { 7, 6 } == 6 The final distance matrix becomes ( 135 ) ( 24) ( 135 ) 0 ( 24) Consequently, clusters (135) and (24) are merged to form a single cluster of all five objects, (12345), when the nearest neighbor distance reaches 6. The dendrogram picturing the hierarchical clustering just concluded is shown in Figure 12 . 4. The groupings and the distance levels at which they occur are clearly illustrated by the dendrogram. • ==
==
==
==
==
[� & ]
==
[�
J
==
Section 1 2 .3
3
5
Objects
H ierarch ica l Cl ustering Methods
2
4
683
S i n g l e l i n kage dendrogram for dista nces between five objects . Figure 1 2 .4
In typical applications of hierarchical clustering, the intermediate results-where the objects are sorted into a moderate number of clusters-are of chief interest. Example 1 2. 5
(Si ngle l i n kage clustering of 1 1 lang uages)
Consider the array of concordances in Table 12.4 representing the closeness be tween the numbers 1-10 in 11 languages. To develop a matrix of distances, we subtract the concordances from the perfect agreement figure of 10 that each language has with itself. The subsequent assignments of distances are E N Da Du G Fr Sp I P H Fi 0 E N 2 0 Da 2 CD 0 Du 7 5 6 0 5 0 6 4 5 G 6 6 6 Fr 9 7 0 6 6 5 Sp 9 7 2 0 6 6 5 I 9 1 CD CD o p 7 7 6 10 8 5 3 4 0 H 9 8 8 8 9 10 10 10 10 0 9 9 9 9 9 9 8 0 Fi 9 9 9 We first search for the minimum distance between pairs of languages ( clus ters ) . The minimum distance, 1, occurs between Danish and Norwegian, Italian and French, and Italian and Spanish. Numbering the languages in the order in which they appear across the top of the array, we have
d86 = 1; and d87 = 1 Since d76 = 2, we can merge only clusters 8 and 6 or clusters 8 and 7. We can not merge clusters 6, 7, and 8 at level 1. We choose first to merge 6 and 8, and then to update the distance matrix and merge 2 and 3 to obtain the clusters (68) and (23 ) . Subsequent computer calculations produce the dendrogram in Figure 12.5.
684
Chapter 1 2
Cl usteri ng, D ista nce Methods, and Ord ination
10 8 8
6
0
4
§
CZl
2 0 E
N
Da
Fr
I
Sp
P
Du
G
H
Languages
Fi
S i n g l e l i n kage d e n d rograms fo r distances between n u m be rs in 1 1 l a n g u ages. Figure 1 2 . 5
From the dendrogram, we see that Norwegian and Danish, and also French and Italian, cluster at the minimum distance (maximum similarity) level. When the allowable distance is increased, English is added to the Norwegian-Danish group, and Spanish merges with the French-Italian group. Notice that Hungarian and Finnish are more similar to each other than to the other clusters of languages. However, these two clusters (languages) do not merge until the distance between nearest neighbors has increased substantially. Finally, all the clusters of languages are merged into a single cluster at the largest nearest neighbor distance, 9. • Since single linkage joins clusters by the shortest link between them, the tech nique cannot discern poorly separated clusters. [See Figure 12.6(a).] On the other hand, single linkage is one of the few clustering methods that can delineate nonel lipsoidal clusters. The tendency of single linkage to pick out long stringlike clusters is known as chaining. [See Figure 12.6(b ).] Chaining can be misleading if items at op posite ends of the chain are, in fact, quite dissimilar. The clusters formed by the single linkage method will be unchanged by any as signment of distance (similarity) that gives the same relative orderings as the initial distances (similarities) . In particular, any one of a set of similarity coefficients from Table 12.2 that are monotonic to one another will produce the same clustering. Variable 2
Variable 2
Elliptical •• • :•• \• configurations !•£ : •• .... . •...•. • •. • ....•. • . •• ·· • · •: . • •• '-
--variable 1
(a) Single linkage confused by near overlap Figure 1 2 . 6
'-----
(b) Chaining effect
S i n g l e l i n kage c l u sters.
variable 1
Section 1 2 .3
H iera rch ica l Cl ustering Methods
685
Co mplete Li n kage Complete linkage clustering proceeds in much the same manner as single linkage clusterings, with one important exception: At each stage, the distance (similarity) be tween clusters is determined by the distance (similarity) between the two elements, one from each cluster, that are most distant. Thus, complete linkage ensures that all items in a cluster are within some maximum distance (or minimum similarity) of each other. The general agglomerative algorithm again starts by finding the minimum entry in D = { dik} and merging the corresponding obj ects, such as U and V, to get cluster ( UV) . For Step 3 of the general algorithm in (12-12), the distances between ( UV) and any other cluster W are computed by
d ( uv ) w
=
( 12 - 1 4 )
max { duw , dv w }
Here duw and dv w are the distances between the most distant members of clusters U and W and clusters V and W, respectively. Example 1 2.6
{Cl ustering using complete l i n kage)
Let us return to the distance matrix introduced in Example
D
{di k }
=
=
1 0
1 2
9
2
3
0 7
0
4 5
3
3
4
6 5 9 0 11 10 (l) 8 0
5
12.4:
At the first stage, obj ects 3 and 5 are merged, since they are most similar. This gives the cluster (35). At stage 2, we compute
d ( 35 ) 1 d( 3s ) 2 d ( 3s ) 4
max {d31 , d51} = max {d3 2 , d5 2 } = max { d34 , d54}
=
=
max { 3 ,
=
10
=
9
11}
=
11
and the modified distance matrix becomes
1 2 4
(35) (35)
1 2 4
0 11 10
0
9 0 6 � 0
9
The next merger occurs between the most similar groups, 2 and cluster (24). At stage 3, we have
d ( 24 ) ( 35 ) d ( 24 ) 1
= =
max { d2 ( 35 ) , d4 ( 3s ) } max { d2 1 , d41}
=
9
=
max { 10,
9}
=
10
4, to give the
686
Chapter 1 2
Cl usteri ng, D i stance Methods, and Ord i n ation
and the distance matrix (35) (24) 1 The next merger produces the cluster (124). At the final stage, the groups (3 5 ) and (124) are merged as the single cluster (12345) at level
d ( 1 24 ) ( 3s ) == max { d 1 ( 3 s ) , d ( 24 ) ( 3s ) }
== max { 1 1 , 10} == 1 1 •
The dendrogram is given i n Figure 12.7.
2
4
3
5
Objects
Com plete l i n kage dendrogram for d ista n ces between five obj ects. Figure 1 2 .7
Comparing Figures 12.4 and 12.7, we see that the dendrograms for single link age and complete linkage differ in the allocation of obj ect 1 to previous groups. Example 1 2.7 (Compl ete l i n kage clustering of 1 1 lang uages) In Example 12.5, we presented a distance matrix for numbers in 11 languages. The complete linkage clustering algorithm applied to this distance matrix pro duces the dendrogram shown in Figure 12.8. Comparing Figures 12.8 and 12.5, we see that both hierarchical methods yield the English-Norwegian-Danish and the French-Italian-Spanish language groups. Polish is merged with French-Italian-Spanish at an intermediate level. In addi tion, both methods merge Hungarian and Finnish only at the penultimate stage.
E
N
Da
G
Fr
I
Sp
Languages
P
Du
H
Fi
Figure 1 2 .8 Com p l ete l i n kage dendrogram for distances between n u m bers in 1 1 l a n g u ag es.
Section 1 2 .3
H ierarch ica l Cl ustering Methods
687
However, the two methods handle German and Dutch differently. Single linkage merges German and Dutch at an intermediate distance, and these two languages remain a cluster until the final merger. Complete linkage merges German with the English-Norwegian-Danish group at an intermediate level. Dutch remains a cluster by itself until it is merged with the English-N orwe gian-Danish-German and French-Italian-Spanish-Polish groups at a higher distance level. The final complete linkage merger involves two clusters. The final merger in single linkage involves three clusters. • Example 1 2 .8
(Cl ustering va riables using co mplete l i n kage)
Data collected on 22 U.S. public utility companies for the year 1975 are listed in Table 12.5. Although it is more interesting to group companies, we shall see TABLE 1 2. 5
P U B LIC UTI LITY DATA (1 9 7 5)
Variables Company 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Arizona Public Service Boston Edison Co. Central Louisiana Electric Co. Commonwealth Edison Co. Consolidated Edison Co. (N.Y. ) Florida Power & Light Co. Hawaiian Electric Co. Idaho Power Co. Kentucky Utilities Co. Madison Gas & Electric Co. Nevada Power Co. New England Electric Co. Northern States Power Co. Oklahoma Gas & Electric Co. Pacific Gas & Electric Co. Puget Sound Power & Light Co. San Diego Gas & Electric Co. The Southern Co. Texas Utilities Co. Wisconsin Electric Power Co. United Illuminating Co. Virginia Electric & Power Co.
xl
x2
x3
x4
Xs
x6
x7
Xs
1 .06 .89 1 .43 1.02 1 .49 1 .32 1.22 1.10 1 .34 1.12 .75 1.13 1.15 1.09 .96 1.16 .76 1.05 1.16 1 .20 1.04 1.07
9.2 10.3 15.4 1 1 .2 8.8 13.5 12.2 9.2 13.0 12.4 7.5 10.9 12.7 12.0 7.6 9.9 6.4 12.6 1 1 .7 11.8 8.6 9.3
151 202 113 1 68 1 92 111 175 245 168 1 97 173 178 199 96 1 64 252 136 150 104 148 204 174
54.4 57.9 53.0 56.0 5 1 .2 60.0 67.6 57.0 60.4 53.0 51.5 62.0 53.7 49.8 62.2 56.0 61.9 56.7 54.0 59.9 61.0 54.3
1.6 2.2 3.4 .3 1 .0 -2.2 2.2 3.3 7.2 2.7 6.5 3.7 6.4 1.4 -0.1 9.2 9.0 2.7 -2.1 3.5 3.5 5.9
9077 5088 9212 6423 3300 1 1 127 7642 13082 8406 6455 17441 6154 7179 9673 6468 15991 5714 10140 13507 7287 6650 10093
0. 25.3 0. 34.3 15.6 22.5 0. 0. 0. 39.2 0. 0. 50.2 0. .9 0. 8.3 0. 0. 41.1 0. 26.6
.628 1.555 1 .058 .700 2.044 1 .241 1.652 .309 .862 .623 .768 1.897 .527 .588 1.400 .620 1 .920 1 . 108 .636 .702 2.116 1.306
KEY: X1 : Fixed-charge coverage ratio (income/debt ) . X2: Rate of return on capital. X3 : Cost per KW capacity in place. X4 : Annual load factor. X5 : Peak kWh demand growth from 1974 to 1975. X6: Sales ( kWh use per year ) . X7 : Percent nuclear. X8: Total fuel costs ( cents per kWh ) . Source: Data courtesy of H. E. Thompson.
688
Chapter 1 2
Cl ustering, Dista nce Methods, a n d Ord i n ation
here how the complete linkage algorithm can be used to cluster variables. We measure the similarity between pairs of variables by the product-moment cor relation coefficient. The correlation matrix is given in Table 12.6. TABLE 1 2.6 CO R R E LATI O N S B ETWE E N PAI RS OF VAR I A B L E S (PU B LI C UTI LITY DATA)
xl 1.000 .643 - . 103 - .082 - .259 - . 152 .045 -.013
1.000 - .348 - .086 - .260 - .010 .21 1 - .328
1.000 .100 .435 .028 .115 .005
1 .000 .034 - .288 - .1 64 .486
1 .000 .176 - .019 -.007
1.000 - .374 -.561
1.000 - . 185
1.000
When the sample correlations are used as similarity measures, variables with large negative correlations are regarded as very dissimilar; variables with large positive correlations are regarded as very similar. In this case, the "dis tance" between clusters is measured as the smallest similarity between members of the corresponding clusters. The complete linkage algorithm, applied to the foregoing similarity matrix, yields the dendrogram in Figure 12.9. - .4 - .
"2 0 ·� � t::0
2
0 .2
� c
.4
·a
.6
]
U3
.8
1 .0
n
2
7
4
8
Variables
3
5
6
Figure 12.9 Com p l ete l i n kage dendrogram for s i m i l a rities among eight util ity company va riables.
We see that variables 1 and 2 (fixed-charge coverage ratio and rate of re turn on capital) , variables 4 and 8 (annual load factor and total fuel costs), and variables 3 and 5 (cost per kilowatt capacity in place and peak kilowatthour de mand growth) cluster at intermediate "similarity" levels. Variables 7 (percent nuclear) and 6 (sales) remain by themselves until the final stages. The final • merger brings together the ( 12478 ) group and the ( 356 ) group.
Section 1 2 .3
H i erarch ica l Cl ustering Methods
689
As in single linkage, a "new" assignment of distances (similarities) that have the same relative orderings as the initial distances will not change the configuration of the complete linkage clusters. Average Li n kage Average linkage treats the distance between two clusters as the average distance be tween all pairs of items where one member of a pair belongs to each cluster. Again, the input to the average linkage algorithm may be distances or similar ities, and the method can be used to group objects or variables. The average linkage algorithm proceeds in the manner of the general algorithm of (12-12) . We begin by searching the distance matrix D { dik } to find the nearest (most similar) objects for example, U and V. These obj ects are merged to form the cluster ( UV) . For Step 3 of the general agglomerative algorithm, the distances between ( UV) and the other cluster W are determined by ==
(12-15) where dik is the distance between object i in the cluster ( UV) and object k in the clus ter W, and N( uv ) and Nw are the number of items in clusters ( UV) and W, respectively. Example 1 2.9
(Average l i n kage clustering of 1 1 languages)
The average linkage algorithm was applied to the "distances" between 1 1 languages given in Example 12.5. The resulting dendrogram is displayed in Figure 12.10. 10 Q) u
§
CZl
0
8 6 4 2 0 E
N
Da
G
Du
Fr
I
Sp
Languages
P
H
Fi
Figure 1 2 . 1 0 Ave rage l i n kage dendrogram for d ista nces betwee n n u m bers i n 1 1 l a n g uages.
A comparison of the dendrogram in Figure 12.10 with the corresponding single linkage dendrogram (Figure 12.5) and complete linkage dendrogram (Figure 12.8) indicates that average linkage yields a configuration very much like the complete linkage configuration. However, because distance is defined differently for each case, it is not surprising that mergers take place at differ • ent levels.
690
Chapter 1 2
Cl usteri ng, D ista nce Methods, and Ord i nation
Example 1 2. 1 0
(Average l i n kage clustering of public uti l ities)
An average linkage algorithm applied to the Euclidean distances between 22 public utilities (see Table 12.1) produced the dendrogram in Figure 12. 11 . 4
� u
�
;s CZl
2
0
0
_l_
4•
4•
___j__
••
4�
1 4�
4•
n
4•
4•
I I �
I
I
3
4•
J
�
�'--
4�
4�
4�
4•
•
o
4�
4�
4t
4�
4�
4�
1 1 8 1 9 1 4 9 3 6 22 1 0 1 3 20 4 7 12 21 1 5 2 1 1 1 6 8 5 1 7 Public utility companies
Figure 1 2 . 1 1
Ave rage l i n kage dendrogram for distances between 2 2 p u b l ic util ity com pa n i es.
Concentrating on the intermediate clusters, we see that the utility com panies tend to group according to geographical location. For example, one in termediate cluster contains the firms 1 (Arizona Public Service) , 1 8 (The Southern Company-primarily Georgia and Alabama) , 19 (Texas Utilities Com pany), and 14 (Oklahoma Gas and Electric Company). There are some excep tions. The cluster (7, 12, 21, 15, 2) contains firms on the eastern seaboard and in the far west. On the other hand, all these firms are located near the coasts. Notice that Consolidated Edison Company of New York and San Diego Gas and Electric Company stand by themselves until the final amalgamation stages. It is, perhaps, not surprising that utility firms with similar locations (or types of locations) cluster. One would expect regulated firms in the same area to use, basically, the same type of fuel(s) for power plants and face common markets. Consequently, types of generation, costs, growth rates, and so forth should be relatively homogeneous among these firms. This is apparently re • flected in the hierarchical clustering. For average linkage clustering, changes in the assignment of distances (simi larities) can affect the arrangement of the final configuration of clusters, even though the changes preserve relative orderings. Wa rd's H iera rchica l Clustering Method Ward [29] considered hierarchical clustering procedures based on minimizing the 'loss of information ' from j oining two groups. This method is usually implemented with loss of information taken to be an increase in an error sum of squares criterion, ESS.
Section 1 2 .3
H i era rch ica l Cl ustering Methods
691
First, for a given cluster k , let ESSk be the sum of the squared deviations of every item in the cluster from the cluster mean (centroid) . If there are currently K clus ters, define ESS as the sum of the ESSk or ESS = ESS 1 + ESS 2 + . . . + ESS K . At each step in the analysis, the union of every possible pair of clusters is considered, and the two clusters whose combination results in the smallest increase in ESS (minimum loss of information) are j oined. Initially, each cluster consists of a sin gle item, and, if there are N items, ESSk = 0, k = 1, 2, . . . , N, so ESS = 0. At the other extreme, when all the clusters are combined in a single group of N items, the value of ESS is given by ESS
N
=
2: (x1 - x ) ' (xj - x) j =l
where x j is the multivariate measurement associated with the jth item and x is the mean of all the items. The results of Ward ' s method can be displayed as a dendrogram. The vertical axis gives the values of ESS at which the mergers occur. Ward ' s method is based on the notion that the clusters of multivariate obser vations are expected to be roughly elliptically shaped. It is a hierarchical precursor to nonhierarchical clustering methods that optimize some criterion for dividing data into a given number of elliptical groups. We discuss nonhierarchical clustering pro cedures in the next section. Additional discussion of optimization methods of clus ter analysis is contained in [8] . Example 1 2 . 1 1
(Cl ustering pure malt scotch wh iskies)
Virtually all the world ' s pure malt Scotch whiskies are produced in Scotland. In one study (see [20]), 68 binary variables were created measuring characteristics of Scotch whiskey that can be broadly classified as color, nose, body, palate, and finish. For example, there were 14 color characteristics (descriptions), includ ing white wine, yellow, very pale, pale, bronze, full amber, red, and so forth. La Pointe and Legendre clustered 109 pure malt Scotch whiskies, each from a different distillery. The investigators were interested in determining the maj or types of single-malt whiskies, their chief characteristics, and the best represen tative. In addition, they wanted to know whether the groups produced by the hierarchical clustering procedure corresponded to different geographical re gions, since it is known that whiskies are affected by local soil, temperature, and water conditions. Weighted similarity coefficients { si k} were created from binary variables representing the presence or absence of characteristics. The resulting "dis tances," defined as {dik = 1 - sik} , were used with Ward ' s method to group the 109 pure (single-) malt Scotch whiskies. The resulting dendrogram is shown in Figure 12.12. (An average linkage procedure applied to a similarity matrix produced almost exactly the same classification.) The groups labelled A-L in the figure are the 12 groups of similar Scotch es identified by the investigators. A follow-up analysis suggested that these 12 groups have a large geographic component in the sense that Scotches with sim ilar characteristics tend to be produced by distilleries that are located reasonably
692
Chapter 1 2 1 .0
Cl ustering, Distance Methods, and Ord i n ation 3
2
12
6
0.5
0.7
0.2 ri---r
A
B
'---
-
c
�
------i
D
-
Aberfeldy Laphroaig Aberlour B alvenie Lochside Dalmore Glendullan
�
Highland Park Ardmore Port Ellen B lair Athol Auchentoshan Colebum B alblair Kinclaith Inchmurrin Caol Ila Edradour Aultmore
F�
;-----
0.0
Mac allan
�
r--
Number of group s
�
Benromach ri
Cardhu Milton duff Glen Deveron Bunnahabhain
�
G
I
Glen Scotia Springbank Tomintoul Glenglassaugh Rosebank Bruichladdich Deans ton
H
Glentauchers
I
Glen Mhor Glen Spey Bowmore
I
Lon grow Glenlochy
� I
J
Glenfarclas Glen Albyn Glen Grant North Port Glengoyne B almenach Glenesk Knockdhu Convalmore
K
Glendronach
r---
y L Figure 1 2 . 1 2
Mortlach
� I
Glenordie Tormore Glen Elgin Glen Garioch Glencadam Teaninich
Glenugie Scapa Singleton Millburn Benrinnes Strathisla Glenturret Glenlivet Obarr Clynelish Talisker Glenmorangie Ben Nevis Speybum Littlemill Bladnoch Inverleven Pulteney Glenburgie Glenallachie Dalwhinnie Knockando Benriach Glenkinchie Tullibardine Inchgower Cragganmore Longmom Glen Moray Tamnavulin Glenfiddich Fettercaim Lady bum Tobermory Ardberg Lagavulin Dufftown Glenury Royal Jura Tamdhu Linkwood S aint Magdalene Glenlossie Tomatin Craigellachie Brack! a Dailuaine Dallas Dhu Glen Keith Glenrothes Banff Caperdonich Lochnagar Imperial
A d e n d rogram for s i m i l a rities between 1 09 p u re malt Scotch wh iskies.
close to one another. Consequently, the investigators concluded, "The rela tionship with geographic features was demonstrated, supporting the hypothe sis that whiskies are affected not only by distillery secrets and traditions but also by factors dependent on region such as water, soil, microclimate, temper ature and even air quality." •
Section 1 2 .3
H ierarch ica l Cl ustering Methods
693
Final Comments-H iera rchical Procedu res There are many agglomerative hierarchical clustering procedures besides single link age, complete linkage, and average linkage. However, all the agglomerative proce dures follow the basic algorithm of (12-12) . As with most clustering methods, sources of error and variation are not for mally considered in hierarchical procedures. This means that a clustering method will be sensitive to outliers, or "noise points." In hierarchical clustering, there is no provision for a reallocation of obj ects that may have been "incorrectly" grouped at an early stage. Consequently, the final con figuration of clusters should always be carefully examined to see whether it is sensible. For a particular problem, it is a good idea to try several clustering methods and, within a given method, a couple different ways of assigning distances (similarities) . If the outcomes from the several methods are (roughly) consistent with one anoth er, perhaps a case for "natural" groupings can be advanced. The stability of a hierarchical solution can sometimes be checked by applying the clustering algorithm before and after small errors (perturbations) have been added to the data units. If the groups are fairly well distinguished, the clusterings before perturbation and after perturbation should agree. Common values (ties) in the similarity or distance matrix can produce multi ple solutions to a hierarchical clustering problem. That is, the dendrograms corre sponding to different treatments of the tied similarities (distances) can be different, particularly at the lower levels. This is not an inherent problem of any method; rather, multiple solutions occur for certain kinds of data. Multiple solutions are not necessarily bad, but the user needs to know of their existence so that the group ings (dendrograms) can be properly interpreted and different groupings ( dendro grams) compared to assess their overlap. A further discussion of this issue appears in [24] . Some data sets and hierarchical clustering methods can produce inversions. (See [24] .) An inversion occurs when an obj ect joins an existing cluster at a smaller distance (greater similarity) than that of a previous consolidation. An inversion is rep resented two different ways in the following diagram:
32 30
30 32
lI
20
0
n
A
B
20
(i)
c
D
0
A
B
(ii)
c
D
694
Chapter 1 2
Cl ustering, D ista nce Methods, and Ord ination
In this example, the clustering method j oins A and B at distance 20. At the next step, C is added to the group (AB) at distance 32. Because of the nature of the clus tering algorithm, D is added to group (ABC) at distance 30, a smaller distance than the distance at which C joined (AB). In (i) the inversion is indicated by a dendrogram with crossover. In (ii) , the inversion is indicated by a dendrogram with a nonmo notonic scale. Inversions can occur when there is no clear cluster structure and are generally associated with two hierarchical clustering algorithms known as the centroid method and the median method. The hierarchical procedures discussed in this book are not prone to inversions.
1 2.4
N O N H I E RARCHICAL CLUSTERING M ETH ODS Nonhierarchical clustering techniques are designed to group items, rather than vari ables, into a collection of K clusters. The number of clusters, K, may either be spec ified in advance or determined as part of the clustering procedure. Because a matrix of distances (similarities) does not have to be determined, and the basic data do not have to be stored during the computer run, nonhierarchical methods can be applied to much larger data sets than can hierarchical techniques. Nonhierarchical methods start from either (1) an initial partition of items into groups or (2) an initial set of seed points, which will form the nuclei of clusters. Good choices for starting configurations should be free of overt biases. One way to start is to randomly select seed points from among the items or to randomly partition the items into initial groups. In this section, we discuss one of the more popular nonhierarchical procedures, the K-means method. K-means Method MacQueen [22] suggests the term K-means for describing an algorithm of his that assigns each item to the cluster having the nearest centroid (mean) . In its simplest version, the process is composed of these three steps: 1. Partition the items into K initial clusters. 2. Proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearest. (Distance is usually computed using Euclidean distance with either standardized or unstandardized observations.) Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3. Repeat Step 2 until no more reassignments take place. (12-16) Rather than starting with a partition of all items into K preliminary groups in Step 1 , we could specify K initial centroids (seed points) and then proceed to Step 2. The final assignment of items to clusters will be, to some extent, dependent upon the initial partition or the initial selection of seed points. Experience suggests that most maj or changes in assignment occur with the first reallocation step.
Section 1 2 .4
Example 1 2. 1 2
N o n h i erarch ica l Cl ustering Methods
695
(Cl usteri ng using the K-means method)
Suppose we measure two variables X1 and X2 for each of four items A, B, C, and D . The data are given in the following table: Observations Item A B
xl
x2
5
3
-1 1
c
-3
D
1
-2 -2
The obj ective is to divide these items into K = 2 clusters such that the items within a cluster are closer to one another than they are to the items in different clusters. To implement the K = 2-means method , we arbi trarily partition the items into two clusters, such as (AB ) and ( CD ) , and compute the coordinates ( x 1 , x2 ) of the cluster centroid ( mean ) . Thus, at Step 1 , we have
Coordinates of centroid
xl
Cluster 5
(AB )
+ ( -1) 2
x2 = 2
1 + ( -3)
( CD )
2
=
-1
3
+1
= 2 2 -2 + ( -2 ) 2
=
-2
At Step 2, we compute the Euclidean distance of each item from the group centroids and reassign each item to the nearest group. If an item is moved from the initial configuration, the cluster centroids (means ) must be updated before proceeding. We compute the squared distances
d 2 (A, (AB ) ) = ( 5 - 2 ) 2 + ( 3 - 2 ) 2 = 10 d 2 (A, ( CD ) ) = (5 + 1 ) 2 + (3 + 2) 2 = 61 Since A is closer to cluster tinuing, we get
( AB ) than to cluster ( CD ) , it is not reassigned. Con
d 2 (B, (AB) ) = ( - 1 - 2 ) 2 + ( 1 - 2 ) 2 = 10 d 2 (B, ( CD ) ) = ( - 1 + 1 ) 2 + (1 + 2 ) 2 = 9
696
Chapter 1 2
Cl usteri ng, Distance Methods, and Ord i n ation
and, consequently, B is reassigned to cluster ( CD ) , giving cluster (BCD ) and the following updated coordinates of the centroid: Coordinates of centroid Cluster A (BCD )
3 -1
5 -1
Again, each item is checked for reassignment. Computing the squared distances gives the following:
Squared distances to group centroids Item Cluster
A
B
c
D
A (BCD )
0 52
40 4
41 5
89 5
We see that each item is currently assigned to the cluster with the nearest centroid (mean ) , and the process stops. The final K = 2 clusters are A and (BCD ) . • To check the stability of the clustering, it is desirable to rerun the algorithm with a new initial partition. Once clusters are determined, intuitions concerning their interpretations are aided by rearranging the list of items so that those in the first clus ter appear first, those in the second cluster appear next, and so forth. A table of the cluster centroids (means ) and within-cluster variances also helps to delineate group differences.
Example 1 2. 1 3
(K-means clustering of p u b l i c uti l ities)
Let us return to the problem of clustering public utilities using the data in Table 12.5. The K-means algorithm for several choices of K was run. We present a summary of the results for K = 4 and K = 5 . In general, the choice of a par ticular K is not clear cut and depends upon subject-matter knowledge, as well as data-based appraisals. (Data-based appraisals might include choosing K so as to maximize the between-cluster variability relative to the within-cluster vari ability. Relevant measures might include I W I / I B + W I [see (6-38)] and tr (W - 1 B) . ) The summary is as follows:
Section 1 2 .4
K
=
Nonhiera rchica l Cl ustering Methods
697
4
Cluster
Number of firms
1
5
2
6
3
5
4
6
Firms
{ { {
Idaho Power Co. (8), Nevada Power Co. (11), Puget Sound Power & Light Co. (16), Virginia Electric & Power Co. (22) , Kentucky Utilities Co. (9) . Central Louisiana Electric Co. (3), Oklahoma Gas & Electric Co. (14), The Southern Co. (18), Texas Utilities Co. (19), Arizona Public Service (1), Florida Power & Light Co. (6) . New England Electric Co. (12), Pacific Gas & Electric Co. (15), San Diego Gas & Electric Co. (17), United Illuminating Co. (21) , Hawaiian Electric Co. (7) . Consolidated Edison Co. ( N.Y. ) (5), Boston Edison Co. (2) , Madison Gas & Electric Co. (10), Northern States Power Co. (13), Wisconsin Electric Power Co. (20), Commonwealth Edison Co. (4) . Distances between Cluster Centers 1 2 3 4
K
=
1 0 3.08 3.29 3.05
2
3
4
0 3.56 2.84
0 3.18
0
5
Cluster
Number of firms
1
5
2
6
3
5
4
2
5
4
Firms
{ { { { {
Nevada Power Co. (11), Puget Sound Power & Light Co. (16), Idaho Power Co. (8), Virginia Electric & Power Co. (22), Kentucky Utilities Co. ( 9). Central Louisiana Electric Co. (3), Texas Utilities Co. (19), Oklahoma Gas & Electric Co. (14), The Southern Co. (18), Arizona Public Service (1 ), Florida Power & Light Co. ( 6). New England Electric Co. (12) , Pacific Gas & Electric Co. (15), San Diego Gas & Electric Co. (17), United Illuminating Co. (21 ), Hawaiian Electric Co. (7) . Consolidated Edison Co. ( N.Y. ) (5), Boston Edison Co. (2) .
Commonwealth Edison Co. (4) , Madison Gas & Electric Co. (10), Northern States Power Co. (13), Wisconsin Electric Power Co. (20).
698
Chapter 1 2
Cl ustering, Distance Methods, and Ord i nation
Distances between Cluster Centers 1 2 3 4 5
1 2 3 0 3.08 0 3.29 3.56 0 3.63 3.46 2.63 3.18 2.99 3.81
4
5
0 2.89
0
The cluster profiles (K == 5 ) shown in Figure 12. 13 order the eight vari ables according to the ratios of their between-cluster variability to their within cluster variability. [For univariate F-ratios, see Section 6.4.] We have Fnuc
==
mean square percent nuclear between clusters . mean square percent nuclear within clusters .
==
3.335 .255
--
==
13.1
so firms within different clusters are widely separated with respect to percent nuclear, but firms within the same cluster show little percent nuclear variation. Fuel costs (FUELC) and annual sales (SALES) also seem to be of some im portance in distinguishing the clusters. Reviewing the firms in the five clusters, it is apparent that the K-means method gives results generally consistent with the average linkage hierarchical method. (See Example 12.10.) Firms with common or compatible geographi cal locations cluster. Also, the firms in a given cluster seem to be roughly the same in terms of percent nuclear. • We must caution, as we have throughout the book, that the importance of
individual variables in clustering must be judged from a multivariate perspective. All of the variables (multivariate observations) determine the cluster means and the re assignment of items. In addition, the values of the descriptive statistics measuring the importance of individual variables are functions of the number of clusters and the final configuration of the clusters. On the other hand, descriptive measures can be helpful, after the fact, in assessing the "success" of the clustering procedure. Final Comments-Nonhierarchical Procedures There are strong arguments for not fixing the number of clusters, K, in advance, in cluding the following: 1. If two or more seed points inadvertently lie within a single cluster, their result ing clusters will be poorly differentiated. 2. The existence of an outlier might produce at least one group with very dis perse items. 3. Even if the population is known to consist of K groups, the sampling method may be such that data from the rarest group do not appear in the sample. Forc ing the data into K groups would lead to nonsensical clusters.
Cluster profiles-variables are ordered by F-ratio size
•
Percent nuclear Total fuel costs Sales Cost per KW capacity in place Annual load factor Peak kWh demand growth Rate of return on capital Fixed-charge coverage ratio
•
-- 1 -- 1 --- 1 --- 1 --- - 1 == -- 1 ---- 1 --
2-
--2-
2-2--
--2--2----2-----2----
---- 1 -•
•
- 3 -
3 --3-
--3 --- 3 -
•
3-
- � = - - - 3- - - - - - - - - •
Each column describes a cluster. The cluster number is printed at the mean of each variable. Dashes indicate one standard deviation above and below mean.
Figure 1 2 . 1 3
0\ \0 \0
C l u ster p rofi l es ( K
=
5 ) for p u b l i c uti l ity data.
•
--4-4-4--4---4----4----4----4---•
•
-5-5 5--5 ---5 ----5 ---5--5 --•
F-ratio
13.1 12. 4 8. 4 5.1. 44 2.7 2. 4 0 .5
700
Chapter 1 2
Clustering, D ista nce Methods, and Ord ination
In cases where a single run of the algorithm requires the user to specify K, it is always a good idea to rerun the algorithm for several choices. Discussions of other nonhierarchical clustering procedures are available in [3] , [8] , and [15] . 1 2. 5
M U LTI D I M E N SIONAL SCALI NG This section begins a discussion of methods for displaying (transformed) multivari ate data in low-dimensional space. We have already considered this issue when we discussed plotting scores on, say, the first two principal components or the scores on the first two linear discriminants. The methods we are about to discuss differ from these procedures in the sense that their primary obj ective is to "fit" the original data into a low-dimensional coordinate system such that any distortion caused by a re duction in dimensionality is minimized. Distortion generally refers to the similarities or dissimilarities (distances) among the original data points. Although Euclidean distance may be used to measure the closeness of points in the final low-dimensional configuration, the notion of similarity or dissimilarity depends upon the underlying technique for its definition. A low-dimensional plot of the kind we are alluding to is called an ordination of the data. Multidimensional scaling techniques deal with the following problem: For a set of observed similarities (or distances) between every pair of N items, find a repre sentation of the items in few dimensions such that the interitem proximities "nearly match" the original similarities (or distances) . It may not be possible to match exactly the ordering of the original similarities (distances) . Consequently, scaling techniques attempt to find configurations in q < N - 1 dimensions such that the match is as close as possible. The numerical measure of closeness is called the stress. It is possible to arrange the N items in a low-dimensional coordinate system using only the rank orders of the N ( N - 1 )/2 original similarities (distances) , and not their magnitudes. When only this ordinal information is used to obtain a geo metric representation, the process is called nonmetric multidimensional scaling. If the actual magnitudes of the original similarities (distances) are used to obtain a geometric representation in q dimensions, the process is called metric multidimen sional scaling. Metric multidimensional scaling is also known as principal coordinate
analysis.
Scaling techniques were developed by Shepard (see [26] for a review of early work), Kruskal [17, 18, 19] , and others. A good summary of the history, theory, and applications of multidimensional scaling is contained in [32] . Multidimensional scal ing invariably requires the use of a computer, and several good computer programs are now available for the purpose. The Basic Algo rith m For N items, there are M == N ( N - 1 ) /2 similarities (distances) between pairs of different items. These similarities constitute the basic data. (In cases where the sim ilarities cannot be easily quantified as, for example, the similarity between two col ors, the rank orders of the similarities are the basic data.)
Section 1 2 . 5
M u ltidi mensiona l Sca l i n g
701
Assuming no ties, the similarities can be arranged in a strictly ascending order as (12-17) Here si 1 k 1 is the smallest of the M similarities. The subscript i 1 k1 indicates the pair of items that are least similar-that is, the items with rank 1 in the similarity ordering. Other subscripts are interpreted in the same manner. We want to find a q-dimensional configuration of the N items such that the distances, d } j } , between pairs of items match the ordering in (12-17). If the distances are laid out in a manner correspond ing to that ordering, a perfect match occurs when d ll� qk)l > d l�2qk)2 > > d(qlM)k M (12-18) That is, the descending ordering of the distances in q dimensions is exactly analogous to the ascending ordering of the initial similarities. As long as the order in (12-18) is preserved, the magnitudes of the distances are unimportant. For a given value of q, it may not be possible to find a configuration of points whose pairwise distances are monotonically related to the original similarities. Kruskal [17] proposed a measure of the extent to which a geometrical representation falls short of a perfect match. This measure, the stress, is defined as 0 0 0
Stress( q)
(12-19)
=
The d� % ) ' s in the stress formula are numbers kflown to satisfy (12-18); that is, they are monotonically related to the similarities. The d � Z ) ' s are not distances in the sense that they satisfy the usual distance properties of (1-25). They are merely reference num bers used to judge the nonmonotonicity of the observed d � Z ) ' s. The idea is to find a representation of the items as points in q-dimensions such that the stress is as small as possible. Kruskal [17] suggests the stress be informally interpreted according to the following guidelines:
Stress 20% 10% 5% 2.5% 0%
Goodness offit Poor Fair Good Excellent Perfect
(12-20)
Goodness offit refers to the monotonic relationship between the similarities and the final distances. A second measure of discrepancy, introduced by Takane et al. [28], is becoming the preferred criterion,.: For a given dimension q, this measure, denoted by SStress, replaces the di k ' s and di k ' s in (12-19) by their squares and is given by 1/2 2 d (df r k k) Li L k < (12-21) SStress =
702
Chapter 1 2
Cl usteri ng, Distance Meth ods, and Ord i n ation
The value of SStress is always between 0 and 1 . Any value less than . 1 is typically taken to mean that there is a good representation of the objects by the points in the given configuration . Once items are located in q dimensions, their q X 1 vectors of coordinates can be treated as multivariate observations. For display purposes, it is convenient to rep resent this q-dimensional scatter plot in terms of its principal component axes. (See Chapter 8.) We have written the stress measure as a function of q, the number of dimensions for the geometrical representation. For each q, the configuration leading to the min imum stress can be obtained. As q increases, minimum stress will, within rounding error, decrease and will be zero for q == N - 1 . Beginning with q == 1 , a plot of these stress(q) numbers versus q can be constructed. The value of q for which this plot be gins to level off may be selected as the "best" choice of the dimensionality. That is, we look for an "elbow" in the stress-dimensionality plot. The entire multidimensional scaling algorithm is summarized in these steps:
== N ( N - 1 ) /2 similarities (distances) between dis tinct pairs of items. Order the similarities as in (12-17) . (Distances are ordered from largest to smallest.) If similarities (distances) cannot be computed, the rank orders must be specified. 2. Using a trial configuration in q dimensions, determine the interitem distances d i % ) and numbers J i Z ) , where the latter satisfy (12-18) and minimize the stress (12-1 9) or SStress (12-21 ). (The J i Z ) are frequently determined within scaling computer programs using regression methods designed to produce monotonic "fitted" distances.) 3. Using the J i Z ) ' s, move the points around to obtain an improved configuration. (For q fixed, an improved configuration is determined by a general function minimization procedure applied to the stress. In this context, the stress is re garded as a function of the N x,.. q coordinates of the N items.) A new config uration will have new d i Z ) 's new df/ ) 's and smaller stress. The process is repeated until the best (minimum stress) representation is obtained. 4. Plot minimum stress(q) versus q and choose the best number of dimensions, q*, from an examination of this plot. (12-22) 1. For N items, obtain the M
We have assumed that the initial similarity values are symmetric ( si k == sk z ) , that there are no ties, and that there are no missing observations. Kruskal [ 17 , 1 8] has suggested methods for handling asymmetries, ties, and missing observations. In ad dition, there are now multidimensional scaling computer programs that will handle not only Euclidean distance, but any distance of the Minkowski type. [See (12-3).] The next three examples illustrate multidimensional scaling with distances as the initial (dis )similarity measures. Example 1 2. 1 4
(Mu ltidi mensional scaling of U . S . cities)
Table 12.7 displays the airline distances between pairs of selected U.S. cities. Since the cities naturally lie in a two-dimensional space (a nearly level part of the curved surface of the earth), it is not surprising that multidimensional scal ing with q 2 will locate these items about as they occur on a map. Note that ==
TABLE 1 2. 7
A I R LI N E- D I STAN C E DATA
Atlanta Boston Cincinnati Columbus Dallas Indianapolis Little Rock Los Angeles Memphis St. Louis Spokane Tampa (11) (10) (12) (9) (8) (7) (4) (6) (3) (5) (1) (2) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
._,... 0 w
0 1068 461 549 805 508 505 2197 366 558 2467 467
0 867 769 1819 941 1494 3052 1355 1178 2747 1379
0 107 943 108 618 2186 502 338 2067 928
0 1050 172 725 2245 586 409 2131 985
0 882 325 1403 464 645 1891 1077
0 562 2080 436 234 1 959 975
0 1701 137 353 1 988 912
0 1 83 1 1 848 1227 2480
0 294 2042 779
0 1820 1016
0 2821
0
704
Ch apter 1 2
Cl usteri ng, Dista nce Methods, a n d Ord i nation •
Spokane
.8
r-
•
Boston
r-
• • Cincinnati
Indianapolis
• St. Louis
O rf-
- .4
r-
8
r-
-
.
•
Columbus
. 4 r-
•
• Memphis
Atlanta
•
•
Los Angeles
I
I
- 1.5
- 2.0 Figure 1 2 . 1 4
Dallas
I
- 1 .0
I
- .5
• Little Rock
•
Tampa
1
0
I
.5
I
1.0
1.5
A geometrical representation o f cities prod uced by m u ltidimensional sca l i n g .
if the distances in the table are ordered from largest t o smallest-that is, from a least similar to most similar-the first position is occupied by dBoston , L.A. == 3052. A multidimensional scaling plot for q == 2 dimensions is shown in Figure 12.14. The axes lie along the sample principal components of the scatter plot. A plot of stress ( q ) versus q is shown in Figure 12.15 on page 705. Since stress ( 1 ) X 100% == 12%, a representation of the cities in one dimension ( along a single axis ) is not unreasonable. The "elbow" of the stress function occurs at q == 2. Here stress ( 2 ) X 100% == 0.8%, and the "fit" is almost perfect. The plot in Figure 12.15 indicates that q == 2 is the best choice for the dimension of the final configuration. Note that the stress actually increases for q == 3. This anomaly can occur for extremely small values of stress be cause of difficulties with the numerical search procedure used to locate the • minimum stress. Example 1 2. 1 5
(M u ltidi mensional sca l i n g of public uti l ities)
Let us try to represent the 22 public utility firms discussed in Example 12.8 as points in a low-dimensional space. The measures of ( dis ) similarities between pairs of firms are the Euclidean distances listed in Table 12.1. Multidimensional scaling in q = 1, 2, . . . , 6 dimensions produced the stress function shown in Fig ure 12. 16. The stress function in Figure 12.16 has no sharp elbow. The plot appears to level out at "good" values of stress ( less than or equal to 5% ) in the neigh borhood of q == 4. A good four-dimensional representation of the utilities is achievable, but difficult to display. We show a plot of the utility configuration
Section 1 2 . 5
M u lti d i mensional Sca l i n g
Stress . 14 .12 .10
.
08
.06 .04 .02 2----3 --- 4
0
4
2 Figure 1 2 . 1 5
6
q
Stress fu nction for a i r l i n e distances between cities.
Stress
.40 .35 .30 .25 .20 .15 .10 .05 .00 - .05 0 Figure 1 2 . 1 6
2
4
6
8
Stress function fo r distances between util ities.
q
705
706
Chapter 1 2
Cl ustering, D ista nce M ethods, and Ord i nation 1.5
f-
1.0
r--
•
San Dieg. G&E
•
•
Pac. G&E
.5 -
•
e N. Eng. El.
•
Kent Util.
Bost. Ed.
•
VEPCO
o -
- .5 - 1.0
•
•
Pug. Sd. Po.
•
•
•
•
e WEPCO
Common. Ed .
Ariz. Pub. Ser.
•
M.G.&.E.
•
- Nev. Po.
•
Flor. Po. & Lt.
Southern Co. •
Idaho Po.
•
Con. Ed.
•
Haw . El.
Unit. Ill. Co.
Cent. Louis.
•
NSP .
e ok. G. & E .
Tex. Util.
- 1 .5 I
I
- 1 .0
- 1 .5 Figure 1 2 . 1 7
I
- .5
l
0
I
.5
I
1 .0
I
1 .5
A geometrical representation o f uti l ities prod uced by m u ltidimensional sca l i n g .
obtained in q = 2 dimensions in Figure 12.17. The axes lie along the sample principal components of the final scatter. Although the stress for two dimensions is rather high ( stress (2) X 100% = 19%), the distances between firms in Figure 12.17 are not wild ly inconsistent with the clustering results presented earlier in this chapter. For example, the midwest utilities-Commonwealth Edison, Wisconsin Elec tric Power ( WEPCO ), Madison Gas and Electric ( MG & E), and Northern States Power ( NSP)-are close together ( similar) . Texas Utilities and Okla homa Gas and Electric ( Ok. G & E) are also very close together ( similar) . Other utilities tend to group according to geographical locations or similar environments. The utilities cannot be positioned in two dimensions such that the in ) terutility distances d �� are entirely consistent with the original distances in Table 12. 1 . More flexibility for positioning the points is required, and this can only be • obtained by introducing additional dimensions. Example 1 2. 1 6
{Mu ltidi mensional sca l i ng of universities)
Data related to 25 U.S. universities are given in Table 12.9 on page 722. ( See Example 12. 19.) These data give the average SAT score of entering freshmen, percent of freshmen in top 10% of high school class, percent of applicants
Section 1 2 . 5
M u ltid i mensiona l Sca l ing
707
4
2
TexasA&M
PennS tate
0 Purdue
Brown UVirginia NotreDame Georgetown UCBerkeley Cornell Duke Harvard DartmouthPrinceton UMichigan y 1e Stanford .. UPenn. a Lolum b ia Northwestern M IT UChicago
UWisconsin
CamegieMellon
-2
JohnsHopkins
CalTech
-4 -4
-2
0
2
12.18
A two-d imensional representat ion of u n iversities produced by metric m u lt i d i mensional sca l i n g .
Figure
accepted, student-faculty ratio, estimated annual expense, and graduation rate (%). A metric multidimensional scaling algorithm applied to the standardized university data gives the two-dimensional representation shown in Figure 12.18. Notice how the private universities cluster on the right of the plot while the large public universities are, generally, on the left. A nonmetric multidimen sional scaling two-dimensional configuration is shown in Figure 12.19. For this example, the metric and nonmetric scaling representations are very similar, with the two dimensional stress value being approximately 10% for both scalings. • Classical metric scaling, or principal coordinate analysis, is equivalent to plot ing the principal components. Different software programs choose the signs of the appropriate eigenvectors differently, so at first sight, two solutions may appear to be different. However, the solutions will coincide with a reflection of one or more of the axes. (See [23] .)
708
Chapter 1 2
Cl ustering, Dista nce Meth ods, and Ord i nation 4
r-
2 r--
UCB erkeley TexasA&M
PennS tate UMichigan
O r-
Purdue
Georgetown Brown N otreDame Princeton UVirginia Cornell Dartmouth Duke Harvard UPenn Stanford Northwestern Yale Columbia MIT UChicago
UWisconsin
CarnegieMellon
- 2 r--
J ohnsHopkins
Cal Tech
-4 I -4
I
-2
I
0
I
2
12.19 A two-d imensional represe ntation of u n ive rsities produced by non metric m u ltidimensional sca l i n g .
Figure
To summarize, the key obj ective of multidimensional scaling procedures is a low-dimensional picture. Whenever multivariate data can be presented graphically in two or three dimensions, visual inspection can greatly aid interpretations. When the multivariate observations are naturally numerical, and Euclidean dis ) tances in p-dimensions, d �f , can be computed, we can seek a q < p-dimensional rep resentation by minimizing (12-23) In this alternative approach, the Euclidean distances in p and q dimensions are com pared directly. Techniques for obtaining low-dimensional representations by mini mizing E are called nonlinear mappings. The final goodness of fit of any low-dimensional representation can be de picted graphically by minimal spanning trees. (See [15] for a further discussion of these topics.)
Section 1 2 .6
1 2.6
Correspondence Ana lysis
709
CORRESPO N D E NCE ANALYSIS Developed by the French, correspondence analysis is a graphical procedure for rep resenting associations in a table of frequencies or counts. We will concentrate on a two-way table of frequencies or contingency table. If the contingency table has I rows and I columns, the plot produced by correspondence analysis contains two sets of points: A set of I points corresponding to the rows and a set of I points correspond ing to the columns. The positions of the points reflect associations. Row points that are close together indicate rows that have similar profiles ( con ditional distributions ) across the columns. Column points that are close together in dicate columns with similar profiles ( conditional distributions ) down the rows. Finally, row points that are close to column points represent combinations that occur more frequently than would be expected from an independence model-that is, a model in which the row categories are unrelated to the column categories. The usual output from a correspondence analysis includes the "best" two dimensional representation of the data, along with the coordinates of the plotted points, and a measure ( called the inertia) of the amount of information retained in each dimension. Before briefly discussing the algebraic development of contingency analysis, it is helpful to illustrate the ideas we have introduced with an example. Example 1 2. 1 7
(Co rrespondence ana lysis of archaeo logica l data)
Table 12.8 contains the frequencies ( counts ) of I = 4 different types of pot tery ( called potsherds ) found at I = 7 archaeological sites in an area of the American Southwest. If we divide the frequencies in each row ( archaeolog ical site ) by the corresponding row total, we obtain a profile of types of pot tery. The profiles for the different sites ( rows ) are shown in a bar graph in Figure 12.20 ( a ) . The widths of the bars are proportional to the total row fre quencies. In general, the profiles are different; however, the profiles for sites P1 and P2 are similar, as are the profiles for sites P4 and P5 . TABLE 1 2.8
F R EQ U E N C I E S OF TY PES OF POTTERY
Type Site
A
B
c
D
Total
PO P1 P2 P3 P4 P5 P6
30 53 73 20 46 45 16
10 4 1 6 36 6 28
10 16 41 1 37 59 169
39 2 1 4 13 10 5
89 75 116 31 132 120 218
283
91
333
74
781
Total
Source: Data courtesy of M. J. Tretter.
710
Chapter 1 2
Cl usteri ng, Dista nce M ethods, and Ord i nation ( Site by Type )
( Type By Site )
0.75
0.7 5 Q) 0..
R
o. 5
0.2 5
0
pO p 1
p2 p 3 p4 Site
p5
p6
I
a
0. 5
0
a
b
c
d
I
Type
(b)
( a) Figure 1 2 .20
Site and pottery type profi les fo r the data i n Ta b l e 1 2 .8.
The archaeological site profiles for different types of pott �ry (columns) are shown in a bar graph in Figure 12.20(b ). The site profiles are �onstructed using the column totals. The bars in the figure appear to be quite different from one another. This suggests that the various types of pottery are not distributed over the archaeological sites in the same way. The two-dimensional plot from a correspondence analysis2 of the pottery type-site data is shown in Figure 12.21 . The plot in Figure 12.21 indicates, for example, that sites P1 and P2 have similar pottery type profiles (the two points are close together), and sites PO and P6 have very different profiles (the points are far apart). The individual points representing the types of pottery are spread out, indicating that their ar chaeological site profiles are quite different. These findings are consistent with the profiles pictured in Figure 12.20. Notice that the points PO and D are quite close together and separated from the remaining points. This indicates that pottery type D tends to be asso ciated, almost exclusively, with site PO. Similarly, pottery type A tends to be as sociated with site P1 and, to lesser degrees, with sites P2 and P3. Pottery type B is associated with sites P4 and P5, and pottery type C tends to be associated, again, almost exclusively, with site P6. Since the archaeological sites represent different periods, these associations are of considerable interest to archaeologists. The number AI = .28 at the end of the first coordinate axis in the two dimensional plot is the inertia associated with the first dimension. This inertia is 55% of the total inertia. The inertia associated with the second dimension is A� = .17, and the second dimension accounts for 33% of the total inertia. Together, the two dimensions account for 55% + 33% = 88% of the total 2 The JMP software was used for a correspondence analysis of the data in Table 12.8.
Section 1 2 .6
A.?
=
71 1
.28 ( 55%) 1 .0
-
-
-0.5
-
1:1
1:1
-
P5
1:1 P4
A.i
" T"' .... .LJ
=
. 1 7 ( 33% )
-
X C 1:1 P6
-
- 1 .0
X D PO
X A Pl
1:1 P2
0.0
- 0.5
1:1
1:1 P3
-
\
Correspondence Ana lysis
-
I
I
- 1 .0
-0.5
[!] Figure 1 2 . 2 1
I
Type
I
I
0.0 c2
I
I
0.5
I
1 .0
Site
A correspondence a n a lysis plot of the potte ry type-site data.
inertia. Since, in this case, the data could be exactly represented in three di mensions, relatively little information (variation) is lost by representing the data in the two-dimensional plot of Figure 12.21 . Equivalently, we may regard this plot as the best two-dimensional representation of the multidimensional scat ter of row points and the multidimensional scatter of column points. The com bined inertia of 88% suggests that the representation "fits" the data well. In this example, the graphical output from a correspondence analysis shows the nature of the associations in the contingency table quite clearly. • Algebraic Develo pment of Co rrespondence Analysis To begin, let X, with elements xi j , be an I X J two-way table of unsealed frequencies or counts. In our discussion we take I > J and assume that X is of full column rank J. The rows and columns o f the contingency table X correspond t o different categories of two different characteristics. As an example, the array of frequencies of different pottery types at different archaeological sites shown in Table 12.8 is a contingency table with I == 7 archaeological sites and J 4 pottery types. If n is the total of the frequencies in the data matrix X, we first construct a ma trix of proportions P = {Pi j } by dividing each element of X by n. Hence X l· 1· 1 = Pi j - , i == 1 , 2, . . . , I, j = 1 , 2, . . . , J , or p = - X (12-24) ==
n The matrix P is called the correspondence matrix.
( !XI)
n ( !XI )
712
Chapter 1 2
Cl ustering, D i stance Methods, a n d Ord i n ation
Next define the vectors of row and column sums r and c respectively, and the diagonal matrices D r and De with the elements of r and c on the diagonals. Thus
I X· . I i = 1 , 2, . . . , I, or r = P 1I ri = :L Pij = :L _!_!_ , j= 1 j= 1 n ( IX 1 ) ( IXI ) ( IX 1 ) I I X· . j = 1 , 2, . . . , 1 , or c = P' 1I cj = :L Pij = :L _!_!_ i= 1 n , i= 1 ( J X 1 ) ( IX J ) ( IX 1 ) where 1I is a J X 1 and 1I is a I X 1 vector of 1 ' s and Dr = diag ( r1 , r2 , . . . , ri ) and De = diag ( c1 , c2 , . . . , ci )
( (
We define the square root matrices
D;;2 = diag ( vr;-, . . , VY;) .
D�l2 = diag ( ve;-, . . , \10) .
D,- 1/2
-
1 1 . d1ag ...,_ c , c v ri � r1 /
�
I
�
) )
1 1 . ye;- , . . . , _ \10 De- 1/2 - dtag -
(12-25 )
(12-26)
(12-27)
for scaling purposes. Corresp9ndence analysis can be formulated as the weighted least squares prob lem to select P = {Pij } , a matrix of specified reduced rank, to minimize (12-28) since
( Pij - Pi j )! vr:c; is the ( i, j) element of D� 112(P - P) D� 112 . "
As Result 12.1 demonstrates, the term rc ' is commoll to the approximation P whatever the I X J correspondence matrix P. The matrix P = rc' can be shown to be the best rank 1 approximation to P. "
R esult 12.1. The term rc' is common to the approximation P whatever the J I X correspondence matrix P. The reduced rank s approximation to P , which minimizes the sum of squares (12-28), is given by
s s :L Ak (D;I2 uk ) (D�I2 vk ) ' = rc' + :L Ak (D;I2uk ) (D�I2 vk ) ' k= 1 k =2 where the Ak are the singular values and the I X 1 vectors uk and the J X 1 vectors vk are the corresponding singular vectors of the I X J matrix D� 112 PD� 112 . The I minimum value of (12-28) is :L Ak . k =s+ 1 p
.
The reduced rank K
>
1 approximation to
P - rc '
K
.
P
-
rc' is
:L Ak (D;I2 uk ) (D�I2 vk ) ' k= 1
(12-29)
Section 1 2 .6
Correspondence Ana lysis
713
where the A k are the singular values and the I X 1 vectors u k and the J X 1 vectors vk are the corresE._9 nding singular vectors of the I X J matrix D � 1 12 (P - rc ' ) D�112 • Here Ak = A k + 1 , u k = uk + 1 , and vk = vk + 1 for k = 1, . . . , 1 - 1 .
Proof. We first consider a scaled version B = D� 1 12 PD � 1 12 of the correspon
dence matrix P . According to Result 2A. 16, the best low rank = s approximation B to D� 1 12 PD� 1 12 is given by the first s terms in the the singular-value decomposition I " � -1 -1/2 /2pn Dr (12-30) c = ..L.J ll k ll k V k k= 1 where r-.J
r-.J ,
= � r Il k V k U k n-1/2pn-1/2
r-.J,
and
1 (D� 112 PD� 1 12 ) (D� 112 PD� 112 ) ' - A � I 1
r-.J ,
C
= o for
The approximation to P is then given by
(12-3 1)
k = 1, . . . , J
I and, by Result 2A. 16, the error of approximation is 2: A �. k =s + 1 Whatever the correspondence matrix P , the term rc' always provides a (the best) rank one approximation. This corresponds to the assumption of independence of the rows and columns. To see this, let u 1 = n ; /2 1/ and vl = DY2 1I , where 1/ is a I X 1 and 1I a J X 1 vector of 1 ' s. We verify that (12-31 ) holds for these choices.
n! (D� 112pn � 1;2 ) = ( n ; 12 11 ) ' (D� 1!2pn � 1;2 ) = f/ pn-1/2 c c = c' n-1/2
and
( n-1/2pn-1f2 r r /2pn-1 c f2 ) (D c1 12 1J ) c ) v1 = ( n -1
That is,
( ll , V ) - (D r1f2 1I ' D c1f2 11 ) 1 1 r-.J
r-.J
(12-32)
are singular vectors associated with singular value A 1 = 1. For any correspondence matrix, P , the common term in every expansion is r-.J
n ;l2 u v 1 D �I2 1
=
D r 11fiDc = rc '
714
Chapter 1 2
Cl usteri ng, Distance Methods, and Ord i n ation
Therefore, we have established the first approximation and (12-30) can always be ex pressed as
p
I
rc'
=
+ � Ak ( n ;l2 uk ) (DY2 vk ) ' k=2
Because of the common term, the problem can be rephrased in terms of P - rc' and its scaled version D� 112 ( P - rc' ) D� 1 /2 • B y the orthogonality of the singular vec tors of n-1 12 PD -1 12 we have u'k (D 112 11 ) = 0 and vk' (D 1 12 1J ) = 0 ' for k > 1 ' so r
c
r
'
c
n-1 12 ( P - rc' ) n-1;2 r
c
I
=
" �
k =2
Ak uk vk'
is the singular-value decomposition of D� 1 12 ( P - rc' ) D� 1/2 in terms of the singular val ues and vectors obtained from D � 1/2 PD � 1/2 • Converting to singular values and vectors Ak , uk , and vk from D � 1 /2 ( P - rc ' ) D � 1 /2 only amounts to changing k to k - 1 so A k = Ak + 1 , uk = uk + 1 ' and vk = vk + 1 for k = 1 , . . . ' J - 1 . In terms of the singular value decomposition for �f/2 ( P - rc' ) D� 1 12 , the ex pansion for P - rc ' takes the form
P - rc'
I-1 =
� Ak ( n ;l2 uk ) (DY2 vk ) '
k=1
(12-33) K
The best rank K approximation to D � 1 12 ( P - rc' ) D� 1/2 is given by � Ak uk vk . Then, k= 1 the best approximation to P - rc' is P
- rc'
-
K
� Ak ( D ;I2 uk ) (D�I2 vk ) '
k=1
(12-34) •
Remark. Note that the vectors n ;l2 uk and D�l2 vk in the expansion (12-34) of
P - rc' need not have length 1 but satisfy the scaling ( D 1 12 u k ) ' D -1 ( D 112 u k )
=
u'k u k
=
1
( D 1 12 vk ) ' D -1 ( D 1 12 vk )
=
v k' vk
=
1
r
c
r
c
r
c
Because of this scaling, the expansions in Result 12.1 have been called a generalized singular-value decomposition. Let A, U = [ u 1 , . . . , u 1 ] and V = [ v1 , . . . , vi J be the matricies of singular values and vectors obtained from D � 1 12 ( P - rc' ) D� 112 • It is usual in correspondence analy sis to plot the first two or three columns of F = D � 1 (D;I2 U ) A and G = D � 1 (D�I2 V ) A or A k D � 112 uk and A k D � 1/2 vk for k = 1, 2, and maybe 3. The j oint plot of the coordinates in F and G is called a symmetric map (see Greenacre [13]) since the points representing the rows and columns have the same normalization, or scaling, along the dimensions of the solution. That is, the geome try for the row points is identical to the geometry for the column points.
Section 1 2 . 6
Example 1 2. 1 8
Correspondence Ana lysis
71 5
(Calculations for Correspondence Analysis)
Consider the
3
X 2 contingency table
B1 AAl 16 A3 10060
B1 36 80 1006 100 200 [ .08 .06 ] . 3 0 .20 [ .5, .5] [ .18, .50] . .6, .8, . 1 0 3 . 8 0 . 0 3 6 -. [ ] ] ] [ [ - ..0308 .20 .50 [ .5 .5] -..0058 -..0058 2
Total
2
24
2
4 4
4
The correspondence matrix is
p
with marginal totals root matrices are
D� 1 12 Then
P
'
'
.24
=
and
=
diag ( vlj
=
rc
c
.12
vlj
r
'
.32,
=
The negative square
D� 1/2
vl)
=
diag ( v2 , vl)
.12
.24
=
-
.32
=
The scaled version of this matrix is
A
=
D� 1 12 ( P
v2
0 0 . 0 3 0 3 -. ] . 6 [ �] [ 0 8 -. . 0 8 00 .08 0 .05 -.05 0 [ -0.0.21 -0.1 ] 0.1 -0.1 [ - .. 11 -.. 11 ] [ · 1 - · 1 ] [ - ..0066 -..0066 ] .1 -.1 rc
'
) D� 1 12
=
v2
v2
"\
v2
=
Since I
>
A'A
0.2
J, the square of the singular values and the vi are determined from
=
- .2 .2
- .2
.2
=
716
Chapter 1 2
Clustering, Distance Methods, a n d Ord i nation
It is easily checked that AI = 1 .
2, 0, A� =
since J - 1 = 1, and that 1
V2 -1
V2 Further,
[ -.2 .2 ] _ .l - ..22 ] - [ - .·0042 -.04 _ ..0024 ] [ .02 -.04 .02 .2\13 -1
.1
AA' =
.1 -.1
.
.1
•1
-.1
.08
A computer calculation confirms that the single nonzero eigenvalue is and, as you AI = .12, so that the singular value has absolute value A 1 = can easily check, 1
v'6
2
--
v'6 1
v'6 The expansion of P - rc ' is then the single term
A 1 ( D 112 u 1 ) (D 112 v1 ) ' r
c
1 -
.6 -
0 0 2 0 0 0 [� �] 0 0 0 3 . 0 3 -. 0 3 [ ] ] [ 2 2 .05 -.05
V2
= Y.I2
v'6
.8 -
1 -
V2
--
V2
v'6
1 -
1 -
V2
1 -
V2
v'6
.
v'3
.8 --
\13 .5
v'3
1
-
-1
-
=
- .08
.08
check
Section 1 2 . 6
Correspondence Ana lysis
717
There is only one vector t o plot .6
-
v'2 A 1 D ; 12 u 1 == VI2
0
1
0
.8
0
-
0
0
v'2
0
-
v'6
vJ
2
-
-
v'6
1
.3
-
1
== VI2
-
.8
-
vJ .5
-
-
-
v'2
v'6
vJ
and 1
-
A 1 D �I2 v1 == Y.I2
v'2
-
v'2
1
0
1 2 1 2
1
0
== VI2
-1
-
-
v'2
v'2
•
There is a second way to define contingency analysis. Following Greenacre [13] , we call the preceding approach the matrix approximation method and the ap proach to follow the profile approx imation method. We illustrate the profile approximation method using the row profiles; however, an analogous solution re sults if we were to begin with the column profiles. Algebraically, the row profiles are the rows of the matrix D� 1 P , and contin gency analysis can be defined as the approximation of the row profiles by points in a low-dimensional space. Consider approximating the row profiles by the matrix P * . Using the square-root matrices n ; /2 and DY2 defined in (12-27), we can write
( n-1 P r
_
P * ) n-1;2 == n-1/2 ( n-1/2p c
r
r
_
D 1;2 P * ) n-1;2 r
and the least squares criterion (12-28) can be written, with
== tr [D 1f2 ( n-1/2p r
r
_
D 1;2 P * ) n-1/2n-1/2 ( n-1/2p r
c
c
r
_
c
p1} == pi 1 jri , as
n 1/2 P * ) ' n-1f2 J r
r
(12-35) Minimizing the last expression for the trace in (12-35) is precisely the first min imization problem treated in the proof of Result 12. 1 . By (12-30) , D� 1 12 PD� 1/2 has the singular-value decomposition
I
D � 1;2 PD� 1;2 == � A'k u k v k k=1
(12-36)
718
Chapter 1 2
Cl usteri ng, D i sta nce Methods, a n d Ord i nation
The best rank K approximation is obtained by using the first K terms of this expan sion. Since, by (12-35), we have D � 1 12 PD�I2 approximated by o; 12 P * D� 1/2 , we le ft multiply by D � 112 and right multiply by D� /2 to obtain the generalized singular-valu e decomposition
n�1 P
I L Ak D� 1;2uk ( n �l2 vk ) ' k
(12-37) =1 where, from (12-32) , ( u1 , v1 ) = ( o; /2 11 , DY2 11 ) are singular vectors associated with singular value A 1 = 1 . Since D� 1 12 (D ; ;2 11 ) = 11 and (D� /2 11 ) ' D� /2 = c' , the leadin g term in the decomposition (12-37) is 11c' . Consequently, in terms of the singular values and vectors from D � 112 PD� 1 12 , the reduced rank K < J approximation to the row profiles D�1 P is K P* (12-38) l1c ' + L Ak D� 112 uk (D�I2 vk ) ' k=2 In terms of the singular values and vectors A k , uk and vk obtained from o -1 12 ( P - rc' ) D -1 12 we can write K-1 P * - 11 c' L Ak D� 112 uk ( D�I2 vk ) ' k=1 (Row profiles for the archaeological data in Table 12.8 are shown in Figure 12.20.) =
·
r
c
'
·
Inertia Total inertia is a measure of the variation in the count data and is defined as the weighted sum of squares
where the Ak are the singular values obtained from the singular-value decomposition 3 of D � 1 12 ( P - rc' ) D� 1/2 ( see the proof of Result 12. 1). The inertia associated with the best reduced rank K < J approximation to the K centered matrix P - rc' ( the K-dimensional solution) has inertia L A� . The residual k=1 inertia (variation ) not accounted for by the rank K solution is equal to the sum of squares of the remaining singular values: Ak + 1 + Ak + 2 + . . . + A} _ 1 . For plots, the inertia associated with dimension k, A� , is ordinarily displayed along the kth coordi nate axis, as in Figure 12.21 for k = 1 , 2. 3 Total inertia is related to the chi-square measure of association in a two-way contingency table,
X2
=
2
(Ot; - Et;) Here 01 1 _L E l,j
lJ
.
=
x1 1 is the observed frequency and E1 1 is the expected frequency for the
ijth cell. In our context, if the row variable is independent of (unrelated to) the column variable, E1 1 = nr1c1 , and
n
Section 1 2 .7
B i p l ots for Viewi ng Sam p l i ng U n its and Va riables
719
I nterpretatio n i n Two Di mensions Since the inertia is a measure of the data table ' s total variation, how do we interpret a large value for the proportion
1- 1
( "-I + A�)/ L Ak ? Geometrically, we say that the k= l
associations in the centered data are well represented by points in a plane, and this best approximating plane accounts for nearly all the variation in the data beyond that accounted for by the rank 1 solution ( independence model ) . Algebraically, we say that the approximation
is very good or, equivalently, that
Final Co mme nts Correspondence analysis is primarily a graphical technique designed to represent as sociations in a low-dimensional space. It can be regarded as a scaling method, and can be viewed as a complement to other methods such as multidimensional scaling ( Section 12.5) and biplots ( Section 12.7). Correspondence analysis also has links to principal component analysis ( Chapter 8) and canonical correlation analysis ( Chap ter 10). The book by Greenacre [14] is one choice for learning more about corre spondence analysis.
1 2.7
BIPLOTS FOR VI EWI NG SAM PLI N G U N ITS AND VARIABLES A biplot is a graphical representation of the information in an n X p data matrix. The bi- refers to the two kinds of information contained in a data matrix. The infor mation in the rows pertains to samples or sampling units and that in the columns per tains to variables. When there are only two variables, scatter plots can represent the information on both the sampling units and the variables in a single diagram. This permits the vi sual inspection of the position of one sampling unit relative to another and the rela tive importance of each of the two variables to the position of any unit. With several variables, one can construct a matrix array of scatter plots, but there is no one single plot of the sampling units. On the other hand, a two-dimensional plot of the sampling units can be obtained by graphing the first two principal com ponents, as in Section 8.4. The idea behind biplots is to add the information about the variables to the principal component graph. Figure 12.22 gives an example of a biplot for the public utilities data in Table 12.5. You can see how the companies group together and which variables contribute to their positioning within this representation. For instance, X4 = annual load fac tor and X8 = total fuel costs are primarily responsible for the grouping of the most ly coastal companies in the lower right. The two variables X1 = fixed-charge ratio and X2 == rate of return on capital put the Florida and Louisiana companies together.
720
Chapter 1 2
Cl usteri ng, D istance Methods, and Ord i nation Nev . Po.
3 Pug. Sd. Po.
X6 2
Idaho Po.
Ok. G. & E .
X5
Tex. Util.
X3
0 Cent. Louis .
X2 San Dieg. G&E
X1
-1
Bost. Ed. Pac . G&E
Flor. Po. & Lt.
X4
Unit. Ill. Co.
N. En . El. Con. Ed.
-2
-2
-1
Figure 1 2 .22
0
Haw. El.
X8
2
3
A b i p l ot of the data on p u b l i c util ities.
Co nstructi ng Biplots The construction of a biplot proceeds from the sample principal components. According to Result 8A. 1 , the best two-dimensional approximation to the data matrix X approximates the jth observation xj in terms of the sample values of the first two principal components. In particular, (12-40) where e l and e 2 are the first two eigenvectors of s or, equivalently, of X� X C = ( n - 1 ) S. Here X C denotes the mean corrected data matrix with rows ( xj - x) ' . The eigenvectors determine a plane, and the coordinates of the jth unit (row) are the pair of values of the principal components, ( yj 1 , yj 2 ) . To include the information on the variables in this plot, we consider the pair of eigenvectors ( e l ' e 2 ) . These eigenvectors are the coefficient vectors for the first two sample principal components. Consequently, each row of the matrix E = [ e l ' e 2 ] po sitions a variable in the graph, and the magnitudes of the coefficients (the coordi nates of the variable) show the weightings that variable has in each principal component. The positions of the variables in the plot are indicated by a vector.
Section 1 2 .7
B i p l ots for Viewing Sa m p l i ng U n its and Va ria bles
721
Usually, statistical computer programs include a multiplier so that the lengths of all of the vectors can be suitably adjusted and plotted on the same axes as the sampling units. Units that are close to a variable likely have high values on that variable. To interpret a new point x 0 , we plot its principal components E(x 0 - x) . A direct approach to obtaining a biplot starts from the singular value decompo sition (see Result 2A.15), which first expresses the n X p mean corrected matrix X c as
Xc
(n xp)
==
U
A
V'
(12-41 )
( n xp) ( pXp) ( pXp)
where A == diag ( A 1 , A2 , . . . , A p ) and V is an orthogo11al matrix whose columns are the == ' e , . . . ' e p ]. Multiplying eigenvectors of X� X C A ( n - 1 ) S. That is, v == E [ e l 2 (12-41) on the right by E, we find ==
(12-42) where the jth row of the left-hand side,
is just the value of the principal components for the jth i tem. That is, U A contains all of the values of the principal components, while V E contains the coefficients that define the principal components. The best rank 2 approximation to X c is obtained by replacing A by diag ( A 1 , A2 , 0, . . . , 0 ) . This result, called the Eckart-Young theorem, was es A* tablished in Result 8.A. 1 . The approximation is then ==
==
Xc
_:_
UA *V' = [ Yl , YzJ
[�J
(12-43)
where y1 is the n X 1 vector of values of the first principal component and y2 is the n X 1 vector of values of the second principal component. In the biplot, each row of the data matrix, or item, is represented by the point located by the pair of values of the principal components. The ith column of the data matrix, or variable, is represented as an arrow from the origin to the point with co ordinates ( e l i' e2 J' the entries in the ith column of the second matrix [ e l ' e 2 ] ' in the approximation (12-43). This scale may not be compatible with that of the principal components, so an arbitrary multiplier can be introduced that adjusts all of the vec tors by the same amount. The idea of a biplot, to represent both units and variables in the same plot, ex tends to canonical correlation analysis, multidimensional scaling, and even more com plicated nonlinear techniques. (See [11].) Example 1 2 . 1 9
{A biplot of u niversities and thei r cha racteri stics)
Table 12.9 gives the data on some universities for certain variables used to com average SAT pare or rank maj or universities. These variables include X1 score of new freshmen, X2 == percentage of new freshmen in top 10% of high ==
722
Chapter 1 2
Cl usteri ng, Distance Methods, and Ord i nation
TABLE 1 2.9
DATA O N U N IVERS ITI ES
University
SAT
Top10
Accept
SFRatio
Expenses
Grad
Harvard Princeton Yale Stanford MIT Duke Cal Tech Dartmouth Brown JohnsHopkins UChicago UPenn Cornell Northwestern Columbia NotreDame UVirginia Georgetown CarnegieMellon UMichigan UCBerkeley UWisconsin PennS tate Purdue TexasA&M
14.00 13 .75 13 .75 13.60 13 .80 13.15 14.15 13 .40 13.10 13.05 12.90 12.85 12.80 12.60 13.10 12.55 12.25 12.55 12.60 1 1 . 80 12.40 10.85 10.81 10.05 10.75
91 91 95 90 94 90 100 89 89 75 75 80 83 85 76 81 77 74 62 65 95 40 38 28 49
14 14 19 20 30 30 25 23 22 44 50 36 33 39 24 42 44 24 59 68 40 69 54 90 67
11 8 11 12 10 12 6 10 13 7 13 11 13 11 12 13 14 12 9 16 17 15 18 19 25
39.525 30.220 43 .514 36 .450 34.870 31 .585 63.575 32.162 22.704 58.691 38.380 27.553 21 .864 28.052 31 .510 15. 122 13 .349 20.126 25 .026 15.470 15. 140 1 1 . 857 10.185 9.066 8.704
97 95 96 93 91 95 81 95 94 87 87 90 90 89 88 94 92 92 72 85 78 71 80 69 67
Source: U. S. News & World Report, September 18, 1995, p. 126.
school class, x3 = percentage of applicants accepted, x4 = student -faculty ratio, X5 = estimated annual expenses and X6 = graduation rate ( % ) . Because two of the variables, SAT and Expenses, are on a much different scale from that of the other variables, we standardize the data and base our hi plot on the matrix of standardized observations z j . The biplot is given in Fig ure 12.23 on page 723 . Notice how Cal Tech and Johns Hopkins are off by themselves; the vari able Expense is mostly responsible for this positioning. The large state uni versities in our sample are to the left in the biplot, and most of the private schools are on the right. Large values for the variables SAT, ToplO, and Grad are associated with the private school group. Northwestern lies in the middle • of the biplot .
Section 1 2 . 8
Procrustes Ana lysis: A Method for Compa r i n g Config urations
723
2
SFRatio
Uv.1rgm1a · · NotreDame
UCBerkeley
Brown
Grad
Georgetown Cornell
Top l O
Duke PennS tate
TexasA&M
UMichigan
0
UChicago Purdue
UWisconsin
Accept
-1
Expense CarnegieMellon
-2 JohnsHopkins Cal Tech
-4 Figure 1 2 .23
1 2.8
-2
0
2
A b i p l ot of the data on u n iversities.
PROCRU STES ANALYS IS: A M ETHOD FOR CO M PARING CON F I G U RATI O N S Starting with a given n X n matrix of distances D , or similarities S, that relate n ob jects, two or more configurations can be obtained using different techniques. The possible methods include both metric and nonmetric multidimensional scaling. The question naturally arises as to how well the solutions coincide. Figures 12.18 and 12. 19 in Example 12. 16 respectively give the metric multidimensional scaling ( prin cipal coordinate analysis ) and nonmetric multidimensional scaling solutions for the data on universities. The two configurations appear to be quite similar, but a quan titative measure would be useful. A numerical comparison of two configurations, obtained by moving one configuration so that it aligns best with the other, is called Procrustes analysis, after the innkeeper Procrustes, in Greek mythology, who would either stretch or lop off customers ' limbs so they would fit his bed.
724
Chapter 1 2
Cl ustering, Distance Methods, and Ord i nation
Co nstructi ng the Procru stes Measu re of Ag reement Suppose the n X p matrix X* contains the coordinates of the n points obtained for plotting with technique 1 and the n X q matrix Y* contains the coordinates from technique 2, where q < p. By adding columns of zeros to Y*, if necessary, we can as sume that X* and Y* both have the same dimension n X p. To determine how com patible the two configurations are, we move, say, the second configuration to match the first by shifting each point by the same amount and rotating or reflecting the con figuration about the coordinate axes. 4 Mathematically, we translate by a vector b and multiply by an orthogonal ma Q trix so that the coordinates of the jth point yj are transformed to
Q yj + b The vector b and orthogonal matrix Q are then varied to order to minimize the sum, over all n points, of squared distances (12-44)
between xj and the transformed coordinates Q yj + b obtained for the second tech nique. We take, as a measure of fit, or agreement, between the two configurations, the residual sum of squares n
Q Q PR 2 = minb " � (x1· - y.1 - b ) ' (x 1· - y·1 - b) Q' = 1 1
(12-45)
The next result shows how to evaluate this Procrustes residual sum of squares mea sure of agreement and determines the Procrustes rotation of Y* relative to X*.
Result 12.2 Let the n X
p
configurations X* and Y* both be centered so that
all rows have mean zero. Then
PR 2
where A = diag ( A1 ,
n
�
=
j= l
xjxj +
n
� y;·y1 j=l
-
p 2 � Ai i =l
= tr [X*X* ' J + tr [Y*Y* ' ] - 2 tr [ A J
(12-46)
A.2 , . . . , Ap ) and the minimizing transformation is p
"
Q = � vi ui = VU' i =l
"
b = O
(12-47)
4 Sib son [27] has proposed a numerical measure of the agreement between two configurations, given by the coefficient
'Y =
l
-
[tr ( Y* ' X*X* ' Y* ) 112 ] 2 tr (X* ' X* ) tr ( Y* ' Y* )
For identical configurations, y = 0. If necessary, y can be computed after a Procrustes analysis has been completed.
Section 1 2. 8
Procrustes Ana lysis: A Method for Comparing Config u rations
725
Here A, U, and V are obtained from the singular-value decomposition
n
* L yj xj = ( pYx n*)' ( X p) n x j= l
(�
Proof.
L
j= l
A
V'
(p X p) ( p x p) (p x p)
� )
B e cause the configurations are centered to have zero means
xi = 0 and
n
= U
Yi = 0 . we have
n
( xj - Qyj - b ) ' ( xj - Qyj - b )
L
j= l
( xj - Qyj ) ' ( xj - Qyj ) + n b ' b "
The last term is nonnegative, so the best fit occurs for b = 0. Consequently, we need only consider
n n n n 2 PR = min L ( xj - Qyj ) ' ( xj - Qyj ) = L xj xj + L yj yj - 2 max L xj Qyj Q
j= l
j= l
j= l
Q
j= l
Using xj Qyj = tr [ Qyj xj ] , we find that the expression being maximized becomes
By the singular-value decomposition,
n
L
j= l
yj xj = Y* ' X* = UA V'
p
=
L Aiuivi
j= l
where U = [ u 1 , u2 , . . . , u p ] and V = [ v1 , v2 , . . . , vp ] are Consequently,
±
1=l
[ (± )]
xj Qyi = tr Q
z=l
A; u; v ;
=
p
X
p
orthogonal matrices.
A; tr [ Qu; vi ] ± z=l
The variable quantity in the ith term
has an upper bound of 1 as can be seen by applying the Cauchy-Schwarz inequality (2-48) with b = Qvi and d = ui . That is, since Q is orthogonal,
< Vvl� QQ ' v·l vl� Qul· -
�= l
l
� l l X 1 = 1
726
Chapter 1 2
Cl ustering, Distance Methods, and Ord i nation
Each of these p terms can be maximized by the same choice Q
=
VU ' . With this choice,
0
vi Qu i
v i VU ' ui
=
=
0 1 0
[0, . . . , 0, 1 , 0, . . . , O J
=
1
0 Therefore, n
-2 max � xjQyj Q j= l Finally, we verify that QQ ' matrix, as required. Example 1 2 .20
=
=
VU' UV'
-2 ( A 1 =
+ A2 +
VI P V '
=
···
+ Ap )
I P , so Q is a p X
p
orthogonal •
(Procrustes analysis of the data on u n iversities)
Two configurations, produced by metric and non metric multidimensional scal ing, of data on universities are given Example 12.16. The two configurations ap pear to be quite close. There is a two-dimensional array of coordinates for each of the two scaling methods. Initially, the sum of squared distances is
25 � ( xj - yj ) ' ( xj - yj ) j= l A computer calculation gives u
A
= =
[
[
- . 9990 .0448
.0448 .9990
-[
]
114.9439 0.000 0.000 21 .3673
v-
=
3.862
- 1 .0000 .0076
.0076 1 .0000
]
]
According to Result 12.2, to better align these two solutions, we multiply the nonmetric scaling solution by the orthogonal matrix
Q
=
± i=l
V;U;
=
VU '
=
[
.9993 .0372
- .0372 .9993
]
This corresponds to clockwise rotation of the nonmetric solution by about 2 degrees. After rotation, the sum of squared distances, 3.862, is reduced to the Procrustes measure of fit 25 25 2 " " A· = 3 • 673 " 2 PR = L..J x]'·X]· + L..J y]'· y·] - 2 L..J • j= l j= l j= l l
Section 1 2.8
727
Procrustes Ana lysis: A Method for Com pa ring Configurations
Example 1 2 .21
(Procrustes ana lysis and additional ord i n ations of data o n fo rests)
Data were collected on the populations of eight species of trees growing on ten upland sites in southern Wisconsin. These data are shown in Table 12.10. The metric, or principal coordinate, solution and nonmetric multidimen sional scaling solution are shown in Figures 12.24 and 12.25. TABLE 1 2 . 1 0
WISCO N S I N FOR EST DATA
Site Tree
1
2
3
4
5
6
7
8
9
10
BurOak Black Oak WhiteOak Red Oak AmericanElm Basswood Ironwood SugarMaple
9 8 5 3 2 0 0 0
8 9 4 4 2 0 0 0
3 8 9 0 4 0 0 0
5 7 9 6 5 0 0 0
6 0 7 9 6 2 0 0
0 0 7 8 0 7 0 5
5 0 4 7 5 6 7 4
0 0 6 6 0 6 4 8
0 0 0 4 2 7 6 8
0 0 2
Source: See [21 ] .
4 t-
2 S3 Sl S2
SlO
0 -
S7
S4
S9
S8
S6 S5
-2 -
I
I
I
I
-2
0
2
4
Figure 1 2 .24
Metric m u ltidimensional sca l i n g of t h e data o n forests.
3 5 6 5 9
728
Chapter 1 2
Cl usteri ng, D i sta nce Methods, a n d Ord i nation
4
r-
2
r-
0
r-
SlO
S3
Sl
S9
S2
S8 S6
S4
-2
S7
r-
S5
l
l
-2 Figure 1 2 .25
l
l
2
0
4
Non metric m u ltidimensional sca l i n g of the data on forests.
Using the coordinates of the points in Figures 12.24 and 12.25 , we obtain the initial sum of squared distances for fit: 10
L
j= l
( xj - yj ) ' (x j - yj )
A computer calculation gives
[ [
u -- - . 9833 - . 1 821 - .1821 .9833 A
=
43.3748 0.0000
]
0.0000 14.9103
v
=
=
[
8.547
- 1 .0000 - . 0001
- .0001 1.0000
]
]
According to Result 12.2, to better align these two solutions, we multiply the nonmetric scaling solution by the orthogonal matrix
" Q
� .£.,;
i= l
V·U· 1 l
'
-[
.9833 .1821 - U - - . 1 821 .9833
- v
'
]
Section 1 2 .8
Procrustes Ana lysis: A Method for Com pa r i n g Config u rations
729
This corresponds to clockwise rotation of the nonmetric solution by about 10 degrees. After rotation, the sum of squared distances, 8.547, is reduced to the Procrustes measure of fit
PR 2
==
10
10
.£.,; y'· Y·
.£.,; x'· X · + J
"
j=1
J
"
j= 1
J J
-
2 " A· 2 .£.,; l
t=1
== 6 599 •
We note that the sampling sites seem to fall along a curve in both pictures. This could lead to a one-dimensional nonlinear ordination of the data. A quadratic or other curve could be fit to the points. By adding a scale to the curve, we would obtain a one-dimensional ordination. It is informative to view the Wisconsin forest data when both sampling units and variables are shown. A correspondence analysis applied to the data produces the plot in Figure 12.26. The biplot is shown in Figure 12.27. All of the plots tell similar stories. Sites 1-5 tend to be associated with species of oak trees, while sites 7-10 tend to be associated with basswood, iron wood, and sugar maples. American elm trees are distributed over most sites, but are more closely associated with the lower numbered sites. There is almost a continuum of sites distinguished by the different species of trees. •
5
2 -
6
Ironwood
BlackOak
1
t-
SugarMaple 4 BurOak o
r--
Ame.ric.qn_El.pJ.
- - - - - - - - - - - - - - - - - _ - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
7
1 2
Basswood
3 WhiteOak
- 1 t-
10
9
:RedOak I
I
-2
I
I -1 Figure 1 2 .26
I 0
I
The corresponde nce a n a lysis plot of the data on forests.
I 2
730
Chapter 1 2
Cl usteri ng, D ista nce Methods, and Ord i nation 3
2
9
3 10
2 BlackOak
Ironwood SugarMaple 0 8 Basswood Arne · anElm -1
7
4 WhiteOak 6
-2
RedOak
5 -2
-1 Figure 1 2 .27
0
The b l p l ot of the data o n forests.
2
3
S U PP LE M E NT 1 2A
Data Mining
I NTRO DUCTI O N A very large sample in applications of traditional statistical methodology may mean 10,000 observations on, perhaps, 50 variables. Today, computer-based repositories known as data warehouses may contain many terabytes of data. For some organi zations, corporate data have grown by a factor of 100,000 or more over the last few decades. The telecommunications, banking, pharmaceutical, and (package) shipping industries provide several examples of companies with huge databases. Consider the following illustration. If each of the approximately 17 million books in the Library of Congress contained a megabyte of text (roughly 450 pages) in MS Word format, then typing this collection of printed material into a computer database would con sume about 17 terabytes of disk space. United Parcel Service (UPS) has a package level detail database of about 17 terabytes to track its shipments. For our purposes, data mining refers to the process associated with discovering patterns and relationships in extremely large data sets. That is, data mining is con cerned with extracting a few nuggets of knowledge from a relative mountain of nu merical information. From a business perspective, the nuggets of knowledge represent actionable information that can be exploited for a competitive advantage. Data mining is not possible without appropriate software and fast computers. Not surprisingly, many of the techniques discussed in this book, along with algorithms developed in the machine learning and artificial intelligence fields, play important roles in data mining. Companies with well-known statistical software packages now offer comprehensive data mining programs. 5 In addition, special purpose programs such as CART have been used successfully in data mining applications. Data mining has helped to identify new chemical compounds for prescription drugs, detect fraudulent claims and purchases, create and maintain individual cus tomer relationships, design better engines and build appropriate inventories, create better medical procedures, improve process control, and develop effective credit scoring rules. In traditional statistical applications, sample sizes are relatively small, data are carefully collected, sample results provide a basis for inference, anomalies are treat ed but are often not of immediate interest, and models are frequently highly 5 SAS Institute's data mining program is currently called Enterprise Miner. SPSS's data mining program is Clementine.
731
732
Chapter
Clustering, Distance Methods, and Ord i nation
12
structured. In data mining, sample sizes can be huge; data are scattered and histori cal (routinely recorded) , samples are used for training, validation, and testing (no formal inference); anomalies are of interest; and models are often unstructured. Moreover, data preparation-including data collection, assessment and cleaning, and variable definition and selection-is typically an arduous task and represents 60 to 80% of the data mining effort. Data mining problems can be roughly classified into the following categories: •
•
•
•
•
Classification (discrete outcomes): Who is likely to move to another cellular phone service? Prediction (continuous outcomes) : What is the appropriate appraised value for this house? Association/market basket analysis: Is skim milk typically purchased with low-fat cottage cheese? Clustering: Are there groups with similar buying habits? Description: On Thursdays, grocery store consumers often purchase corn chips and soft drinks together.
Given the nature of data mining problems, it should not be surprising that many of the statistical methods discussed in this book are part of comprehensive data mining software packages. Specifically, regression, discrimination and classification proce dures (linear rules, logistic regression, decision trees such as those produced by CART) , and clustering algorithms are important data mining tools. Other tools, whose discussion is beyond the scope of this book, include association rules, multi variate adaptive regression splines (MARS), K-nearest neighbor algorithm, neural networks, genetic algorithms, and visualization. 6 The Data M i n i n g Process Data mining is a process requiring a sequence of steps. The steps form a strategy that is not unlike the strategy associated with any model building effort. Specifical ly, data miners must 1. Define the problem and identify objectives. 2. Gather and prepare the appropriate data. 3. Explore the data for suspected associations, unanticipated characteristics, and obvious anomalies to gain understanding. 4. Clean the data and perform any variable transformation that seems appropriate. 5. Divide the data into training, validation, and, perhaps, test data sets. 6. Build the model on the training set. 6 For more information on data mining in general and data mining tools in particular, see the ref erences at the end of this chapter.
Supplement 1 2A
Data M i n ing
733
7. Modify the model (if necessary) based on its performance with the validation data. 8. Assess the model by checking its performance on validation or test data. Compare the model outcomes with the initial obj ectives. Is the model likely to be useful? 9. Use the model. 10. Monitor the model performance. Are the results reliable, cost effective? In practice, it is typically necessary to repeat one of more of these steps sever al times until a satisfactory solution is achieved. Data mining software suites such as Enterprise Miner and Clementine are typically organized so that the user can work sequentially through the steps listed and, in fact, can picture them on the screen as a process flow diagram. Data mining requires a rich collection of tools and algorithms used by a skilled analyst with sound subj ect matter knowledge (or working with someone with sound subj ect matter knowledge) to produce acceptable results. Once established, any suc cessful data mining effort is an ongoing exercise. New data must be collected and processed, the model must be updated or a new model developed, and, in general, ad justments made in light of new experience. The cost of a poor data mining effort is high, so careful model construction and evaluation is imperative. Model Assessment In the model development stage of data mining, several models may be examined si multaneously. In the example to follow, we briefly discuss the results of applying lo gistic regression, decision tree methodology, and a neural network to the problem of credit scoring (determining good credit risks) using a publicly available data set known as the German Credit data. Although the data miner can control the model inputs and certain parameters that govern the development of individual models, in most data mining applications there is little formal statistical inference. Models are ordi narily assessed (and compared) by domain experts using descriptive devices such as confusion matrices, summary profit or loss numbers, lift charts, threshold charts, and other, mostly graphical, procedures. The split of the very large initial data set into training, validation, and testing sub sets allows potential models to be assessed with data that were not involved in model development. Thus, the training set is used to build models that are assessed on the validation (holdout) data set. If a model does not perform satisfactorily in the vali dation phase, it is retrained. Iteration between training and validation continues until satisfactory performance with validation data is achieved. At this point, a trained and validated model is assessed with test data. The test data set is ordinarily used once at the end of the modeling process to ensure an unbiased assessment of model per formance. On occasion, the test data step is omitted and the final assessment is done with the validation sample, or by cross-validation. An important assessment tool is the lift chart. Lift charts may be formatted in various ways, but all indicate improvement of the selected procedures (models) over what can be achieved by a baseline activity. The baseline activity often represents a prior conviction or a random assignment. Lift charts are particularly useful for com paring the performance of different models.
734
Chapter 1 2
Cl ustering, Distance Methods, a n d Ord i nation
Lift is defined as . L1ft
P(result I condition ) ==
----
P ( result )
1. A value of Lift > 1 im If the result is independent of the condition, then Lift plies the condition (generally a model or algorithm) leads to a greater probability of the desired result and, hence, the condition is useful and potentially profitable. Dif ferent conditions can be compared by comparing their lift charts. ==
Example 1 2 .21
(A sma l l -scale data m i n i ng exercise)
A publicly available data set known as the German Credit data7 contains ob servations on 20 variables for 1000 past applicants for credit. In addition, the resulting credit rating (" Good" or "Bad") for each applicant was recorded. The obj ective is to develop a credit scoring rule that can be used to determine if a new applicant is a good credit risk or a bad credit risk based on values for one or more of the 20 explanatory variables. The 20 explanatory variables include CHECKING (checking account status) , DURATION (duration of credit in months) , HISTORY (credit history) , AMOUNT (credit amount), EMPLOYED (present employment since) , RESIDENT (present resident since), AGE (age in years), OTHER (other installment debts), INSTALLP (installment rate as % of disposable income) , and so forth. Essentially, then, we must develop a function of several variables that allows us to classify a new applicant into one of two cat egories: Good or Bad. We will develop a classification procedure using three approaches dis cussed in Section 11 .8; logistic regression, classification trees, and neural net works. An abbreviated assessment of the three approaches will allow us compare the performance of the three approaches on a validation data set. This data mining exercise is implemented using the general data mining process de scribed earlier and SAS Enterprise Miner software. In the full credit data set, 70% of the applicants were Good credit risks and 30% of the applicants were Bad credit risks. The initial data were divided into two sets for our purposes, a training set and a validation set. About 60% of the data (581 cases) were allocated to the training set and about 40% of the data ( 419 cases) were allocated to the validation set. The random sampling scheme em ployed ensured that each of the training and validation sets contained about 70% Good applicants and about 30% Bad applicants. The applicant credit risk profiles for the data sets follow.
Good: Bad: Total:
Credit Data
Training Data
Validation Data
700 300 1000
401 180 581
299 120 419
7 At the time this supplement was written, the German Credit data were available in a sample data file accompanying SAS Enterprise Miner. Many other publicly available data sets can be downloaded from the following Web site: www.kdnuggets.com.
Supplement 1 2A
Data M i n i ng
735
Partition
Figure
1 2.28
Neura l
S AMPS l O .
Network
DMAGESCR
The p rocess flow d i a g r a m .
Figure 12.28 shows the process flow diagram from the Enterprise Miner screen. The icons in the figure represent various activities in the data mining process. As examples, SAMPS10.DMAGECR contains the data; Data Parti tion allows the data to be split into training, validation, and testing subsets; Transform Variables, as the name implies, allows one to make variable trans formations; the Regression, Tree, and Neural Network icons can each be opened to develop the individual models; and Assessment allows an evaluation of each predictive model in terms of predictive power, lift, profit or loss, and so on, and a comparison of all models. The best model (with the training set parameters) can be used to score a new selection of applicants without a credit designation (SMAPS10.DMAGE SCR) . The results of this scoring can be displayed, in various ways, with Dis tribution Explorer. For this example, the prior probabilities were set proportional to the data; consequently, P( Good) = .7 and P(Bad ) = .3. The cost matrix was initially specified as follows: Predicted (Decision)
Actual
Good Bad
Good (Accept)
Bad (Rej ect)
0 $5
$1 0
so that it is 5 times as costly to classify a Bad applicant as Good (Accept) as it is to classify a Good applicant as Bad (Reject) . In practice, accepting a Good credit risk should result in a profit or, equivalently, a negative cost. To match this formulation more closely, we subtract $1 from the entries in the first row of the cost matrix to obtain the "realistic" cost matrix:
736
Chapter 1 2
Clustering, Dista nce Methods, and Ord i nation
Predicted (Decision)
Actual
Good (Accept)
Bad (Rej ect)
-$1 $5
0 0
Good Bad
This matrix yields the same decisions as the original cost matrix, but the results are easier to interpret relative to the expected cost obj ective function. For ex ample, after further adj ustments, a negative expected cost score may indicate a potential profit so the applicant would be a Good credit risk. Next, input variables need to be processed (perhaps transformed) , mod els (or algorithms) must be specified, and required parameters must be set in all of the icons in the process flow diagram. Then the process can be executed up to any point in the diagram by clicking on an icon. All previous connected icons are run. For example, clicking on Score executes the process up to and includ ing the Score icon. Results associated with individual icons can then be exam ined by clicking on the appropriate icon. We illustrate model assessment using lift charts. These lift charts, available in the Assessment icon, result from one execution of the process flow diagram in Figure 12.28. Consider the logistic regression classifier. Using the logistic regression function determined with the training data, an expected cost can be computed for each case in the validation set. These expected cost "scores" can then or dered from smallest to largest and partitioned into groups by the lOth, 20th, . . . , and 90th percentiles. The first percentile group then contains the 42 (10% of 419) of the applicants with the smallest negative expected costs (largest potential profits), the second percentile group contains the next 42 applicants (next 10%), and so on. (From a classification viewpoint, those applicants with negative ex pected costs might be classified as Good risks and those with nonnegative expected costs as Bad risks. ) If the model has no predictive power, we would expect, approximately, a uniform distribution of, say, Good credit risks over the percentile groups. That is, we would expect 10% or .10(299) = 30 Good credit risks among the 42 ap plicants in each of the percentile groups. Once the validation data have been scored, we can count the number of Good credit risks (of the 42 applicants) actually falling in each percentile group. For example, of the 42 applicants in the first percentile group, 40 were actually Good risks for a "captured response rate" of 40/299 = .133 or 13.3%. In this case, lift for the first percentile group can be calculated as the ratio of the num ber of Good predicted by the model to the number of Good from a random as signment or Lift
=
��
=
1 .33
The lift value indicates the model assigns 10/299
=
.033 or 3.3% more Good
Supp lement 1 2A
Data M i n ing
737
Figure 12.29 C u m u l ative l i ft chart for the log istic reg ression classifier.
risks to the first percentile group (largest negative expected cost) than would be assigned by chance. 8 Lift statistics can be displayed as individual (noncumulative) values or as cumulative values. For example, 40 Good risks also occur in the second per centile group for the logistic regression classifier, and the cumulative risk for the first two percentile groups is Lift
=
40 30
+ 40 + 30
=
1 . 33
The cumulative lift chart for the logistic regression model is displayed in Figure 12.29. Lift and cumulative lift statistics can be determined for the classification tree tool and for the neural network tool. For each classifier, the entire data set is scored (expected costs computed), applicants ordered from smallest score to largest score and percentile groups created. At this point, the lift calculations follow those outlined for the logistic regression method. The cumulative charts for all three classifiers are shown in Figure 12.30. We see from Figure 12.30 that the neural network and the logistic re gression have very similar predictive powers and they both do better, in this case, than the classification tree. The classification tree, in turn, outperforms a random assignment. If this represented the end of the model building and as sessment effort, one model would be picked (say, the neural network) to score a new set of applicants (without a credit risk designation) as Good (accept) or Bad (reject) . 8The lift numbers calculated here differ a bit from the numbers displayed in the lift diagrams to fol low because of rounding.
738
Chapter 1 2
Cl ustering, D istance Methods, and Ord i nation
Figure 1 2 .30 C u m u l ative l i ft charts for n e u ral n etwor k, cl assification tree, a n d logistic reg ression tools.
In the decision flow diagram in Figure 12.28, the SAMPS10.DMAGESCR file contains 75 new applicants. Expected cost scores for these applicants were created using the neural network model. Of the 75 applicants, 33 were classi fied as Good credit risks (with negative expected costs ) . • Data mining procedures and software continue to evolve, and it is difficult to predict what the future might bring. Database packages with embedded data min ing capabilities, such as SQL Server 2000, represent one evolutionary direction. EXERCI SES 12.1. Certain characteristics associated with a few recent U.S. presidents are listed in Table 12. 1 1 . TABLE 1 2 . 1 1
President 1. 2. 3. 4. 5. 6.
R. Reagan J. Carter G. Ford R. Nixon L. Johnson J. Kennedy
Birthplace (region of United States) Midwest South Midwest West South East
Elected first term? Yes Yes No Yes No Yes
Party
Prior U.S. congressional experience?
Served as vice president?
Republican Democrat Republican Republican Democrat Democrat
No No Yes Yes Yes Yes
No No Yes Yes Yes No
Chapter 1 2
Exercises
739
(a) Introducing appropriate binary variables, calculate similarity coefficient 1 in Table 12.2 for pairs of presidents. Hint: You may use birthplace as South, non-South. (b) Proceeding as in Part a, calculate similarity coefficients 2 and 3 in Table 12 . 2 Verify the monotonicity relation of coefficients 1, 2, and 3 by displaying the order of the 15 similarities for each coefficient. 12.2. Repeat Exercise 12.1 using similarity coefficients 5, 6, and 7 in Table 12.2. 12.3. Show that the sample correlation coefficient [see (12-1 1)] can be written as r
for two
=
ad - be [ (a + b) ( a + c) ( b + d) ( c + d) J 112
-------
0-1 binary variables with the following frequencies: Variable 2 1 0 b 0 a Variable 1 d 1 c
12.4. Show that the monotonicity property holds for the similarity coefficients and 3 in Table 12.2. Hint: ( b + c) = p - ( a + d ) . So, for instance,
1, 2,
1 1 + 2 [pj (a + d) - 1 ] This equation relates coefficients 3 and 1 . Find analogous representations for a+d a + d + 2 (b + c)
the other pairs. 12.5. Consider the matrix of distances
1 2 3 4 0 1 0 11 2 0 5 3 4 0
1 2 3 4
Cluster the four items using each of the following procedures. (a) Single linkage hierarchical procedure. (b) Complete linkage hierarchical procedure. (c) Average linkage hierarchical procedure. Draw the dendrograms and compare the results in (a), (b) , and (c) . 12.6. The distances between pairs of five items are as follows:
1 2 0 4 0
3
3 4
6
0 10 0
5
6
1 2
1
9 7 3
5
4 5
8
0
740
Chapter 1 2
Cl ustering, Dista nce Meth ods, and Ord i nation
Cluster the five items using the single linkage, complete linkage, and average linkage hierarchical methods. Draw the dendrograms and compare the results. 12.7. Sample correlations for five stocks were given in Example 8.5. These correla tions, rounded to two decimal places, are reproduced as follows:
Allied Chemical Du Pont Union Carbide Exxon Texaco
Allied Chemical Du Pont 1 1 .58 .51 .60 .39 .39 .32 .46
Union Carbide
Exxon Texaco
1 .44 .43
1 .52
1
Treating the sample correlations as similarity measures, cluster the stocks using the single linkage and complete linkage hierarchical procedures. Draw the den drograms and compare the results. 12.8. Using the distances in Example 12.4, cluster the items using the average link age hierarchical procedure. Draw the dendrogram. Compare the results with those in Examples 12.4 and 12.6. 12.9. The vocabulary "richness" of a text can be quantitatively described by counting the words used once, the words used twice, and so forth. Based on these counts, a linguist proposed the following distances between chapters of the Old Testa ment book Lamentations (data courtesy of Y. T. Radday and M. A. Pollatschek) :
1 1 2 3 4 5
Lamentations chapter
0 .76 2.97 4.88 3.86
Lamentations chapter 2 4 3 0 .80 4.17 1 . 92
0 .21 1 .51
5
0 .51 0
Cluster the chapters of Lamentations using the three linkage hierarchical meth ods we have discussed. Draw the dendrograms and compare the results. 12.10. Use Ward's method to cluster the 4 items whose measurements on a single vari able X are given in the following table. Measurements Item 1
2 3 4
X
2 1 5 8
Chapter 1 2
Exercises
741
(a) Initially, each item is a cluster and we have the clusters {1}
{2}
{3}
{4}
Show that ESS = 0, as it must. (b) If we j oin clusters { 1 } and { 2 } , the new cluster { 12} has ESS 1
=
� (xj - x) 2
=
(2 - 1 .5 ) 2
+ ( 1 - 1.5) 2
=
.5
and the ESS associated with the grouping { 12 } , { 3 } , {4} is ESS = .5 + 0 + 0 = .5. The increase in ESS (loss of information) from the first step to the current step in .5 - 0 = .5. Complete the following table by determining the increase in ESS for all the possibilities at step 2. Increase in ESS
Clusters { 12} {13} {14} {1} {1} {1}
{3} {2} {2} {23 } {24} {2}
.5
{4} {4} {3} {4} {3} {34}
(c) Complete the last two steps, and construct the dendrogram showing the val ues of ESS at which the mergers take place. 12.11. Suppose we measure two variables X1 and X2 for four items A, B, C, and D. The data are as follows: 0 bservations
Item
xl
x2
A B
5 1 -1 3
4 -2 1 1
c
D
Use the K-means clustering technique to divide the items into K = 2 clusters. Start with the initial groups (AB) and ( CD ) . 12.12. Repeat Example 12.12, starting with the initial groups (AC) and (BD ) . Com pare your solution with the solution in the example. Are they the same? Graph the items in terms of their ( x 1 , x2 ) coordinates, and comment on the solutions. 12.13. Repeat Example 12.12, but start at the bottom of the list of items, and proceed up in the order D, C, B, A. Begin with the initial groups (AB ) and ( CD ) . [The first potential reassignment will be based on the distances d 2 ( D, (AB ) ) and d 2 ( D, ( CD ) ) .] Compare your solution with the solution in the example. Are they the same? Should they be the same?
742
Chapter 1 2
Cl usteri ng, D i sta nce Methods, and Ord ination
The following exercises require the use of a computer. 12.14. Table 1 1 .9 lists measurements on 8 variables for 43 breakfast cereals. (a) Using the data in the table, calculate the Euclidean distances between pairs of cereal brands. (b) Treating the distances calculated in (a) as measures of (dis )similarity, clus ter the cereals using the single linkage and complete linkage hierarchical procedures. Construct dendrograms and compare the results. 12.15. Input the data in Table 1 1 . 9 into a K-means clustering program. Cluster the cereals into K == 2, 3, and 4 groups. Compare the results with those in Exer cise 12. 14. 12.16. The national track records data for women are given in Table 1 .9. (a) Using the data in Table 1.9, calculate the Euclidean distances between pairs of countries. (b) Treating the distances in (a) as measures of (dis)similarity, cluster the coun tries using the single linkage and complete linkage hierarchical procedures. Construct dendrograms and compare the results. (c) Input the data in Table 1.9 into a K-means clustering program. Cluster the countries into groups using several values of K. Compare the results with those in Part b. 12.17. Repeat Exercise 12.16 using the national track records data for men given in Table 8.6. Compare the results with those of Exercise 12.16. Explain any differences. 12.18. Table 12.12 gives the road distances between 12 Wisconsin cities and cities in neighboring states. Locate the cities in q == 1, 2, and 3 dimensions using mul tidimensional scaling. Plot the minimum stress(q) versus q and interpret the graph. Compare the two-dimensional multidimensional scaling configuration with the locations of the cities on a map from an atlas. 12.19. Table 12.13 on page 744 gives the "distances" between certain archaeological sites from different periods, based upon the frequencies of different types of pot sherds found at the sites. Given these distances, determine the coordinates of the sites in q == 3 , 4, and 5 dimensions using multidimensional scaling. Plot the minimum stress( q) versus q and interpret the graph. If possible, locate the sites in two dimensions (the first two principal components) using the coordinates for the q == 5-dimensional solution. (Treat the sites as variables.) Noting the periods associated with the sites, interpret the two-dimensional configuration. 12.20. A sample of n == 1660 people is cross-classified according to mental health sta tus and socioeconomic status in Table 12.14 on page 745 . Perform a correspondence analysis of these data. Interpret the results. Can the associations in the data be well represented in one dimension? 12.21. A sample of 901 individuals was cross-classified according to three categories of income and four categories of j ob satisfaction. The results are given in Table 12. 15 on page 745. Perform a correspondence analysis of these data. Interpret the results. 12.22. Perform a correspondence analysis of the data on forests listed in Table 12.10, and verify Figure 12.26 given in Example 12.21 .
TABLE 1 2. 1 2
D I STAN C E S B ETWE E N CITI E S I N WISCO N S I N A N D CITI ES I N N E I G H B O R I N G STATES
Fort Appleton Beloit Atkinson Madison Marshfield Milwaukee Monroe Superior Wausau Dubuque St. Paul Chicago (12) ( 10 ) (7) (9) (11) (4) (2) (3) ( 5) (8) (6) (1 ) (1 ) (2) (3) (4) (5) (6) (7) (8) (9) ( 10 ) (1 1 ) (12)
._,... � w
0 130 98 102 103 100 149 3 15 91 1 96 257 186
0 33 50 1 85 73 33 377 186 94 304 97
0 36 1 64 54 58 359 166 119 287 113
0 138 77 47 330 139 95 258 146
0 184 170 219 45 186 161 276
0 107 3 94 181 1 68 322 93
0 3 62 186 61 289 130
0 223 351 1 62 467
0 215 175 275
0 274 184
0 3 95
0
._,... � �
TABLE 1 2. 1 3
(1 ) (2) ( 3) (4) (5) (6) (7) ( 8) (9)
D I STAN C E S B ETW E E N ARCHAEO LO G I CAL S ITES
P1980918 (1 )
P1931 131 (2)
P1550960 (3)
P1530987 (4)
P1361024 (5)
P1351005 ( 6)
P1340945 (7)
P1311 137 ( 8)
P1301062 (9)
0 2.202 1 .004 1 .108 1 . 122 0.914 0.914 2.056 1 . 608
0 2.025 1 . 943 1 .870 2.070 2.186 2.055 1 .722
0 0.233 0.719 0.719 0.452 1 . 986 1 .358
0 0.541 0.679 0.681 1 .990 1 . 1 68
0 0.539 1 .102 1 . 963 0.681
0 0.916 2.056 1 .005
0 2.027 1 .719
0 1 .991
0
KEY: P1 980918 refers t o site P198 dated A.D. 0918, P1931131 refers t o site P1 93 dated A.D. 1 1 3 1 , and s o forth. Source: Data Courtesy of M. J. Tretter.
Chapter 1 2
TABLE 1 2. 1 4
745
References
M E NTAL H EALTH STATUS A N D SOCI O ECO N O M I C STATUS DATA
Parental Socioeconomic Status Mental Health Status Well Mild symptom formation Moderate symptom formation Impaired
A ( High) 121 1 88 1 12 86
B
c
D
57 105 65 60
72 141 77 94
36 97 54 78
E ( Low) 21 71 54 71
Source: Adapted from data in Srole, L. , T. S. Langner, S. T. Michael, P. Kirkpatrick, M. K. Opler, and T. A. C. Rennie, Mental Health in the Metropolis: The Midtown Manhatten Study, rev. ed. (New York: NYU Press, 1978).
TABLE 1 2. 1 5
I N CO M E A N D J O B SATISFACTI O N DATA
Job S atisfaction Income < $ 25,000 $25,000-$50,000 > $ 50,000
Very dissatisfied
Somewhat dissatisfied
Moderately satisfied
Very satisfied
42 13 7
62 28 18
184 81 54
207 1 13 92
Source: Adapted from data in Table 8.2 in Agresti, A., Categorical Data Analysis (New York: John Wiley, 1990) .
12.23. Construct a biplot of the pottery data in Table 12.8. Interpret the biplot. Is the biplot consistent with the correspondence analysis plot in Figure 12.21? Discuss your answer. ( Use the row proportions as a vector of observations at a site.)
12.24. Construct a biplot of the mental health and socioeconomic data in Table 12.14. Interpret the biplot. Is the biplot consistent with the correspondence analysis plot in Exercise 12.20? Discuss your answer. ( Use the column proportions as the vector of observations for each status.) 12.25. Using the archaeological data in Table 12.13, determine the two-dimensional metric and nonmetric multidimensional scaling plots. ( See Exercise 12.19.) Given the coordinates of the points in each of these plots, perform a Procrustes analysis. Interpret the results.
REFERENCES 1. Abramowitz, M., and I. A. Stegun, eds. Handbook of Mathematical Functions. U.S. Department of Commerce, National Bureau of Standards Applied Mathematical Se ries. 55, 1 964.
746
Chapter 1 2
Cl usteri ng, Distance Methods, and Ord i n ation
2. Adriaans, P. , and D. Zantinge. Data Mining. Harlow, England: Addison-Wesley, 1 996. 3. Anderberg, M. R. Cluster Analysis for Applications. New York: Academic Press, 197 3. 4. Berry, M. J. A., and G. Linoff. Data Mining Techniques. New York: John Wiley, 1997 . 5. Berry, M. J. A., and G. Linoff. Mastering Data Mining. New York: John Wiley, 2000. 6. Berthold, M . , and D. J. Hand. Intelligent Data Analysis. Berlin, Germany: Springer Verlag, 1 999. 7. Cormack, R. M. "A Review of Classification (with discussion)." Journal of the Royal Sta tistical Society (A) , 134, no. 3 (1971 ), 321-367.
8. Everitt, B. S. Cluster Analysis (3d ed.). London: Edward Arnold, 1993. 9. Gower, J. C. "Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis." Biometrika, 53 (1966), 325-338. 10. Gower, J. C. "Multivariate Analysis and Multidimensional Geometry." The Statistician, 17 (1967) , 13-25 . 1 1 . Gower, J. C., and D. J. Hand. Biplots. London: Chapman and Hall, 1996. 12. Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations. New York: John Wiley, 1977. 13. Greenacre, M. J., "Correspondence Analysis of Square Asymmetric Matrices," Applied Statistics, 49, (2000) 297-3 10. 14. Greenacre, M. J. Theory and Applications of Correspondence Analysis. London: Acade mic Press, 1 984. 15. Hartigan, J. A. Clustering Algorithms. New York: John Wiley, 1975. 16. Kennedy, R. L., L. Lee, B. Van Roy, and C. D. Reed. So lving Data Mining Pro blems Through Pattern Recognition. Upper Saddle River, NJ: Prentice Hall, 1997. 17. Kruskal, J. B. "Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis." Psychometrika, 29, no. 1 (1964), 1-27. 18. Kruskal, J. B. "Non-metric Multidimensional Scaling: A Numerical Method." Psy chometrika, 29, no. 1 (1964) , 115-129. 19. Kruskal, J. B. , and M. Wish. "Multidimensional S caling." Sage University Paper series on Quantitative Applications in the Social Sciences, 07-01 1 . Beverly Hills and London: Sage Publications, 1978. 20. LaPointe, F-J, and P. Legendre. "A Classification of Pure Malt Scotch Whiskies." Applied Statistics, 43, no. 1 (1994), 237-257. 21. Ludwig, J. A., and J. F. Reynolds. Statistical Ecology-a Primer on Methods and Com puting, New York: John Wiley, 1988. 22. MacQueen, J. B. "Some Methods for Classification and Analysis of Multivariate Obser vations." Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Prob ability, 1 , Berkeley, CA: University of California Press (1967) , 281-297.
23. Mardia, K. V. , J. T. Kent, and J. M. Bibby. Multivariate Analysis. New York: Academic Press, 1979.
24. Morgan, B. J. T. , and A. P. G. Ray. "Non-uniqueness and Inversions in Cluster Analysis." Applied Statistics, 44, no. 1 (1995), 1 17-134. 25 . Pyle, D. Data Preparation for Data Mining. San Francisco: Morgan Kaufinann, 1999.
26. Shepard, R. N. "Multidimensional Scaling, Tree-Fitting, and Clustering." Science, 210, no. 4468 (1980) , 390-398. 27. Sibson, R. "Studies in the Robustness of Multidimensional Scaling" Journal of the Royal Statistical Society (B) , 40 (1978), 234-238.
Chapter 1 2
References
747
28. Takane, Y. , F. W. Young, and J. De Leeuw. "Non-metric Individual Differences Multi dimensional Scaling: Alternating Least Squares with Optimal Scaling Features." Psy cometrika, 42 (1977), 7-67.
29. Ward, Jr. , J. H. "Hierarchical Grouping to Optimize an Objective Function." Journal of the American Statistical Association, 58 (1963), 236-244. 30. Westphal, C, and T. Blaxton. Data Mining Solutions. New York: John Wiley, 1998. 31. Witten, I. H., and E. Frank. Data Mining. San Francisco: Morgan Kaufmann, 2000. 32. Young, F. W. , and R. M. Hamer. Multidimensional Scaling: History, Theory, and Applica tions. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers, 1987.
A ppendix
Table 1 Standard Normal Probabilities Table 2 Student ' s t-Distribution Percentage Points Table 3 x2 Distribution Percentage Points
748
Table 4 F-Distribution Percentage Points
(a
=
.10)
Table 5 F-Distribution Percentage Points
(a
=
.05 )
Table 6 F-Distribution Percentage Points
(a
=
.01 )
749
Appendix
TABLE 1
STAN DARD N O RMAL PRO BAB I LITI ES
P[Z � z] 0
z
z
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9
.5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159
.5040 .5438 .5832 .6217 .6591 .6950 .7291 .761 1 .7910 .8186
.5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212
.5120 .5517 .5910 .6293 .6664 .701 9 .7357 .7673 .7967 .8238
.5160 .5557 .5948 .6331 .6700 .7054 .7389 .7703 .7995 .8264
.5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289
.5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315
.5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340
.5319 .5714 .6103 .6480 .6844 .7190 .7517 .7823 .8106 .8365
.5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389
1 .0 1.1 1 .2 1.3 1 .4 1 .5 1.6 1 .7 1.8 1.9
.8413 .8643 .8849 .9032 .9192 .9332 .9452 .9554 .9641 .9713
.8438 .8665 .8869 .9049 .9207 .9345 .9463 .9564 .9649 .9719
.8461 .8686 .8888 .9066 .9222 .9357 .9474 .9573 .9656 .9726
.8485 .8708 .8907 .9082 .9236 .9370 .9484 .9582 .9664 .9732
.8508 .8729 .8925 .9099 .9251 .9382 .9495 .9591 .9671 .9738
.8531 .8749 .8944 .9115 .9265 .9394 .9505 .9599 .9678 .9744
.8554 .8770 .8962 .9131 .9279 .9406 .9515 .9608 .9686 .9750
.8577 .8790 .8980 .9147 .9292 .941 8 .9525 .9616 .9693 .9756
.8599 .8810 .8997 .9162 .9306 .9429 .9535 .9625 .9699 .9761
.8621 .8830 .9015 .9177 .9319 .9441 .9545 .9633 .9706 .9767
2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
.9772 .9821 .9861 .9893 .9918 .9938 .9953 .9965 .9974 .9981
.9778 .9826 .9864 .9896 .9920 .9940 .9955 .9966 .9975 .9982
.9783 .9830 .9868 .9898 .9922 .9941 .9956 .9967 .9976 .9982
.9788 .9834 .9871 .9901 .9925 .9943 .9957 .9968 .9977 .9983
.9793 .9838 .9875 .9904 .9927 .9945 .9959 .9969 .9977 .9984
.9798 .9842 .9878 .9906 .9929 .9946 .9960 .9970 .9978 .9984
.9803 .9846 .9881 .9909 .9931 .9948 .9961 . 9971 .9979 .9985
.9808 .9850 .9884 .9911 .9932 .9949 .9962 .9972 .9979 .9985
.9812 .9854 .9887 .9913 .9934 .9951 .9963 .9973 .9980 .9986
.9817 .9857 .9890 .9916 .9936 .9952 .9964 .9974 .9981 .9986
3.0 3.1 3 .2 3.3 3.4 3.5
.9987 .9990 .9993 .9995 .9997 .9998
.9987 .9991 .9993 .9995 .9997 .9998
.9987 .9991 .9994 .9995 .9997 .9998
.9988 .9991 .9994 .9996 .9997 .9998
.9988 .9992 .9994 .9996 .9997 .9998
.9989 .9992 .9994 .9996 .9997 .9998
.9989 .9992 .9994 .9996 .9997 .9998
.9989 .9992 .9995 .9996 .9997 .9998
.9990 .9993 .9995 .9996 .9997 .9998
.9990 .9993 .9995 .9997 .9998 .9998
750
Appe n d ix
TABLE 2
STU D E NT'S t-D I STRI B UT I O N P E RCE NTAG E POI NTS
0
a
d.f.
v
.250
.100
.050
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120
1.000 .816 .765 .741 .727 .718 .711 .706 .703 .700 .697 .695 .694 .692 .691 .690 .689 .688 .688 .687 .686 .686 .685 .685 .684 .684 .684 .683 .683 .683 .681 .679 .677 .674
3.078 1.886 1.638 1 .533 1 .476 1 .440 1 .415 1 .397 1 .383 1.372 1 .363 1 .356 1 .350 1 .345 1 .341 1 .337 1 .333 1 .330 1.328 1.325 1.323 1 .321 1 .319 1 .318 1 .316 1 .315 1 .314 1.313 1 .3 1 1 1 .310 1 .303 1 .296 1 .289 1.282
6.314 2.920 2.353 2.132 2.015 1 . 943 1 . 895 1 .860 1 . 833 1 .812 1 .796 1 .782 1 .771 1 .761 1 .753 1 .746 1 .740 1 .734 1 .729 1 .725 1 .721 1 .717 1 .714 1 .711 1 .708 1 .706 1 .703 1 .701 1 . 699 1 . 697 1 . 684 1 . 671 1 . 658 1 . 645
00
tv( a )
.025
.010
.00833
.00625
.005
.0025
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2. 179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.000 1 .980 1 .960
31.821 6.965 4.541 3.747 3.365 3. 143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.423 2.390 2.358 2.326
38.190 7.649 4.857 3.961 3.534 3.287 3.128 3.016 2.933 2.870 2.820 2.779 2.746 2.718 2.694 2.673 2.655 2.639 2.625 2.613 2.601 2.591 2.582 2.574 2.566 2.559 2.552 2.546 2.541 2.536 2.499 2.463 2.428 2.394
50.923 8.860 5.392 4.315 3.810 3.521 3.335 3.206 3.111 3.038 2.981 2. 934 2.896 2. 864 2.837 2.813 2.793 2.775 2.759 2.744 2.732 2.720 2.710 2.700 2.692 2.684 2.676 2.669 2.663 2. 657 2.616 2.575 2.536 2.498
63 .657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3. 169 3. 106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.660 2.617 2.576
127.321 14.089 7.453 5.598 4.773 4.317 4.029 3.833 3.690 3.581 3.497 3.428 3.372 3.326 3.286 3.252 3.222 3.197 3 . 174 3.153 3 . 135 3.119 3.104 3.091 3.078 3.067 3.057 3.047 3.038 3.030 2.971 2.915 2.860 2.813
751
Appendix
TABLE 3
x2 D I STRI B UT I O N PERCE NTAG E POI NTS
: X � (a) d.f. v
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100
x2
a
.990 .0002 .02 .11 .30 .55 .87 1 .24 1.65 2.09 2.56 3.05 3.57 4.11 4.66 5.23 5.81 6.41 7.01 7.63 8.26 8.90 9.54 10.20 10.86 1 1 .52 12.20 12.88 13.56 14.26 14.95 22.16 29.71 37.48 45.44 53.54 61 .75 70.06
.950
.900
.500
.100
.050
.025
.010
.005
.004 .10 .35 .71 1.15 1 .64 2.17 2.73 3.33 3.94 4.57 5.23 5.89 6.57 7.26 7.96 8.67 9.39 10.12 10.85 11 .59 12.34 13 .09 13.85 14.61 15.38 16.15 16.93 17.71 18.49 26.51 34.76 43 .19 51 .74 60.39 69.13 77 .93
.02 .21 .58 1 .06 1 .61 2.20 2.83 3.49 4.17 4.87 5.58 6.30 7.04 7.79 8.55 9.31 10.09 10.86 1 1 .65 12.44 13.24 14.04 14.85 15.66 16.47 17.29 18.11 18.94 19 .77 20.60 29.05 37.69 46.46 55.33 64.28 73 .29 82.36
.45 1 .39 2.37 3.36 4.35 5.35 6.35 7.34 8.34 9.34 10.34 1 1 .34 12.34 13.34 14.34 15.34 16.34 17.34 18.34 19.34 20.34 21.34 22.34 23.34 24.34 25.34 26.34 27.34 28.34 29.34 39.34 49.33 59.33 69.33 79.33 89.33 99.33
2.71 4.61 6.25 7.78 9.24 10.64 12.02 13 .36 14.68 15.99 17.28 1 8.55 1 9.81 21 .06 22.31 23 .54 24.77 25.99 27.20 28.41 29.62 30.81 32.01 33 .20 34.38 35.56 36.74 37.92 39.09 40.26 51.81 63.17 74.40 85.53 96.58 107.57 1 18.50
3.84 5.99 7.81 9.49 1 1 .07 12.59 14.07 15.51 16.92 18.31 19.68 21 .03 22.36 23.68 25.00 26.30 27.59 28.87 30.14 31 .41 32.67 33.92 35.17 36.42 37.65 38.89 40. 1 1 41 .34 42.56 43.77 55.76 67.50 79.08 90.53 101.88 113.15 124.34
5.02 7.38 9.35 1 1 .14 12.83 14.45 16.01 17.53 19.02 20.48 21 .92 23 .34 24.74 26.12 27.49 28.85 30.19 31 .53 32.85 34.17 35.48 36.78 38.08 39.36 40.65 41.92 43. 1 9 44.46 45.72 46.98 59.34 7 1 .42 83.30 95.02 106.63 1 1 8.14 129.56
6.63 9.21 1 1 .34 13.28 15.09 16.81 18.48 20.09 21 .67 23.21 24.72 26.22 27.69 29 .14 30.58 32.00 33 .41 34.81 36.19 37.57 38.93 40.29 41.64 42.98 44.31 45.64 46.96 48.28 49.59 50.89 63.69 76.15 88.38 100.43 1 12.33 124.12 135.81
7.88 10.60 12.84 14.86 16.75 1 8.55 20.28 21.95 23.59 25.19 26.76 28.30 29.82 3 1 .32 32.80 34.27 35.72 37.16 38.58 40.00 41 .40 42.80 44.18 45.56 46.93 48.29 49.64 50.99 52.34 53.67 66.77 79.49 91 .95 104.21 1 1 6.32 128.30 140.17
._,... U'1 N
"'0 "'0 (1) :::::l a..
)> )(
TABLE 4
F-D I STRI B UT I O N PE RCE NTAG E PO I NTS
(a
==
. 1 0)
.10
F
V i ,V 2
v2
vl
1
1
2
3
4
5
6
55.83
57.24
58.20
F
(. 1 0)
8
9
58.91
59.44
59.86
7
12
15
60 . 1 9
60.71
61 .22
10
39.86
49.50
53.59
2
8.53
9.00
9.16
9.24
9.29
9.33
9.35
9.37
9.38
9.39
9.41
9.42
3
5.54
5.46
5.39
5 . 34
5.31
5.28
5.27
5 .25
5 . 24
5 .23
5.22
5 . 20
4
4.54
4.32
4.19
4.11
4.05
4.01
3.98
3.95
3 .94
3 .92
3.90
3 .87
5
4.06
3.78
3 .62
3.52
3 .45
3.40
3.37
3.34
3.32
3.30
3.27
20 61 .74
25
30
40
60
62. 53
62.79
62.05
62.26
9.44
9.45
9.46
9.47
9.47
5.18
5.17
5.17
5.16
5.15
3 .84
3.83
3 . 82
3 . 80
3.79
3.24
3.21
3.19
3.17
3.16
3.14
6
3.78
3 .46
3 .29
3.18
3.11
3 .05
3.01
2.98
2.96
2.94
2.90
2.87
2.84
2.81
2.80
2.78
2.76
7
3.59
3.26
3.07
2.96
2.88
2.83
2.78
2.75
2.72
2.70
2.67
2.63
2.59
2.57
2.56
2.54
2.51
8
3 .46
3.11
2.92
2.81
2.73
2.67
2.62
2.59
2.56
2.54
2.50
2.46
2.42
2.40
2.38
2.36
2.34
9
3.36
3.01
2.81
2.69
2.61
2.55
2.51
2.47
2.44
2.42
2.38
2.34
2.30
2.27
2.25
2.23
2.21
10
3 .29
2.92
2.73
2.61
2.52
2.46
2.41
2.38
2.35
2.32
2.28
2.24
2.20
2.17
2.16
2.13
2.11
11
3 .23
2.86
2.66
2 . 54
2.45
2.39
2.34
2.30
2.27
2.25
2.21
2.17
2.12
2.10
2.08
2.05
2.03
"
12
3.18
2.81
2.61
2.48
2.39
2.33
2.28
2.24
2.21
2.19
2.15
2.10
2.06
2.03
2.01
1 . 99
1 . 96
13
3.14
2.76
2.56
2.43
2.35
2.28
2.23
2.20
2.16
2.14
2.10
2.05
2.01
1 .98
1 . 96
1 .93
1 . 90
14
3.10
2.73
2.52
2.39
2.31
2.24
2. 1 9
2.15
2. 12
2.10
2.05
2.01
1 . 96
1 . 93
1 . 91
1.89
1.86
15
3.07
2.70
2.49
2.36
2.27
2.21
2.16
2.12
2.09
2.06
2.02
1 .97
1 .9 2
1 . 89
1 . 87
1 . 85
1 . 82
16
3 .05
2.67
2.46
2.33
2.24
2.18
2.13
2.09
2.06
2.03
1 . 99
1 . 94
1 .89
1 . 86
1 . 84
1.81
1 . 78
17
3 . 03
2.64
2.44
2.31
2.22
2.15
2.10
2.06
2.03
2.00
1 . 96
1.91
1 .86
1 . 83
1.81
1 .78
1 .75
18
3 .01
2.62
2.42
2 .29
2.20
2.13
2.08
2.04
2.00
1 .98
1 .93
1 . 89
1 .84
1 .80
1 .78
1 .7 5
1 .72
19
2.99
2.61
2.40
2.27
2.18
2.11
2.06
2.02
1 .98
1 .96
1 . 91
1 . 86
1.81
1 .78
1 .76
1 .73
1 .70
20
2.97
2.59
2.38
2.25
2.16
2.09
2.04
2.00
1 .96
1 . 94
1 . 89
1 . 84
1 .79
1 .76
1 .74
1 .7 1
1 .68
21
2.96
2.57
2.36
2.23
2.14
2.08
2.02
1 .98
1 .95
1 .92
1 .87
1 .83
1 .7 8
1 .7 4
1 .72
1 .69
1 .66
22
2.95
2.56
2.35
2.22
2.13
2.06
2.01
1 . 97
1 . 93
1 .90
1 . 86
1.81
1 .76
1 .73
1 .70
1 . 67
1 . 64
23
2.94
2.55
2.34
2.21
2. 1 1
2.05
1 . 99
1 . 95
1 .92
1 .89
1 .84
1 . 80
1 .74
1.71
1 .69
1 .66
1 .62
24
2.93
2.54
2.33
2.19
2.10
2.04
1 . 98
1 . 94
1 . 91
1 .88
1 .83
1 . 78
1 .73
1 .70
1 . 67
1 .64
1.61
25
2.92
2.53
2.32
2.18
2.09
2.02
1 .97
1 .93
1 .89
1 .87
1 .82
1 .77
1 .72
1 .68
1 .66
1 .63
1 . 59
26
2.91
2.52
2.31
2.17
2.08
2.01
1 . 96
1 . 92
1 .88
1 .86
1.81
1 . 76
1.71
1 .67
1 . 65
1 .61
1.58
27
2.90
2.51
2.30
2.17
2.07
2.00
1 . 95
1.91
1 . 87
1 .85
1 . 80
1 .75
1 .70
1 . 66
1 .64
1 .60
1 . 57
28
2.89
2.50
2.29
2.16
2.06
2.00
1 .94
1 .90
1 .87
1 . 84
1 .79
1 .74
1 . 69
1 .65
1 . 63
1 . 59
1 . 56
29
2.89
2.50
2.28
2.15
2.06
1 . 99
1 . 93
1 . 89
1 .86
1 . 83
1 .7 8
1 .73
1 .68
1 .64
1 . 62
1.58
1 .5 5
30
2.88
2.49
2.28
2.14
2.05
1 .98
1 . 93
1 . 88
1 .85
1 . 82
1 .77
1 .72
1 .67
1 . 63
1 . 61
1 .57
1 . 54
40
2.84
2.44
2.23
2.09
2.00
1 . 93
1 .87
1 . 83
1 .79
1 .76
1 .7 1
1 . 66
1.61
1 .57
1 . 54
1.51
1 .47
60
2.79
2.39
2.18
2.04
1 .95
1 .87
1 .82
1 .77
1 .74
1 .71
1 .66
1 .60
1 . 54
1 .50
1 .48
1 .44
1 .40
120
2.75
2.35
2.13
1 .99
1 . 90
1 . 82
1 .77
1 .72
1 .68
1 . 65
1 .60
1 .55
1 .48
1 .45
1 .41
1 .37
1 . 32
2.71
2.30
2.08
1 . 94
1 .85
1 .77
1 .72
1 . 67
1 .63
1 . 60
1 . 55
1 .49
1 .42
1 . 38
1 . 34
1.30
1 .24
00
"'0 "'0 (1) :::::l a.
)>
x·
._,... U'1 w
._,... U'1 �
"'0 "'0 (1) :::::l a..
)>
x·
TABLE 5
F-D I STR I B UT I O N PE RCE NTAG E POI NTS
(a
==
.05)
.05
F
If, 1 ' v 2 (.05) v2
vl
1 2
1 161 .5 18.51
2
3
1 99.5
215.7
1 9 .00
19.16
4 224.6 1 9.25
5 230.2 19.30
6 234.0 19.33
7 236.8 1 9.35
8 238 . 9 19.37
9 240.5 1 9.38
10 241 .9 1 9.40
12 243 .9
15 246.0
20
25
248.0
249 .3
30 250 . 1
40
60
25 1 . 1
252.2
1 9.41
1 9.43
1 9.45
1 9.46
1 9.46
1 9.47
1 9.48
8.66
8.63
8.62
8.59
8.57
3
10.13
9.55
9.28
9.12
9.01
8.94
8.89
8 . 85
8.81
8.79
8.74
8.70
4
7.71
6.94
6.59
6.39
6.26
6.16
6.09
6.04
6.00
5.96
5.91
5 . 86
5 .80
5 .77
5.75
5.72
5 .69
5
6.61
5.79
5.41
5.19
5 .05
4.95
4.88
4.82
4.77
4.74
4.68
4.62
4.56
4.52
4.50
4.46
4.43
6
5.99
5.14
4.76
4.53
4.39
4.28
4.21
4.15
4.10
4.06
4.00
3 . 94
3.87
3.83
3.81
3.77
3 .74
7
5.59
4.74
4.35
4.12
3 . 97
3.87
3 .79
3 .73
3.68
3 .64
3.57
3.51
3 .44
3.40
3.38
3.34
3 . 30
8
5.32
4.46
4.07
3 . 84
3.69
3.58
3.50
3 .44
3.39
3.35
3 .28
3.22
3.15
3.11
3 .08
3.04
3.01
9
5.12
4.26
3 . 86
3 .63
3 .48
3 .37
3 .29
3 .23
3.18
3.14
3.07
3.01
2.94
2.89
2.86
2.83
2.79
10
4.96
4.10
3.71
3 .48
3.33
3 .22
3.14
3.07
3.02
2.98
2.91
2.85
2.77
2.73
2.70
2.66
2.62
11
4 . 84
3.98
3.59
3.36
3.20
3 .09
3.01
2.95
2.90
2.85
2.79
2.72
2.65
2.60
2.57
2.53
2.49
12
4.75
3.89
3.49
13
4 . 67
3.81
3.41
14
4.60
3 .74
3.34
15
4.54
3 . 68
3.29
16
4.49
3 .63
3 .24
3 .26
3.11
3 .00
3.18
3 . 03
2.92
3.11
2.96
2.85
3.06
2.90
2.79
3.01
2.85
2.74
2.62
2.54
2.50
2.43
2.38
2.38
2.34
2.30
2.31
2.27
2.22
2.25
2.20
2.16
2.23
2.19
2.15
2.11
2.18
2.15
2.10
2.06
2.85
2.80
2.75
2.69
2.83
2.77
2.71
2.67
2.60
2.53
2.46
2.41
2.76
2.70
2.65
2.60
2.53
2.46
2.39
2.34
2.71
2.64
2.59
2.54
2.48
2.40
2.33
2.28
2.66
2.59
2.54
2.49
2.42
2.35
2 .28
2.49
2.45
2.38
2.31
2.23
2.91
2.47
17
4.45
3.59
3.20
2.96
2.81
2.70
2.61
2.55
18
4.41
3.55
3.16
2.93
2.77
2.66
2.58
2.51
2.46
2.41
2.34
2.27
2.19
2.14
2.11
2.06
2.02
19
4.38
3.52
3.13
2.90
2.74
2.63
2.54
2.48
2 .42
2.38
2.31
2.23
2.16
2.11
2.07
2.03
1 .98
20
4.35
3 .49
3.10
2.87
2.71
2.60
2.51
2.45
2.39
2.35
2.28
2.20
2.12
2.07
2.04
1 . 99
1 .95
21
4.32
3 .47
3.07
2.84
2.68
2.57
2.49
2.42
2.37
2.32
2.25
2.18
2.10
2.05
2.01
1 .96
1 .92
22
4.30
3 .44
3 .05
2.82
2.66
2.55
2.46
2.40
2.34
2.30
2.23
2.15
2.07
2.02
1 .98
1 . 94
1 .89
23
4.28
3 . 42
3 . 03
2.80
2.64
2.53
2.44
2.37
2.32
2.27
2.20
2.13
2.05
2.00
1 .96
1.91
1 . 86
1 . 97
24
4.26
3 .40
3.01
2.78
2.62
2.51
2.42
2.36
2.30
2.25
2.18
2.11
2.03
1 . 94
1 . 89
1 .84
25
4.24
3 .39
2.99
2.76
2.60
2.49
2.40
2.34
2.28
2.24
2.16
2.09
2.01
1 .96
1 .92
1 . 87
1 . 82
26
4.23
3.37
2.98
2.74
2.59
2.47
2.39
2.32
2.27
2.22
2.15
2.07
1 . 99
1 . 94
1 . 90
1 . 85
1 .80
27
4.21
3.35
2.96
2.73
2.57
2.46
2.37
2.31
2.25
2.20
2.13
2.06
1 . 97
1 . 92
1 .88
1 . 84
1 .79
28
4.20
3.34
2.95
2.71
2.56
2.45
2.36
2.29
2.24
2.19
2.12
2.04
1 . 96
1 . 91
1 . 87
1 . 82
1 .77
29
4.18
3.33
2.93
2.70
2.55
2.43
2.35
2.28
2.22
2.18
2.10
2.03
1 . 94
1 .89
1 . 85
1 .81
1 .75
30
4.17
3.32
2.92
2.69
2.53
2.42
2.33
2.27
2.21
2.16
2.09
2.01
1 . 93
1 .88
1 . 84
1 .79
1 .74
40
4.08
3 . 23
2.84
2.61
2.45
2.34
2.25
2.18
2.12
2.08
2.00
1 . 92
1 . 84
1 .78
1 .74
1 . 69
1 . 64
60
4.00
3.15
2.76
2.53
2.37
2.25
2.17
2.10
2.04
1 . 99
1 . 92
1 .84
1 .75
1 . 69
1 . 65
1 . 59
1 .53
120
3 . 92
3.07
2.68
2.45
2.29
2.18
2.09
2.02
1 . 96
1.91
1 .83
1 .75
1 . 66
1 .60
1 . 55
1 . 50
1 . 43
3 . 84
3 .00
2.61
2.37
2.21
2.10
2.01
1 . 94
1 . 88
1 . 83
1 .75
1 . 67
1 . 57
1.51
1 .46
1 . 39
1 . 32
00
I
"'0 "'0 (1) :::::l a.
)>
)(
"
...... U'1 U'1
._,... U'1 �
"'0 "'0 (1) :::::l a..
)>
:x·
TABLE 6
F-D I STRI B UT I O N P E RC E NTAG E PO I NTS
(a
==
.01 ) .0 1
v2
Fv
vl
1
1
2
4052.
5000.
3 5403.
4 5625 .
5
5764.
2
98.50
99.00
99.17
99.25
99.30
3
34. 1 2
30.82
29.46
28.71
28.24
4
21 .20
18.00
16.69
1 5 . 98
6 5859.
7 5928.
1 'V2
F
(. 0 1 )
8 5981.
9 6023 .
10 6056.
12 6106.
15 6157.
20 6209.
25
30
6240.
626 1 .
40 6287 .
60 6313.
99.33
99.36
99.37
99.39
99 .40
99.42
99 .43
99.45
99.46
99.47
99.47
9 9.48
27 .91
27.67
27.49
27.35
27 .23
27 .05
26.87
26.69
26.58
26.50
26.41
26.32
1 5 .52
15.21
14.98
1 4.80
14.66
14.55
14.37
1 4.20
14.02
13.91
13.84
13.75
13.65
5
1 6.26
13.27
12.06
1 1 .39
10.97
1 0.67
10.46
10.29
10.16
10.05
9.89
9.72
9.55
9.45
9.38
9.29
9.20
6
13.75
10.92
9.78
9.15
8.75
8.47
8.26
8.10
7.98
7.87
7.72
7.56
7.40
7 .30
7.23
7.14
7 . 06
7
12.25
9.55
8.45
7.85
7.46
7.19
6.99
6.84
6.72
6.62
6.47
6.31
6.16
6.06
5.99
5.91
5.82
8
1 1 .26
8.65
7.59
7.01
6.63
6.37
6.18
6.03
5.91
5.81
5 . 67
5 .52
5.36
5.26
5 . 20
5.12
5.03
9
10.56
8.02
6.99
6.42
6.06
5 .80
5.61
5 .47
5.35
5 .26
5.11
4.96
4.81
4.71
4.65
4.57
4.48
10
1 0.04
7.56
6.55
5.99
5.64
5.39
5 .20
5 .06
4.94
4.85
4.71
4.56
4.41
4.31
4.25
4.17
4.08
11
9.65
7.21
6.22
5 .67
5. 32
5.07
4.89
4.74
4.63
4.54
4.40
4.25
4.10
4.01
3 . 94
3.86
3.78
12
9.33
6.93
5 . 95
5.41
5 .06
4.82
4.64
4.50
4.39
4.30
4. 1 6
4.01
3.86
3.76
3.70
3.62
3.54
13
9.07
6.70
5.74
5.21
4.86
4.62
4.44
4.30
4.19
4.10
3.96
3.82
3.66
3.57
3.51
3 .43
3.34
14
8.86
6.51
5.56
5 .04
4.69
4.46
4.28
4.14
4.03
3.94
3.80
3.66
3.51
3.41
3.35
3 .27
3.18
15
8.68
6.36
5 .42
4.89
4.56
4.32
4.14
4.00
3.89
3.80
3.67
3.52
3 .37
3.28
3.21
3.13
3.05
16
8.53
6.23
5.29
4.77
4.44
4.20
4.03
3.89
3.78
3.69
3.55
3.41
3.26
3.16
3.10
3 .02
2.93
17
8.40
6.11
5.19
4.67
4.34
4.10
3 . 93
3.79
3.68
3.59
3.46
3.31
3.16
3 . 07
3.00
2.92
2.83
18
8.29
6.01
5 .09
4.58
4.25
4.01
3.84
3.71
3.60
3.51
3.37
3 .23
3 .08
2.98
2.92
2.84
2.75
19
8.18
5.93
5.01
4.50
4.17
3.94
3.77
3.63
3.52
3 .43
3.30
3.15
3.00
2.91
2.84
2.76
2.67
20
8.10
5.85
4.94
4.43
4.10
3.87
3.70
3.56
3.46
3.37
3.23
3.09
2.94
2.84
2.78
2.69
2.61
21
8.02
5 .78
4.87
4.37
4.04
3.81
3.64
3.51
3.40
3.31
3.17
3 . 03
2.88
2.79
2. 72
2.64
2.55
22
7.95
5.72
4.82
4.31
3.99
3.76
3.59
3.45
3.35
3.26
3.12
2.98
2.83
2.73
2.67
2.58
2.50
23
7 .88
5.66
4.76
4.26
3.94
3.71
3 .54
3.41
3 .30
3.21
3 .07
2.93
2.78
2.69
2.62
2.54
2.45
24
7.82
5.61
4.72
4.22
3.90
3.67
3.50
3.36
3.26
3.17
3.03
2.89
2.74
2.64
2.58
2.49
2.40
25
7.77
5.57
4.68
4.18
3.85
3.63
3.46
3.32
3 .22
3.13
2.99
2.85
2.70
2.60
2.54
2.45
2.36
26
7.72
5.53
4.64
4.14
3.82
3.59
3.42
3.29
3.18
3.09
2.96
2.81
2.66
2.57
2.50
2.42
2.33
2.38
2.29
27
7.68
5.49
4.60
4.1 1
3 .78
3.56
3.39
3.26
3.15
3.06
2.93
2.78
2.63
2.54
2.47
28
7.64
5 .45
4.57
4.07
3 .75
3.53
3.36
3.23
3.12
3.03
2.90
2.75
2.60
2.51
2.44
2.35
2.26
29
7.60
5 .42
4.54
4.04
3 .73
3.50
3.33
3.20
3.09
3.00
2.87
2.73
2.57
2.48
2.41
2.33
2.23
30
7.56
5.39
4.51
4.02
3.70
3.47
3.30
3.17
3.07
2.98
2.84
2.70
2.55
2.45
2.39
2.30
2.21
40
7.31
5.18
4.31
3.83
3.51
3.29
3.12
2.99
2.89
2.80
2.66
2.52
2.37
2.27
2.20
2.11
2.02
60
7 . 08
4.98
4.13
3 . 65
3 .34
3 . 12
2.95
2.82
2.72
2.63
2.50
2.35
2.20
2.10
2.03
1 .94
1 .84
120
6.85
4.79
3.95
3.48
3.17
2.96
2.79
2.66
2.56
2.47
2.34
2.19
2.03
1 .93
1 .86
1 .76
1 .66
00
6.63
4.61
3.78
3.32
3.02
2.80
2.64
2.51
2.41
2.32
2.18
2.04
1 .88
1 .78
1 .70
1 .5 9
1 .47
"'0 "'0 (1) :::::l a..
)>
x·
._,... U1 ._,...
Data Index
Admission, 659 examples, 620, 658 Airline distances, 703 examples, 702 Air pollution, 40 examples, 39, 208, 421 , 470, 538 Amitriptyline, 423 examples, 423 Archeological site distances, 744 examples, 743, 745 Bankruptcy, 655 examples, 44, 653, 654 Battery failure, 422 examples, 421 Bird, 268 examples, 268 Biting fly, 348 examples, 347 Bonds, 342 examples, 341 Bones (mineral content ) , 44, 349 examples, 43, 209, 269, 347, 422, 472 Breakfast cereal, 664 examples, 46, 662, 742 Bull, 47 examples, 47, 209, 422, 423, 474, 541 , 662 Calcium (bones), 323, 324 examples, 326 Carapace (painted turtles ) , 339, 536 examples, 338, 441-42, 450, 536 Census tract, 470 examples, 439, 470, 540
758
College test scores, 228 examples, 226, 267, 420 Computer data, 376, 397 examples, 376, 379, 397, 402, 405, 407, 409 Crime, 575 examples, 575-76 Crude oil, 661 examples, 345, 632, 659
Diabetic, 578 examples, 578 Effiuent, 275 examples, 275, 332-33 Egyptian skull, 344 examples, 270, 344 Electrical consumption, 288 examples, 288-89, 292, 333 Electrical time-of-use pricing, 345 examples, 345 Examination scores, 502 examples, 502 Female bear, 25 examples, 24, 262 Financial, 39 examples, 38-39, 184, 207, 421 , 467 Forest, 727 examples, 727, 742 Fowl, 518 examples, 517, 535, 558, 565-67 Grizzly bear, 261-62 examples, 261 -62, 474-75
Data I ndex
Hair (Peruvian), 262 exarnples, 262-63 Hemophilia, 594, 660, 663 exarnples, 593, 610, 660 Hook-billed kite, 341 examples, 341 Iris, 657 exarnples, 344, 626, 642, 656 Job satisfaction/characteristics, 561, 745 exarnples, 559, 569, 571, 742 Lamentations, 740 examples, 7 40 Lizard data-two genera, 330 examples, 329 Lizard size, 17 examples, 17, 1 8 Love and marriage, 321 examples, 320-22 Lumber, 267 exarnples, 267-68 Mental health, 745 examples, 7 42, 7 45 Mice, 449, 471 exarnples, 449, 454, 471 , 540 Milk transportation cost, 269, 340 exarnples, 46, 269, 339 Multiple sclerosis, 42 exarnples, 42, 209, 653 Musical aptitude, 236 examples, 236 National track records, 45, 473 examples, 43, 44, 209, 472, 541 , 742 Natural gas, 411 examples, 410-12 Number parity, 338 examples, 337 Numerals, 678 exarnples, 677, 683, 686, 689
759
Nursing horne, 303 examples, 303, 306
Olympic decathlon, 495 exarnples, 495, 508 Overtime ( police ) , 240, 475 examples, 239, 241 , 243, 248, 270, 456, 459, 460, 475 Oxygen consumption, 343 examples, 46, 342 Paper quality, 15 examples, 14, 20, 209 Peanut, 349 examples, 347 Plastic film, 313 examples, 312-17, 337 Pottery, 709 examples, 709, 745 Profitability, 537 examples, 536-37, 578 Public utility, 687 examples, 27, 28, 47, 671 , 687, 690, 696, 704 Radiation, 181, 200 examples, 180, 196, 200, 208, 221 -23, 226, 233, 261 Radiotherapy, 43 examples, 43, 209, 471 Reading/arithrnetic test scores, 57 5 examples, 575 Real estate, 368 examples, 368, 420-21 Road distances, 7 43 examples, 742 Salmon, 607 exarnples, 606, 660 Sleeping dog, 281 examples, 280 Smoking, 579 examples, 578-80 Spectral reflectance, 351 examples, 350
760
Data I ndex
Spouse, 346 examples, 345-47 Stiffness ( lumber ) , 187, 1 92 exanaples, 22, 1 87, 191, 337, 540, 578 Stock price, 469 examples, 447, 453, 469, 489, 493-95 , 499-501, 507, 514, 576, 740
Sweat, 215 examples, 214, 261 , 471
University, 722 examples, 706, 721 , 726 Welder, 245 examples, 244 Wheat, 577 examples, 577
Subject Index
Analysis of variance, multivariate: one-way, 298 two-way, 309, 335 Analysis of variance, univariate: one-way, 293, 357 two-way, 307 ANOVA (see Analysis of variance, univariate) Autocorrelation, 411-12 Autoregressive model, 412 Average linkage (see Cluster analysis) Biplot, 719 Bonferroni intervals: comparison with T 2 intervals, 234 definition, 232 for means, 232, 275, 290 for treatment effects, 305, 312 Canonical correlation analysis: canonical correlations, 543 , 545 , 553 , 557 canonical variables, 543, 545-46, 557 correlation coefficients in, 552, 558 definition of, 545 , 556 errors of approximation, 564 geometry of, 555 interpretation of, 551 population, 545-46 sample, 556-57 tests of hypothesis in, 569-70 variance explained, 567 CART, 641 Central-limit theorem, 176 Characteristic equation, 98
Characteristic roots (see Eigenvalues) Characteristic vectors (see Eigenvectors) Chernoff faces, 28 Chi-square plots, 185 Classification: Anderson statistic, 612 Bayes 's rule (ECM) , 587, 589-90, 614 confusion matrix, 601 error rates, 599, 601 , 603 expected cost, 587, 613 interpretations in, 617-18, 624-25 Lachenbruch holdout procedure, 602, 626 linear discriminant functions, 591, 592, 610, 6 1 1 , 617, 630 misclassification probabilities, 585-86, 589 with normal populations, 590, 596, 616 quadratic discriminant function, 597, 617 qualitative variables, 641 selection of variables, 645 for several groups, 612, 635 for two groups, 582, 590, 611 Classification trees, 641 Cluster analysis: algorithm, 681 , 694 average linkage, 689 complete linkage, 685 dendrogram, 680 hierarchical, 679 inversions in, 693
761
762
Subject I ndex
Cluster analysis (continued) K-means, 694 similarity and distance, 67 6 similarity coefficients, 67 4, 677 single linkage, 681 Ward's method, 690 Coefficient of determination, 361, 400 Communality, 480 Complete linkage (see Cluster analysis ) Confidence intervals: mean of normal population, 211 simultaneous, 225, 232, 235, 265, 275, 305, 312 Confidence regions: for contrasts, 280 definition, 220 for difference of mean vectors, 285, 291 for mean vectors, 221 , 235 for paired comparisons, 275 Contingency table, 709 Contrast matrix, 279 Contrast vector, 278 Control chart: definition, 239 ellipse format, 241 , 250, 456 for subsample means, 249, 251 multivariate, 241 , 457-58, 460-61 T2 chart, 243, 248, 250, 251 , 459 Control regions: definition, 247 for future observations, 247, 251, 460 Correlation: autocorrelation, 411-12 coefficient of, 8, 72 geometrical interpretation of sample, 1 1 9 multiple, 3 6 1 , 400, 554 partial, 406 sample, 8, 118 Correlation matrix: population, 73 sample, 9 tests of hypotheses for equicorrelation, 453-54
Correspondence analysis: algebraic development, 711 correspondence matrix, 711 inertia, 710, 718 matrix approximation method, 717 profile approximation method, 717 Correspondence matrix, 711 Covariance: definitions of, 70 of linear combinations, 76, 77 sample, 8 Covariance matrix: definitions of, 70 distribution of, 175 factor analysis models for, 480 geometrical interpretation of sample, 1 1 9 , 125-27 large sample behavior, 175 as matrix operation, 140 partitioning, 7 4, 78 population, 72 sample, 124
Data minin g, lift chart, 733 model assessment, 733 process, 732 Dendrogram, 680 Descriptive statistics: correlation coefficient, 8 covariance, 8 mean, 7 variance, 7 Design matrix, 356, 384,408 Determinant: computation of, 94 product of eigenvalues, 105 Discriminant function (see Classification ) Distance: Canberra, 671 Czekanowski, 671 development of, 30-37, 65 Euclidean, 30 Minkowski, 670 properties, 37 statistical, 3 1 , 36
Subject I n dex
Distributions: chi-square (table) , 751 F (table), 752-57 multinomial, 264 normal (table), 749 Q-Q plot correlation coefficient (table), 1 82 t (table) , 750 Wishart, 17 4
Eigenvalues, 98 Eigenvectors, 99 EM algorithm, 252 Estimation: generalized least squares, 417, 419 least squares, 358 maximum likelihood, 168 minimum variance, 364-65 unbiased, 122, 124, 364-65 Estimator (see Estimation) Expected value, 67-68 Experimental unit, 5 Factor analysis: bipolar factor, 503 common factors, 478, 479 communalities, 480 computational details, 530 of correlation matrix, 486, 490, 532 Heywood cases, 532, 534 least squares (Bartlett) computation of factor scores, 5 1 1 , 512 loadings, 478, 479 maximum likelihood estimation in, 492 nonuniqueness of loadings, 483 oblique rotation, 503 , 509 orthogonal factor model, 479 principal component estimation in, 484 principal factor estimation in, 490 regression computation of factor scores, 513, 514 residual matrix, 486 rotation of factors, 501
763
specific factors, 478, 479 specific variance, 480 strategy for, 517 testing for the number of factors, 498 varimax criterion, 504 Factor loading matrix, 478 Factor scores, 512, 514 Fisher ' s linear discriminants: population, 651-52 sample, 611, 630 scaling, 595
Gamma plot, 185 Gauss (Markov) theorem, 364 Generalized inverse, 363, 418 Generalized least squares (see Estimation) Generalized variance: geometric interpretation of sample, 125, 136-37 sample, 124, 136 situations where zero, 130 General linear model: design matrix for, 356, 384 multivariate, 384 univariate, 356 Geometry: of classification, 624-25 generalized variance, 125, 136-37 of least squares, 361 of principal components, 462 of sample, 1 1 9 Gram-Schmidt process, 88 Graphical techniques: biplot, 719 Chernoff faces, 28 marginal dot diagrams, 12 n points in p dimensions, 17 p points in n dimensions, 19 scatter diagram (plot), 11, 20 stars, 25 Growth curve, 24, 323 Hat matrix, 358, 419 Heywood cases (see Factor analysis) Hotelling ' s T2 (see T2-statistic)
764
Subject I ndex
Independence: definition, 70 of multivariate normal variables, 159-60 of sample mean and covariance matrix, 174 tests of hypotheses for, 468 Inequalities: Canchy-Schwarz, 80 extended Cauchy-Schwarz, 80 Inertia, 710, 718 Influential observations, 380 lnvariance of maximum likelihood estimators, 172 Item ( individual ) , 5 K-means
(see Cluster analysis )
Lawley-Hotelling trace statistic, 331, 395 Leverage, 377, 380 Lift chart, 733 Likelihood function, 168 Likelihood ratio tests: definition, 219 limiting distribution, 220 in regression, 370, 392 and T2 , 218 Linear combinations of vectors, 85, 1 65 Linear combinations of variables: mean of, 77 normal populations, 156, 157 sample covariances of, 142, 145 sample means of, 142, 145 variances and covariances of, 77 Linear structural relationships, 524 LISREL, 525 Logistic regression, 641 MANOVA (see Analysis of variance, multivariate ) Matrices: addition of, 90 characteristic equation of, 98 correspondence, 711
definition of, 55, 89 determinant of, 94, 105 dimension of, 89 eigenvalues of, 60, 98 eigenvectors of, 60, 99 generalized inverses of, 363, 418 identity, 59, 92 inverses of, 59, 96 multiplication of, 92, 110 orthogonal, 60, 98 partitioned, 74, 75, 79 positive definite, 61, 63 products of, 57, 92, 93 random, 67 rank of, 96 scalar multiplication in, 90 singular and nonsingular, 96 singular-value decomposition, 101, 714, 721 spectral decomposition, 61 -62, 100 square root, 66 symmetric, 58, 91 trace of, 98 transpose of, 56, 91 Maxima and minima ( with matices ) , 80, 81 Maximum likelihood estimation: development, 170-72 invariance property of, 172 in regression, 365, 390, 401 -02 Mean, 67 Mean vector: definition, 70 distribution of, 17 4 large sample behavior, 175 as matrix operation, 139 partitioning, 74, 79 sample, 9, 79 Minimal spanning tree, 708 Missing observations, 252 Multicollinearity, 382 Multidimensional scaling: algorithm, 700, 702 development, 700-08 sstress, 701 stress, 701
Subject I ndex
Multiple comparisons (see Simultaneous confidence intervals ) Multiple correleation coefficient: population, 400, 554 sample, 361 Multiple regression (see Regression and General linear model ) Multivariate analysis of variance (see Analysis of variance, multivariate ) Multivariate control chart (see control chart ) Multivariate normal distribution (see Normal distribution, multivariate )
Neural network, 644 Nonlinear mapping, 708 Nonlinear ordination, 729 Normal distribution: bivariate, 151 checking for normality, 177 conditional, 160-61 constant density contours, 153, 431 marginal, 156, 158 maximum likelihood estimation in, 171 multivariate, 149-67 properties of, 156-67 transformations to, 194 Normal equations, 418 Normal probability plots (see Q-Q plots ) Outliers: definition, 189 detection of, 190 Paired comparisons, 272-78 Partial correlation, 406 Partitioned matrix: definition, 74, 75, 79 determinant of, 204-05 inverse of, 205 Path diagram, 525-26 Pillai ' s trace statistic, 331, 395
765
Plots: biplot, 719 CP , 381 factor scores, 515, 517 gamma ( or chi-square ) , 185 principal components, 450-51 Q-Q, 178, 378 residual, 378-79 scree, 441 Positive definite (see Quadratic forms ) Posterior probabilities, 589-90, 614 Principal component analysis: correlation coefficients in, 429, 438, 447 for correlation matrix, 433, 447 definition of, 427-28, 438 equicorrelation matrix, 435-37, 453-54 geometry of, 462-66 interpretation of, 431 -32 large-sample theory of, 452-55 monitoring quality with, 455-61 plots, 450-51 population, 426-37 reduction of dimensionality by, 462-64 sample, 437-49 tests of hypotheses in, 453-55, 468 variance explained, 429, 433 , 447 Procustus analysis: development, 723-29 measure of agreement, 724 rotation, 724 Profile analysis, 318-23 Proportions: large-sample inferences, 264-65 multinomial distribution, 264
Q-Q plots: correlation coefficient, 182 critical values, 182 description, 178-83 Quadratic forms: definition, 63, 100 extrema of, 81 nonnegative definite, 63 positive definite, 61 , 63
766
Subject I ndex
Random matrix, 67 Random sample, 120 Regression (see also General linear model) : autoregressive model, 412 assumptions, 355-56, 365, 384, 390 coefficient of determination, 361, 400 confidence regions in, 367, 374, 396, 418 cp plot, 381 decomposition of sum of squares, 360-61, 385, extra sum of squares and cross products, 370-7 1 , 393 fitted values, 358, 384 forecast errors in, 375 Gauss theorem in, 364 geometric interpretation of, 361 least squares estimates, 358, 384 likelihood ratio tests in, 370, 392 maximum likelihood estimation in, 365, 390, 401 , 404 multivariate, 383 -97 regression coefficients, 358, 403 regression function, 365, 401 residual analysis in, 377-79 residuals, 358, 378, 384 residual sum of squares and cross products, 358, 385 sampling properties of estimators, 363-65, 388 selection of variables, 380-82 univariate, 354-57 weighted least squares, 417 with time-dependent errors, 410-14 Regression coefficients (see Regression) Repeated measures designs, 278-82, 323-27 Residuals, 358, 378, 384, 451 Roy ' s largest root, 331, 395 Sample: geometry, 1 12-1 9 Sample splitting, 5 1 7 , 529, 602, 732, 733 Scree plot, 441
Simultaneous confidence ellipses: as projections, 259-60 Simultaneous confidence intervals: comparisons of, 229-31, 232-34 for components of mean vectors, 225, 232, 235 for contrasts, 280 development, 223-26 for differences in mean vectors, 287 , 290, 291 for paired comparisons, 275 as projections, 258 for regression coefficients, 367 for treatment effects, 305-06, 312 Single linkage (see Cluster analysis) Singular matrix, 96 Singular-value decomposition, 101 , 714, 721 Special causes (of variation) , 239 Specific variance, 480 Spectral decomposition, 61 -62, 100 SStress, 701 Standard deviation: population, 73 sample, 7 Standard deviation matrix: population, 73 sample, 140 Standardized observations, 8, 445 Standardized variables, 432 Stars, 25 Strategy for multivariate comparisons, 332 Stress, 701 Structural equation models, 524-29 Studentized residuals, 377 Sufficient statistics, 173 Sums of squares and cross products matrices: between, 298 total, 298 within, 298, 299
Time dependence (in multivariate observations), 256-57, 410-14 T2-sta tis tic: definition of, 211-12
Subject I ndex
distribution of, 212 invariance property of, 215-16 in quality control, 243, 248, 250, 251 , 459 in profile analysis, 319 for repeated measures designs, 279 single-sample, 212 two-sample, 285 Trace of a matrix, 98 Transformations of data, 194-202
Variables: canonical, 545-46, 556-57 dummy, 357 endogenous, 525 exogenous, 525 latent, 525 predictor, 354 response, 354 standardized, 432 Variance: definition, 69 generalized, 124, 136
767
geometrical interpretation of, 119 total sample, 138, 438, 447, 567 Varimax rotation criterion, 504 Vectors: addition, 52, 84 angle between, 53 , 86 basis, 86 definition of, 50, 84 inner product, 53, 87 length of, 52, 86 linearly dependent, 54, 85 linearly independent, 54, 85 linear span, 85 perpendicular (orthogonal), 54, 87 proj ection of, 55, 88 random, 67 scalar multiplication, 84 unit, 52 vector space, 85
Wilks's lambda, 217, 299, 395 Wishart distribution, 174