Ulf Olsson
Generalized Linear Models An Applied Approach
Copying prohibited All rights reserved. No part of this pub...
231 downloads
2694 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Ulf Olsson
Generalized Linear Models An Applied Approach
Copying prohibited All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. The papers and inks used in this product are environment-friendly.
Art. No 31023 eISBN10 91-44-03141-6 eISBN13 978-91-44-03141-5 © Ulf Olsson and Studentlitteratur 2002 Cover design: Henrik Hast Printed in Sweden Studentlitteratur, Lund Web-address: www.studentlitteratur.se Printing/year 1
2
3
4
5
6
7
8
9 10
2006 05 04 03 02
Contents Preface
ix
1 General Linear Models 1.1 The role of models . . . . . . . . . . . 1.2 General Linear Models . . . . . . . . . 1.3 Estimation . . . . . . . . . . . . . . . 1.4 Assessing the fit of the model . . . . . 1.4.1 Predicted values and residuals 1.4.2 Sums of squares decomposition 1.5 Inference on single parameters . . . . . 1.6 Tests on subsets of the parameters . . 1.7 Different types of tests . . . . . . . . . 1.8 Some applications . . . . . . . . . . . 1.8.1 Simple linear regression . . . . 1.8.2 Multiple regression . . . . . . . 1.8.3 t tests and dummy variables . . 1.8.4 One-way ANOVA . . . . . . . . 1.8.5 ANOVA: Factorial experiments 1.8.6 Analysis of covariance . . . . . 1.8.7 Non-linear models . . . . . . . 1.9 Estimability . . . . . . . . . . . . . . . 1.10 Assumptions in General linear models 1.11 Model building . . . . . . . . . . . . . 1.11.1 Computer software for GLM:s . 1.11.2 Model building strategy . . . . 1.11.3 A few SAS examples . . . . . . 1.12 Exercises . . . . . . . . . . . . . . . .
iii
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 3 4 4 4 6 7 7 8 8 10 12 13 18 21 23 23 24 24 24 25 26 27
iv 2 Generalized Linear Models 2.1 Introduction . . . . . . . . . . . . . . . . . . 2.1.1 Types of response variables . . . . . 2.1.2 Continuous response . . . . . . . . . 2.1.3 Response as a binary variable . . . . 2.1.4 Response as a proportion . . . . . . 2.1.5 Response as a count . . . . . . . . . 2.1.6 Response as a rate . . . . . . . . . . 2.1.7 Ordinal response . . . . . . . . . . . 2.2 Generalized linear models . . . . . . . . . . 2.3 The exponential family of distributions . . . 2.3.1 The Poisson distribution . . . . . . . 2.3.2 The binomial distribution . . . . . . 2.3.3 The Normal distribution . . . . . . . 2.3.4 The function b (·) . . . . . . . . . . . 2.4 The link function . . . . . . . . . . . . . . . 2.4.1 Canonical links . . . . . . . . . . . . 2.5 The linear predictor . . . . . . . . . . . . . 2.6 Maximum likelihood estimation . . . . . . . 2.7 Numerical procedures . . . . . . . . . . . . 2.8 Assessing the fit of the model . . . . . . . . 2.8.1 The deviance . . . . . . . . . . . . . 2.8.2 The generalized Pearson χ2 statistic 2.8.3 Akaike’s information criterion . . . . 2.8.4 The choice of measure of fit . . . . . 2.9 Different types of tests . . . . . . . . . . . . 2.9.1 Wald tests . . . . . . . . . . . . . . . 2.9.2 Likelihood ratio tests . . . . . . . . . 2.9.3 Score tests . . . . . . . . . . . . . . 2.9.4 Tests of Type 1 or 3 . . . . . . . . . 2.10 Descriptive measures of fit . . . . . . . . . . 2.11 An application . . . . . . . . . . . . . . . . 2.12 Exercises . . . . . . . . . . . . . . . . . . .
c Studentlitteratur °
Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 31 31 32 32 33 34 35 35 36 37 37 37 38 38 40 42 42 42 44 45 45 46 46 47 47 47 48 48 49 49 50 53
v
Contents
3 Model diagnostics 3.1 Introduction . . . . . . . . . . . . . . . 3.2 The Hat matrix . . . . . . . . . . . . . 3.3 Residuals in generalized linear models 3.3.1 Pearson residuals . . . . . . . . 3.3.2 Deviance residuals . . . . . . . 3.3.3 Score residuals . . . . . . . . . 3.3.4 Likelihood residuals . . . . . . 3.3.5 Anscombe residuals . . . . . . 3.3.6 The choice of residuals . . . . . 3.4 Influential observations and outliers . 3.4.1 Leverage . . . . . . . . . . . . . 3.4.2 Cook’s distance and Dfbeta . . 3.4.3 Goodness of fit measures . . . 3.4.4 Effect on data analysis . . . . . 3.5 Partial leverage . . . . . . . . . . . . . 3.6 Overdispersion . . . . . . . . . . . . . 3.6.1 Models for overdispersion . . . 3.7 Non-convergence . . . . . . . . . . . . 3.8 Applications . . . . . . . . . . . . . . . 3.8.1 Residual plots . . . . . . . . . . 3.8.2 Variance function diagnostics . 3.8.3 Link function diagnostics . . . 3.8.4 Transformation of covariates . 3.9 Exercises . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
55 55 55 56 56 57 57 58 58 58 59 59 60 60 60 60 61 62 63 64 64 66 67 67 68
4 Models for continuous data 4.1 GLM:s as GLIM:s . . . . . . . . . . . . . . . . . 4.1.1 Simple linear regression . . . . . . . . . . 4.1.2 Simple ANOVA . . . . . . . . . . . . . . . 4.2 The choice of distribution . . . . . . . . . . . . . 4.3 The Gamma distribution . . . . . . . . . . . . . . 4.3.1 The Chi-square distribution . . . . . . . . 4.3.2 The Exponential distribution . . . . . . . 4.3.3 An application with a gamma distribution 4.4 The inverse Gaussian distribution . . . . . . . . . 4.5 Model diagnostics . . . . . . . . . . . . . . . . . . 4.5.1 Plot of residuals against predicted values 4.5.2 Normal probability plot . . . . . . . . . . 4.5.3 Plots of residuals against covariates . . . 4.5.4 Influence diagnostics . . . . . . . . . . . . 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
69 69 69 71 72 73 73 75 75 77 78 78 79 79 81 83
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
c Studentlitteratur °
vi 5 Binary and binomial response variables 5.1 Link functions . . . . . . . . . . . . . . . . . . . . 5.1.1 The probit link . . . . . . . . . . . . . . . 5.1.2 The logit link . . . . . . . . . . . . . . . . 5.1.3 The complementary log-log link . . . . . . 5.2 Distributions for binary and binomial data . . . . 5.2.1 The Bernoulli distribution . . . . . . . . . 5.2.2 The Binomial distribution . . . . . . . . . 5.3 Probit analysis . . . . . . . . . . . . . . . . . . . 5.4 Logit (logistic) regression . . . . . . . . . . . . . 5.5 Multiple logistic regression . . . . . . . . . . . . . 5.5.1 Model building . . . . . . . . . . . . . . . 5.5.2 Model building tools . . . . . . . . . . . . 5.5.3 Model diagnostics . . . . . . . . . . . . . 5.6 Odds ratios . . . . . . . . . . . . . . . . . . . . . 5.7 Overdispersion in binary/binomial models . . . . 5.7.1 Estimation of the dispersion parameter . 5.7.2 Modeling as a beta-binomial distribution 5.7.3 An example of over-dispersed data . . . . 5.8 Exercises . . . . . . . . . . . . . . . . . . . . . .
c Studentlitteratur °
Contents
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
85 85 85 86 86 87 87 88 89 91 92 92 96 97 98 100 101 101 102 104
vii
Contents
6 Response variables as counts 6.1 Log-linear models: introductory example . . . . . . . 6.1.1 A log-linear model for independence . . . . . 6.1.2 When independence does not hold . . . . . . 6.2 Distributions for count data . . . . . . . . . . . . . . 6.2.1 The multinomial distribution . . . . . . . . . 6.2.2 The product multinomial distribution . . . . 6.2.3 The Poisson distribution . . . . . . . . . . . . 6.2.4 Relation to contingency tables . . . . . . . . 6.3 Analysis of the example data . . . . . . . . . . . . . 6.4 Testing independence in an r×c crosstable . . . . . . 6.5 Higher-order tables . . . . . . . . . . . . . . . . . . . 6.5.1 A three-way table . . . . . . . . . . . . . . . 6.5.2 Types of independence . . . . . . . . . . . . . 6.5.3 Genmod analysis of the drug use data . . . . 6.5.4 Interpretation through Odds ratios . . . . . . 6.6 Relation to logistic regression . . . . . . . . . . . . . 6.6.1 Binary response . . . . . . . . . . . . . . . . 6.6.2 Nominal logistic regression . . . . . . . . . . 6.7 Capture-recapture data . . . . . . . . . . . . . . . . 6.8 Poisson regression models . . . . . . . . . . . . . . . 6.9 A designed experiment with a Poisson distribution . 6.10 Rate data . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Overdispersion in Poisson models . . . . . . . . . . . 6.11.1 Modeling the scale parameter . . . . . . . . . 6.11.2 Modeling as a Negative binomial distribution 6.12 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . 6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
111 111 112 112 113 113 114 114 114 115 117 118 118 119 119 120 121 121 122 122 126 129 131 133 133 134 135 137
7 Ordinal response 7.1 Arbitrary scoring . . 7.2 RC models . . . . . 7.3 Proportional odds . 7.4 Latent variables . . . 7.5 A Genmod example 7.6 Exercises . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
145 145 148 148 150 153 155
8 Additional topics 8.1 Variance heterogeneity . . . . . . . . . . . . 8.2 Survival models . . . . . . . . . . . . . . . . 8.2.1 An example . . . . . . . . . . . . . . 8.3 Quasi-likelihood . . . . . . . . . . . . . . . . 8.4 Quasi-likelihood for modeling overdispersion 8.5 Repeated measures: the GEE approach . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
157 157 158 159 162 163 165
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
c Studentlitteratur °
viii
Contents
8.6 8.7
Mixed Generalized Linear Models . . . . . . . . . . . . . . . . 168 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Appendix A: Introduction to matrix algebra Some basic definitions . . . . . . . . . . . . . . The dimension of a matrix . . . . . . . . . . . . The transpose of a matrix . . . . . . . . . . . . Some special types of matrices . . . . . . . . . Calculations on matrices . . . . . . . . . . . . . Matrix multiplication . . . . . . . . . . . . . . Multiplication by a scalar . . . . . . . . . Multiplication by a matrix . . . . . . . . . Calculation rules of multiplication . . . . Idempotent matrices . . . . . . . . . . . . The inverse of a matrix . . . . . . . . . . . . . Generalized inverses . . . . . . . . . . . . . . . The rank of a matrix . . . . . . . . . . . . . . . Determinants . . . . . . . . . . . . . . . . . . . Eigenvalues and eigenvectors . . . . . . . . . . Some statistical formulas on matrix form . . . Further reading . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
179 179 180 180 180 181 182 182 182 183 183 183 184 184 185 185 186 186
Appendix B: Inference using likelihood methods The likelihood function . . . . . . . . . . . . . . . . The Cramér-Rao inequality . . . . . . . . . . . . . . Properties of Maximum Likelihood estimators . . . . Distributions with many parameters . . . . . . . . . Numerical procedures . . . . . . . . . . . . . . . . . The Newton-Raphson method . . . . . . . . . . Fisher’s scoring . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
187 187 188 188 189 189 189 190
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Bibliography
191
Solutions to the exercises
197
c Studentlitteratur °
Preface
Generalized Linear Models (GLIM:s) is a very general class of statistical models that includes many commonly used models as special cases. For example the class of General Linear Models (GLM:s) that includes linear regression, analysis of variance and analysis of covariance, is a special case of GLIM:s. GLIM:s also include log-linear models for analysis of contingency tables, probit/logit regression, Poisson regression, and much more.
Generalized linear models
Models for counts, proportions etc * Probit/logit regression
General linear models * Regression analysis * Analysis of Variance * Covariance analysis
* Poisson regression * Log-linear models * Generalized estimating equations
...
In this book we will make an overview of generalized linear models and present examples of their use. We assume that the reader has a basic understanding of statistical principles. Particularly important is a knowledge of statistical model building, regression analysis and analysis of variance. Some knowledge of matrix algebra (which is summarized in Appendix A), and knowledge of basic calculus, are mathematical prerequisites. Since many of the examples are based on analyses using SAS, some knowledge of the SAS system is recommended. In Chapter 1 we summarize some results on general linear models, assuming equal variances and normal distributions. The models are formulated in ix
x matrix terms. Generalized linear models are introduced in Chapter 2. The exponential family of distributions is discussed, and we discuss Maximum Likelihood estimation and ways of assessing the fit of the model. This chapter provides the basic theory of generalized linear models. Chapter 3 covers model checking, which includes systematic ways of assessing whether the data deviates from the model in some systematic way. In chapters 4—7 we consider applications for different types of response variables. Response variables as continuous variables, as binary/binomial variables, as counts and as ordinal response variables are discussed, and practical examples using the Genmod software of the SAS package are given. Finally, in Chapter 8 we discuss theory and applications of a more complex nature, like quasi-likelihood procedures, repeated measures models, mixed models and analysis of survival data. Terminology in this area of statistics is a bit confused. In this book we will let the acronym GLM denote ”General Linear Models”, while we will let GLIM denote ”Generalized Linear Models”. This is also a way of paying homage to two useful computer procedures, the GLM procedure of the SAS package, and the pioneering GLIM software. Several students and colleagues have read and commented on earlier versions of the book. In particular, I would like to thank Gunnar Ekbohm, Jan-Eric Englund, Carolyn Glynn, Anna Gunsjö, Esbjörn Ohlsson, Tomas Pettersson and Birgitta Vegerfors for giving many useful comments. Most of the data sets for the examples and exercises are available on the Internet. They can be downloaded from the publishers home page which has address http://www.studentlitteratur.se.
c Studentlitteratur °
1. General Linear Models
1.1
The role of models
Many of the methods taught during elementary statistics courses can be collected under the heading general linear models, GLM. Statistical packages like SAS, Minitab and others have standard procedures for general linear models. GLM:s include regression analysis, analysis of variance, and analysis of covariance. Some applied researchers are not aware that even their simplest analyses are, in fact, model based.
But ... I'm not using any model. I'm only doing a few t tests.
Models play an important role in statistical inference. A model is a mathematical way of describing the relationships between a response variable and a set of independent variables. Some models can be seen as a theory about how the data were generated. Other models are only intended to provide a convenient summary of the data. Statistical models, as opposed to deterministic models, account for the possibility that the relationship is not perfect. This is done by allowing for unexplained variation, in the form of residuals. 1
2
1.2. General Linear Models
A way of describing a frequently used class of statistical models is Response = Systematic component + Residual component
(1.1)
Models of type (1.1) are, at best, approximations of the actual conditions. A model is seldom “true” in any real sense. The best we can look for may be a model that can provide a reasonable approximation to reality. However, some models are certainly better than others. The role of the statistician is to find a model that is reasonable, while at the same time it is simple enough to be interpretable.
1.2
General Linear Models
In a general linear model (GLM), the observed value of the dependent variable y for observation number i (i = 1, 2, ..., n) is modeled as a linear function of (p − 1) so called independent variables x1 , x2 , . . . , xp−1 as yi = β 0 + β 1 xi1 + . . . + β p−1 xi(p−1) + ei
(1.2)
or in matrix terms y = Xβ + e.
(1.3)
In (1.3),
y =
y1 y2 .. . yn
is a vector of observations on the dependent variable; 1 x11 · · · x1(p−1) 1 x21 X = . . .. .. 1 xn1
xn(p−1)
is a known matrix of dimension n × p, called a design matrix that contains the values of the independent variables and one column of 1:s corresponding to the intercept; β0 β1 β = . .. β p−1 c Studentlitteratur °
3
1. General Linear Models
is a vector containing p parameters to be estimated (including the intercept); and e1 e2 e = . .. en
is a vector of residuals. It is common to assume that the residuals in e are independent, normally distributed and that the variances are the same for all ei . Some models do not contain any intercept term β 0 . In such models, the leftmost column of the design matrix X is omitted. The purpose of the analysis may be model building, estimation, prediction, hypothesis testing, or a combination of these. We will briefly summarize some results on estimation and hypothesis testing in general linear models. For a more complete description reference is made to standard textbooks in regression analysis, such as Draper and Smith (1998) or Sen and Srivastava (1990); and textbooks in analysis of variance, such as Montgomery (1984) or Christensen (1996).
1.3
Estimation
Estimation of parameters in general linear models is often done using the method of least squares. For normal theory models this is equivalent to Maximum Likelihood estimation. The parameters Pare estimated with those values for which the sum of the squared residuals, e2i , is minimal. In matrix i
terms, this sum of squares is
0 e0 e = (y − Xβ) (y − Xβ) .
(1.4)
Minimizing (1.4) with respect to the parameters in β gives the normal equations X0 Xβ = X0 y.
(1.5)
If the matrix X0 X is nonsingular, this yields, as estimators of the parameters of the model, b = (X0 X)−1 X0 y. β
(1.6)
Throughout this text we will use a “hat”, b, to symbolize an estimator. If the inverse of X0 X does not exist, we can still find a solution, although the c Studentlitteratur °
4
1.4. Assessing the fit of the model
solution may not be unique. We can use generalized inverses (see Appendix A) and find a solution as b = (X0 X)− X0 y. β
(1.7)
Alternatively we can restrict the number of parameters in the model by introducing constraints that lead to a nonsingular X0 X.
1.4 1.4.1
Assessing the fit of the model Predicted values and residuals
When the parameters of a general linear model have been estimated you may want to assess how well the model fits the data. This is done by subdividing the variation in the data into two parts: systematic variation and unexplained variation. Formally, this is done as follows. We define the predicted value (or fitted value) of the response variable as ybi =
or in matrix terms
p−1 X j=0
bj xij β
(1.8)
b b = Xβ. y
(1.9)
ebi = yi − ybi .
(1.10)
The predicted values are the values that we would get on the dependent variable if the model had been perfect, i.e. if all residuals had been zero. The difference between the observed value and the predicted value is the observed residual:
1.4.2
Sums of squares decomposition
The total variation in the data can be measured as the total sum of squares, X SST = (yi − y)2 . i
This can be subdivided as X X (yi − y)2 = (yi − ybi + ybi − y)2 i
i
=
X i
c Studentlitteratur °
(yi − ybi )2 +
X i
(b yi − y)2 + 2
(1.11) X i
(yi − ybi ) (b yi − y) .
5
1. General Linear Models
The last term can be shown to be zero. Thus, the total sum of squares SST can be subdivided into two parts: X (b yi − y)2 SSModel = i
and SSe =
X i
(yi − ybi )2 .
SSe , called the residual (or error) sum of squares, will be small if the model fits the data well. The sum of squares can also be written in matrix terms. It holds that X (yi − y)2 = y0 y−ny 2 with n − 1 degrees of freedom (df). SST = i
SSModel
=
X i
SSe
=
X i
2 b 0 X0 y−ny2 with p − 1 df . (b yi − y) = β
b 0 X0 y with n − p df . (yi − ybi )2 = y0 y−β
The subdivision of the total variation (the total sum of squares) into parts is often summarized as an analysis of variance table: Source Model Residual Total
Sum of squares (SS) b 0 X0 y−ny 2 SSModel = β b 0 X0 y SSe = y0 y−β SST = y0 y−ny 2
df
M S = SS/df
p−1 n−p n−1
M SModel M Se = σ b2
These results can be used in several ways. M Se provides an estimator of σ2 , which is the variance of the residuals. A descriptive measure of the fit of the model to data can be calculated as R2 =
SSModel SSe =1− . SST SST
(1.12)
R2 is called the coefficient of determination. It holds that 0 ≤ R2 ≤ 1. For data where the predicted values ybi all are equal to the corresponding observed values yi , R2 would be 1. It is not possible to judge a model based on R2 alone. In some applications, for example econometric model building, models often have values of R2 very close to 1. In other applications models can be valuable and interpretable although R2 is rather small. When several models have been fitted to the same data, R2 can be used to judge which model to prefer. However, since R2 increases (or is unchanged) when new terms are c Studentlitteratur °
6
1.5. Inference on single parameters
added to the model, model comparisons are often based on the adjusted R2 . The adjusted R2 decreases when irrelevant terms are added to the model. It is defined as R2adj = 1 −
¢ n−1 ¡ M Se . 1 − R2 = 1 − n−p SST / (n − 1)
(1.13)
This can be interpreted as 2 Radj =1−
Variance estimated from the model . Variance estimated without any model
A formal test of the full model (i.e. a test of the hypothesis that β 1 , ..., β p−1 are all zero) can be obtained as F =
MSModel . M Se
(1.14)
This is compared to appropriate percentage points of the F distribution with (p − 1, n − p) degrees of freedom.
1.5
Inference on single parameters
Parameter estimators in general linear models are linear functions of the observed data. Thus, the estimator of any parameter β j can be written as X b = wij yi (1.15) β j i
where wij are known weights. If we assume that all yi :s have the same variance σ2 , this makes it possible to obtain the variance of any parameter estimator as ³ ´ X b = w2 σ 2 . (1.16) V ar β ij
j
i
The variance σ2 can be estimated from data as P 2 ebi = M Se . σ b2 = i n−p
b can now be estimated as The variance of a parameter estimator β j ³ ´ X 2 2 bj = wij σ b . Vd ar β i
c Studentlitteratur °
(1.17)
(1.18)
7
1. General Linear Models
This makes it possible to calculate confidence intervals and to test hypotheses about single parameters. A test of the hypothesis that the parameter β j is zero can be made by comparing b β j t= r ³ ´ b Vd ar β j
(1.19)
with the appropriate percentage point of the t distribution with n − p degrees of freedom. Similarly, r ³ ´ b ± t(1−α/2,n−p) Vd b β (1.20) ar β j j would provide a (1 − α) · 100% confidence interval for the parameter β j .
1.6
Tests on subsets of the parameters
In some cases it is of interest to make simultaneous inference about several parameters. For example, in a model with p parameters one may wish to simultaneously test if q of the parameters are zero. This can be done in the following way: Estimate the parameters of the full model. This will give an error sum of squares, SSe1 , with (n − p) degrees of freedom. Now estimate the parameters of the smaller model, i.e. the model with fewer parameters. This will give an error sum of squares, SSe2 , with (n − p − q) degrees of freedom, where q is the number of parameters that are included in model 1, but not in model 2. The difference SSe2 − SSe1 will be related to a χ2 distribution with q degrees of freedom. We can now test hypotheses of type H0 : β 1 = β 2 =, ..., β q = 0 by the F test F =
(SSe2 − SSe1 ) /q SSe1 / (n − p)
(1.21)
with (q, n − p) degrees of freedom.
1.7
Different types of tests
Tests of single parameters in general linear models depend on the order in which the hypotheses are tested. Tests in balanced analysis of variance designs are exceptions; in such models the different parameter estimates are c Studentlitteratur °
8
1.8. Some applications
independent. In other cases there are several ways to test hypotheses. SAS handles this problem by allowing the user to select among four different types of tests. Type 1 means that the test for each parameter is calculated as the change in SSe when the parameter is added to the model, in the order given in the MODEL statement. If we have the model Y = A B A*B, SSA is calculated first as if the experiment had been a one-factor experiment. (model: Y=A). Then SSB|A is calculated as the reduction in SSe when we run the model Y=A B, and finally the interaction SSAB|A,B is obtained as the reduction in SSe when we also add the interaction to the model. This can be written as SS(A), SS(B|A) and SS(AB|A, B). Type I SS are sometimes called sequential sums of squares. Type 2 means that the SS for each parameter is calculated as if the factor had been added last to the model except that, for interactions, all main effects that are part of the interaction should also be included. For the model Y = A B A*B this gives the SS as SS(A|B); SS(B|A) and SS(AB|A, B). Type 3 is, loosely speaking, an attempt to calculate what the SS would have been if the experiment had been balanced. These are often called partial sums of squares. These SS cannot in general be computed by comparing model SS from several models. The Type 3 SS are generally preferred when experiments are unbalanced. One problem with them is that the sum of the SS for all factors and interactions is generally not the same as the Total SS. Minitab gives the Type 3 SS as “Adjusted Sum of Squares”. Type 4 differs from Type 3 in the method of handling empty cells, i.e. incomplete experiments. If the experiment is balanced, all these SS will be equal. In practice, tests in unbalanced situations are often done using Type 3 SS (or “Adjusted Sum of Squares” in Minitab). Unfortunately, this is not an infallible method.
1.8 1.8.1
Some applications Simple linear regression
In regression analysis, the design matrix X often contains one column that only contains 1:s (corresponding to the intercept), while the remaining coc Studentlitteratur °
9
1. General Linear Models
lumns contain the values of the independent variables. Thus, the small regression model yi = β 0 + β 1 xi + ei with n = 4 observations can be written in matrix form as y1 1 x1 e1 µ ¶ y2 1 x2 β 0 e2 (1.22) y3 = 1 x3 β 1 + e3 . y4 1 x4 e4 Example 1.1 An experiment has been made to study the emission of CO2 from the root zone of Barley (Zagal et al, 1993). The emission of CO2 was measured on a number of plants at different times after planting. A small part of the data is given in the following table and graph: Emission of CO2 as a function of time
Time 24 24 30 30 35 35 38 38
Y = -36.7443 + 2.09776X R-Sq = 97.5 % 45
40
35
Emission
Emission 11.069 15.255 26.765 28.200 34.730 35.830 41.677 45.351
30
25
20
15
10
24
29
34
39
Time
One purpose of the experiment was to describe how y=CO2 -emission develops over time. The graph suggests that a linear trend may provide a reasonable approximation to the data, over the time span covered by the experiment. The linear function fitted to these data is yb = −36.7+2.1x. A SAS regression output, including ANOVA table, is given below. It can be concluded that the emission of CO2 increases significantly with time, the rate of increase being about 2.1 units per time unit.
c Studentlitteratur °
10
1.8. Some applications
Dependent Variable: EMISSION Source Model Error Corrected Total
DF 1 6 7
Sum of Squares 992.3361798 25.3765201 1017.7126999
Mean Square 992.3361798 4.2294200
R-Square
C.V.
Root MSE
EMISSION Mean
0.975065
6.887412
2.056555
29.85963
F Value 234.63
Pr > F 0.0001
Estimate
T for H0: Parameter=0
Pr > |T|
Parameter
Std Error of Estimate
INTERCEPT TIME
-36.74430710 2.09776164
-8.33 15.32
0.0002 0.0001
4.40858691 0.13695161
¤
1.8.2
Multiple regression
Generalization of simple linear regression models of type (1.1) to include more than one independent variable is rather straightforward. For example, suppose that y may depend on two variables, and that we have made n = 6 observations. The regression model is then yi = β 0 + β 1 xi1 + β 2 xi2 + ei , i = 1, . . . , 6. In matrix terms this model is 1 x11 x12 y1 e1 y2 1 x21 x22 e2 β0 y3 1 x31 x32 = β 1 + e3 . (1.23) y4 1 x41 x42 e4 β 2 y5 1 x51 x52 e5 y6 1 x61 x62 e6 Example 1.2 Professor Orley Ashenfelter issues a wine magazine, “Liquid assets”, giving advice about good years. He bases his advice on multiple regression of y = Price of the wine at wine auctions with meteorological data as predictors. The New York Times used the headline “Wine Equation Puts Some Noses Out of Joint” on an article about Prof. Ashenberger. Base material was taken from “Departures” magazine, September/October 1990, but the data are invented. The variables in the data set below are: • Rain_W=Amount of rain during the winter. • Av_temp=Average temperature. c Studentlitteratur °
11
1. General Linear Models
Table 1.1: Data for prediction of the quality of wine. Year 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Rain_W 123 66 58 109 46 40 42 167 99 48 85 177 80 64 75
Av_temp 23 21 20 26 22 19 18 25 28 24 24 27 22 25 25
Rain_H 23 100 27 33 102 77 85 14 17 47 28 11 45 40 16
Quality 89 70 77 87 73 70 60 92 87 79 84 93 75 82 88
• Rain_H=Rain in the harvest season. • y=Quality, which is an index based on auction prices. A set of data of this type is reproduced in Table 1.1. A multiple regression output from Minitab based on these data is as follows: Regression Analysis The regression equation is Quality = 48.9 + 0.0594 Rain_W + 1.36 Av_temp - 0.118 Rain_H Predictor Constant Rain_W Av_temp Rain_H
Coef 48.91 0.05937 1.3603 -0.11773
S = 3.092
StDev 10.41 0.02767 0.4187 0.04010
R-Sq = 91.6%
T 4.70 2.15 3.25 -2.94
P 0.001 0.055 0.008 0.014
R-Sq(adj) = 89.4%
Analysis of Variance Source Regression Residual Error Total
DF 3 11 14
SS 1152.43 105.17 1257.60
MS 384.14 9.56
F 40.18
P 0.000
c Studentlitteratur °
12
1.8. Some applications
The output indicates that the three predictor variables do indeed have a relationship to the wine quality, as measured by the price. The variable Rain_W is not quite significant but would be included in a predictive model. The size and direction of this relationship is given by the estimated coefficients of the regression equation. It appears that years with much winter rain, a high average temperature, and only a small amount of rain at harvest time, would produce good wine. ¤
1.8.3
t tests and dummy variables
Classification variables (non-numeric variables), such as treatments, groups or blocks can be included in the model as so called dummy variables, i.e. as variables that only take on the values 0 or 1. For example, a simple t test on data with two groups and three observations per group can be formulated as yij = µ + βdi + eij i = 1, 2; j = 1, 2, 3. Here, µ is a general mean value, di is a dummy variable that has value di = 1 if observation i belongs to group 1 and di = 0 if it belongs to group 2, and eij is a residual. According to this model, the population mean value for group 1 is µ1 = µ + β and the population mean value for group 2 is simply µ2 = µ. In the t test situation we want to examine whether µ1 is different from µ2 , i.e. whether β is different from 0. This model can be written in matrix terms as e11 1 1 y11 y12 1 1 µ ¶ e12 e13 y13 1 1 µ (1.24) y21 = 1 0 β + e21 . e22 y22 1 0 1 0 y23 e23
Example 1.3 In a pharmacological study (Rea et al, 1984), researchers measured the concentration of Dopamine in the brains of six control rats and of six rats that had been exposed to toluene. The concentrations in the striatum region of the brain are given in Table 1.2. The interest lies in comparing the two groups with respect to average Dopamine level. This is often done as a two sample t test. To illustrate that the t test is actually a special case of a general linear model, we analyzed these data with Minitab using regression analysis with Group as a dummy variable. Rats in the toluene group were given the value 1 on the dummy variable, while rats in the control group were coded as 0. The Minitab output of the regression analysis is:
c Studentlitteratur °
13
1. General Linear Models
Table 1.2: Dopamine levels in the brains of rats under two treatments.
Dopamine, ng/kg Toluene group Control 3.420 2.314 1.911 2.464 2.781 2.803
group 1.820 1.843 1.397 1.803 2.539 1.990
Regression Analysis The regression equation is Dopamine level = 1.90 + 0.717 Group Predictor Constant Group S = 0.4482
Coef 1.8987 0.7168
StDev 0.1830 0.2587
R-Sq = 43.4%
T 10.38 2.77
P 0.000 0.020
R-Sq(adj) = 37.8%
Analysis of Variance Source Regression Residual Error Total
DF 1 10 11
SS 1.5416 2.0084 3.5500
MS 1.5416 0.2008
F 7.68
P 0.020
The output indicates a significant Group effect (t = 2.77, p = 0.020). The b = 0.7168. This size of this group effect is estimated as the coefficient β 1 means that the toluene group has an estimated mean value that is 0.7168 units higher than the mean value in the control group. The reader might wish to check that this calculation is correct, and that the t test given by the regression routine does actually give the same results as a t test performed according to textbook formulas. Also note that the F test in the output is related to the t test through t2 = F : 2.772 = 7.68. These two tests are identical. ¤
1.8.4
One-way ANOVA
The generalization of models of type (1.24) to more than two groups is rather straightforward; we would need one more column in X (one new dummy variable) for each new group. This leads to a simple oneway analysis of variance (ANOVA) model. Thus, a one-way ANOVA model with three treatments, c Studentlitteratur °
14
1.8. Some applications
each with two observations per treatment, can be written as yij = µ + β i + eij , i = 1, 2, 3, j = 1, 2
(1.25)
We can introduce three dummy variables d1 , d2 and d3 such that di = ½ 1 for group i . The model can now be written as 0 otherwise = µ + β 1 d1 + β 2 d2 + β 3 d3 + eij = µ + β i di + eij , i = 1, 2, 3, j = 1, 2
yij
(1.26)
Note that the third dummy variable d3 is not needed. If we know the values of d1 and d2 the group membership is known so d3 is redundant and can be removed from the model. In fact, any combination of two of the dummy variables is sufficient for identifying group membership so the choice to delete one of them is to some extent arbitrary. After removing d3 , the model can be written in matrix terms as 1 1 0 y11 e11 y12 1 1 0 e12 µ y21 1 0 1 β 1 + e21 = (1.27) y22 1 0 1 e22 β 2 y31 1 0 0 e31 1 0 0 y32 e32 Although there are three treatments we have only included two dummy variables for the treatments, i.e. we have chosen the restriction β 3 = 0. Follow-up analyses One of the results from a one-way ANOVA is an over-all F test of the hypothesis that all group (treatment) means are equal. If this test is significant, it can be followed up by various types of comparisons between the groups. Since the ANOVA provides an estimator σ b2e = MSe of the residual variance 2 σe , this estimator should be used in such group comparisons if the assumption of equal variance seems tenable. A pairwise comparison between two group means, i.e. a test of the hypothesis that two groups have equal mean values, can be obtained as y i − y i0 t= r ³ MSe n1i + c Studentlitteratur °
1 ni0
´
15
1. General Linear Models
with degrees of freedom taken from M Se . A confidence interval for the difference between the mean values can be obtained analogously. In some cases it may be of interest to do comparisons which are not simple pairwise comparisons. For example, we may want to compare treatment 1 with the average of treatements 2, 3 and 4. We can then define a contrast in the treatment means as L = µ1 − µ2 +µ33 +µ4 . A general way to write a contrast is X hi µi , (1.28) L= i
where we define the weights hi such that
P
hi = 0. The contrast can be
i
estimated as b= L
X
hi y i ,
(1.29)
i
b is and the estimated variance of L ³ ´ X h2 i b = M Se . Vd ar L n i i
(1.30)
This can be used for tests and confidence intervals on contrasts.
Problems when the number of comparisons is large After you have obtained a significant F test, there may be many pairwise comparisons or other contrasts to examine. For example, in a one-way ANOVA with seven treatments you can make 21 pairwise comparisons. If you make many tests at, say, the 5% level you may end up with a number of significant results even if all the null hypotheses are true. If you make 100 such tests you would expect, on the average, 5 significant results. Thus, even if the significance level of each individual test is 5% (the so called comparisonwise error rate), the over-all significance level of all tests (the experimentwise error rate), i.e. the probability to get at least one significant result given that all null hypotheses are true, is larger. This is the problem of mass significance. There is some controversy whether mass significance is a real problem. For example, Nelder (1971) states “In my view, multiple comparison methods have no place at all in the interpretation of data”. However, other authors have suggested various methods to protect against mass significance. The general solution is to apply a stricter limit on what we should declare “significant”. If a single t test would be significant for |t| > 2.0, we could use the limit 2.5 or 3.0 instead. The SAS procedure GLM includes 16 different c Studentlitteratur °
16
1.8. Some applications
Table 1.3: Change in urine production following treatment with different contrast media (n = 57). Medium Diatrizoate Diatrizoate Diatrizoate Diatrizoate Diatrizoate Hexabrix Hexabrix Hexabrix Hexabrix Hexabrix Hexabrix Hexabrix Hexabrix Hexabrix Isovist Isovist Isovist Isovist Isovist
Diff 32.92 25.85 20.75 20.38 7.06 6.47 5.63 3.08 0.96 2.37 7.00 4.88 1.11 4.14 2.10 0.77 −0.04 4.80 2.74
Medium Isovist Isovist Isovist Isovist Omnipaque Omnipaque Omnipaque Omnipaque Omnipaque Omnipaque Omnipaque Omnipaque Omnipaque Omnipaque Ringer Ringer Ringer Ringer Ringer
Diff 2.44 0.87 −0.22 1.52 8.51 16.11 7.22 9.03 10.11 6.77 1.16 16.11 3.99 4.90 0.07 −0.03 0.34 0.08 0.51
Medium Ringer Ringer Mannitol Mannitol Mannitol Mannitol Mannitol Mannitol Mannitol Mannitol Mannitol Ultravist Ultravist Ultravist Ultravist Ultravist Ultravist Ultravist Ultravist
Diff 0.10 0.40 9.19 0.79 10.22 4.78 14.64 6.98 7.51 9.55 5.53 12.94 7.30 15.35 6.58 15.68 3.48 5.75 12.18
methods for deciding which limit to use. A simple but reasonably powerful method is to use Bonferroni adjustment. This means that each individual test is made at the significance level α/c, where α is the desired over-all level and c is the number of comparisons you want to make. Example 1.4 Liss et al (1996) studied the effects of seven contrast media (used in X-ray investigations) on different physiological functions of 57 rats. One variable that was studied was the urine production. Table 1.3 shows the change in urine production of each rat before and after treatment with each medium. It is of interest to compare the contrast media with respect to the change in urine production. This analysis is a oneway ANOVA situation. The procedure GLM in SAS produced the following result:
c Studentlitteratur °
17
1. General Linear Models
General Linear Models Procedure Dependent Variable: DIFF Source Model Error Corrected Total
Source MEDIUM
DF 6 50 56
DIFF Sum of Mean Squares Square F Value 1787.9722541 297.9953757 16.46 905.1155428 18.1023109 2693.0877969
R-Square 0.663912
C.V. 61.95963
Root MSE 4.2546811
DF 6
Type III SS 1787.9722541
Mean Square 297.9953757
Pr > F 0.0001
DIFF Mean 6.8668596 F Value 16.46
Pr > F 0.0001
There are clearly significant differences between the media (p < 0.0001). To find out more about the nature of these differences we requested Proc GLM to print estimates of the parameters, i.e. estimates of the coefficients β i for each of the dummy variables. The following results were obtained: Parameter INTERCEPT MEDIUM Diatrizoate Hexabrix Isovist Mannitol Omnipaque Ringer Ultravist
Estimate 9.90787500 11.48412500 -5.94731944 -8.24365278 -2.21920833 -1.51817500 -9.69787500 0.00000000
B B B B B B B B
T for H0: Parameter=0
Pr > |T|
Std Error of Estimate
6.59 4.73 -2.88 -3.99 -1.07 -0.75 -4.40 .
0.0001 0.0001 0.0059 0.0002 0.2882 0.4554 0.0001 .
1.50425691 2.42554139 2.06740338 2.06740338 2.06740338 2.01817243 2.20200665 .
NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
Note that Proc GLM reports the X0 X matrix to be singular. This is as expected for an ANOVA model: not all dummy variables can be included in the model. The procedure excludes the last dummy variable, setting the parameter for Ultravist to 0. All other estimates are comparisons of the estimated mean value for that medium, with the mean value for Ultravist. Least squares estimates of the mean values for the media can be calculated and compared. Since this can result in a large number of pairwise comparisons (in this case, 7 · 6/2 = 21 comparisons), some method for protection against mass significance might be considered. The least squares means are given in Table 1.4 along with indications of significant pairwise differences using Bonferroni adjustment. Before we close this example, we should take a look at how the data behave. For example, we can prepare a boxplot of the distributions for the different c Studentlitteratur °
18
1.8. Some applications
Table 1.4: Least squares means, and pairwise comparisons between treatments, for the contrast media experiment.
Mean Diatrizoate Ultravist Omnipaque Mannitol Hexabrix Isovist Ringer
Diatrizoate 21.39
−
* * * * * *
Ultravist 9.91
−
n.s. n.s. n.s. * *
Omnipaque 8.39
−
n.s. n.s. * *
Mannitol 7.69
−
n.s. n.s. *
Hexabrix 3.96
−
n.s. n.s.
Isovist
Ringer
1.66
0.21
−
n.s.
−
media. This boxplot is given in Figure 1.1. The plot indicates that the variation is quite different for the different media, with a large variation for Diatrizoate and a small variation for Ringer (which is actually a placebo). This suggests that one assumption underlying the analysis, the assumption of equal variance, may be violated. We will return to these data later to see if we can make a better analysis. ¤
1.8.5
ANOVA: Factorial experiments
The ideas used above can be extended to factorial experiments that include more than one factor and possible interactions. The dummy variables that correspond to the interaction terms would then be constructed by multiplying the corresponding main effect dummy variables with each other. This feature can be illustrated by considering a factorial experiment with factor A (two levels) and factor B (three levels), and where we have two observations for each factor combination. The model is yijk = µ + αi + β j + (αβ)ij + eijk , i = 1, 2, j = 1, 2, 3, k = 1, 2
(1.31)
The number of dummy variables that we have included for each factor is equal to the number of factor levels minus one, i.e. the last dummy variable for each factor has been excluded. The number of non-redundant dummy variables equals the number of degrees of freedom for the effect. In matrix terms,
c Studentlitteratur °
19
1. General Linear Models
30
Diff
20
10
0 Diatrizoate Hexabrix
Isovist
MannitolOmnipaque Ringer
Ultravist
Medium
Figure 1.1: Boxplot of change in urine production for different contrast media.
y111 y112 y121 y122 y131 y132 y211 y212 y221 y222 y231 y232
=
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0 0 0 0
1 1 0 0 0 0 1 1 0 0 0 0
0 0 1 1 0 0 0 0 1 1 0 0
1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0
µ α1 β1 β2 (αβ) 11 (αβ) 12
+
e111 e112 e121 e122 e131 e132 e211 e212 e221 e222 e231 e232
.
(1.32)
Example 1.5 Lindahl et al (1999) studied certain reactions of fungus myceliae on pieces of wood by using radioactively labeled 32 P . In one of the experiments, two species of fungus (Paxillus involutus and Suillus variegatus) were used, along with two sizes of wood pieces (Large and Small); the response was a certain chemical measurement denoted by C. The data are reproduced in Table 1.5.
These data were analyzed as a factorial experiment with two factors. Part of the Minitab output was: c Studentlitteratur °
20
1.8. Some applications
Table 1.5: Data for a two-factor experiment. Species H H H H H H H H H H H H H H H H
Size Large Large Large Large Large Large Large Large Small Small Small Small Small Small Small Small
C 0.0010 0.0011 0.0017 0.0008 0.0010 0.0028 0.0003 0.0013 0.0061 0.0010 0.0020 0.0018 0.0033 0.0015 0.0040 0.0041
Species S S S S S S S S S S S S S S S S
Size Large Large Large Large Large Large Large Large Small Small Small Small Small Small Small Small
C 0.0021 0.0001 0.0016 0.0046 0.0035 0.0065 0.0073 0.0039 0.0007 0.0011 0.0019 0.0022 0.0011 0.0012 0.0009 0.0040
General Linear Model: C versus Species; Size Analysis of Variance for C, using Adjusted SS for Tests Source Species Size Species*Size Error Total
DF 1 1 1 28 31
Seq SS 0.0000025 0.0000002 0.0000287 0.0000742 0.0001056
Adj SS 0.0000025 0.0000002 0.0000287 0.0000742
Adj MS 0.0000025 0.0000002 0.0000287 0.0000027
F 0.93 0.09 10.82
P 0.342 0.772 0.003
The main conclusion from this analysis is that the interaction Species × Size is highly significant. This means that the effect of Size is different for different species. In such cases, interpretation of the main effects is not very meaningful. As a tool for interpreting the interaction effect, a so called interaction plot can be prepared. Such a plot for these data is as given in Figure 1.2. The mean value of the response for species S is higher for large wood pieces than for small wood pieces. For species H the opposite is true: the mean value is larger for small wood pieces. This is an example of an interaction. ¤
c Studentlitteratur °
21
1. General Linear Models
Interaction Plot - Data Means for C Species
H S
Mean
0.0035
0.0025
0.0015
Large
Small
Size
Figure 1.2: Interaction plot for the 2-factor experiment.
1.8.6
Analysis of covariance
In regression analysis models the design matrix X contains quantitative variables. In ANOVA models, the design matrix only contains dummy variables corresponding to treatments, design structure and possible interactions. It is quite possible to include a mixture of quantitative variables and dummy variables in the design matrix. Such models are called covariance analysis, or ANCOVA, models. Let us look at a simple case where there are two groups and one covariate. Several different models can be considered for the analysis of such data even in the simple case where we assume that all relationships are linear: 1. There is no relationship between x and y in any of the groups and the groups have the same mean value. 2. There is a relationship between x and y; the relationship is the same in the groups. 3. There is no relationship between x and y but the groups have different levels. 4. There is a relationship between x and y; the lines are parallel but at different levels. c Studentlitteratur °
22
1.8. Some applications
5. There is a relationship between x and y; the lines are different in the groups. These five cases correspond to different models that can be represented in graphs or in formulas: 14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0
Model 1: yij = µ + eij
Model 2: yij = µ + βx + eij
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0
Model 3: yij = µ + αi + eij
Model 4: yij = µ + αi + βx + eij
14
12
10
8
6
4
2
0
Model 5: yij = µ + αi + βx + γ · di · x + eij
Model 5 is the most general of the models, allowing for different intercepts (µ + αi ) and different slopes β + γdi , where d is a dummy variable indicating group membership. If it can be assumed that the term γdi is zero for all i, then we are back at model 4. If, in addition, all αi are zero, then model 2 is correct. If, on the other hand. β is zero, we would use model 3. If finally β is zero in model 2, then model 1 describes the situation. This is an example of a set of models where some of the models are nested within other models. The model choice can be made by comparing any model to a simpler model which only differs in terms of one factor.
c Studentlitteratur °
1. General Linear Models
1.8.7
23
Non-linear models
Models can be non-linear in different ways. A model can contain non-linear functions of the parameters, like y = β 0 + β 1 eβ 2 x + e. We will not consider such models, which are called intrinsically nonlinear, or nonlinear in the parameters. Some models can be transformed into a linear form by a suitable choice of transformation. For example, the model y = eβ 0 +β 1 x can be made linear by using a log transformation: log (y) = β 0 + β 1 x. Other models can be linear in the parameters, but nonlinear in the variables, like yi = β 0 + β 1 xi + β 2 x2i + β 3 exi + ei .
(1.33)
Such models are simple to analyze using general linear models. Formally, each transformation of x is treated as a new variable. Thus, if we denote ui = x2i and vi = exi then the model (1.33) can be written as yi = β 0 + β 1 xi + β 2 ui + β 3 vi + ei
(1.34)
which is a standard multiple regression model. Models of this type can be handled using standard GLM software.
1.9
Estimability
In some types of general linear models it is impossible to estimate all model parameters. It is then necessary to restrict some parameters to be zero, or to use some other restriction on the parameters. As an example, a two-factor ANOVA model with two levels of factor A, three levels of factor B and two replications can be written as yijk = µ + αi + β j + (αβ)ij + eijk , i = 1, 2, j = 1, 2, 3, k = 1, 2.
(1.35) (1.36)
In this model it would be possible to replace µ with µ + c and to replace each αi with αi − c, where c is some constant. The same kind of ambiguity holds also for other parameters of the model. This model contains a total of 12 parameters: µ, α1 , α2 , β 1 , β 2 , β 3 , (αβ)11 , (αβ)12 , (αβ)13 , (αβ)21 , (αβ)22 , and (αβ)23 , but only 6 of the parameters can be estimated. As noted above, computer programs often solve this problem by restricting some parameters to be zero. However, it may be possible to estimate certain functions of the parameters in a unique way. Such functions, if they exist, are called estimable functions. A linear combination of model parameters is estimable if it can be written as a linear combination of expected values of the observations. c Studentlitteratur °
24
1.10. Assumptions in General linear models
Let us denote with µij· the mean value for the treatment combination that has factor A at level i and factor B at level j. It holds that µij· = E (yijk ) = µ + αi + β j + (αβ)ij
(1.37)
which is a linear function of the parameters. This function is estimable. In addition, any linear function of the µij :s is also estimable. For example, the expected value of all observations with factor A at level i can be written as µi·· =
µ11· + µ12· + µ13· . 3
This is a linear function of cell means. Since the cell means are estimable, µi·· is also estimable.
1.10
Assumptions in General linear models
The classical application of general linear models rests on the following set of assumptions: The model used for the analysis is assumed to be correct. The residuals are assumed to be independent. The residuals are assumed to follow a Normal distribution. The residuals are assumed to have the same variance σ 2e , independent of X, i.e. the residuals are homoscedastic. Different diagnostic tools have been developed to detect departures from these assumptions. Since similar tools are used for generalized linear models, reference is made to Chapter 3 for details.
1.11
Model building
1.11.1
Computer software for GLM:s
There are many options for fitting general linear models to data. One option is to use a regression package and leave it to the user to construct appropriate dummy variables for class variables. However, most statistical packages have routines for general linear models that automatically construct the appropriate set of dummy variables. c Studentlitteratur °
25
1. General Linear Models
Let us use letters at the end of the alphabet (X, Y , Z) to denote numeric variables. Y will be used for the dependent variable. Letters in the beginning of the alphabet (A, B) will symbolize class variables (groups, treatments, blocks, etc.) Computer software requires the user to state the model in symbolic terms. The model statement contains operators that specify different aspects of the model. In the following table we list the operators used by SAS. Examples of the use of the operators are given below.
Operator * (none) | () @
Explanation, SAS example Interaction: A*B. Also used for polynomials: X*X Both effects present: A B All main effects and interactions: A|B=A B A*B Nested factor: A(B). “A nested within B” Order operator: A|B|C @ 2 means that all main effects and all interaction up to and including second order interactions are included.
The kinds of models that we have discussed in this chapter can symbolically be written in SAS language as indicated in the following table.
Model Simple linear regression Multiple regression t tests, oneway ANOVA Two-way ANOVA with interaction Covariance analysis model 1 Covariance analysis model 2 Covariance analysis model 3 Covariance analysis model 4 Covariance analysis model 5
1.11.2
Computer model (SAS) Y=X Y=XZ Y=A Y = A B A*B or Y=A|B Y= Y=X Y=A Y=AX Y = A X A*X
Model building strategy
Statistical model building is an art as much as it is a science. There are many requirements on models: they should make sense from a subject-matter point of view, they should be simple, and at the same time they should capture most of the information in the data. A good model is a compromise between parsimony and completeness. This means that it is impossible to state simple rules for model building: there will certainly be cases where the rules are not c Studentlitteratur °
26
1.11. Model building
relevant. However, the following suggestions, partially based on McCullagh and Nelder (1989, p. 89), are useful in many cases: • Include all relevant main effects in the model, even those that are not significant. • If an interaction is included, the model should also include all main effects and interactions it comprises. For example, if the interaction A*B*C is included, the model should also include A, B, C, A*B, A*C and B*C. • A model that contains polynomial terms of type xa should also contain the lower-degree terms x, x2 , ... , xa−1 . • Covariates that do not have any detectable effect should be excluded. • The conventional 5% significance level is often too strict for model building purposes. A significance level in the range 15-25% may be used instead. • Alternatively, criteria like the Akaike information criterion can be used. This is discussed on page 46 in connection with generalized linear models.
1.11.3
A few SAS examples
In SAS terms, grouping variables (classification variables) are called CLASS variables. As examples of SAS programs for a few of the models discussed above we can consider the regression model (1.1) using Proc GLM. The analysis could be done with a program that does not include any CLASS variables: PROC GLM DATA=Regression; MODEL y = x; RUN;
The t test (or the oneway ANOVA) can be modelled as PROC GLM DATA=Anova; CLASS group; MODEL y = group; RUN;
The difference between the two programs is that in the t test, the independent variable ( “group”) is given as a CLASS variable. This asks SAS to build appropriate dummy variables.
c Studentlitteratur °
27
1. General Linear Models
1.12
Exercises
Exercise 1.1 Cicirelli et al (1983) studied protein synthesis in developing egg cells of the frog Xenopus laevis. Radioactively labeled leucine was injected into egg cells. At various times after injection, radioactivity measurements were made. From these measurements it was possible to calculate how much of the leucine had been incorporated into protein. The following data, quoted from Samuels and Witmer (1999), are mean values of two egg cells. All egg cells were taken from the same female. Time 0 10 20 30 40 50 60
Leucine (ng) 0.02 0.25 0.54 0.69 1.07 1.50 1.74
A. Use linear regression to estimate the rate of incorporation of the labeled leucine. B. Plot the data and the regression line. C. Prepare an ANOVA table. Exercise 1.2 The level of cortisol has been measured for three groups of patients with different syndromes: a) adenoma b) bilateral hyperplasia c) cardinoma. The results are summarized in the following table: a 3.1 3.0 1.9 3.8 4.1 1.9
b 8.3 3.8 3.9 7.8 9.1 15.4 7.7 6.5 5.7 13.6
c 10.2 9.2 9.6 53.8 15.8
A. Make an analysis of these data that can answer the question whether there are any differences in cortisol level between the groups. A complete solution should contain hypotheses, calculations, test statistic, and a conclusion. A c Studentlitteratur °
28
1.12. Exercises
graphical display (for example a boxplot) may help in the interpretation of the results. B. There are some indications that the assumptions underlying the analysis in A. are not fulfilled. Examine this, indicate what the problems are, and suggest what can be done to improve the analysis. No new ANOVA is needed. Exercise 1.3 Below are some data on the emission of carbon dioxide from the root system of plants (Zagal et al, 1993). Two levels of nitrogen were used, and samples of plants were analyzed 24, 30, 35 and 38 days after germination. The data were as follows: Level of Nitrogen High
Low
Days from germination 24 30 35 38 8.220 19.296 25.479 31.186 12.594 31.115 34.951 39.237 11.301 18.891 20.688 21.403 15.255 28.200 32.862 41.677 11.069 26.765 34.730 43.448 10.481 28.414 35.830 45.351
A. Analyze the data in a way that treats Days from germination as a quantitative factor. Treat level of nitrogen as a dummy variable, and assume that all regressions are linear. i) Fit a model that assumes that the two regression lines are parallel. ii) Fit a model that does not assume that the regression lines are parallel. iii) Test the hypothesis that the regressions are parallel. B. What is the expected rate of CO2 emission for a plant with a high level of nitrogen, 35 days after germination? The same question for a plant with a low level of nitrogen? Use the model you consider the best of the models you have fitted under A. and B. above. Make the calculation by hand, using the computer printouts of model equations. C. Graph the data. Include both the observed data and the fitted Y values in your graph. D. According to your best analysis above, is there any significant effect of: i) Interaction ii) Level of nitrogen ii) Days from germination
c Studentlitteratur °
29
1. General Linear Models
Exercise 1.4 Gowen and Price, quoted from Snedecor and Cochran (1980), counted the number of lesions of Aucuba mosaic virus after exposure to Xrays for various times. The results were: Exposure 0 15 30 45 60
Count 271 108 59 29 12
It was assumed that the Count (y) depends on the exposure time (x) through an exponential relation of type y = Ae−Bx . A convenient way to estimate the parameters of such a function is to make a linear regression of log(y) on x. A. Perform a linear regression of log(y) on x. B. What assumptions are made regarding the residuals in your analysis in A.? C. Plot the data and the fitted function in the same graph.
c Studentlitteratur °
2. Generalized Linear Models
2.1
Introduction
In Chapter 1 we briefly summarized the theory of general linear models (GLM:s). GLM:s are very useful for data analysis. However, GLM:s are limited in many ways. Formally, the classical applications of GLM:s rest on the assumptions of normality, linearity and homoscedasticity. The generalization of GLM:s that we will present in this chapter will allow us to model our data using other distributions than the Normal. The choice of distribution affects the assumptions we make regarding variances, since the relation between the variance and the mean is known for many distributions. For example, the Poisson distribution has the property that µ = σ2 . This chapter is the most theoretical chapter in the book. It builds on the theory of Maximum Likelihood estimation (see Appendix B), and on the class of distributions called the exponential family. In later chapters we will apply the theory in different situations.
2.1.1
Types of response variables
This book is concerned with statistical models for data. In these models, the concept of a response variable is crucial. In general linear models, the response variable Y is often assumed to be quantitative and normally distributed. But this is by no means the only type of response variables that we might meet in practice. Some examples of different types of response variables are:
31
32
2.1. Introduction
• Continuous response variables. • Binary response variables. • Response variables in the form of proportions. • Response variables in the form of counts. • Response in the form of rates. • Ordinal response. We will here give a few examples of these types of response variables.
2.1.2
Continuous response
Models where the response variable is considered to be continuous are common in many application areas. In fact, since measurements cannot be made to infinite precision, few response variables are truly continuous, but continuous models are still often used as approximations. Many response variables of this type are modeled as general linear models, often assuming normality and homoscedasticity. It is common for response variables to be restricted to positive values. Physical measurements in cm or kg are examples of this. Since the Normal distribution is defined on [−∞, ∞], the normality assumption cannot hold exactly for such data, and one has to revert to approximations. We may illustrate the concept of continuous response using data of a type often used in general linear models; other examples will be discussed in later chapters. Example 2.1 In the pharmacological study discussed in Example 1.3 the concentration of Dopamine was measured in the brains of six control rats and of six rats that had been exposed to toluene. The results were given on page 13. In this example the response variable may be regarded as essentially continuous. ¤
2.1.3
Response as a binary variable
Binary response, often called quantal response in earlier literature, is the result of measurements where it has only been recorded whether an event has occurred (Y = 1) or not (Y = 0). A common approach to modeling this type of data is to model the probability that the event will occur. Since a probability p is limited by 0 ≤ p ≤ 1, models for the data should use this restriction. Binary data are often modeled using the Bernoulli distribution, c Studentlitteratur °
33
2. Generalized Linear Models
which is a special case of the Binomial distribution where n = 1. The binomial distribution is further discussed on page 88. Example 2.2 Collett (1991), quoting data from Brown (1980), reports some data on the treatment of prostatic cancer. The issue of concern was to find indicators whether the cancer had spread to the surrounding lymph nodes. Surgery is needed to ascertain the extent of nodal involvement. Some variables that can be measured without surgery may be indicators of nodal involvement. Thus, one purpose of the modeling is to formulate a model that can predict whether or not the lymph nodes have been affected. The data are of the type given in the following table. Only a portion of the data is listed; the actual data set contained 53 patients. Age 66 65 61 58 65
.. .
Acid level 0.48 0.46 0.50 0.48 0.84
X-ray result 0 1 0 1 1
Tumour size 0 0 1 1 1
Tumour grade 0 0 0 0 1
Nodal involvement 0 0 0 1 1
.. .
.. .
.. .
.. .
.. .
In this type of data, the response Y has value 1 if nodal involvement has occurred and 0 otherwise. This is called a binary response. Even some of the independent variables (X-ray results, Tumour size and Tumour grade) are binary variables, taking on only the values 0 or 1. These data will be analyzed in Chapter 5. ¤
2.1.4
Response as a proportion
Response in the form of proportions (binomial response) is obtained when a group of n individuals is exposed to the same conditions. f out of the n individuals respond in one way (Y = 1) while the remaining n−f individuals respond in some other way (Y = 0). The response is the proportion pb = nf . The response of the individuals might be to improve from a certain medical treatment; to die from a specified dose of an insecticide; or for a piece of equipment to fail. A proportion corresponds to a probability, and modeling of the response probability is an important part of the data analysis. In such models the fact that 0 ≤ p ≤ 1 should be allowed to influence the choice of model. Binary response is a special case of binomial response with n = 1. Example 2.3 Finney (1947) reported on an experiment on the effect of Rotenone, in different concentrations, when sprayed on the insect Macrosic Studentlitteratur °
34
2.1. Introduction
phoniella sanborni, in batches of about fifty. The results are given in the following table. Conc 10.2 7.7 5.1 3.8 2.6
Log(Conc) 1.01 0.89 0.71 0.58 0.41
No. of insects 50 49 46 48 50
No. affected 44 42 24 16 6
% affected 88 86 52 33 12
One aim with this experiment was to find a model for the relation between the probability p that an insect is affecteded and the dose, i.e. the concentration. Such a model can be written, in general terms, as g(p) = f(Concentration). The functions g and f should be chosen such that the model cannot produce a predicted probability that is larger than 1. These data will be discussed later on page 89. ¤
2.1.5
Response as a count
Counts are measurements where the response indicates how many times a specific event has occurred. Counts are often recorded in the form of frequency tables or crosstabulations. Count data are restricted to integers ≥ 0. Models for counts should take this limitation into account. Example 2.4 Sokal and Rohlf (1973) reported some data on the color of Tiger beetles (Cicindela fulgida) collected during different seasons. The results are: Season Early spring Late spring Early summer Late summer Total
Red 29 273 8 64 374
Other 11 191 31 64 297
Total 40 464 39 128 671
The data may be used to study how the color of the beetle depends on season. A common approach is to test whether there is independence between season and color through a χ2 test. We will return to the analysis of these data later (page 117). ¤
c Studentlitteratur °
35
2. Generalized Linear Models
2.1.6
Response as a rate
In some cases, the response can be assumed to be proportional to the size of the object being measured. For example, the number of birds of a certain species that have been sighted may depend on the area of the habitat that has been surveyed. In this case the response may be measured as “number of sightings per km2 ”, which we will call a rate. In the analysis of data of this type, one has to account for differences in size between objects. Example 2.5 The data below, quoted from Agresti (1996), are accident rates for elderly drivers, subdivided by sex. For each sex the number of person years (in thousands) is also given. The data refer to 16262 Medicaid enrollees. No. of accidents No. of person years (’000)
Females 175 17.3
Males 320 21.4
Accident data can often be modeled using the Poisson distribution. In this case, we have to account for the fact that males and females have different observation periods, in terms of number of person years. Accident rate can be measured as (no. of accidents)/(no. of person years). In a later chapter (page 131), we will discuss how this type of data can be modelled. ¤
2.1.7
Ordinal response
Response variables are sometimes measured on an ordinal scale, i.e. on a scale where the categories are ordered but where the distance between scale steps is not constant. Examples of such variables are ratings of patients; answers to attitude items; and school marks. Example 2.6 Norton and Dunn (1985) studied the relation between snoring and heart problems for a sample of 2484 patients. The data were obtained through interviews with the patients. The amount of snoring was assessed on a scale ranging from “Never” to “Always”, which is an ordinal variable. An interesting question is whether there is any relation between snoring and heart problems. The data are: Heart problems
Never
Yes No Total
24 1355 1379
Sometimes 35 603 638
Snoring Often
Always
Total
21 192 213
30 224 254
110 2374 2484 c Studentlitteratur °
36
2.2. Generalized linear models
The main interest lies in studying possible dependence between snoring and heart problems. Analysis of ordinal data is discussed in Chapter 7. ¤
2.2
Generalized linear models
Generalized linear models provide a unified approach to modelling of all the types of response variables we have met in the examples above. In this section we will summarize the theory of generalized linear models. In later sections we will return to the examples and see how the theory can be applied in specific cases. Let us return to the general linear model (1.3): y = Xβ + e
(2.1)
η = Xβ
(2.2)
Let us denote
as the linear predictor part of the model (1.3). Generalized linear models are a generalization of general linear models in the following ways: 1. An assumptions often made in a GLM is that the components of y are independently normally distributed with constant variance. We can relax this assumption to permit the distribution to be any distribution that belongs to the exponential family of distributions. This includes distributions such as Normal, Poisson, gamma and binomial distributions. 2. Instead of modeling µ =E (y) directly as a function of the linear predictor Xβ, we model some function g (µ) of µ. Thus, the model becomes g (µ) = η = Xβ. The function g (·) in (2.3), is called a link function. The specification of a generalized linear model thus involves: 1. specification of the distribution 2. specification of the link function g (·) 3. specification of the linear predictor Xβ. We will discuss these issues, starting with the distribution. c Studentlitteratur °
(2.3)
37
2. Generalized Linear Models
2.3
The exponential family of distributions
The exponential family is a general class of distributions that includes many well known distributions as special cases. It can be written in the form ¸ · (yθ − b (θ)) + c (y, φ) (2.4) f (y; θ, φ) = exp a (φ) where a (·) , b (·) and c (·) are some functions. The so called canonical parameter θ is some function of the location parameter of the distribution. Some authors differ between exponential family, which is (2.4) assuming that a (φ) is unity, and exponential dispersion family, which include the function a (φ) while assuming that the so called dispersion parameter φ is a constant; see Jørgensen (1987); Lindsey (1997, p. 10f). As examples of the usefulness of the exponential family, we will demonstrate that some well-known distributions are, in fact, special cases of the exponential family.
2.3.1
The Poisson distribution
The Poisson distribution can be written as a special case of an exponential family distribution. It has probability function µy e−µ y! = exp [y log (µ) − µ − log (y!)] .
f (y; µ) =
(2.5)
We can compare this expression with (2.4). We note that θ = log (µ) which means that µ = exp (θ). We insert this into (2.5) and get f (y; µ) = exp [yθ − exp (θ) − log (y!)] Thus, (2.5) is a special case of (2.4) with θ = log (µ), b(θ) = exp(θ), c(y, φ) = − log(y!) and a(φ) = 1.
2.3.2
The binomial distribution
The binomial distribution can be written as µ ¶ n y n−y f (y; p) = p (1 − p) y · µ ¶ µ ¶¸ p n = exp y log + n log (1 − p) + log . 1−p y
(2.6)
c Studentlitteratur °
38
2.3. The exponential family of distributions
We use θ = log
³
p 1−p
´
i.e. p =
exp(θ) 1+exp(θ) .
· µ f (y; p) = exp yθ + n log
This can be inserted into 2.6 to give
1 1 + exp (θ)
¶
µ ¶¸ n + log . y
It follows that distribution ³ the´binomial distribution is an exponential family ¡n¢ p with θ = log 1−p , b (θ) = n log [1 + exp (θ)], c (y, φ) = log y and a(φ) = 1.
2.3.3
The Normal distribution
The Normal distribution can be written as ¢ −(y−µ)2 ¡ 1 e 2σ2 f y; µ, σ2 = √ 2πσ2 ´ ³ 2 yµ − µ2 2 ¡ ¢ y 1 = exp − 2 − log 2πσ 2 . σ2 2σ 2
(2.7)
2 This is an exponential family distribution with θ = £ 2 ¤ µ, φ = σ , a (φ) = 2 φ, b (θ) = θ /2, and c (y, φ) = − y /φ + log (2πφ) /2. (In fact, it is an exponential dispersion family distribution; see above.)
2.3.4
The function b (·)
The function b (·) is of special importance in generalized linear models because b (·) describes the relationship between the mean value and the variance in the distribution. To show how this works we consider Maximum Likelihood estimation of the parameters of the model. For a brief introduction to Maximum Likelihood estimation reference is made to Appendix B. The first derivative: b0 We denote the log likelihood function with l (θ, φ; y) = log f (y; θ, φ). According to likelihood theory it holds that µ ¶ ∂l E =0 (2.8) ∂θ and that E
c Studentlitteratur °
µ
∂ 2l ∂θ 2
¶
+E
"µ
∂l ∂θ
¶2 #
= 0.
(2.9)
39
2. Generalized Linear Models
From (2.4) we obtain that l (θ; φ, y) = (yθ − b (θ)) /a (φ)+c (y, φ). Therefore, ∂l = [y − b0 (θ)] /a (φ) ∂θ
(2.10)
∂ 2l = −b00 (θ) /a (φ) ∂θ2
(2.11)
and
where b0 and b00 denote the first and second derivative, respectively, of b with respect to θ. From (2.8) and (2.10) we get µ ¶ ∂l E (2.12) = E {[y − b0 (θ)] /a (φ)} = 0 ∂θ so that E (y) = µ = b0 (θ) .
(2.13)
Thus the mean value of the distribution is equal to the first derivative of b with respect to θ. For the distributions we have discussed so far, these derivatives are: Poisson : b (θ) = exp (θ) gives b0 (θ) = exp (θ) = µ Binomial : b (θ) = n log (1 + exp (θ)) gives b0 (θ) = n Normal : b (θ) =
exp (θ) = np 1 + exp (θ)
θ2 gives b0 (θ) = θ = µ 2
For each of the distributions the mean value is equal to b0 (θ). The second derivative: b00 From (2.9) and (2.11) we get b00 (θ) V ar (y) + 2 =0 a (φ) a (φ)
(2.14)
V ar (y) = a (φ) · b00 (θ) .
(2.15)
− so that
We see that the variance of y is a product of two terms: the second derivative of b (·), and the function a (φ) which is independent of θ. The parameter φ is called the dispersion parameter and b00 (θ) is called the variance function. c Studentlitteratur °
40
2.4. The link function
For the distributions that we have discussed so far the variance functions are as follows: Poisson
:
b00 (θ) = exp (θ) = µ
Binomial
:
b00 (θ) =
= n Normal
:
n exp (θ) (1 + exp (θ)) − (exp (θ))2 (1 + exp (θ))2
exp (θ)
2 = np (1 − p) (1 + exp (θ)) a (φ) b00 (θ) = φ · 1 = σ2
The variance function is often written as V (µ) = b00 (θ). The notation V (µ) does not mean “the variance of µ”; rather, V (µ) indicates how the variance depends on the mean value µ in the distribution, where µ in turn is a function of θ. In the table on page 41 we summarize some characteristics of a few distributions in the exponential family; see also McCullagh and Nelder (1989).
2.4
The link function
The link function g (·) is a function relating the expected value of the response Y to the predictors X1 . . . Xp . It has the general form g (µ) = η = Xβ. The function g (·) must be monotone and differentiable. For a monotone function we can define the inverse function g −1 (·) by the relation g −1 (g (µ)) = µ. The choice of link function depends on the type of data. For continuous normaltheory data an identity link may be appropriate. For data in the form of counts, the link function should restrict µ to be positive, while data in the form of proportions should use a link that restricts µ to the interval [0, 1]. Some commonly used link functions and their inverses are: The identity link: η = µ. The inverse is simply µ = η. The logit link: η = log [µ/ (1 − µ)]. The inverse µ = to the interval [0, 1].
exp(η) 1+exp(η)
is restricted
The probit link: η = Φ−1 (µ), where Φ is the standard Normal distribution function. The inverse µ = Φ (η) is restricted to the interval [0, 1]. The complementary log-log link: η = log [− log (1 − µ)]. The inverse µ = 1 − exp (− exp (η)) is restricted to the interval [0, 1]. ¢ ¡ Power links: η = µλ − 1 /λ where we take η = log (µ) for λ = 0. Examples √ of power links are η = µ2 ; η = µ1 ; η = µ; and η = log (µ). These all belong to the Box-Cox family of transformations. For λ 6= 0, the inverse c Studentlitteratur °
41
2. Generalized Linear Models
c Studentlitteratur °
Figure 2.1:
42
2.5. The linear predictor ln(λη+1)
link is µ = e λ . For the log link with λ = 0, the inverse link is µ = exp (η) which is restricted to the interval 0, ∞.
2.4.1
Canonical links
Certain link functions are, in a sense, “natural” for certain distributions. These are called canonical links. The canonical link is that function which transforms the mean to a canonical location parameter of the exponential dispersion family member (Lindsey, 1997). This means that the canonical link is that function g (·) for which g (µ) = θ. It holds that: Poisson : θ = log (µ) so the canonical link is log. p which is the logit link. Binomial : θ = log 1−p Normal : θ = µ so the canonical link is the identity link. The canonical links for a few distributions are listed in the table on page 41. Computer procedures such as Proc Genmod in SAS use the canonical link by default once the distribution has been specified. It should be noted, however, that there is no guarantee that the canonical links will always provide the “best” model for a given set of data. In any particular application the data may exhibit peculiar behavior, or there may be theoretical justification for choosing links other than the canonical links.
2.5
The linear predictor
The linear predictor Xβ plays the same role in generalized linear models as in general linear models. In regression settings, X contains values of independent variables. In ANOVA settings, X contains dummy variables corresponding to qualitative predictors (treatments, blocks etc). In general, the model states that some function of the mean of y is a linear function of the predictors: η = Xβ. As noted in Chapter 1, X is called a design matrix.
2.6
Maximum likelihood estimation
Estimation of the parameters of generalized linear models is often done using the Maximum Likelihood method. The estimates are those parameter values
c Studentlitteratur °
43
2. Generalized Linear Models
that maximize the log likelihood, which for a single observation can be written l = log [L (θ, φ; y)] =
yθ − b (θ) + c (y, φ) . a (φ)
(2.16)
The parameters of the model is a p × 1 vector of regression coefficients β which are, in turn, functions of θ. Differentiation of l with respect to the elements of β, using the chain rule, yields ∂l ∂l dθ dµ ∂η = . ∂β j ∂θ dµ dη ∂β j
(2.17)
We have shown earlier that b0 (θ) = µ, and that b00 (θ) = V , the variance function. Thus, ∂µ ∂θ = V . From the expression for the linear predictor η = P ∂η xj β j we obtain ∂β = xj . Putting things together, j
j
∂l ∂β j
= =
(y − µ) 1 dµ xj a (φ) V dη dη W (y − µ) xj . a (φ) dµ
(2.18)
In 2.18, W is defined from W
−1
=
µ
dη dµ
¶2
V.
(2.19)
So far, we have written the likelihood for one single observation. By summing over the observations, the likelihood equation for one parameter β j is given by X Wi (yi − µ ) dη i i xij = 0. a (φ) dµ i i
(2.20)
We can solve (2.20) with respect to β j since the µi :s are functions of the parameters β j . Asymptotic variances and covariances of the parameter estimates are obtained through the inverse of the Fisher information matrix (see Appendix B). Thus,
c Studentlitteratur °
44
2.7. Numerical procedures
³ ´ ³ ´ b b ,β b V ar β Cov β 0 0 1 ´ ³ ´ ³ b b b ,β V ar β Cov β 1 0 1 .. . ´ ³ b b ··· Cov β p−1 , β 0 2 ∂ l ∂l ∂l
= −E
2.7
∂β 20
∂l ∂l ∂β 1 ∂β 0
∂β 0 ∂β 1 ∂2 l ∂β 21
∂l ∂l ∂β p−1 ∂β 0
³ ´ b ,β b Cov β 0 p−1 = .. . ³ ´ b V ar β p−1 −1 ∂l ∂l ∂β 0 ∂β p−1
.. .
..
. ···
∂2l ∂β 2p−1
(2.21)
.
Numerical procedures
Maximization of the log likelihood (2.16), which is equivalent to solving the likelihood equations (2.20), is done using numerical methods. A commonly used procedure is the iteratively reweighted least squares approach; see McCullagh and Nelder (1989). Briefly, this algorithm works as follows: 1. Linearize the link function g (·) by using the first order Taylor series approximation g(y) ≈ g (µ) + (y − µ) g 0 (µ) = z. 2. Let b η 0 be the current estimate of the linear predictor, and let µ b0 be the corresponding fitted value derived from the link function η = g (µ). ³ ´ dη Form the adjusted dependent variate z0 = b η0 + (y − µ b0 ) dµ where the derivative of the link is evaluated at µ b0 .
3. Define the weight matrix W from W0−1 = variance function.
0
³
dη dµ
´2
V0 , where V is the
4. Perform a weighted regression of dependent variable z on predictors x1 , b of the paramx2 , . . . , xp using weights W0 . This gives new estimates β 1 eters, from which a new estimate b η 1 of the linear predictor is calculated. 5. Repeat steps 1−4 until the changes are sufficiently small.
c Studentlitteratur °
2. Generalized Linear Models
2.8 2.8.1
45
Assessing the fit of the model The deviance
The fit of a generalized linear model to data may be assessed through the deviance. The deviance is also used to compare nested models. Different models can have different degrees of complexity. The null model has only one parameter that represents a common mean value µ for all observations. In contrast, the full (or saturated) model has n parameters, one for each observation. For the saturated model, each observation fits the model perfectly, i.e. y = yb. The full model is used as a benchmark for assessing the fit of any model to the data. This is done by calculating the deviance. The deviance is defined as follows: Let l(b µ, φ; y) be the log likelihood of the current model at the Maximum Likelihood estimate, and let l(y, φ; y) be the log likelihood of the full model. The deviance D is defined as D = 2 (l(y, φ; y) − l(b µ, φ; y)) .
(2.22)
It can be noted that for a Normal distribution, the deviance is just the residual sum of squares. The Genmod procedure in SAS presents two deviance statistics: the deviance and the scaled deviance. For distributions that have a scale parameter φ, the scaled deviance is D∗ = D/φ. It is actually the scaled deviance that is used for inference. For distributions such as Binomial and Poisson, the deviance and the scaled deviance are identical. If the model is true, the deviance will asymptotically tend towards a χ2 distribution as n increases. This can be used as an over-all test of the adequacy of the model. The degree of approximation to a χ2 distribution is different for different types of data. A second, and perhaps more important use of the deviance is in comparing competing models. Suppose that a certain model gives a deviance D1 on df1 degrees of freedom (df), and that a simpler model produces deviance D2 on df2 degrees of freedom. The simpler model would then have a larger deviance and more df. To compare the two models we can calculate the difference in deviance, (D2 − D1 ), and relate this to the χ2 distribution with (df2 − df1 ) degrees of freedom. This would give a large-sample test of the significance of the parameters that are included in model 1 but not in model 2. This, of course, requires that the parameters included in model 2 is a subset of the parameters of model 1, i.e. that the models are nested.
c Studentlitteratur °
46
2.8.2
2.8. Assessing the fit of the model
The generalized Pearson χ2 statistic
An alternative to the deviance for testing and comparing models is the Pearson χ2 , which can be defined as X (yi − µ b)2 /Vb (b µ) . (2.23) χ2 = i
Here, Vb (b µ) is the estimated variance function. For the Normal distribution, this is again the residual sum of squares of the model, so in this case, the deviance and Pearson’s χ2 coincide. In other cases, the deviance and Pearson’s χ2 have different asymptotic properties and may produce different results. Maximum likelihood estimation of the parameters in generalized linear models seeks to minimize the deviance, which may be one reason to prefer the deviance over the Pearson χ2 . Another reason is that the Pearson χ2 does not have the same additive properties as the deviance for comparing nested models. Computer packages for generalized linear models often produce both the deviance and the Pearson χ2 . Large differences between these may be an indication that the χ2 approximation is bad.
2.8.3
Akaike’s information criterion
An idea that has been put forward by several authors is to “penalize” the likelihood functions such that simpler models are being preferred. A general expression of this idea is to measure the fit of a model to data by a measure such as DC = D − αqφ.
(2.24)
Here, D is the deviance, q is the number of parameters in the model, and φ is the dispersion parameter. If φ is constant, it can be shown that α ≈ 4 is roughly equivalent to testing one parameter at the 5% level. It can be shown that α ≈ 2 would lead to prediction errors near the minimum. This is the information criterion (AIC) suggested by Akaike (1973). Akaike’s information criterion is often used for model selection: the model with the smallest value of DC would then be the preferred model. Note that some computer program report the AIC with the opposite sign; large values of AIC would then indicate a good model. The AIC is not very useful in itself since the scale is somewhat arbitrary. The main use is to compare the AIC of competing models in order to decide which model to prefer.
c Studentlitteratur °
47
2. Generalized Linear Models
2.8.4
The choice of measure of fit
The deviance, and the Pearson χ2 , can provide large-sample tests of the fit of the model. The usefulness of these tests depends on the kind of data being analyzed. For example, Collett (1991) concludes that for binary data with all ni = 1, the deviance cannot be used to assess the over-all fit of the model (p. 65). For Normal models the deviance is equal to the residual sum of squares which is not a model test by itself. The advantage of the deviance, as compared to the Pearson χ2 , is that it is a likelihood-based test that is useful for comparing nested models. Akaike’s information criterion is often used as a way of comparing several competing models, without necessarily making any formal inference.
2.9
Different types of tests
Hypotheses on single parameters or groups of parameters can be tested in different ways in generalized linear models.
2.9.1
Wald tests
Maximum likelihood estimation of the parameters of some model results in estimates of the parameters and estimates of the standard errors of the estimators. The estimates of standard errors are often asymptotic results that are valid for large samples. Let us denote the asymptotic standard error of b with σ the estimator β bβb . If the large-sample conditions are valid, we can test hypotheses about single parameters by using z=
b β σ bβb
(2.25)
where z is a standard Normal variate. This is called a Wald test. In normal theory models, tests based on (2.25), but with z replaced by t, are exact. In other cases the Wald tests are asymptotic tests that are valid only in large samples. In some cases the Wald tests are presented as χ2 tests. This is based on the fact that if z is standard Normal, then z 2 follows a χ2 distribution on 1 degree of freedom. Note that the Wald tests for single parameters are related to the Pearson χ2 .
c Studentlitteratur °
48
2.9.2
2.9. Different types of tests
Likelihood ratio tests
Likelihood ratio tests are based on the following principle. Denote with L1 the likelihood function maximized over the full parameter space, and denote with L0 the likelihood function maximized over parameter values that correspond to the null hypothesis being tested. The likelihood ratio statistic is −2 log (L0 /L1 ) = −2 [log (L0 ) − log (L1 )] = −2 (l0 − l1 ) .
(2.26)
Under rather general conditions, it can be shown that the distribution of the likelihood ratio statistic approaches a χ2 distribution as the sample size grows. Generally, the number of degrees of freedom of this statistic is equal to the number of parameters in model 1 minus the number of parameters in model 0. Exceptions occur if there are linear constraints on the parameters. In the same way as the Wald tests are related to the Pearson χ2 , the likelihood ratio tests are related to the deviance.
2.9.3
Score tests
We will illustrate score tests (also called efficient score tests) based on arguments taken from Agresti (1996). In Figure 2.2, we illustrate a hypothetical b is the Maximum Likelihood estimator of some palikelihood function. β rameter β. We are testing a hypothesis H0 : β = 0. L1 and L0 denote the likelihood under H1 and H0 , respectively. The Wald test uses the behavior of the likelihood function at the ML estimate b The asymptotic standard error of β b depends on the curvature of the β. b likelihood function close to β.
The score test is based on the behavior of the likelihood function close to 0, the value stated in H0 . If the derivative at H0 is “large”, this would be an indication that H0 is wrong, while a derivative close to 0 would be a sign that we are close to the maximum. The score test is calculated as the square of the ratio of this derivative to its asymptotic standard error. It can be treated as an asymptotic χ2 variate on 1 df . b The likelihood ratio test uses information on the log likelihood both at β and at 0. It compares the likelihoods L1 and L0 using the asymptotic χ2 distribution of −2 (log L0 − log L1 ). Thus, in a sense, the LR statistic uses more information than the Wald and score statistics. For this reason, Agresti (1996) suggests that the likelihood ratio statistic may be the most reliable of the three.
c Studentlitteratur °
49
2. Generalized Linear Models
L(b ) L1 L0
0
bˆ
b
Figure 2.2: A likelihood function indicating information used in Wald, LR and score tests.
2.9.4
Tests of Type 1 or 3
Tests in generalized linear models have the same sequential property as tests in general linear models. Proc Genmod in SAS offers Type 1 or Type 3 tests. The interpretation of these tests is the same as in general linear models. In a Type 1 analysis the result of a test depends on the order in which terms are included in the model. A type 3 analysis does not depend on the order in which the Model statement is written: it can be seen as an attempt to mimic the analysis that would be obtained if the data had been balanced. In general linear models the Type 1 and Type 3 tests are obtained through sums of squares. In Generalized linear models the tests are Likelihood ratio tests, but there is an option in Genmod to use Wald tests instead. See Chapter 1 for a discussion on Type 1 and Type 3 tests.
2.10
Descriptive measures of fit
In general linear models, the fit of the model to data can be summarized as Model R2 = SSSS . It holds that 0 ≤ R2 ≤ 1. A value of R2 close to 1 would T indicate a good fit. An adjusted version of R2 has been proposed to account for the fact that R2 increases even when irrelevant factors are added to the model; see Chapter 1. Similar measures of fit have been proposed also for generalized linear models. c Studentlitteratur °
50
2.11. An application
Cox and Snell (1989) suggested to use R2 = 1 −
µ
L0 LMax
¶2/n
(2.27)
where L is the likelihood. This measure equals the usual R2 for Normal models, but has the disadvantage that it is always smaller than 1. In fact, 2 = 1 − (L0 )2/n . Rmax
(2.28)
For example, in a binomial model with half of the observations in each category, this maximum equals 0.75 even if there is perfect agreement between the variables. Nagelkerke (1991) therefore suggested the modification 2
2 R = R2 /Rmax .
(2.29)
Similar coefficients have been suggested by Ben-Akiva and Lerman (1985), and by Horowitz (1982). The coefficients by Cox and Snell, and by Nagelkerke, are available in the Logistic procedure in SAS.
2.11
An application
Example 2.7 Samuels and Witmer (1999) report on a study of methods for producing sheep’s milk for use in the manufacture of cheese. Ewes were randomly assigned to either mechanical or manual milking. It was suspected that the mechanical method might irritate the udder and thus producing a higher concentration of somatic cells in the milk. The data in the following table show the counts of somatic cells for each animal.
y s
c Studentlitteratur °
Mechanical 2966 269 59 1887 3452 189 93 618 130 2493 1216 1343
Manual 186 107 65 126 123 164 408 324 548 139 219 156
51
2. Generalized Linear Models
This is close to a textbook example that may be used for illustrating twosample t tests. However, closer scrutiny of the data reveals that the variation is quite different in the two groups. In fact, some kind of relationship between the mean and the variance may be at hand. ¤ We will illustrate the analysis of these data by attempting several different analyses. An analysis similar to a standard two-sample t test can be obtained by using a generalized linear model with a dummy variable for group membership, a Normal distribution and an identity link. Cell counts often conform to Poisson distributions. This means that a Poisson distribution with the canonical (log) link is another option. The SAS programs for analysis of these data were of the following type: PROC GENMOD DATA=sheep; CLASS milking; MODEL count = milking / dist = normal; RUN;
A model assuming a Normal distribution and with milking method as a factor was fitted. By default, the program then chooses the canonical link which, for the Normal distribution, is the identity link. In this model, the Wald test of group differences is significant (p = 0.014). If we use a standard t test assuming equal variances, this gives p = 0.044. The difference between these p-values is explained by the fact that the Wald test essentially approximates the t distribution with a Normal distribution. The Poisson model gives a deviance of 14863 on 18 df. The group difference is highly significant: χ2 = 5451.12 on 1 df , p < 0.0001. The output for this model is as follows: Model Information Description
Value
Data Set WORK.SHEEP Distribution POISSON Link Function LOG Dependent Variable COUNT Observations Used 20 Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
18 18 18 18 .
14862.7182 14862.7182 14355.5077 14355.5077 83800.0507
825.7066 825.7066 797.5282 797.5282 .
c Studentlitteratur °
52
2.11. An application Analysis Of Parameter Estimates Parameter INTERCEPT MILKING MILKING SCALE
NOTE:
Man Mech
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 0
7.1030 -1.7139 0.0000 1.0000
0.0091 0.0232 0.0000 0.0000
613300.717 5451.1201 . .
0.0001 0.0001 . .
The scale parameter was held fixed.
However, since the ratio deviance/df is so large, a second analysis was made where the program estimated the dispersion parameter φ. This approach, which is related to a phenomenon called overdispersion, is discussed in the next chapter. In the analysis where the scale parameter was used, the scaled deviance was 18.0 and the p value for milking was 0.010. Two other distributions were also tested: the gamma distribution and the inverse gaussian distribution. These were used with their respective canonical links. In addition, a Wilcoxon-Mann-Whitney test was made. The results for all these models can be summarized as follows:
Model Normal, Glim Normal, t test Log normal Gamma Inverse Gaussian Poisson Poisson with φ Wilcoxon
p value 0.0140 0.0440 0.0405 0.0086 0.0610 <0.0001 0.0102 0.1620
Although most models seem to indicate significant group differences, the p values are rather different. The first Poisson model gives a strongly significant result while the standard t test is only just below the magical 0.05 limit. The non-parametric Wilcoxon test is not significant. This illustrates the fact that significance testing is not a mechanical procedure. To decide which of the results to use we need to assess the different models, based both on statistical consideration and on subject-matter knowledge. Methods for statistical model diagnostics in generalized linear models is the topic of the next chapter.
c Studentlitteratur °
53
2. Generalized Linear Models
2.12
Exercises
Exercise 2.1 The distribution of waiting times (for example the time you wait in line at a bank) can sometimes be approximated by an exponential distribution. The density of the exponential distribution is f(x) = λe−λx for x > 0. Does the exponential distribution belong to the exponential family? If so, what are the functions b (·) and c (·)? What is the variance function? Exercise 2.2 Sometimes data are collected which are “essentially Poisson”, but where it is impossible to observe the value y = 0. For example, if data are collected by interviewing occupants in houses on “how many occupants are there in this house”, it would be impossible to get an answer from houses that are not occupied. The truncated Poisson distribution can sometimes be used to model such data. It has probability function p (yi |λ) =
e−λ λyi (1 − e−λ ) yi !
for yi = 1, 2, 3, ... A. Investigate whether the truncated Poisson distribution is a member of the Exponential family. B. Derive the variance function. Exercise 2.3 Aanes (1961) studied the effect of a certain type of poisoning in sheep. The survival time and the weight were recorded for 13 sheep that had been poisoned. Results: Weight 46 55 61 75 64 75 71
Survival 44 27 24 24 36 36 44
Weight 59 64 67 60 63 66
Survival 44 120 29 36 36 36
Find a generalized linear model for the relationship between survival time and weight. Try several different distributions and link functions. Do not use only the canonical link for each distribution. Plot the data and the fitted models.
c Studentlitteratur °
3. Model diagnostics
3.1
Introduction
In general linear models, the fit of the model to data can be explored by using residual plots and other diagnostic tools. For example, the normality assumption can be examined using normal probability plots, the assumption of homoscedasticity can be checked by plotting residuals against yb, and so on. Model diagnostics in Glim:s can be performed in similar ways.
In this chapter we will discuss different model diagnostic tools for generalized linear models. Our discussion will be fairly general. We will return to these issues in later chapters when we consider analysis of different types of response variables. The purpose of model diagnostics is to examine whether the model provides a reasonable approximation to the data. If there are indications of systematic deviations between data and model, the model should be modified. The diagnostic tools that we consider are the following: • Residual plots (similar to the residual plots used in GLM:s) can be used to detect various deviations between data and model: outliers, problems with distributions or variances, dependence, and so on. • Some observations may have an unusually large impact on the results. We will discuss tools to identify such influential observations. • Overdispersion means that the variance is larger than would be expected for the chosen distribution. We will discuss ways to detect over-dispersion and to modify the model to account for over-dispersion.
3.2
The Hat matrix
In general linear models, a residual is the difference between the observed value of y and the fitted value yb that would be obtained if the model were 55
56
3.3. Residuals in generalized linear models
perfectly true: e = y − yb. The concept of a “residual” is not quite as clear-cut in generalized linear models.
The estimated expected value (“fitted value”) of the response in a general \ linear model is E (Yi ) = µ bi = ybi . The fitted values are linear functions of the observed values. For linear predictors, it holds that b = Hy y
(3.1)
where H is known as the “hat matrix”. H is idempotent (i.e. HH = H) and symmetric. b = X (X0 X)−1 X0 y. In this b = Xβ Example: In simple linear regression, y −1
case, the hat matrix is H = X (X0 X)
X0 .
The hat matrix may have a more complex form in other models than GLM:s. Except in degenerate cases, it is possible to compute the hat matrix. We will not give general formulae here; however, most computer software for Glim:s have options to print the hat matrix.
3.3
Residuals in generalized linear models
In general linear models, the observed residuals are simply the difference between the observed values of y and the values yb that are predicted from the model: eb = y − yb. In generalized linear models, the variance of the residuals is often related to the size of yb. Therefore, some kind of scaling mechanism is needed if we want to use the residuals for plots or other model diagnostics. Several suggestions have been made on how to achieve this.
3.3.1
Pearson residuals
The raw residual for an observation yi can be defined as ebi = yi − ybi . The Pearson residual is the raw residual standardized with the standard deviation of the fitted value: yi − ybi . ei,P earson = q Vd ar (b yi )
The Pearson residuals are related to the Pearson χ2 through χ2 = Example: In a Poisson model, ei,P earson =
(yi −b √ yi ) . y bi
Example: In a binomial model, ei,P earson = √(yi −ni pbi ) . c Studentlitteratur °
ni p bi (1−b pi )
(3.2) P i
e2i,P earson .
57
3. Model diagnostics
If the model holds, Pearson residuals can often be considered to be approximately normally distributed with a constant variance, in large samples. However, even when they are standardized with the standard error of yb, the variance of Pearson residuals cannot be assumed to be 1. This is since we have standardized the residuals using estimated standard errors. Still, the standard errors of Pearson residuals can be estimated. It can be shown that adjusted Pearson residuals can be obtained as ei,P earson ei,adj,P = √ 1 − hii
(3.3)
where hii are diagonal elements from the hat matrix. The adjusted Pearson residuals can often be considered to be standard Normal, which means that e.g. residuals outside ±2 will occur in about 5% of the cases. This can be used to detect possible outliers in the data.
3.3.2
Deviance residuals
Observation number i contributes an amount di to the deviance, as a measure P of fit of the model: D = di . We define the deviance residuals as i
ei,Deviance = sign (yi − ybi )
p di .
(3.4)
The deviance residuals can also be written in standardized form, i.e. such that their variance is close to unity. This is obtained as ei,Deviance ei,adj,D = √ 1 − hii
(3.5)
where hii are again diagonal elements from the hat matrix.
3.3.3
Score residuals
The Wald tests, Likelihood ratio tests and Score tests, presented in the previous chapter, provide different ways of testing hypotheses about parameters of the model. The score residuals are related to the score tests. In Maximum Likelihood estimation, the parameter estimates are obtained by solving the score equations, which are of type U=
∂l =0 ∂θ
(3.6)
where θ is some parameter. The score equations involve sums of terms Ui , one for each observation. These terms can, properly standardized, be interpreted c Studentlitteratur °
58
3.3. Residuals in generalized linear models
as residuals, i.e. as the contribution from each observation to the score. The standardized score residuals are obtained from Ui ei,adj,S = p (1 − hii ) vi
(3.7)
where hii are diagonal elements of the hat matrix, and vi are elements of a certain weight matrix.
3.3.4
Likelihood residuals
Theoretically it would be possible compare the deviance of a model that comprises all the data with the deviance of a model with observation i excluded. However, this procedure would require heavy computations. An approximation to the residuals that would be obtained using this procedure is q ei,Likelihood = sign (yi − ybi ) hii (ei,Score )2 + (1 − hii ) (ei,Deviance )2 (3.8)
where hii are diagonal elements of the hat matrix. This is a kind of weighted average of the deviance and score residuals.
3.3.5
Anscombe residuals
The types of residuals discussed so far have distributions that may not always be close to Normal, in samples of the sizes we often meet in practice. Anscombe (1953) suggested that the residuals may be defined based on some transformation of observed data and fitted values. The transformation would be chosen such that the calculated residuals are approximately standard Normal. Anscombe defined the residuals as A (yi ) − A (b yi ) ei,Anscombe = q Vd ar (A (yi ) − A (b yi ))
(3.9)
The function A (·) is chosen depending on the type of data. For example, for Poisson the Anscombe residuals take the form ei,Anscombe = ¡ 2/3 ¢ data 3 2/3 1/6 − y b . In general, the Anscombe residuals are rather difficult y /b y 2 to calculate, which may explain why they have not reached widespread use.
3.3.6
The choice of residuals
The types of residuals discussed above are related to the types of tests and other model building tools that are used: c Studentlitteratur °
59
3. Model diagnostics
The deviance residuals are related to the deviance as a measure of fit of the model and to Likelihood ratio tests. The Pearson residuals are related to the Pearson χ2 and to the Wald tests. The score residuals are related to score tests. The likelihood residuals are a compromise between score and deviance residuals The Anscombe residuals, although theoretically appealing, are not often used in practice in programs for fitting generalized linear models. In the previous chapter we suggested that the likelihood ratio tests may be preferred over Wald tests and score tests for hypothesis testing in Glim:s. By extending this argument, the deviance residuals may be the preferred type of residuals to use for model diagnostics. Collett (1991) suggested that either the deviance residuals or the likelihood residuals should be used.
3.4
Influential observations and outliers
Some of the observations in the data may have an unduly large impact on the parameter estimates. If such so called influential observations are changed by a small amount, or if they are deleted, the estimates may change drastically. An outlier is an observation for which the model does not give a good approximation. Outliers can often be detected using different types of plots. Note that influential observations are not necessarily outliers. An observation can be influential and still be close to the main bulk of the data. Diagnostic tools are needed to detect influential observations and outliers.
3.4.1
Leverage
The leverage of observation i on the fitted value µ bi is the derivative of µ bi with respect to yi . This derivative is the corresponding diagonal element hii of the hat matrix H. Since H is idempotent it holds that tr (H) = p, i.e. the number of parameters. The average leverage of all observations is therefore p/n. Observations with a leverage of, say, twice this amount may need to be examined. Computer software like the Insight procedure in SAS (2000b), and the related JMP program (SAS, 2000a) have options to store the hat matrix in a file for further processing and plotting.
c Studentlitteratur °
60
3.4.2
3.5. Partial leverage
Cook’s distance and Dfbeta
Dfbeta is the change in the estimate of a parameter when observation i is deleted. The Dfbetas can be combined over all parameters as 1 ³b b ´0 0 ³b b ´ Di = β − β(i) X X β − β (i) . p
It can be shown that this yields the so called Cook’s distance Ci . In principle, the calculation of Ci (or Di ) requires extensive re-fitting of the model which may take time even on fast computers. However, an approximation to Ci can be obtained as Ci ≈
hii (ei,P earson )2 p (1 − hii )
(3.10)
where p is the number of parameters and hii are elements of the hat matrix.
3.4.3
Goodness of fit measures
Another type of measure of the influence of an observation is to compute the change in deviance, or the change in Pearson’s χ2 , when the observation is deleted. A large change in the measure of fit may indicate an influential observation.
3.4.4
Effect on data analysis
Computer programs for generalized linear models often include options to calculate the measures of influence discussed above, and others. It belongs to good data analytic practice to use such program options to investigate influential observations and outliers. A statistical result that may be attributed to very few observations should, of course, be doubted. Thus, data analysis in generalized linear models should contain both an analysis of the residuals, discussed above, and an analysis of influential observations.
3.5
Partial leverage
In models with several explanatory variables it may be of interest to study the impact of a variable, say variable xj , on the results. The partial leverage of variable j can be obtained in the following way. Let X[j] be the design matrix with the column corresponding to variable xj removed. Fit the generalized c Studentlitteratur °
3. Model diagnostics
61
linear model to this design matrix and calculate the residuals. Also, fit a model with variable xj as the response and the remaining variables X[j] as regressors. Calculate the residuals from this model as well. A partial leverage plot is a plot of these two sets of residuals. It shows how much the residuals change between models with and without variable xj . Partial leverage plots can be produced in procedure Insight (SAS, 2000b).
3.6
Overdispersion
A generalized linear model can sometimes give a good summary the data, in the sense that both the linear predictor and the distribution are correctly chosen, and still the fit of the full model may be poor. One possible reason for this may be a phenomenon called over-dispersion. Over-dispersion occurs when the variance of the response is larger than would be expected for the chosen distribution. For example, if we use a Poisson distribution to model the data we would expect the variance to be equal to the mean value: µ = σ2 . Similarly, for data that are modelled using a binomial distribution, the variance is a function of the response probability: σ2 = np (1 − p). Thus, for many distributions it is possible to infer what the variance “should be”, given the mean value. In Chapter 2 we noted that for distributions in the exponential family, the variance is some function of the mean: σ2 = V (µ). Under-dispersion, i.e. a “too small” variance, is theoretically possible but rather unusual in practice. Interesting examples of under-dispersion can be found in the analysis of Mendel’s classical genetic data; these data are better than would be expected by chance. In models that do not contain any scale parameter, over-dispersion can be detected as a poor model fit, as measured by deviance/df . Note, however, that a poor model fit can also be caused by the wrong choice of linear predictor or wrong choice of distribution or link. Thus, a poor fit does not necessarily mean that we have over-dispersion. Over-dispersion may have many different reasons. However, the main reason is often some type of lack of homogeneity. This lack of homogeneity may occur between groups of individuals; between individuals; and within individuals. As an example, consider a dose-response experiment where the same dose of an insecticide is given to two batches of insects. In one of the batches, 50 out of 100 insects die, while in the other batch 65 out of 100 insects die. Formally, this means that the response probabilities in the two batches are significantly different (the reader may wish to confirm that a “textbook” Chi-square test gives χ2 = 4.6, p = 0.032). This may indicate that the batches of insects are c Studentlitteratur °
62
3.6. Overdispersion
not homogenous with respect to tolerance to the insecticide. If these data are part of some larger dose-response experiment, using more batches of animals and more doses, this type of inhomogeneity would result in a poor model fit because of overdispersion.
3.6.1
Models for overdispersion
Before any attempts are made to model the over-dispersion, you have to examine all other possible reasons for poor model fit. These include: • Wrong choice of linear predictor. For example, you may have to add terms to the predictor, such as new covariates, interaction terms or nonlinear terms. • Wrong choice of link function. • There may be outliers in the data. • When the data are sparse, the assumptions underlying the large-sample theory may not be fulfilled, thus causing a poor model fit. A common effect of over-dispersion is that estimates of standard errors are under-estimates. This leads to test statistics which are too large: it becomes too easy to get a significant result. A simple way to model over-dispersion is to introduce a scale parameter φ into the variance function. Thus, we would assume that V ar (Y ) = φσ 2 . For binomial data this means that we would use the variance np (1 − p) φ, and for Poisson data we would use φµ as variance. The parameter φ is often called the over-dispersion parameter. A simple, but somewhat rough, way to estimate φ is to fit a “maximal model”1 to the data, and to use the mean deviance (i.e. Deviance/df ), or Pearson χ2 /df , from that model as an estimator of φ. We can then re-fit the model, using the obtained value of the over-dispersion parameter. Williams (1982) suggested a more sophisticated iterative procedure for estimating φ; see Collett (1991) for details. A more satisfactory approach would be to model the over-dispersion based on some specific model. One possible model is to assume that the mean parameter has a separate value for each individual. Thus, the mean parameter would be assumed to follow some random distribution over individuals while the response follows a second distribution, given the mean value. This would 1 Note
that this ”maximal model” is not the same as the saturated model, which has φ = 0. Instead, the ”maximal model” is a somewhat subjectively chosen ”large” model which includes all effects that can reasonably be included.
c Studentlitteratur °
63
3. Model diagnostics
lead to compound distributions. A few examples of compound distributions are discussed in Chapters 5 and 6. See also Lindsey (1997) for details. We will return to the topic of over-dispersion as we discuss fitting of generalized linear models to different types of data. Another approach to overdispersion, based on so-called Quasi-likelihood estimation, is discussed in Chapter 8.
3.7
Non-convergence
When using packages like Genmod for fitting generalized linear models, it may happen that the program reports that the procedure has not converged. Sometimes the convergence is slow and the procedure reports estimates of standard errors that are very large. Typical error messages might be WARNING: The negative of the Hessian is not positive definite. The convergence is questionable. WARNING: The procedure is continuing but the validity of the model fit is questionable. WARNING: The specified model did not converge.
Note that in SAS, the error messages are given in the program log. You can get some output even if these warnings have been given. Non-convergence occurs because of the structure of the data in relation to the model that is being fitted. A common problem is that the number of observed data values is small relative to the number of parameters in the model. The model is then under-identified. This can easily happen in the analysis of multidimensional crosstables. For example, a crosstable of dimension 4·3·3·3 contains 108 cells. If the sample size is moderate, say n = 100, the average number of observations per cell will be less than 1. It is then easy to imagine that many of the cells will be empty. Convergence problems are likely in such cases. When the data are binomial, the procedure may fail to converge when it tries to fit estimated proportions close to 0 or 1. This may happen when many observed proportions are 0 or 1. As a general advice: when the procedure does not converge, try to simplify the model as much as possible by removing, in particular, interaction terms. Make tables and other summaries of the data to find out the reasons for the failure to converge.
c Studentlitteratur °
64
3.8. Applications
3.8
Applications
3.8.1
Residual plots
In this section we will discuss a number of useful ways to check models, using the statistics we have discussed in this chapter. As illustrations of the various plots we use the example on somatic cells in the milk of sheep, discussed in the previous chapter (page 50). For the illustrations we use a model with a Normal distribution and a unit link, and a model with a Poisson distribution and a log link. The following types of residual plots are often useful: 1. A plot of residuals against the fitted values b η should show a pattern where the residuals have a constant mean value of 0 and a constant range. Deviations from this “random” pattern may arise because of incorrect link function; wrong choice of scale of the covariates; or omission of non-linear terms in the linear predictor. 2. A plot of residuals against covariates should show the same pattern as the previous plot. Deviations from this pattern may indicate the wrong link function, incorrect choice of scale or omission of non-linear terms. 3. Plotting the residuals in the order the observations are given in the data may help to detect possible dependence between observations. 4. A normal probability plot of the residuals plots the sorted residuals against their expected values. These are given by Φ−1 [(i − 3/8) / (n + 1/4)] where Φ−1 is the inverse of the standard Normal distribution function, i is the order of the observation, and n is the sample size. This plot should yield a straight line, as long as we can assume that the residuals are approximately Normal. 5. The residuals can also be plotted to detect an omitted covariate u. This is done as follows: fit a model with u as response, using the same model as for y. Obtain unstandardized residuals from both these models, and plot these against each other. Any systematic pattern in this plot may indicate that u should be used as a covariate. Plots of residuals against predicted values for the data in the example on Page 50 are given in Figure 3.1 and Figure 3.2 for Normal and Poisson distributions, respectively. The plots of residuals against predicted values indicate that the variation is larger for larger predicted values. This tendency is strongest for the Normal model. c Studentlitteratur °
3. Model diagnostics
65
Figure 3.1: Plot of residuals against predicted values for example data. Normal distribution and identity link.
Figure 3.2: Plot of residuals against predicted values for example data. Poisson distribution and log link.
c Studentlitteratur °
66
3.8. Applications
Figure 3.3: Normal probability plot for the example data. Normal distribution with an identity link.
A Normal probability plot is a plot of the residuals against their normal quantiles. Normal probability plots can be produced i.a. by Proc Univariate in SAS. SAS code for the normal probability plots presented here was as follows. The deviance residuals were stored in the file ut under the name resdev. PROC UNIVARIATE normal data=ut; VAR resdev; PROBPLOT resdev / NORMAL (MU=est SIGMA=est color=black w=2 ) height=4; LABEL resdev="Deviance residual"; RUN;
Normal probability plots for these data are given in Figures 3.3 and 3.4, for the Normal and Poisson models, respectively. The distribution of the residuals is closer to Normal for the Poisson model, but the fit is not perfect.
3.8.2
Variance function diagnostics
McCullagh and Nelder (1989) suggest the following procedure for checking the variance function. Assume that the variance is proportional to µζ , where ζ is some constant. Fit the model for different values of ζ, and plot the deviance against ζ. The value of ζ for which the deviance is as small as possible is suggested by the data. c Studentlitteratur °
3. Model diagnostics
67
Figure 3.4: Normal probability plot for the example data. Poisson distribution with a log link.
3.8.3
Link function diagnostics
To check the link function we need the so called adjusted dependent variable z. This is defined as zi = g (b µi ). This can be plotted against b η. If the link is correct this should result in an essentially linear plot.
3.8.4
Transformation of covariates
So called partial residual plots can be used to detect whether any of the covariates need to be transformed. The partial residual is defined as u = z −b η +b γ x, where z is the adjusted dependent variable, b η is the fitted linear predictor, x is a covariate and γ b is the parameter estimate for the covariate. The partial residuals can be plotted against x. The plot should be approximately linear if no transformation is needed. Curvature in the plot is an indication that x may need to be transformed.
c Studentlitteratur °
68
3.9
3.9. Exercises
Exercises
Exercise 3.1 For the data in Exercise 1.1: A. Calculate predicted values and residuals B. Plot the residuals against the predicted values C. Prepare a Normal probability plot D. Calculate the leverage of all observations Comment on the results Exercise 3.2 For the data in Exercise 1.3: A. Calculate predicted values and residuals B. Plot the residuals against the predicted values C. Prepare a Normal probability plot D. Calculate the leverage of all observations Comment on the results Exercise 3.3 Use one or two of your “best” models for the data in Exercise 2.3 to: A. Calculate predicted values and residuals B. Plot the residuals against the predicted values C. Prepare a Normal probability plot D. Calculate the leverage of all observations Comment on the results
c Studentlitteratur °
4. Models for continuous data
4.1
GLM:s as GLIM:s
General linear models, such as regression models, ANOVA, t tests etc. can be stated as generalized linear models by using a Normal distribution and an identity link. We will illustrate this on some of the GLM examples we discussed in Chapter 1. Throughout this chapter, we will use the SAS (2000b) procedure Genmod for data analysis.
4.1.1
Simple linear regression
A simple linear regression model can be written in Genmod as PROC GENMOD; MODEL y = x / DIST=Normal LINK=Identity ; RUN;
The identity link is the default link for the Normal distribution. We used this program on the regression data given on page 9. The results are:
69
70
4.1. GLM:s as GLIM:s The GENMOD Procedure Model Information Description
Value
Data Set Distribution Link Function Dependent Variable Observations Used
WORK.EMISSION NORMAL IDENTITY EMISSION 8
Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
6 6 6 6 .
25.3765 8.0000 25.3765 8.0000 -15.9690
4.2294 1.3333 4.2294 1.3333 .
Analysis Of Parameter Estimates
NOTE:
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT TIME SCALE
1 1 1
-36.7443 2.0978 1.7810
3.8179 0.1186 0.4453
92.6233 312.8360 .
0.0001 0.0001 .
The scale parameter was estimated by maximum likelihood.
The regression model is estimated as yb = −36.7443 + 2.0978 · T ime. This is the same estimate as given by a standard regression routine. Note that the deviance reported by the Genmod procedure is equal to the error sum of squares in the output on page 10. Also, the scaled deviance is 8, which is equal to the sample size. The tests, however, are not the same as in a standard regression analysis: the Genmod tests are Wald tests while the tests in the regression output are t tests. These tests are equivalent only in large samples. The Wald test of the hypothesis that the parameter β j is zero is given by b −0 β j z=r ³ ´, b d V ar β j
where z is a standard Normal variate. The t test in the regression output was obtained as
c Studentlitteratur °
bj − 0 β t= r ³ ´ b Vd ar β j
71
4. Models for continuous data
with n − p degrees of freedom. SCALE gives the estimated scale parameter as 1.7810. This is the ML estimate of σ. Note that the ML estimator of σ 2 is biased. An√ unbiased estimate of σ 2 is given by σ b2 = Deviance/df = 4.2294 giving σ b = 4.2294 = 2.0566. The relation between these two estimates is that the ML estimate does not account for the degrees of freedom: σ b2ML = n−p b2 . For these data we get n σ √ 2 6 σ bML = 8 · 4.2294 = 3.1721 so σ bML = 3.1721 = 1.781.
4.1.2
Simple ANOVA
A Genmod program for an ANOVA model (of which the simple t test is a special case) can be written as PROC GENMOD DATA=lissdata; CLASS medium; MODEL diff = medium / DIST=normal LINK=identity ; RUN;
The output from this program, using the data on page 16, contains the following information: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
50 50 50 50 .
905.1155 57.0000 905.1155 57.0000 -159.6823
18.1023 1.1400 18.1023 1.1400 .
c Studentlitteratur °
72
4.2. The choice of distribution
Parameter INTERCEPT MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM SCALE NOTE:
Diatrizoate Hexabrix Isovist Mannitol Omnipaque Ringer Ultravist
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 1 1 1 1 1 0 1
9.9079 11.4841 -5.9473 -8.2437 -2.2192 -1.5182 -9.6979 0.0000 3.9849
1.4089 2.2717 1.9363 1.9363 1.9363 1.8902 2.0624 0.0000 0.3732
49.4563 25.5554 9.4340 18.1257 1.3136 0.6451 22.1116 . .
0.0001 0.0001 0.0021 0.0001 0.2518 0.4219 0.0001 . .
The scale parameter was estimated by maximum likelihood. LR Statistics For Type 3 Analysis Source
DF
ChiSquare
Pr>Chi
MEDIUM
6
62.1517
0.0001
The scale parameter, which is the ML estimator of σ, is estimated as 3.98, while an unbiased estimate of σ2 is given by Deviance/df =18.1023. The scaled deviance is 57. As noted above, the scaled deviance is equal to n and the deviance is equal to the residual sum of squares in Normal theory models. The effect of Medium is significant (p<0.0001). The parameter estimates for the different media are given, which makes it possible to make e.g. pairwise comparisons between the media. Note that the parameter estimates are the same as for the ANOVA output given on page 17. However, the ANOVA gives the tests as an over-all F test and as t tests for single parameters, while the Genmod analysis gives the type 3 test as a χ2 approximations to the likelihood ratio test, while the tests of single parameters are Wald tests. The examples given in this section show that many analyses that can be run as general linear models, using e.g. Proc GLM in SAS, can alternatively be run using Proc Genmod. In fact, the JMP program (SAS Institute, 2000a), as well as the related procedure Insight in SAS, take a generalized linear model approach to all model fitting. This also holds for the pioneering Glim software (Francis et al, 1993).
4.2
The choice of distribution
One advantage of the generalized linear model approach is that it is not necessary to limit the models to Normal distributions. In many cases there are theoretical justifications for assuming other distributions than the Normal. Experience with the type of data at hand can often suggest a suitable distribution. Figure 4.1 (based on Leemis, 1986) summarizes the relation-
c Studentlitteratur °
73
4. Models for continuous data
ships between some common distributions. Note, however, that not all these distributions are members of the exponential family.
4.3
The Gamma distribution
Among all distributions in the exponential family, a particularly useful class of distributions is the gamma distribution. If α is positive, then the integral Γ (α) =
Z∞
tα−1 e−t dt
(4.1)
0
is called a gamma function. For the gamma function, it holds that Γ (α + 1) = αΓ (α) for α > 0
(4.2)
Γ (n) = (n − 1)!
(4.3)
and that
where (n − 1)! = (n − 1)·(n − 2)·...·2·1. The gamma distribution is defined, using the gamma function, as f (y; α, β) =
1 y α−1 e−y/β . Γ (α) β α
(4.4)
The gamma distribution has two parameters, α and β. The parameter α describes the shape of the distribution, mainly the peakedness and is often called the shape parameter. The parameter β mostly influences the spread of the distribution and is called the scale parameter. For the gamma distribution it holds that E (y) = αβ and V ar (y) = αβ 2 . Note that the gamma distribution is a member of the exponential family. It has a reciprocal canonical link; in fact, g (y) = − y1 . The variance function of the gamma distribution is V (µ) = µ2 . These relations also hold for the special cases of the gamma distribution that are described below. A few examples of gamma distributions are illustrated in Figure 4.2.
4.3.1
The Chi-square distribution
The χ2 distribution is a special case of the gamma distribution. A χ2 distribution with p degrees of freedom can be obtained as a gamma distribution c Studentlitteratur °
74
4.3. The Gamma distribution
Figure 4.1: Relationships among common distributions. (1986).
c Studentlitteratur °
Adapted from Leemis
75
4. Models for continuous data
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
2
4
6
8
10
12
14
Figure 4.2: Gamma distributions with parameters α = 1 and β =1, 2, 3 and 5, respectively.
with α = p/2 and¡ β ¢= 2. The χ2 distribution has mean value ¡ 2parameters ¢ E χ = p and variance V ar χ2 = 2p. Note that, for data from a Normal 2
∼ χ2 with (n − 1) degrees of freedom. For this reason, distribution, (n−1)s σ2 the gamma distribution is sometimes used for modelling of variances. An example of this is given on page 157.
4.3.2
The Exponential distribution
The exponential distribution has density f (y; β) =
1 −y/β e β
(4.5)
It can be obtained as a gamma distribution with α = 1. The exponential distribution is sometimes used as a simple model for lifetime data.
4.3.3
An application with a gamma distribution
Example 4.1 Hurn et al (1945), quoted from McCullagh and Nelder (1989), studied the clotting time of blood. Two different clotting agents were com-
c Studentlitteratur °
76
4.3. The Gamma distribution
pared for different concentrations of plasma. The data are: Conc 5 10 15 20 30 40 60 80 100
Clotting time Agent 1 Agent 2 118 69 58 35 42 26 35 21 27 18 25 16 21 13 19 12 18 12
Duration data can often be modeled using the gamma distribution. The canonical link of the gamma distribution is minus the inverse link, −1/µ. Preliminary analysis of the data suggested that the relation between clotting time and concentration was better approximated by a linear function if the concentrations were log-transformed. Thus, the models that were fitted to the data were of type 1 = β 0 + β 1 d + β 2 x + β 3 dx µ where x = log (conc) and d is a dummy variable with d = 1 for lot 1 and d = 0 for lot 2. This is a kind of covariance analysis model (see Chapter 1). A Genmod analysis of the full model gave the following output: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
14 14 14 14 .
0.0294 17.9674 0.0298 18.2205 -26.5976
0.0021 1.2834 0.0021 1.3015 .
Analysis Of Parameter Estimates Parameter INTERCEPT AGENT AGENT LC LC*AGENT LC*AGENT SCALE NOTE:
1 2 1 2
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 1 1 0 1
-0.0239 0.0074 0.0000 0.0236 -0.0083 0.0000 611.1058
0.0013 0.0015 0.0000 0.0005 0.0006 0.0000 203.6464
359.9825 24.9927 . 1855.0452 164.0704 . .
0.0001 0.0001 . 0.0001 0.0001 . .
The scale parameter was estimated by maximum likelihood.
c Studentlitteratur °
77
4. Models for continuous data
Figure 4.3: Relation between clotting time and log(concentration)
We can see that all parameters are significantly different from zero, which means that we cannot simplify the model any further. The scaled deviance is 17.97 on 14 df. A plot of the fitted model, along with the data, is given in Figure 4.3. The fit is good, but McCullagh and Nelder note that the lowest concentration value might have been misrecorded. ¤
4.4
The inverse Gaussian distribution
The inverse Gaussian distribution, also called the Wald distribution, has its roots in models for random movement of particles, called Brownian motion after the British botanist Robert Brown. The density function is # " µ ¶1/2 λ −λ (x − µ)2 (4.6) f (x; µ, λ) = exp 2πx3 2µ2 x The distribution has two parameters, µ and λ. It has mean value µ and variance µ3 /λ. It belongs to the exponential family and is available in procedure Genmod. The distribution is skewed to the right, and resembles the lognormal and gamma distributions. A graph of the shape of inverse Gaussian distributions is given in Figure 4.4. In a so called Wiener process for a particle, the time T it takes for the particle to reach a barrier for the first time has an inverse Gaussian distribution. The c Studentlitteratur °
78
4.5. Model diagnostics
0
0.5
1
1.5
2
2.5
Figure 4.4: Inverse Gaussian distributions with λ = 1 and mean values µ = 1 (lowest curve), µ = 2, µ = 3 and µ = 4, respectively.
distribution has also been used to model the length of time a particle remains in the blood; maternity data; crop field size; and length of stay in hospitals. See Armitage and Colton (1998) for references.
4.5
Model diagnostics
For the example data on page 76, the deviance residuals and the predicted values were stored in a file for further analysis. In this section we will present some examples of model diagnostics based on these data.
4.5.1
Plot of residuals against predicted values
The residuals can be plotted against the predicted values. In Normal theory models this kind of plot can be used to detect heteroscedasticity. Such a plot for our example data is given in Figure 4.5. The plot does not show the even, random pattern that would be expected. Two observations in the lower right corner are possible outliers.
c Studentlitteratur °
4. Models for continuous data
79
Figure 4.5: Plot of residuals against predicted values for the Gamma regression data
4.5.2
Normal probability plot
The Normal probability plot can be used to assess the distributional properties of the residuals. For most generalized linear models the residuals can be regarded as asymptotically Normal. However, the distributional properties of the residuals in finite samples depend upon the type of model. Still, the normal probability plot is a useful tool for detecting anomalies in the data. A Normal probability plot for the gamma regression data is given in Figure 4.6.
4.5.3
Plots of residuals against covariates
A plot of residuals against quantitative covariates can be used to detect whether the assumed model is too simple. In simple linear models, systematic patterns in this kind of plot may indicate that non-linear terms are needed, or that some observations may be outliers. A plot of deviance residuals against log(concentration) for the gamma regression data is given in Figure 4.7. The figure may indicate that the two observations in the lower left corner are outliers. These are the same two observations that stand out in figure 4.5.
c Studentlitteratur °
80
4.5. Model diagnostics
Figure 4.6: Normal probability plot for the gamma regression data
Figure 4.7: Plot of deviance residuals against Log(conc) for the gamma regression data
c Studentlitteratur °
4. Models for continuous data
4.5.4
81
Influence diagnostics
The value of Dfbeta with respect to log(conc) was calculated for all observations and plotted against log(conc). The resulting plot is given in Figure 4.8. The figure shows that observations with the lowest value of log(conc) have the largest influence on the results. These are the same observations that were noted in other diagnostic plots above; the two possible outliers noted earlier are actually placed on top of each other in this plot. The diagonal elements of the Hat matrix were computed using Proc Insight (SAS, 2000b). These values are plotted against the sequence numbers of the observations in Figure 4.9. Since there are four parameters and n = 18, the average leverage is 4/18 = 0.222. As noted in Chapter 3, observation with a leverage above twice that amount, i.e. here above 2 · 0.22 = 0.44, should be examined. For these data the first two observations have a high leverage; these are the observations that have been noted in the other diagnostic plots.
c Studentlitteratur °
82
4.5. Model diagnostics
Figure 4.8: Dfbeta plotted against log(conc) for the gamma regression data.
Figure 4.9: Leverage plot, Gamma regression data.
c Studentlitteratur °
83
4. Models for continuous data
4.6
Exercises
Exercise 4.1 The following data, taken from Box and Cox (1964), show the survival times (in 10 hour units) of a certain variety of animals. The experiment is a two-way factorial experiment with factors Poison (three levels) and Treatment (four levels). Poison I
II
III
A 0.31 0.45 0.46 0.43 0.36 0.29 0.40 0.23 0.22 0.21 0.18 0.23
Treatment B C 0.82 0.43 1.10 0.45 0.88 0.63 0.72 0.76 0.92 0.44 0.61 0.35 0.49 0.31 1.24 0.40 0.30 0.23 0.37 0.25 0.38 0.24 0.29 0.22
D 0.45 0.71 0.66 0.62 0.56 1.02 0.71 0.38 0.30 0.36 0.31 0.33
Analyze these data to find possible effects of poison, treatment, and interactions. The analysis suggested by Box and Cox was a standard twoway ANOVA on the data transformed as z = 1/y. Make this analysis, and also make a generalized linear model analysis assuming that the data can be approximated with a gamma distribution. In both cases, make residual diagnostics and influence diagnostics. Exercise 4.2 The data given below are the time intervals (in seconds) between successive pulses along a nerve fibre. Data were extracted from Cox and Lewis (1966), who gave credit to Drs. P. Fatt and B. Katz. The original data set consists of 799 observations; we use the first 200 observations only. If pulses arrive in a completely random fashion one would expect the distribution of waiting times between pulses to follow an exponential distribution. Fit an exponential distribution to these data by applying a generalized linear model with an appropriate distribution and link, and where the linear predictor only contains an intercept. Compare the observed data with the fitted distribution using different kinds of plots. The data are as follows:
c Studentlitteratur °
84
4.6. Exercises
0.21 0.18 0.02 0.15 0.15 0.24 0.02 0.06 0.55 0.05 0.38 0.01 0.06 0.09 0.08 0.38 0.74 0.17 0.05 0.30 0.49 0.01 0.96 0.23 0.74 0.01 0.09 0.05 0.26 0.05 0.24 0.26 0.16 0.15
c Studentlitteratur °
0.03 0.55 0.14 0.08 0.09 0.29 0.15 0.51 0.28 0.07 0.38 0.16 0.06 0.04 0.01 0.08 0.15 0.64 0.34 0.07 0.07 0.35 0.14 0.31 0.30 0.51 0.20 0.08 0.07 0.03 0.08 0.06 0.78 0.29
0.05 0.37 0.09 0.24 0.03 0.16 0.12 0.11 0.04 0.11 0.01 0.05 0.06 0.27 0.70 0.32 0.07 0.61 0.07 0.12 0.11 0.45 1.38 0.05 0.09 0.12 0.03 0.04 0.68 0.40 0.23 0.40 0.04
0.11 0.09 0.05 0.16 0.21 0.07 0.26 0.28 0.01 0.38 0.06 0.10 0.11 0.50 0.04 0.39 0.26 0.15 0.10 0.01 0.35 0.07 0.15 0.05 0.02 0.12 0.05 0.09 0.15 0.04 0.10 0.51 0.27
0.59 0.14 0.15 0.06 0.02 0.07 0.15 0.36 0.94 0.21 0.13 0.16 0.44 0.25 0.08 0.58 0.25 0.26 0.09 0.16 1.21 0.93 0.01 0.29 0.19 0.43 0.13 0.10 0.01 0.21 0.19 0.15 0.35
0.06 0.19 0.23 0.11 0.14 0.04 0.33 0.14 0.73 0.49 0.06 0.06 0.05 0.25 0.16 0.56 0.01 0.03 0.02 0.14 0.17 0.04 0.05 0.01 0.47 0.32 0.15 0.10 0.27 0.29 0.20 1.10 0.71
5. Binary and binomial response variables
In binary and binomial models, we model the response probabilities as functions of the predictors. A probability has range 0 ≤ p ≤ 1. Since the linear predictor Xβ can take on any value on the real line, we would like the model to use a link g (p) that transforms a probability to the range (−∞, ∞). Three different functions are often used for this purpose: the probit link; the logit link; and the complementary log-log link. We will briefly discuss some arguments related to the choice of link for binary and binomial data.
5.1
Link functions
5.1.1
The probit link
The probit link transforms a probability by applying the function Φ−1 (p), where Φ is the standard Normal distribution function. One way to justify the probit link is as follows. Suppose that underlying the observed binary response Y is a continuous variable ξ that is normally distributed. If the value of ξ is larger than some threshold τ , then we observe Y = 0, else we observe Y = 1. The Normal distribution, used in this context, is called a tolerance distribution. This situation is illustrated in Figure 5.1. In mathematical terms, the probit is that value of τ for which 1 p = Φ (τ ) = √ 2π
Zτ
e−u
2
/2
du.
(5.1)
−∞
This is the integral of the standard Normal distribution. Thus, τ = Φ−1 (p). In the original work leading to the probit (see Finney, 1947), the probit was defined as probit(p) = 5 + Φ−1 (p), to avoid working with negative numbers. However, most current computer programs define the probit without addition of the constant 5. 85
86
5.1. Link functions
t
y=1
x
y=0
Figure 5.1: The relation between y and ξ for a probit model
5.1.2
The logit link
The logit link, or logistic transformation, transforms a probability as logit (p) = log
p . 1−p
(5.2)
p The ratio 1−p is the odds of success, so the logit is often called the log odds. The logit function is a sigmoid function that is symmetric around 0. The logistic link is rather close to the probit link, and since it is easier to handle mathematically, some authors prefer it to the probit link. The logit link is the canonical link for the binomial distribution so it is often the natural choice of link for binary and binomial data. The logit link corresponds to a tolerance distribution that is called the logistic distribution. This distribution has density
f (y) =
5.1.3
βeα+βy
2.
[1 + eα+βy ]
The complementary log-log link
The complementary log-log link is based on arguments originating from a method called dilution assay. This is a method for estimating the number of c Studentlitteratur °
87
5. Binary and binomial response variables
active organisms in a solution. The method works as follows. The solution containing the organisms is progressively diluted. Samples from each dilution are applied to plates that contain some growth medium. After some time it is possible to record, for each plate, whether it has been infected by the organism or not. Suppose that the original solution contained N individuals per unit volume. This means that dilution by a factor of two gives a solution with N2 individuals per unit volume. After i dilutions the concentration is N 2i . If the organisms are randomly distributed one would expect the number of individuals per unit volume to follow a Poisson distribution with mean µi . Thus, µi = N 2i or, by taking logarithms, log µi = log N − i log 2. The probability that a plate will contain no organisms, assuming a Poisson distribution, is e−µi . Thus, if pi is the probability that growth occurs under dilution i, then pi = 1 − e−µi . Therefore, µi = − log (1 − pi ) which gives log µi = log [− log (1 − pi )] .
(5.3)
This is the complementary log-log link: log [− log (1 − pi )]. As opposed to the probit and logit links, this function is asymmetric around 0. The tolerance distribution that corresponds to the complementary log-log link is called the extreme value distribution, or a Gumbel distribution, and has density h i f (y) = β exp (α + βy) − e(α+βy) . The probit, logit and complementary log-log links are compared in Figure 5.2.
5.2 5.2.1
Distributions for binary and binomial data The Bernoulli distribution
A binary random variable that takes on the values 1 and 0 with probabilities p and 1 − p, respectively, is said to follow a Bernoulli distribution. The probability function of a Bernoulli random variable y is ½ 1 − p if y = 0 f (y) = py (1 − p)1−y = . (5.4) p if y = 1 The Bernoulli distribution has mean value E (y) = p and variance V ar (y) = p (1 − p).
c Studentlitteratur °
88
5.2. Distributions for binary and binomial data
4
Transformed value 3
2
1
0 0
0.2
0.4
0.6
0.8
1
Probability -1
Probit Logit Compl. Log-log
-2
-3
-4
Figure 5.2: The probit, logit and complementary log-log links
5.2.2
The Binomial distribution
If a Bernoulli trial is repeated n times such that the trials are independent, then y = the number of successes (1:s) among the n trials follows a binomial distribution with parameters n and p. The probability function of the binomial distribution is µ ¶ n y f (y) = p (1 − p)n−y . (5.5) y The binomial distribution has mean E (y) = np and variance V ar (y) = np (1 − p) . The proportion of successes, pb = ny , follows the same distribution, except for a scale factor: f (y) = f (b p). It holds that E (b p) = p and V ar (b p) = p(1−p) . n As was demonstrated in formula (2.6) on page 37, the binomial distribution is a member of the exponential family. Since the Bernoulli distribution is a special case of the binomial distribution with n = 1, even the Bernoulli distribution is an exponential family distribution. c Studentlitteratur °
89
5. Binary and binomial response variables
When the binomial distribution is applied for modeling real data, the crucial assumption is the assumption of independence. If independence does not hold, this can often be diagnosed as over-dispersion.
5.3
Probit analysis
Example 5.1 Finney (1947) reported on an experiment on the effect of Rotenone, in different concentrations, when sprayed on the insect Macrosiphoniella sanborni, in batches of about fifty. The results were: Conc 10.2 7.7 5.1 3.8 2.6
Log(Conc) 1.01 0.89 0.71 0.58 0.41
No. of insects 50 49 46 48 50
No. affected 44 42 24 16 6
% affected 88 86 52 33 12
A plot of the relation between the proportion of affected insects and Log(Conc) is given below. A fitted distribution is also included. 1
0.8
0.6
0.4
0.2
0 0
0.3
0.6
0.9
1.2
1.5
1.8
Relation between log(Conc) and proportion affected This situation is an example of a “probit analysis” setting. The dependent variable is a proportion. The probit analysis approach is to assume a linear relation between Φ−1 (p) and log(dose), where Φ−1 is the inverse of the cumulative Normal distribution (the so called probit), and p is the proportion affected in the population. This can be achieved as a generalized linear model by specifying a binomial distribution for the response and using a probit link. The following SAS program was used to analyze these data using Proc Genmod: c Studentlitteratur °
90
5.3. Probit analysis
DATA probit; INPUT conc n x; logconc=log10(conc); CARDS; 10.2 50 44 7.7 49 42 5.1 46 24 3.8 48 16 2.6 50 6 ; PROC GENMOD DATA=probit; MODEL x/n = logconc / LINK = probit DIST = bin ;
Part of the output is as follows: The GENMOD Procedure Model Information Description
Value
Data Set Distribution Link Function Dependent Variable Dependent Variable Observations Used Number Of Events Number Of Trials
WORK.PROBIT BINOMIAL PROBIT X N 5 132 243
This part simply gives us confirmation that we are using a Binomial distribution and a Probit link. Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
3 3 3 3 .
1.7390 1.7390 1.7289 1.7289 -120.0516
0.5797 0.5797 0.5763 0.5763 .
This section gives information about the fit of the model to the data. The deviance can be interpreted as a χ2 variate on 3 degrees of freedom, if the sample is large. In this case, the value is 1.74 which is clearly non-significant, indicating a good fit. Collett (1991) states that “a useful rule of thumb is c Studentlitteratur °
91
5. Binary and binomial response variables
that when the deviance on fitting a linear logistic model is approximately equal to its degrees of freedom, the model is satisfactory” (p. 66). Analysis Of Parameter Estimates Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT LOGCONC SCALE
1 1 0
-2.8875 4.2132 1.0000
0.3501 0.4783 0.0000
68.0085 77.5919 .
0.0001 0.0001 .
The output finally contains an Analysis of Parameter estimates. This gives estimates of model parameters, their standard errors, and a Wald test of each parameter in the form of a χ2 test. In this case, the estimated model is Φ−1 (p) = −2.8875 + 4.2132 · log (conc) . The dose that affects 50% of the animals (ED50 ) can be calculated: if p = 0.5 then Φ−1 (p) = 0 from which log (conc) = −
b β 2.8875 0 = 0.68535 giving = b 4.2132 β1
conc = 100.68535 = 4.8456.
5.4
¤
Logit (logistic) regression
Example 5.2 Since the logit and probit links are very similar, we can alternatively analyze the data in Table 5.1 using a binomial distribution with a logit link function. The program and part of the output are similar to the probit analysis. The fit of the model is excellent, as for the probit analysis case: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
3 3 3 3 .
1.4241 1.4241 1.4218 1.4218 -119.8942
0.4747 0.4747 0.4739 0.4739 .
c Studentlitteratur °
92
5.5. Multiple logistic regression
The parameter estimates are given by the last part of the output: Analysis Of Parameter Estimates Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT LOGCONC SCALE
1 1 0
-4.8869 7.1462 1.0000
0.6429 0.8928 0.0000
57.7757 64.0744 .
0.0001 0.0001 .
The resulting estimated model is logit (b p) = log
pb = −4.8869 + 7.1462 log (conc) . 1 − pb
The estimated model parameters permits us to estimate e.g. the dose that gives a 50% effect (ED50 ) as the value of log (conc) for which p = 0.5. Since 0.5 log 1−0.5 = 0, this value is ED50 = −
b β −4.8869 0 = 0.6839 =− b 7.1462 β1
which, on the dose scale, is 100.6839 = 4.83. This is similar to the estimate provided by the probit analysis. Note that the estimated proportion affected at a given concentration can be obtained from pb =
exp (−4.8869 + 7.1462 · log (conc)) . 1 + exp (−4.8869 + 7.1462 · log (conc))
¤
It can be mentioned that data in the form of proportions have previously often been analyzed as general ³p ´linear models by using the so called Arc sine transformation y = arcsin pb (see e.g. Snedecor and Cochran, 1980).
5.5 5.5.1
Multiple logistic regression Model building
Model building in multiple logistic regression models can be done in essentially the same way as in standard multiple regression. Example 5.3 The data in Table 5.1, taken from Collett (1991), were collected to explore whether it was possible to diagnose nodal involvement in c Studentlitteratur °
93
5. Binary and binomial response variables
prostatic cancer based on non-invasive methods. The variables are: Age Acid X-ray Size Grade Involvement
Age of the patient Level of serum acid phosphate Result of x-ray examination (0=negative, 1=positive) Tumour size (0=small, 1=large) Tumour grade (0=less serious, 1=more serious) Nodal involvement (0=no, 1=yes)
The data analytic task is to explore whether the independent variables can be used to predict the probability of nodal involvement. We have two continuous covariates and three covariates coded as dummy variables. Initial analysis of the data suggests that the value of Acid should be log-transformed prior to the analysis. There are 32 possible linear logistic models, excluding interactions. As a first step in the analysis, all these models were fitted to the data. A summary of the results is given in Table 5.2. A useful rule-of-thumb in model building is to keep in the model all terms that are significant at, say, the 20% level. In this case, a kind of backward elimination process would start with the full model. We would then delete Grade from the model (p = 0.29). In the model with Age, log(acid), x-ray and size, age is not significant (p = 0.26). This suggests a model that includes log(acid), x-ray and size; in this model, all terms are significant (p < 0.05). There are no indications of non-linear relations between log(acid) and the probability of nodal involvement. It remains to investigate whether any interactions between the terms in the model would improve the fit. To check this, interaction terms were added to the full model. Since there are five variables, the model was tested with all 10 possible pairwise interactions. The interactions size*grade (p = 0.01) and logacid*grade (p = 0.10) were judged to be large enough for further consideration. Note that grade was not suggested by the analysis until the interactions were included. We then tried a model with both these interactions. Age could be deleted. The resulting model includes logacid (p = 0.06), x-ray (p = 0.03), size (p = 0.21), grade (p = 0.19), logacid*grade (p = 0.11), and size*grade (p = 0.02). Part of the output is: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
46 46 46 46 .
36.2871 36.2871 42.7826 42.7826 -18.1436
0.7889 0.7889 0.9301 0.9301 .
c Studentlitteratur °
94
5.5. Multiple logistic regression
A ge
A cid
X ray
S ize
G rade
I n v o lv
A ge
A cid
X ray
S ize
G rade
I n v o lv
66
.4 8
0
0
0
0
64
.4 0
0
1
1
0
68
.5 6
0
0
0
0
61
.5 0
0
1
0
0
66
.5 0
0
0
0
0
64
.5 0
0
1
1
0
56
.5 2
0
0
0
0
63
.4 0
0
1
0
0
58
.5 0
0
0
0
0
52
.5 5
0
1
1
0
60
.4 9
0
0
0
0
66
.5 9
0
1
1
0
65
.4 6
1
0
0
0
58
.4 8
1
1
0
1
60
.6 2
1
0
0
0
57
.5 1
1
1
1
1
50
.5 6
0
0
1
1
65
.4 9
0
1
0
1
49
.5 5
1
0
0
0
65
.4 8
0
1
1
0
61
.6 2
0
0
0
0
59
.6 3
1
1
1
0
58
.7 1
0
0
0
0
61
1 .0 2
0
1
0
0
51
.6 5
0
0
0
0
53
.7 6
0
1
0
0
67
.6 7
1
0
1
1
67
.9 5
0
1
0
0
67
.4 7
0
0
1
0
53
.6 6
0
1
1
0
51
.4 9
0
0
0
0
65
.8 4
1
1
1
1
56
.5 0
0
0
1
0
50
.8 1
1
1
1
1
60
.7 8
0
0
0
0
60
.7 6
1
1
1
1
52
.8 3
0
0
0
0
45
.7 0
0
1
1
1
56
.9 8
0
0
0
0
56
.7 8
1
1
1
1
67
.5 2
0
0
0
0
46
.7 0
0
1
0
1
63
.7 5
0
0
0
0
67
.6 7
0
1
0
1
59
.9 9
0
0
1
1
63
.8 2
0
1
0
1
64
1 .8 7
0
0
0
0
57
.6 7
0
1
1
1
61
1 .3 6
1
0
0
1
51
.7 2
1
1
0
1
56
.8 2
0
0
0
1
64
.8 9
1
1
0
1
68
1 .2 6
1
1
1
1
Table 5.1: Predictors of nodal involvement on prostate cancer patients
c Studentlitteratur °
95
5. Binary and binomial response variables
Terms (Intercept only) Age log(acid) Xray Size Grade Age, log(acid) Age, x-ray Age, size Age, grade log(acid), x-ray log(acid), size log(acid), grade x-ray, size x-ray, grade size, grade Age, log(acid), x-ray Age, log(acid), size Age, log(acid), grade Age, x-ray, size Age, x-ray, grade Age, size, grade log(acid), x-ray, size log(acid), x-ray, grade log(acid), size, grade x-ray, size, grade age, log(acid), x-ray, size age, log(acid), x-ray, grade age, log(acid), size, grade log(acid), x-ray, size, grade age, x-ray, size, grade age, log(acid), x-ray, size, grade
Deviance 70.25 69.16 64.81 59.00 62.55 66.20 63.65 57.66 61.43 65.24 55.27 56.48 59.55 53.35 56.70 61.30 53.78 55.22 58.52 52.09 55.49 60.28 48.99 52.03 54.51 52.78 47.68 50.79 53.38 47.78 51.57 46.56
df 52 51 51 51 51 51 50 50 50 50 50 50 50 50 50 50 49 49 49 49 49 49 49 49 49 49 48 48 48 48 48 47
Table 5.2: Deviances for the nodal involvement data
c Studentlitteratur °
96
5.5. Multiple logistic regression Analysis Of Parameter Estimates
Parameter INTERCEPT LOGACID XRAY XRAY SIZE SIZE GRADE GRADE LOGACID*GRADE LOGACID*GRADE SIZE*GRADE SIZE*GRADE SIZE*GRADE SIZE*GRADE SCALE
0 1 0 1 0 1 0 1 0 0 1 1
0 1 0 1
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 1 0 1 0 1 0 1 0 1 0 0 0 0
7.2391 12.1345 -2.3404 0.0000 2.5098 0.0000 -4.3134 0.0000 -10.4260 0.0000 -5.6477 0.0000 0.0000 0.0000 1.0000
3.4133 6.5154 1.0845 0.0000 2.0218 0.0000 3.2696 0.0000 6.6403 0.0000 2.4346 0.0000 0.0000 0.0000 0.0000
4.4980 3.4686 4.6571 . 1.5410 . 1.7404 . 2.4652 . 5.3814 . . . .
0.0339 0.0625 0.0309 . 0.2145 . 0.1871 . 0.1164 . 0.0204 . . . .
The model fits well, with Deviance/df=0.79. Since the size*grade interaction is included in the model, the main effects of size and of grade should also be included. The output suggest the following models for grade 0 and 1, respectively: Grade 0: logit(b p) = 2.93 + 1.71 · log(acid) − 2.34·x-ray−3.14·size Grade 1: logit(b p) = 7.24 + 12.13 · log(acid) − 2.34·x-ray+2.51·size The probability of nodal involvement increases with increasing acid level. The increase is higher for patients with serious (grade 1) tumors. ¤
5.5.2
Model building tools
A set of tools for model building in logistic regression has been developed. These tools are similar to the tools used in multiple regression analysis. The Logistic procedure in the SAS package includes the following variable selection methods: Forward selection: Starting with an empty model, the procedure adds, at each step, the variable that would give the lowest p-value of the remaining variables. The procedure stops when all variables have been added, or when no variables meet the pre-specified limit for the p-value. Backward selection: Starting with a model containing all variables, variables are step by step deleted from the model until all variables remaining in the model meet a specified limit for their p-values. At each step, the variable with the largest p-value is deleted.
c Studentlitteratur °
97
5. Binary and binomial response variables
Residual Model Diagnostics Normal Plot of Residuals
1
2
Residual
Residual
I Chart of Residuals 3
2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0
11
0
X =-0.05895
5 5 55 1
-2 -1
0
1
0
2
Normal Score
Histogram of Residuals
-2.0-1.5-1.0-0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residual
20
30
40
50
Residuals vs. Fits
Residual
Frequency
0
10
-3.0SL=-1.535
Observation Number
20
10
3.0SL=1.417
22 2
-1
-2
5552
1
2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 0.0
0.5
1.0
Fit
Figure 5.3: Residual plots for nodal involvement data
Stepwise selection: This is a modification of the forward selection model. Variables are added to the model step by step. In each step, the procedure also examines whether variables already in the model can be deleted. Best subset selection: For k = 1, 2, ... up to a user-specified limit, the method identifies a specified number of best models containing k variables. Tests for this method are based on score statistics (see Chapter 2). Although automatic variable selection methods may sometimes be useful for “quick and dirty” model building, they should be handled with caution. There is no guarantee that an automatic procedure will always come up with the correct answer; see Agresti (1990) for a further discussion.
5.5.3
Model diagnostics
As an illustration of model diagnostics for logistic regression models, the predicted values and the residuals were stored as new variables for the multiple logistic regression data (Table 5.1). Based on these, a set of standard diagnostic plots was prepared using Minitab. These plots are reproduced in Figure 5.3.
c Studentlitteratur °
98
5.6. Odds ratios
It appears that the distribution of the (deviance) residuals is reasonably Normal. The “runs” that are shown in the I chart appear because of the way the data set was sorted. Note that the Residuals vs. Fits plot is not very informative for binary data. This is because the points scatter in two groups: one for observations with y = 1 and another group for observations with y = 0.
5.6
Odds ratios
If an event occurs with probability p, then the odds in favor of the event is p Odds = . (5.6) 1−p For example, if an event occurs with probability p = 0.75, then the odds in favor of that event is 0.75/ (1 − 0.75) = 3. This means that a “success” is three times as likely as a “failure”. If the odds are known, the probability p Odds can be calculated as p = Odds+1 . A comparison between two events, or a comparison between e.g. two groups of individuals with respect to some event, can be made by computing the odds ratio OR =
p1 / (1 − p1 ) . p2 / (1 − p2 )
(5.7)
If p1 = p2 then OR = 1. An odds ratio larger than 1 is an indication that the event is more likely in the first group than in the second group. The odds ratio can be estimated from sample data as long as the relevant probabilities can be estimated. Estimated odds ratios are, of course, subject to random variation. If the probabilities are small, the odds ratio can be used as an approximation to the relative risk, which is defined as p1 (5.8) RR = . p2 In some sampling schemes, it is not possible to estimate the relative risk but it may be possible to estimate the odds ratio. One example of this is in so called case-control studies. In such studies a number of patients with a certain disease are studied. One (or more) healthy patient is selected as a control for each patient in the study. The presence or absence of certain risk factors is assessed both for the patients and for the controls. Because of the way the sample was selected, the question whether the risk factor is related to disease occurrence cannot be answered by computing a risk ratio, but it may be possible to estimate the odds ratio. c Studentlitteratur °
5. Binary and binomial response variables
99
Example 5.4 Freeman (1989) reports on a study designed to assess the relation between smoking and survival of newborn babies. 4915 babies to young mothers were followed during their first year. For each baby it was recorded whether the mother smoked and whether the baby survived the first year. Data are as follows: Smoker Yes No
Survived Yes No 499 15 4327 74
The probability of dying for babies to smoking mothers is estimated as 15/(499 + 15) = 0.02918 and for non-smoking mothers it is 74/(4327 + 74) = 0.01681. The odds ratio is 0.02918/(1−0.02918) 0.01681/(1−0.01681) = 1.758. The odds of death for the baby is higher for smoking mothers. ¤ Odds ratios can be estimated using logistic regression. Note that in logistic p = α + βx where, in this case, x is a regression we use the model log 1−p dummy variable with value 1 for the smokers and 0 for the nonsmokers. This is the log of the odds, so the odds is exp (α + βx). Using x = 1 in the numerator and x = 0 in the denominator gives the odds ratio as OR =
exp (α + β) = eβ exp (α)
Thus, the odds ratio can be obtained by exponentiating the regression coefficient β in a logistic regression. We will use the same data as in the previous example to illustrate this. A SAS program for this analysis can be written as follows:
DATA babies; INPUT smoking $ survival $ n; CARDS; Yes Yes 499 Yes No 15 No Yes 4327 No No 74 ; PROC GENMOD DATA=babies order=data; CLASS smoking; FREQ n; MODEL survival = smoking/ dist=bin link=logit ; RUN;
Part of the output is as follow: c Studentlitteratur °
100
5.7. Overdispersion in binary/binomial models
Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
4913 4913 4913 4913
886.9891 886.9891 4914.9964 4914.9964 -443.4945
0.1805 0.1805 1.0004 1.0004
The model fit, as judged by Deviance/df, is excellent.
Analysis Of Parameter Estimates
Parameter Intercept smoking smoking Scale
Yes No
DF
Estimate
Standard Error
1 1 0 0
-4.0686 0.5640 0.0000 1.0000
0.1172 0.2871 0.0000 0.0000
Wald 95% Confidence Limits -4.2983 0.0013 0.0000 1.0000
-3.8388 1.1267 0.0000 1.0000
ChiSquare
Pr > ChiSq
1204.34 3.86 .
<.0001 0.0495 .
There is a significant relationship between smoking and the risk of dying for the babies (p = 0.0495). The odds ratio can be calculated as e0.5640 = 1.7577 which is the same result as we got above by hand calculation. But the Genmod procedure also gives a test and a confidence interval for the parameter.
5.7
Overdispersion in binary/binomial models
Overdispersion means that the variance of the response is larger than would be expected for the chosen model. For binomial models, the variance of y =“number of successes” is np (1 − p), and the variance of pb = ny is p(1−p) . A n simple way to illustrate over-dispersion is to consider a simple dose-response experiment where the same dose has been used on two batches of animals. Suppose that the chosen dose has effect on 10 out of 50 animals in one of the replications, and on 20 out of 50 animals in the other replication. This means that there is actually a significant difference between the two replications (p = 0.029). In other less extreme cases, there may be a tendency for the responses to differ, even if the results are not significantly different at any given dose. Still, when all replications are considered together, a value of the Deviance/df statistic appreciably above unity may indicate that overdispersion is present in the data. c Studentlitteratur °
101
5. Binary and binomial response variables
A common source of over-dispersion is that the data display some form of clustering. This means that the observations are not independent. For example, different batches of animals may come from different parents, and thus be genetically different. One way to model such over-dispersion is to assume that the mean value is still E (y) = np but that the variance takes the form V ar (y) = np (1 − p) σ2 , where in the clustered case it can be assumed that σ2 = 1 + (k − 1) τ 2 . Here, k is the cluster size.
5.7.1
Estimation of the dispersion parameter
One way to account for over-dispersion is to estimate the over-dispersion parameter from the data. If the data have a known cluster structure, this can be done via the between-cluster variance, that can be estimated from r
σ b2 =
1 X (yj − nj pb) r − 1 j=1 nj pb (1 − pb)
(5.9)
where r is the number of clusters.
If the structure of the clustering is unknown, an alternative way of estimating the dispersion parameter is to use the observed value of Deviance/df (or Pearson χ2 /df ) as an estimate, and to re-run the analysis using this value. A useful recommendation is to run a “maximal model”, that contains all relevant factors, even if they are not significant. The dispersion parameter is estimated from this model. This value of the dispersion parameter is then kept constant in all later analyses of the data. For simple models, for example models in designed experiments, an alternative is to ask the software to use a Maximum Likelihood estimate of the dispersion parameter. This option is present in Proc Genmod.
5.7.2
Modeling as a beta-binomial distribution
If one can suspect some form of clustering, another approach to the modeling is to assume that y follows a binomial distribution within clusters but that the parameter p follows some random distribution over clusters. If the distribution of p is known, the distribution of y will be a so called compound distribution which can be derived. A rather simple case is obtained when the distribution of p is a Beta distribution. Then, the distribution of y will follow a distribution called the Beta-binomial distribution. However, this distribution is not at present available in the Genmod procedure. Estimation using Quasi-likelihood methods is an alternative approach to modeling overdispersion. This is discussed in Chapter 8. c Studentlitteratur °
102
5.7. Overdispersion in binary/binomial models
5.7.3
An example of over-dispersed data
Example 5.5 Orobanche is a parasital plant that grows on the roots of other plants. A number of batches of Orobanche seeds of two varieties were grown on extract from Bean or Cucumber roots, and the number of seeds germinating was recorded. The data taken from Collett (1991), are: O. aegyptiaca 75 Bean Cucumber y n y n 10 39 5 6 23 62 53 74 23 81 55 72 26 51 32 51 17 39 46 79 10 13
O. aegyptiaca 73 Bean Cucumber y n y n 8 16 3 12 10 30 22 41 8 28 15 30 23 45 32 51 0 4 3 7
It was of interest to compare the two varieties, and also to compare the two types of host plants. An analysis of these data using a binomial distribution with a logit link revealed that an interaction term was needed. Part of the output for a model containing Variety, Host and Variety*Host, is given below. The model does not fit well (Deviance=33.2778 on 17 df , p = 0.01). The ratio Deviance/df is nearly 2, indicating that overdispersion may be present. Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
17 17 17 17
33.2778 33.2778 31.6511 31.6511 -543.1106
1.9575 1.9575 1.8618 1.8618
Algorithm converged. Analysis Of Parameter Estimates
Parameter Intercept variety variety host host variety*host variety*host variety*host variety*host Scale
c Studentlitteratur °
73 75 Bean Cucumber 73 73 75 75
Bean Cucumber Bean Cucumber
DF
Estimate
Standard Error
1 1 0 1 0 1 0 0 0 0
0.7600 -0.6322 0.0000 -1.3182 0.0000 0.7781 0.0000 0.0000 0.0000 1.0000
0.1250 0.2100 0.0000 0.1775 0.0000 0.3064 0.0000 0.0000 0.0000 0.0000
103
5. Binary and binomial response variables LR Statistics For Type 3 Analysis
Source variety host variety*host
DF
ChiSquare
Pr > ChiSq
1 1 1
2.53 37.48 6.41
0.1121 <.0001 0.0114
As a second analysis, the data were analyzed using the automatic feature in Genmod to estimate the scale parameter from the data using the Maximum Likelihood method. Part of the output was as follows: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
17 17 17 17
33.2778 17.0000 31.6511 16.1690 -277.4487
1.9575 1.0000 1.8618 0.9511
The procedure now uses a scaled deviance of 1.00. The parameter estimates are identical to those of the previous analysis, but the estimated standard errors are larger when we include a scale parameter. This has the effect that the Variety*Host interaction is no longer significant. Analysis Of Parameter Estimates
Parameter Intercept variety variety host host variety*host variety*host variety*host variety*host Scale
73 75 Bean Cucumber 73 73 75 75
Bean Cucumber Bean Cucumber
DF
Estimate
Standard Error
1 1 0 1 0 1 0 0 0 0
0.7600 -0.6322 0.0000 -1.3182 0.0000 0.7781 0.0000 0.0000 0.0000 1.3991
0.1748 0.2938 0.0000 0.2483 0.0000 0.4287 0.0000 0.0000 0.0000 0.0000
LR Statistics For Type 3 Analysis
Source variety host variety*host
Num DF
Den DF
F Value
Pr > F
ChiSquare
Pr > ChiSq
1 1 1
17 17 17
1.29 19.15 3.27
0.2718 0.0004 0.0881
1.29 19.15 3.27
0.2561 <.0001 0.070
¤
c Studentlitteratur °
104
5.8. Exercises
5.8
Exercises
Exercise 5.1 Species A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
Exposure 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4
Rel. Hum 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8
Temp
Deaths
N
Species
10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 4 5 0 2 4 0 2 3 0 1 2 7 7 7 4 4 7 3 3 5 2 3 3
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
Exposure 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4
Rel. Hum 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8 60.0 60.0 60.0 65.8 65.8 65.8 70.5 70.5 70.5 75.8 75.8 75.8
Temp
Deaths
N
10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20 10 15 20
0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 2 1 0 0 1 1 0 1 7 11 11 4 5 9 2 4 6 2 3 5 12 14 16 10 12 12 5 7 9 4 5 7
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
The data set given above contains data from an experiment studying the survival of snails. Groups of 20 snails were held for periods of 1, 2, 3 or 4 weeks under controlled conditions, where temperature and humidity were kept at assigned levels. The snails were of two species (A or B). The experiment was
c Studentlitteratur °
105
5. Binary and binomial response variables
a completely randomized design. The variables are as follows: Species Exposure Humidity Temp Deaths N
Snail species A or B Exposure in weeks (1, 2, 3 or 4) Relative humidity (four levels) Temperature in degrees Celsius (3 levels) Number of deaths Number of snails exposed
Analyze these data to find whether Exposure, Humidity, Temp, or interactions between these have any effects on survival probability. Also, make residual diagnostics and leverage diagnostics. Exercise 5.2 The file Ex5_2.dat gives the following information about passengers travelling on the Titanic when it sank in 1912. Background material for the data can be found on http://www.encyclopedia-titanica.org. Name PClass Age Sex Survived
Name of the person Passenger class: 1st, 2nd or 3rd Age of the person male or female 1=survived, 0=died
Find a model that can predict probability of survival as functions of the given covariates, and possible interactions. Note that some age data are missing. Exercise 5.3 Finney (1947) reported some data on the relative potencies of Rotenone, Deguelin, and a mixture of these. Batches of insects were subjected to these treatments, in different concentrations, and the number of dead insects was recorded. The raw data are:
c Studentlitteratur °
106
5.8. Exercises
Treatment Rotenone
Deguelin
Mixture
ln(dose) 1.01 0.89 0.71 0.58 0.41 1.70 1.61 1.48 1.31 1.00 0.71 1.40 1.31 1.18 1.00 0.71 0.40
n
x
50 49 46 48 50 48 50 49 48 48 49 50 46 48 46 46 47
44 42 24 16 6 48 47 47 34 18 16 48 43 38 27 22 7
Analyze these data. In particular, examine whether the regression lines can be assumed to be parallel. Exercise 5.4 Fahrmeir & Tutz (2001) report some data on the risk of infection from births by Caesarian section. The response variable of interest is the occurrence of infections following the operation. Three dichotomous covariates that might affect the risk of infection were studied: planned risk antibio
Was the Caesarian section planned (=1) or not (=0) Were risk factors such as diabetes, excessive weight or others present (=1) or absent (=0) Were antibiotics given as a prophylactic (=1) or not (=0)
The data are included in the following Sas program that also gives the value of the variable infection (1=infection, 0=no infection). The variable wt is the number of observations with a given combination of the other variables. Thus, for example, there were 17 un-infected cases (infection=0) with planned=1. risk=1, and antibio=1.
c Studentlitteratur °
107
5. Binary and binomial response variables data cesarian; INPUT planned antibio risk infection wt; CARDS; 1 1 1 1 1 1 1 1 0 17 1 1 0 1 0 1 1 0 0 2 1 0 1 1 28 1 0 1 0 30 1 0 0 1 8 1 0 0 0 32 0 1 1 1 11 0 1 1 0 87 0 1 0 1 0 0 1 0 0 0 0 0 1 1 23 0 0 1 0 3 0 0 0 1 0 0 0 0 0 9 ;
The following analyses were run on these data: 1. A binomial Glim model with a logit link, with only the main effects 2. Model 1 plus an interaction planned*antibio 3. Model 1 plus an interaction planned*risk 4. The same as model 3 but with some extra features, discussed below. Some results: Model 1 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
8 8 8 8
226.5177 226.5177 257.2508 257.2508 -113.2588
28.3147 28.3147 32.1563 32.1563
Model 2 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
7 7 7 7
226.4393 226.4393 254.7440 254.7440 -113.2196
32.3485 32.3485 36.3920 36.3920
Model 3 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
7 7 7 7
216.4759 216.4759 261.4010 261.4010 -108.2380
30.9251 30.9251 37.3430 37.3430
c Studentlitteratur °
108
5.8. Exercises
One problem with Model 3 is that no standard error, and no test, of the parameter for the planned*risk interaction is given by Sas. This is because the likelihood is rather flat which, in turn, depends on cells with observed count = 0. Therefore, Model 4 used the same model as Model 3, but with all values where Wt=0 replaced by Wt=0.5. Model 4 Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
11 11 11 11
231.5512 231.5512 446.4616 446.4616 -115.7756
21.0501 21.0501 40.5874 40.5874
Algorithm converged. Analysis Of Parameter Estimates
Parameter Intercept planned antibio risk planned*risk Scale
DF
Estimate
1 1 1 1 1 0
2.1440 -0.8311 3.4991 -3.7172 2.4394 1.0000
Standard Error 1.0568 1.1251 0.5536 1.1637 1.2477 0.0000
Wald 95% Confidence Limits 0.0728 -3.0363 2.4141 -5.9980 -0.0060 1.0000
4.2152 1.3742 4.5840 -1.4364 4.8848 1.0000
ChiSquare
Pr > ChiSq
4.12 0.55 39.95 10.20 3.82
0.0425 0.4601 <.0001 0.0014 0.0506
Questions: Use these data to answer the following questions: A. Compare models 1 and 2 to test whether the planned*antibio interaction is significantly different from zero. B. Compare models 1 and 3 to test whether the planned*risk interaction is significantly different from zero. C. Explain why the Deviance in Model 4 has more degrees of freedom than in Model 3. D. Based on the results for Model 4, estimate the odds ratios for infection for the factors in the model. Note that the program has modeled the probability of not being infected. Calculate the odds ratios for not being infected and, from these, the odds ratios of being infected. E. Calculate predicted values and raw residuals for the first four observations in the data. Exercise 5.5 An experiment has been designed in the following way: Two groups of patients (A1 and A2) were used. The groups differed regarding the type of diagnosis for a certain disease. Each group consisted of nine patients. The patients in the two groups were randomly assigned to three different treatments: B1, B2 and B3, with three patients for each treatment in each group. c Studentlitteratur °
5. Binary and binomial response variables
109
The blood pressure (Z) was measured at the beginning of the experiment on each patient. A binary response variable (Y) was measured on each patient at the end of the experiment. It was modeled as g(µ) = XB, using some computer package. The final model included the main effects of A and B and their interaction, and the effect of the covariate Z. The slope for Z was different for different treatments, but not for the different patient groups or for the A*B interaction. A. Write down the complete design matrix X. You should include all dummy variables, even those that are redundant. Covariate values should be represented by some symbol. Also write down the corresponding parameter vector B. B. The link function used to analyze these data was the logit link g (p) = p log 1−p . What is the inverse g −1 of this link function?
c Studentlitteratur °
6. Response variables as counts
6.1
Log-linear models: introductory example
Count data can be summarized in the form of frequency tables or as contingency tables. The data are then given as the number of observations with each combination of values of some categorical variables. We will first look at a simple example with a contingency table of dimension 2×2. Example 6.1 Norton and Dunn (1985) studied possible relations between snoring and heart problems. For 2484 persons it was recorded whether the person had any heart problems and whether the person was a snorer. An interesting question is then whether there is any relation between snoring and heart problems. The data are as follows: Heart problems Yes No
Snores Seldom Often 59 51 1958 416 2017 467
Total 110 2374 2484
We assume that the persons in the sample constitute a random sample from some population. Denote with pij the probability that a randomly selected person belongs to row category i and column category j of the table. This can be summarized as follows: Heart problems Yes No
Snores Seldom Often p11 p12 p21 p22 p·1 p·2
Total p1· p2· 1
A dot in the subscript indicates a marginal probability. For example, p·1 denotes the probability that a person snores seldom, i.e. p·1 = p11 + p21 . ¤ 111
112
6.1.1
6.1. Log-linear models: introductory example
A log-linear model for independence
If snoring and heart problems were statistically independent, it would hold that pij = pi· p·j for all i and j. This is a model that we would like to compare with the more general model that snoring and heart problems are dependent. Instead of modeling the probabilities, we can state the models in terms of expected frequencies µij = npij , where n is the total sample size and µij is the expected number in cell (i, j). Thus, the independence model states that µij = npi· p·j . This is a multiplicative model. By taking the logs of both sides we get an additive model assuming independence: ¡ ¢ log µij (6.1) = log (n) + log (pi· ) + log (p·j ) = µ + αi + β j . In (6.1), αi denotes the row effect (i.e. the effect of variable A), and β j denotes the column effect (i.e. the effect of variable B). In log-linear model literature, effects are often denoted with symbols like λX i , but we keep a notation that is in line with the notation of previous chapters. We can see that this model is a linear model (a linear predictor), and that the link function is log. Models of type (6.1) are called log-linear models. Note that a model for a crosstable of dimension r × c can include at most (r − 1) parameters for the row effects and (c − 1) parameters for the column effect. This is analogous to ANOVA models. One way to constrain the parameters is to set the last parameter of each kind equal to zero. In our example, r = c = 2 so we need only one parameter αi and one β j , for example α1 and β 1 . In GLIM terms, the model for our example data can then be written as log (µ11 ) 1 1 1 µ log (µ12 ) 1 1 0 α1 . (6.2) log (µ21 ) = 1 0 1 β1 1 0 0 log (µ22 )
6.1.2
When independence does not hold
If independence does not hold we need to include in the model terms of type (αβ)ij that account for the dependence. The terms (αβ)ij represent interaction between the factors A and B, i.e. the effect of one variable depends on the level of the other variable. Then the model becomes ¡ ¢ (6.3) log µij = µ + αi + β j + (αβ)ij . c Studentlitteratur °
113
6. Response variables as counts
Any two-dimensional cross-table can be perfectly represented by the model (6.3); this model is called the saturated model. We can test the restrictions imposed by removing the parameters (αβ)ij by comparing the deviances: the saturated model will have deviance 0 on 0 degrees of freedom, so the deviance from fitting the model (6.1) can be used directly to test the hypothesis of independence.
6.2
Distributions for count data
So far, we have seen that a model for the expected frequencies in a crosstable can be formulated as a log-linear model. This model has the following properties: The predictor is a linear predictor of the same type as in ANOVA. The link function is a log function. It remains to discuss what distributional assumptions to use.
6.2.1
The multinomial distribution
Suppose that a nominal variable Y has k distinct values y1 , y2 , ..., yk such that no implicit ordering is imposed on the values. In fact, the values might be observations on a single nominal variable, or they might be observations on cell counts in a multidimensional contingency table. The probabilities associated with the different values of Y are p1 , p2 , ..., pk . We make n observations on Y . If the observations are independent, the probability to get n1 observations with Y = y1 , n2 observations with Y = y2 , and so on, is P (n1 , n2 , . . . , nk |n) =
n! pn1 · pn2 2 · . . . · pnk k n1 ! · n2 ! · . . . · nk ! 1
(6.4)
Note that in (6.4), the total sample size n = n1 + n2 + . . . + nk is regarded as fixed. Also note that the expression simplifies to the binomial distribution for the case k = 2. The distribution given in (6.4) is called the multinomial distribution. The multinomial distribution is a multivariate distribution since it describes the joint distribution of y1 , y2 , ..., yk . It can be seen as a multivariate generalization of an exponential family distribution (see e.g. Agresti, 1990).
c Studentlitteratur °
114
6.2.2
6.2. Distributions for count data
The product multinomial distribution
A contingency table may have some of its totals fixed by the design of the data collection. For example, 500 males and 500 females might have been interviewed in a survey. In such cases it is not meaningful to talk about the random distribution of the “gender” variable. For such data each “slice” of the table subdivided by gender may be seen as one realization of a multinomial distribution. The joint distribution of all cells of the table is then the product of several multinomial distributions, one for each slice. This joint distribution is called the product multinomial distribution.
6.2.3
The Poisson distribution
Suppose, again, that a nominal variable Y has k distinct values y1 , y2 , ..., yk . We observe counts n1 , n2 , . . . , nk . The expected number of observations in cell i is µi . If the observations arrive randomly, the probability to observe ni observations in cell i is p (ni ) =
e−µi µni i ni !
(6.5)
which is the probability function of a Poisson distribution. Note that in this case, the total sample size n is not regarded as fixed. This is the main difference between this sampling scheme and the multinomial case. Since sums of Poisson variables follow a Poisson distribution, n in itself follows a k P Poisson distribution with mean value µi . i=1
6.2.4
Relation to contingency tables
Contingency tables can be of many different types. In some cases, the total sample size is fixed; an example is when it has been decided that n = 1000 individuals will be interviewed about some political question. In some cases even some of the margins of the table may be fixed. An example is when 500 males and 500 females will participate in a survey. A table with a fixed total sample size would suggest a multinomial distribution; if in addition one or more of the margins are fixed we would assume a product multinomial distribution. However, as noted by Agresti (1996), “For most analyses, one need not worry about which sampling model makes the most sense. For the primary inferential methods in this text, the same results occur for the Poisson, multinomial and independent binomial/multinomial sampling models” (p. 19). c Studentlitteratur °
6. Response variables as counts
115
Suppose that we observe a contingency table of size i × j. The probability that an observation will fall into cell (i, j) is pij . If the observations are independent and arrive randomly, the number of observations falling into cell (i, j) follows a Poisson distribution with mean value µij , if the total sample size n is random. If the cell counts nij follow a Poisson distribution then the conditional distribution of nij |n is multinomial. The Poisson distribution is often used to model count data since it is rather easy to handle. Note, however, there is no guarantee that a given set of data will adhere to this assumption. Sometimes the data may show a tendency to “cluster” such that arrival of one observation in a specific cell may increase the probability that the next observation falls into the same cell. This would lead to overdispersion. We will discuss overdispersion for Poisson models in a later section; a distribution called the negative binomial distribution may be used in some such cases. For the moment, however, we will see what happens if we tentatively accept the Poisson assumption for the data on snoring and heart problems.
6.3
Analysis of the example data
Example 6.2 We analyzed the data on page 111 using the Genmod procedure with Poisson distribution and a log link. The program was: DATA snoring; INPUT snore heart count; CARDS; 1 1 51 1 0 416 0 1 59 0 0 1958 ; PROC GENMOD DATA=snoring; CLASS snore heart; MODEL count = snore heart / LINK = log DIST = poisson ; RUN;
c Studentlitteratur °
116
6.3. Analysis of the example data
The output contains the following information: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
1 1 1 1 .
45.7191 45.7191 57.2805 57.2805 15284.0145
45.7191 45.7191 57.2805 57.2805 .
Analysis Of Parameter Estimates Parameter INTERCEPT SNORE SNORE HEART HEART SCALE NOTE:
0 1 0 1
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 1 0 0
3.0292 1.4630 0.0000 3.0719 0.0000 1.0000
0.1041 0.0514 0.0000 0.0975 0.0000 0.0000
847.2987 811.6746 . 992.0240 . .
0.0001 0.0001 . 0.0001 . .
The scale parameter was held fixed.
A similar analysis that includes an interaction term would produce a deviance of 0 on 0 df . Thus, the difference between our model and the saturated model can be tested; the difference in deviance is 45.7 on 1 degree of freedom which is highly significant when compared with the corresponding χ2 limit with 1 df . We conclude that snoring and heart problems do not seem to be independent. Note that the Pearson chi-square of 57.28 on 1 df presented in the output is based on the textbook formula ¡ ¢2 X nij − µ bij 2 χ = . µ bij i,j The conclusion is the same, in this case, but the tests are not identical.
The output also gives us estimates of the three parameters of the model: b = 3.0719. An analysis of the saturated model µ b = 3.0292, α b 1 = 1.4630 and β 1 [ = 1.4033. From would give an estimate of the interaction parameter as (αβ) 11 this we can calculate the odds ratio OR as OR = exp (1.4033) = 4.07. Patients who snore have a four times larger odds of having heart problems. Odds ratios in log-linear models is further discussed in a later section. ¤
c Studentlitteratur °
117
6. Response variables as counts
6.4
Testing independence in an r×c crosstable
The methods discussed so far can be extended to the analysis of cross-tables of dimension r × c. Example 6.3 Sokal and Rohlf (1973) presented data on the color of the Tiger beetle (Cicindela fulgida) for beetles collected during different seasons. The results are given as: Season Early spring Late spring Early summer Late summer Total
Red 29 273 8 64 374
Other 11 191 31 64 297
Total 40 464 39 128 671
A standard analysis of these data would be to test whether there is independence between season and color through a χ2 test. The corresponding GLIM approach is to model the expected number of beetles as a function of season and color. The observed numbers in each cell are assumed to be generated from an underlying Poisson distribution. The canonical link for the Poisson distribution is the log link. Thus, a Genmod program for these data is DATA chisq; INPUT season $ color $ no; CARDS; Early_spring red 29 Early_spring other 11 Late_spring red 273 Late_spring other 191 Early_summer red 8 Early_summer other 31 Late_summer red 64 Late_summer other 64 ; PROC GENMOD DATA=chisq; CLASS season color; MODEL no = season color / DIST=poisson LINK=log ; run;
c Studentlitteratur °
118
6.5. Higher-order tables
Part of the output is Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
3 3 3 3 .
28.5964 28.5964 27.6840 27.6840 2628.7264
9.5321 9.5321 9.2280 9.2280 .
The deviance is 28.6 on 3 df which is highly significant. The Pearson chisquare is again the same value as would be obtained from a standard chisquare test; it is also highly significant. Formally, independence is tested by comparing the deviance of this model with the deviance that would be obtained if the Season*Color interaction was included in the model. This saturated model has deviance 0.00 on 0 df . Thus, the deviance 28.6 is a large-sample test of independence between color and season. ¤
6.5
Higher-order tables
6.5.1
A three-way table
The arguments used above for the analysis of two-dimensional contingency tables can be generalized to tables of higher order. A general (saturated) model for a three-way table can be written as ¡ ¢ log µijk = µ + αi + β j + γ k + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk (6.6)
An important part of the analysis is to decide which terms to include in the model.
Example 6.4 The table below contains data from a survey from Wright State University in 19921 . 2276 high school seniors were asked whether they had ever used Alcohol (A), Cigarettes (C) and/or Marijuana (M). This is a three-way contingency table of dimension 2 × 2 × 2. Alcohol use Yes No 1 Data
Cigarette use Yes No Yes No
Marijuana use Yes No 911 538 44 456 3 43 2 279
quoted from Agresti (1996) who credited the data to Professor Harry Khamis.
c Studentlitteratur °
¤
119
6. Response variables as counts
6.5.2
Types of independence
Models for data of the type given in the last example can include the main effects of A, C and M and different interactions containing these. The presence of an interaction, for example A*C, means that students who use alcohol have a higher (or lower) probability of also using cigarettes. One way of interpreting interactions is to calculate odds ratios; we will return to this topic soon. A model of type A C M A*C A*M would permit interaction between A and C, and between A and M, but not between C and M. C and M are then said to be conditionally independent, controlling for A. A model that only contains the main effects, i.e. the model A C M is called a mutual independence model. In this example this would mean that use of one drug does not change the risk of using any other drug. A model that contains all interactions up to a certain level, but no higherorder interactions, is called a homogenous association model.
6.5.3
Genmod analysis of the drug use data
The saturated model that contains all main effects and all two- and threeway interactions was fitted to the data as a baseline. The three-way interaction A*C*M was not significant (p = 0.53). The output for the homogenous association model containing all two-way interactions was as follows: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
1 1 1 1 .
0.3740 0.3740 0.4011 0.4011 12010.6124
0.3740 0.3740 0.4011 0.4011 .
The fit of this model is good; a simple rule of thumb is that Value/df should not be too much larger than 1. The parameter estimates for this model are as follows:
c Studentlitteratur °
120
6.5. Higher-order tables
Analysis Of Parameter Estimates Parameter INTERCEPT A A C C M M A*C A*C A*C A*C A*M A*M A*M A*M C*M C*M C*M C*M SCALE
No Yes No Yes No Yes No No Yes Yes No No Yes Yes No No Yes Yes
No Yes No Yes No Yes No Yes No Yes No Yes
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0
6.8139 -5.5283 0.0000 -3.0158 0.0000 -0.5249 0.0000 2.0545 0.0000 0.0000 0.0000 2.9860 0.0000 0.0000 0.0000 2.8479 0.0000 0.0000 0.0000 1.0000
0.0331 0.4522 0.0000 0.1516 0.0000 0.0543 0.0000 0.1741 0.0000 0.0000 0.0000 0.4647 0.0000 0.0000 0.0000 0.1638 0.0000 0.0000 0.0000 0.0000
42312.0532 149.4518 . 395.6463 . 93.4854 . 139.3180 . . . 41.2933 . . . 302.1409 . . . .
0.0001 0.0001 . 0.0001 . 0.0001 . 0.0001 . . . 0.0001 . . . 0.0001 . . . .
All remaining interactions in the model are highly significant which means that no further simplification of the model is suggested by the data.
6.5.4
Interpretation through Odds ratios
Consider, for the moment, a 2 × 2 × k cross-table of variables X, Y and Z. Within a fixed level j of Z, the conditional odds ratio for describing the relationship between X and Y is µ11j µ22j θXY (j) = (6.7) µ12j µ21j where µ denotes expected values. In contrast, in the marginal odds ratio the value of the variable Z is ignored and we calculate the odds ratio as µ µ θXY = 11· 22· (6.8) µ12· µ21· where the dot indicates summation over all levels of Z. The odds ratios can be estimated from the parameter estimates; it holds that, for example, i h [ + (αβ) [ − (αβ) [ − (αβ) [ b (6.9) θXY = exp (αβ) 11 22 12 21
In our drug use example, the chosen model does not contain any three-way interaction, and only one parameter is estimable for each interaction. Thus, the partial odds ratios for the two-way interactions can be estimated as:
c Studentlitteratur °
121
6. Response variables as counts
A*C: exp (2.0545) = 7.80 A*M: exp (2.9860) = 19. 81 C*M: exp (2.8479) = 17. 25 As an example of an interpretation, a student who has tried alcohol has an odds of also having tried marijuana of 19.81, regardless of reported cigarette use.
6.6 6.6.1
Relation to logistic regression Binary response
If one binary variable in a contingency table can be regarded as the response, an alternative to the log-linear model would be to model the probability of response as a function of the other variables in the table. This can be done using logistic regression methods as outlined in Chapter 5. As a comparison, we will analyze the data on page 111 as a logistic regression. A Genmod program for this analysis is:
DATA snoring; INPUT x n snoring $; CARDS; 51 467 Yes 59 2017 No RUN; PROC GENMOD DATA=snoring ORDER=data; CLASS snoring; MODEL x/n = snoring / DIST=Binomial LINK=logit; RUN;
The corresponding output is
Analysis Of Parameter Estimates
Parameter Intercept snoring Yes snoring No Scale
DF
Estimate
Standard Error
1 1 0 0
-3.5021 1.4033 0.0000 1.0000
0.1321 0.1987 0.0000 0.0000
Wald 95% Confidence Limits -3.7611 1.0139 0.0000 1.0000
-3.2432 1.7927 0.0000 1.0000
ChiSquare
Pr > ChiSq
702.47 49.89 .
<.0001 <.0001 .
We note that the parameter estimate for snoring is 1.4033. This is the same as the estimate of the interaction parameter for the saturated log-linear model. c Studentlitteratur °
122
6.7. Capture-recapture data
The odds ratio is OR = exp (1.4033) = 4.07 which is also the same as in the log-linear model. This suggests that contingency tables where one of the variables may be regarded as a binary response can be analyzed either as a log-linear model or using logistic regression. Note, however, that the models are written in different ways. The saturated loglinear model regards the counts as functions of the row and column variables and their interaction: count = snoring heart snoring*heart. The saturated logistic model regards the proportion of persons with heart problems as a function of snoring status: x/n = snoring. Although the models are written in different ways, the results and the interpretations are identical.
6.6.2
Nominal logistic regression
In log-linear models there is no variable that is regarded as the “dependent” variable. The treatment of the row and column variables is symmetric. In some cases it may be preferable to regard one nominal variable as the response. In such cases data can be modeled using nominal logistic regression. The idea is as follows: One category of the nominal response variable is selected as a baseline, or reference, category. If category 1 is the baseline, the logits for the other categories, compared with the first, are µ ¶ pj logit (pj ) = log = Xβ. p1 Thus we write (j − 1) logit equations, one for each category except for the baseline. These logit equations should be estimated simultaneously, which makes the problem multivatiate. At present, nominal logit models are not available in Proc Genmod, except for the case of ordinal response which is discussed in Chapter 7.
6.7
Capture-recapture data
Capture-recapture data provide an interesting application of log-linear models. Suppose that there are M individuals in a population; M is unknown and we want to estimate M . We capture and mark n1 of the individuals. After some time we capture another n2 individuals. It turns out that s of
c Studentlitteratur °
123
6. Response variables as counts
these were marked. It is now relatively straightforward to estimate M as c = n1 = n2 · n1 M pˆ s
(6.10)
If the individuals are captured on three occasions, the data can be written as a three-way contingency table. There are eight different “capture patterns”:
Notation n123 n¯123 n1¯23 n¯1¯23 n12¯3 n¯12¯3 n1¯2¯3 n¯1¯2¯3
Captured at occasion 1, 2 and 3 2 and 3 1 and 3 3 1 and 2 2 1 None
If we assume independence between occasions, the probability that an individual is never captured is pˆ¯1¯2¯3 = (1 −
n2 n3 n1 )(1 − )(1 − ) M M M
(6.11)
Thus, an estimator of the number of individuals that have never been captured is n ˆ ¯1¯2¯3 = M · pˆ¯1¯2¯3 = M · (1 −
n2 n3 n1 )(1 − )(1 − ) M M M
(6.12)
and an estimate of the unknown population size can be obtained by solving M = n + M(1 −
n1 n2 n3 )(1 − )(1 − ) M M M
(6.13)
for M , where n is the number of individuals that have been captured at least once. There are some drawbacks with the method outlined so far. We have to assume independence between occasions, and we only use the information in the margins of the table. A more flexible analysis of this kind of data can be obtained by using log-linear models. If the occasions are independent, it would hold that, for example, p123 = p1 p2 p3 . The expected number of individuals in this cell would then be µ123 = M p1 p2 p3
(6.14) c Studentlitteratur °
124
6.7. Capture-recapture data
Taking logarithms, ln(µ123 ) = ln M + ln p1 + ln p2 + ln p3 = µ + α1 + β 1 + γ 1
(6.15)
where the occasions correspond to α, β and γ, respectively. In a similar way, we could write the expected numbers in all cells as linear functions of parameters. This is a log-linear model. If the occasions are not independent, we can include parameters like (αβ)ij that account for the dependence. Thus, a general log-linear model can be written as ln(µijk ) = µ + α1 + β 1 + γ 1 + (αβ)11 + (αγ)11 + (βγ)11 + (αβγ)111 (6.16) Log-linear models are now often used to model capture-recapture data; see Olsson (2000). The model specification includes a Poisson distribution for the numbers in each cell of the table, a log link and a feature to account for the fact that it is impossible to observe n¯1¯2¯3 ; the number of individuals that have never been captured. Example 6.5 Table 6.1 summarizes information about persons who were heavy misusers of drugs in Sweden in 1979. The individuals could appear in registers within the health care system, social authorities, penal system, police or customs, or others. These correspond to the “captures”, in a capturerecapture sense. It is reasonable to assume that some of these sources of information are related. Thus, for example, an individual who has been taken in charge by police is quite likely to appear also in the penal system at some stage. Thus, interactions between some or all of these sources are likely. The SAS programs that were used for analysis had the structure
proc genmod; class x1 x2 x3 x4 x5; model count=x1 x2 x3 x4 x5 / dist=Poisson obstats residuals; weight w; run; In this program, x1 to x5 refer to the following sources of information: Code x1 x2 x3 x4 x5 c Studentlitteratur °
Source of information Health care Social authorities Penal system Police, customs Others
125
6. Response variables as counts
Table 6.1: Swedish drug addicts with different capture patterns in 1979. Hospital care 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Social authorities 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
Penal system 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
Police, customs 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
Others
Count
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 45 2080 11 1056 11 942 9 1011 59 381 15 245 18 345 13 828 7 179 1 191 3 137 1 264 18 132 9 144 16 133 15
c Studentlitteratur °
126
6.8. Poisson regression models
The weight w has been set to 1 for all combinations except the combination where all x1,...,x5 are 0. This combination cannot be observed and is a structural zero. In the program example it is implicitly assumed that the different sources of information are independent. However, interactions can be included in the model as e.g. x1*x2. In this rather large data set all two-way interactions were significant. In addition, the interaction x1*x3*x4 was also significant. Thus, the final model was count =x1|x2|x3|x4|x5 @2 x1*x3*x4; This model had a good fit to the data (χ2 = 17.5 on 14 df ). The model estimates the number of uncaptured individuals as 8878 individuals with confidence interval 7640 − 10317 individuals. This would mean that the number of drug addicts in Sweden in 1979 was 8319 + 8878 = 17197 individuals. This is more than 5000 individuals higher than the published result, which was 12000 (Socialdepartementet 1980). The published result was obtained through capture-recapture methods but assuming that the different sources of information are independent. ¤
6.8
Poisson regression models
We have seen that log-linear models for cross-tabulations can be handled as generalized linear models. The linear predictor then consists of a design matrix that contains dummy variables for the different margins of the table. It is quite possible to introduce quantitative variables into the model, in a similar way as for regression models. Example 6.6 The table below, taken from Haberman (1978), shows the distribution of stressful events reported by 147 subjects who have experienced exactly one stressful event. The table gives the number of persons reporting a stressful event 1, 2, ..., 18 months prior to the interview. We want to model the occurrence of stressful events as a function of time. Months Number Months Number
1 15 10 10
2 11 11 7
3 14 12 9
4 17 13 11
5 5 14 3
6 11 15 6
7 10 16 1
8 4 17 1
9 8 18 4
One approach to modelling the occurrence of stressful events as a function of X=months is to assume that the number of persons responding for any month is a Poisson variate. The canonical link for the Poisson distribution c Studentlitteratur °
127
6. Response variables as counts
is log, so a first attempt to modelling these data is to assume that log (µ) = β 0 + β 1 x
(6.17)
This is a generalized linear model with a Poisson distribution, a log link and a simple linear predictor. A SAS program for this model can be written as DATA stress; INPUT months number @@; CARDS; 1 15 2 11 3 14 4 17 5 5 6 11 7 10 8 4 9 8 10 10 11 7 12 9 13 11 14 3 15 6 16 1 17 1 18 4 ; PROC GENMOD DATA=stress; MODEL number = months / DIST=poisson LINK=log OBSTATS RESIDUALS; MAKE 'obstats' out=ut; RUN;
Part of the output is Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
16 16 16 16 .
24.5704 24.5704 22.7145 22.7145 174.8451
1.5356 1.5356 1.4197 1.4197 .
The data have a less than perfect fit to the model, with Value/df=1.53; the p-value is 0.078. Analysis Of Parameter Estimates Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT MONTHS SCALE
1 1 0
2.8032 -0.0838 1.0000
0.1482 0.0168 0.0000
357.9476 24.8639 .
0.0001 0.0001 .
We find that the memory of stressful events fades away as log (µ) = 2.80 − 0.084x. A plot of the data, along with the fitted regression line, is given as Figure 6.1. Figure 6.2 shows the data and regression line with a log scale for the y-axis. ¤
c Studentlitteratur °
128
6.8. Poisson regression models
Figure 6.1: Distribution of persons remembering stressful events
Figure 6.2: Distribution of persons remembering stressful events; log scale
c Studentlitteratur °
129
6. Response variables as counts
6.9
A designed experiment with a Poisson distribution
Example 6.7 The number of wireworms counted in the plots of a Latin square experiment following soil fumigations in the previous year is given in the following table2 . 1 2 3 4 5
1 P3 M6 O4 N 17 K4
2 O2 K0 M9 P8 N4
3 N5 O6 K1 M8 P2
4 K1 N4 P6 O9 M4
5 M4 P4 N5 K0 O8
We may model the number of wireworms in a certain plot as a Poisson distribution. The design includes a Row effect, a Column effect and a Treatment effect. Thus, an “ANOVA-like” model for these data can be written as g (µ) = β 0 + αi + β j + τ k
(6.18)
where β 0 is a general mean, αi is a row effect, β j is a column effect and τ k is the effect of treatment k. A SAS program for analysis of these data using Proc Genmod is: DATA Poisson; INPUT Row Col Treat $ Count; CARDS; 1 1 P 3 1 2 O 2 1 3 N 5 ... More data lines ... 5 4 M 4 5 5 O 8 ; PROC GENMOD DATA=Poisson; CLASS row col treat; MODEL Count = row col treat / Dist=Poisson Link=Log Type3; RUN; 2 Data
from Snedecor √ and Cochran (1980). The original analysis was an Anova on data transformed as y = x + 1
c Studentlitteratur °
130
6.9. A designed experiment with a Poisson distribution
The output is: The GENMOD Procedure Model Information Description
Value
Data Set Distribution Link Function Dependent Variable Observations Used
WORK.POISSON POISSON LOG COUNT 25
Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
12 12 12 12 .
19.5080 19.5080 18.0096 18.0096 97.0980
1.6257 1.6257 1.5008 1.5008 .
The fit of the model is reasonable but not perfect; the p value is 0.077. Ideally, Value/df should be closer to 1. Analysis Of Parameter Estimates Parameter INTERCEPT ROW ROW ROW ROW ROW COL COL COL COL COL TREAT TREAT TREAT TREAT TREAT SCALE
1 2 3 4 5 1 2 3 4 5 K M N O P
c Studentlitteratur °
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 0
1.4708 -0.4419 -0.1751 0.0451 0.5699 0.0000 0.3045 -0.0506 -0.0936 -0.0636 0.0000 -1.3797 0.2910 0.3324 0.2003 0.0000 1.0000
0.3519 0.3404 0.3175 0.2980 0.2729 0.0000 0.2892 0.3099 0.3207 0.3093 0.0000 0.4627 0.2789 0.2760 0.2854 0.0000 0.0000
17.4670 1.6851 0.3041 0.0229 4.3618 . 1.1087 0.0267 0.0852 0.0423 . 8.8906 1.0888 1.4502 0.4928 . .
0.0001 0.1942 0.5813 0.8796 0.0368 . 0.2924 0.8703 0.7704 0.8370 . 0.0029 0.2967 0.2285 0.4827 . .
131
6. Response variables as counts
LR Statistics For Type 3 Analysis Source
DF
ChiSquare
Pr>Chi
ROW COL TREAT
4 4 4
14.3595 2.8225 25.1934
0.0062 0.5880 0.0001
We find a significant Row effect and a highly significant Treatment effect. It is interesting to note that the GLM analysis of square root transformed data, as suggested by Snedecor and Cochran (1980, results in a significant treatment effect (p = 0.02) but no significant row or column effects. This may be related to the fact that the model fit is not perfect. We will return to these data later. ¤
6.10
Rate data
Events that may be assumed to be essentially Poisson are sometimes recorded on units of different size. For example, the number of crimes recorded in a number of cities depends on the size of the city, such that “crimes per 1000 inhabitants” is a meaningful measure of crime rate. Data of this type are called rate data. If we denote the measure of size with t, we can model this type of data as ³µ´ log = Xβ (6.19) t
which means that
log (µ) = log (t) + Xβ
(6.20)
The adjustment term log (t) is called an offset. The offset can easily be included in models analyzed with e.g. Proc Genmod. Example 6.8 The data below, quoted from Agresti (1996), are accident rates for elderly drivers, subdivided by sex. For each sex the number of person years (in thousands) is also given. The data refer to 16262 Medicaid enrollees. No. of accidents No. of person years (’000)
Females 175 17.30
Males 320 21.40
From the raw data we can calculate accident rates as 175/17.30 = 10.1 per 1000 person years for females and 320/21.40 = 15.0 per 1000 person years for c Studentlitteratur °
132
6.10. Rate data
males. A simple way to model these data is to use a generalized linear model with a Poisson distribution, a log link, and to use the number of person years as an offset. This is done with the following program: DATA accident; INPUT sex $ accident persyear; logpy=log(persyear); CARDS; Male 320 21.400 Female 175 17.300 ; PROC GENMOD DATA=accident; CLASS sex; MODEL accident = sex / LINK = log DIST = poisson OFFSET = logpy ; RUN;
The output is Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
0 0 0 0 .
0.0000 0.0000 0.0000 0.0000 2254.7003
. . . . .
The model is a saturated model so we can’t assess the over-all fit of the model by using the deviance. Analysis Of Parameter Estimates Parameter INTERCEPT SEX SEX SCALE
Female Male
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 0
2.7049 -0.3909 0.0000 1.0000
0.0559 0.0940 0.0000 0.0000
2341.3269 17.2824 . .
0.0001 0.0001 . .
The parameter estimate for females is −0.39. The model can be written as log (µ) = log (t) + β 0 + β 1 x where x is a dummy variable taking the value 1 for females and 0 for males. Thus the estimate can be interpreted such that the odds ratio is e−0.3909 = 0.676. The risk of having an accident for a female is 68% of the risk for men. This difference is significant; however, other factors that may affect the risk of accident, for example differences in driving distance, are not included in this model. ¤ c Studentlitteratur °
133
6. Response variables as counts
6.11
Overdispersion in Poisson models
Overdispersion means that the variance of the response variable is larger than would be expected for the chosen distribution. For Poisson data we would expect the variance to be equal to the mean. As noted earlier, the presence of overdispersion may be related to mistakes in the formulation of the generalized linear model: the distribution, the link function and/or the linear predictor. The effects of overdispersion is that pvalues for tests are deflated: it becomes “too easy” to get significant results.
6.11.1
Modeling the scale parameter
If the model is correct, overdispersion may be caused by heterogeneity among the observations. One way to account for such heterogeneity is to introduce a scale parameter φ into the variance function. Thus, we would assume that V ar (Y ) = φσ2 , where the parameter φ can be estimated as (Deviance) /df or as χ2 /df , where χ2 is the Pearson Chi-square. Example 6.9 In the analysis of data on number of wireworms in a Latin square experiment (page 129), there were some indications of overdispersion. The ratio Deviance/df was 1.63. We re-analyze these data but ask the Genmod procedure to use 1.63 as an estimate of the dispersion parameter φ. The program is: PROC GENMOD data=poisson; CLASS row col treat; MODEL count = row col treat / dist=Poisson Link=log type3 scale=1.63 ; RUN;
Output: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
12 12 12 12 .
19.5080 7.3424 18.0096 6.7784 36.5456
1.6257 0.6119 1.5008 0.5649 .
c Studentlitteratur °
134
6.11. Overdispersion in Poisson models LR Statistics For Type 3 Analysis Source
DF
ChiSquare
Pr>Chi
ROW COL TREAT
4 4 4
5.4046 1.0623 9.4823
0.2482 0.9002 0.0501
Fixing the scale parameter to 1.63 has a rather dramatic effect on the result. In our previous analysis of these data, the treatment effect was highly significant (p = 0.0001), and the row effect was significant (p = 0.0062). In our new analysis even the treatment effect is above the 0.05 limit. In the original analysis of these data (Snedecor and Cochran, 1980), only the treatment effect was significant (p = 0.021). Note that the Genmod procedure has an automatic feature to base the analysis on a scale parameter estimated by the Maximum Likelihood method; see the SAS manual for details. ¤
6.11.2
Modeling as a Negative binomial distribution
If a Poisson model shows signs of overdispersion, an alternative approach is to replace the Poisson distribution with a Negative binomial distribution. This idea can be traced back to Student (1907), who studied counts of red blood cells. This distribution can be derived in two ways. For a series of Bernouilli trials, suppose that we are studying the number of trials (y) until we have recorded r successes. The probability of success is p. The distribution for y is µ ¶ y−1 r P (y) = p (1 − p)y−r (6.21) r−1 for y = r, (r + 1), .... This is the binomial waiting time distribution. If r = 1, it is called a geometric distribution. Using the Gamma function, the distribution can be defined even for non-integer values of r. When r is an integer, it is called the Pascal distribution. The distribution has mean value E (y) = pr and variance V ar (y) =
r(1−p) p2 .
A second way to derive the negative binomial distribution is as a so called compund distribution. Suppose that the response for individual i can be modeled as a Poisson distribution with mean value µi . Suppose further that the distribution of the mean values µi over individuals follows a Gamma distribution. It can be shown that the resulting compound distribution for y is a negative binomial distribution. The negative binomial distribution has a higher probability for the zero count, and a longer tail, than a Poisson distribution with the same mean value. c Studentlitteratur °
6. Response variables as counts
135
Because of this, and because of the relation to compound distributions, it is often used as an alternative to the Poisson distribution when over-dispersion can be suspected. The negative binomial distribution is available in Proc Genmod in SAS, version 8.
6.12
Diagnostics
Model diagnostics for Poisson models follows the same lines as for other generalized linear models. Example 6.10 We can illustrate some diagnostic plots using data from the Wireworm example on page 129. The residuals and predicted values were stored in a file. The standardized deviance residuals were plotted against the predicted values, and a normal probability plot of the residuals was prepared. The results are given in Figure 6.3 and Figure 6.4. Both plots indicate a reasonable behavior of the residuals. We cannot see any irregularities in the plot of residuals against fitted values, and the normal plot is rather linear. ¤
c Studentlitteratur °
136
6.12. Diagnostics
Figure 6.3: Plot of standardized deviance residuals against fitted values for the Wireworm example.
Figure 6.4: Normal probability plot for the wireworm data.
c Studentlitteratur °
137
6. Response variables as counts
6.13
Exercises
Exercise 6.1 The data in Exercise 1.4 are of a kind that can often be approximated by a Poisson distribution. Re-analyze the data using Poisson regression. Prepare a graph of the relation and compare the results with the results from Exercise 1.4. The data are repeated here for your convenience: Gowen and Price counted the number of lesions of Aucuba mosaic virus after exposure to X-rays for various times. The results were: Minutes exposure 0 15 30 45 60
Count 271 108 59 29 12
Exercise 6.2 The following data consist of failures of pieces of electronic equipment operating in two modes. For each observation, Mode1 is the time spent in one mode and Mode2 is the time spent in the other. The total number of failures recorded in each period is also recorded. Mode1 33.3 52.2 64.7 137.0 125.9 116.3 131.7 85 91.9
Mode2 25.3 14.4 32.5 20.5 97.6 53.6 56.6 87.3 47.8
Failures 15 9 14 24 27 27 23 18 22
Fit a Poisson regression model to these data, using Failures as a dependent variable and Mode1 and Mode2 as predictors. In the original analysis (Jørgensen 1961) an Identity link was used. Try this, but also try a log link. Which model seems to fit best? Exercise 6.3 The following data was taken from the Statlib database (Internet address http://lib.stat.edu/datasets). They have also been used by McCullagh and Nelder (1989). The source of the data is the Lloyd Register of Shipping. The purpose of the analysis is to find variables that are related to the number of damage incidents for ships. The following variables
c Studentlitteratur °
138
6.13. Exercises
are available: type yr_constr per_op mon_serv incident
Ship type A, B, C, D or E Year of construction in 5-year intervals Period of operation: 1960-74, 1975-79 Aggregate months service for ships in this cell Number of damage incidents
Type A A A A A A A A B B B B B B B B C C C C C C C C D D D D D D D D E E E E E E E E
yr_constr 60 60 65 65 70 70 75 75 60 60 65 65 70 70 75 75 60 60 65 65 70 70 75 75 60 60 65 65 70 70 75 75 60 60 65 65 70 70 75 75
per_op 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75 60 75
mon_serv 127 63 1095 1095 1512 3353 0 2244 44882 17176 28609 20370 7064 13099 0 7117 1179 552 781 676 783 1948 0 274 251 105 288 192 349 1208 0 2051 45 0 789 437 1157 2161 0 542
incident 0 0 3 4 6 18 * 11 39 29 58 53 12 44 * 18 1 1 0 1 6 2 * 1 0 0 0 0 2 11 * 4 0 * 7 7 5 12 * 1
The number of damage incidents is reported as * if it is a “structural zero”. Fit a model predicting the number of damage incidents based on the other variables. Use the following instructions: • Use a Poisson model with a log link. • Use log (Aggregate months service) as an offset. • Use all predictors as class variables. Include any necessary interactions. c Studentlitteratur °
139
6. Response variables as counts
• If necessary, try to model any overdispersion in the data. Note: some of the observations are “structural zeros”. For example, a ship constructed in 1975-79 cannot operate during the period 1960-74. Exercise 6.4 The data in table 6.8 (page 131) are rather simple. It is quite possible to calculate the parameter estimates by hand. Make that calculation. Exercise 6.5 An experiment analyzes imperfection rates for two processes used to fabricate silicon wafers for computer chips. For treatment A applied to 10 wafers the number of imperfections are 8, 7, 6, 6, 3, 4, 7, 2, 3, 4. Treatment B applied to 10 wafers has 9, 9, 8, 14, 8, 13, 11, 5, 7, 6 imperfections. The counts were treated as independent Poisson variates in a generalized linear model with a log link. Parts of the results were: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
18 18 18 18
16.2676 16.2676 16.0444 16.0444 138.2221
0.9038 0.9038 0.8914 0.8914
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard Error
Intercept treat Scale
1 1 0
1.6094 0.5878 1.0000
0.1414 0.1764 0.0000
Wald 95% Confidence Limits
ChiSquare
Pr > ChiSq
A model with only an intercept term gave a deviance of 27.857 on 19 degrees of freedom. A. Test the hypothesis H0 : µA = µB using (i) a Likelihood ratio test; (ii) a Wald test. B. Construct a 95% confidence interval for µB /µA . Hint: What is the relationship between the parameter β and µB /µA ? Exercise 6.6 The following table (from Agresti 1996) gives the number of train miles (in millions) and the number of collisions involving British Rail passenger trains between 1970 and 1984. Is it plausible that the collision counts are independent Poisson variates? Respond by testing a model with only an intercept term. Also, examine whether inclusion of log(miles) as an offset would improve the fit.
c Studentlitteratur °
140
6.13. Exercises
Year 1970 1971 1972 1973 1974 1975 1976
Collisions 3 6 4 7 6 2 2
Miles 281 276 268 269 281 271 265
Year 1977 1978 1979 1980 1981 1982 1983
Collisions 4 1 7 3 5 6 1
Miles 264 267 265 267 260 231 249
Exercise 6.7 Rosenberg et al (1988) studied the relationship between coffee drinking, smoking and the risk for myocardial infarction in a case-control study for men under 55 years of age. The data are as follows: Coffee per day 0 1-2 3-4 5-
Case 66 141 113 129
0 Control 123 179 106 80
Case 30 59 63 102
Cigarettes per day 1-24 25-34 Control Case Control 52 15 12 45 53 22 65 55 16 58 118 44
Case 36 69 119 373
35Control 13 25 30 85
A. Analyze these data using smoking and coffee drinking as qualitative variables. B. Assign scores to smoking and coffee drinking and re-analyze the data using these scores as quantitative variables. C. Compare the analyses in A. and B. in terms of fit. Perform residual analyses. Exercise 6.8 Even before the space shuttle Challenger exploded on January 20, 1986, NASA had collected data from 23 earlier launches. One part of these data was the number of O-rings that had been damaged at each launch. O-rings are a kind of gaskets that will prevent hot gas from leaking during takeoff. In total there were six such O-rings at the Challenger. The data included the number of damaged O-rings, and the temperature (in Fahrenheit) at the time of the launch. On the fateful day when the Challenger exploded, the temperature was 31◦ F. One might ask whether the probability that an O-ring is damaged is related to the temperature. The following data are available:
c Studentlitteratur °
141
6. Response variables as counts
No. of Defective O-rings 2 1 1 1 0 0 0 0 0 0 0
Temperao ture F 53 57 58 63 66 67 67 67 68 69 70
No. of Defective O-rings 0 1 1 0 0 0 2 0 0 0 0 0
Temperao ture F 70 70 70 72 73 75 75 76 76 78 79 81
A statistician fitted two alternative Generalized linear models to these data: one model with a Poisson distribution and a log link, and another model with a binomial distribution and a logit link. Part of the output from these two analyses are presented below. Deviances for “null models” that only include an intercept were 22.434 (22 df, Poisson model) and 24.2304 (22 df, binomial model).
Poisson model: Criterion
Criteria For Assessing Goodness Of Fit DF Value
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
21 21 21 21
16.8337 16.8337 28.1745 28.1745 -14.6442
Value/DF 0.8016 0.8016 1.3416 1.3416
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard Error
Intercept Temp Scale
1 1 0
5.9691 -0.1034 1.0000
2.7628 0.0430 0.0000
Wald 95% Confidence Limits
ChiSquare
Pr > ChiSq
c Studentlitteratur °
142
6.13. Exercises
Binomial model:
Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
21 21 21 21
18.0863 18.0863 29.9802 29.9802 -30.1982
0.8613 0.8613 1.4276 1.4276
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard Error
Intercept Temp Scale
1 1 0
5.0850 -0.1156 1.0000
3.0525 0.0470 0.0000
Wald 95% Confidence Limits
ChiSquare
Pr > ChiSq
A. Test whether temperature has any significant effect on the failure of Orings using i) the Poisson model ii) the binomial model B. Predict the outcome of the response variable if the temperature is 31◦ F i) for the Poisson model ii) for the binomial model C. Which of the two models do you prefer? Explain why! D. Using your preferred model, calculate the probability that three or more of the O-rings fail if the temperature is 31◦ F. Exercise 6.9 Agresti (1996) discusses analysis of a set of accident data from Maine. Passengers in all traffic accidents during 1991 were classified by: Gender Location Belt Injury
Gender of the person (F or M) Place of the accident: Urban or Rural Whether the person used seat belt (Y or N) Whether the person was injured in the accident (Y or N)
A total of 68694 passengers were included in the data. A log-linear model was fitted to these data using Proc Genmod. All main effects and two-way interactions were included. A model with a three-way interaction fitted slightly better but is not discussed here. Part of the results were:
c Studentlitteratur °
143
6. Response variables as counts
Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
5 5 5 5
23.3510 23.3510 23.3752 23.3752 536762.6081
4.6702 4.6702 4.6750 4.6750
Analysis Of Parameter Estimates
Parameter
DF Estimate
Intercept gender F gender M location R location U belt N belt Y injury N injury Y gender*location gender*location gender*location gender*location gender*belt gender*belt gender*belt gender*belt gender*injury gender*injury gender*injury gender*injury location*belt location*belt location*belt location*belt location*injury location*injury location*injury location*injury belt*injury belt*injury belt*injury belt*injury Scale
1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0
F F M M F F M M F F M M R R U U R R U U N N Y Y
R U R U N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y
5.9599 0.6212 0.0000 0.2906 0.0000 0.7796 0.0000 3.3309 0.0000 -0.2099 0.0000 0.0000 0.0000 -0.4599 0.0000 0.0000 0.0000 -0.5405 0.0000 0.0000 0.0000 -0.0849 0.0000 0.0000 0.0000 -0.7550 0.0000 0.0000 0.0000 -0.8140 0.0000 0.0000 0.0000 1.0000
Standard Error 0.0314 0.0288 0.0000 0.0290 0.0000 0.0291 0.0000 0.0310 0.0000 0.0161 0.0000 0.0000 0.0000 0.0157 0.0000 0.0000 0.0000 0.0272 0.0000 0.0000 0.0000 0.0162 0.0000 0.0000 0.0000 0.0269 0.0000 0.0000 0.0000 0.0276 0.0000 0.0000 0.0000 0.0000
Wald 95% Limits 5.8984 0.5647 0.0000 0.2337 0.0000 0.7225 0.0000 3.2702 0.0000 -0.2415 0.0000 0.0000 0.0000 -0.4907 0.0000 0.0000 0.0000 -0.5939 0.0000 0.0000 0.0000 -0.1167 0.0000 0.0000 0.0000 -0.8078 0.0000 0.0000 0.0000 -0.8681 0.0000 0.0000 0.0000 1.0000
6.0213 0.6777 0.0000 0.3475 0.0000 0.8367 0.0000 3.3916 0.0000 -0.1783 0.0000 0.0000 0.0000 -0.4292 0.0000 0.0000 0.0000 -0.4872 0.0000 0.0000 0.0000 -0.0532 0.0000 0.0000 0.0000 -0.7022 0.0000 0.0000 0.0000 -0.7599 0.0000 0.0000 0.0000 1.0000
ChiSquare 36133.0 463.89 . 100.16 . 716.17 . 11563.8 . 169.50 . . . 860.14 . . . 394.36 . . . 27.50 . . . 784.94 . . . 868.65 . . .
Pr > ChiSq <.0001 <.0001 . <.0001 . <.0001 . <.0001 . <.0001 . . . <.0001 . . . <.0001 . . . <.0001 . . . <.0001 . . . <.0001 . . .
NOTE: The scale parameter was held fixed.
Calculate and interpret estimated odds ratios for the different factors.
c Studentlitteratur °
7. Ordinal response
Response variables in the form of judgements or other ordered classifications are called ordinal response variables. Examples of such variables are diagnostics of patients (improved, no change, worse); classification of potatoes (ordinary, high quality, extra high quality); answers to opinion items (agree completely; agree; undecided; disagree; disagree completely); and school marks. Ordinal response variables can be analyzed as nominal response using the methods outlined in Chapter 6. However, this kind of analysis would disregard an important part of the information in the data, namely the fact that the categories are ordered. Alternatively, ordinal data are sometimes analyzed as if the data had been numeric, using some scoring of the response. This approach is often unsatisfactory since the data are then assumed to be “better” than they actually are. Several suggestions on the modeling of ordinal response data have been put forward in the literature. We will briefly review some of these approaches from the point of view of generalized linear models.
7.1
Arbitrary scoring
Example 7.1 Norton and Dunn (1985) studied the relation between snoring and heart problems for a sample of 2484 patients. The data were obtained through interviews with the patients. The amount of snoring was assessed on a scale ranging from “Never” to “Always”, which is an ordinal variable. An interesting question is whether there is any relation between snoring and heart problems. The data are given in the following table: Heart problems
Never
Yes No Total
24 1355 1379
Sometimes 35 603 638
Snoring Often
Always
Total
21 192 213
30 224 254
110 2374 2484 ¤
145
146
7.1. Arbitrary scoring
Proportion with heart problems
12
7
2
Score 1 Never
2 Sometimes
3 Often
4 Always
Figure 7.1: Relation between heart problems and snoring
The main interest lies in studying a possible dependence between snoring and heart problems. A simple approach to analyzing these data is to ignore the ordinal nature of the data and use a simple χ2 test of independence or, in this context, the corresponding log-linear model ¡ ¢ (7.1) log µij = µ + αi + β j + (αβ)ij
This is a saturated model. The test of the hypothesis of independence corresponds to testing the hypothesis that the parameter (αβ)ij is zero. For the data on page 145, this gives a deviance of 21.97 on 3 df (p < 0.0001) which is, of course, highly significant. This analysis, however, does not use the fact that the snoring variable is ordinal. A plot (Figure 7.1) of the percentage of persons with heart problems in each snoring category suggests that this percentage increases nearly linearly, if we choose the arbitrary scores (1, 2, 3, 4) for the snoring categories. This suggests a simple way of accounting for the ordinal nature of the data. Instead of entering the dependence between the variables as the interaction term (αβ)ij , as in the saturated model (7.1), we write the model as ¡ ¢ (7.2) log µij = µ + αi + β j + γ · ui · vj where ui are arbitrary scores for the row variable and vj are scores for the column variable. In this model, the term γ · ui · vj captures the linear part of the dependence between the scores. This model is called a linear by linear association model (LL model; see Agresti, 1996). c Studentlitteratur °
147
7. Ordinal response
In a Genmod analysis according to this model we need to arrange the data according to the following data step: DATA snoring; INPUT heart $ snore $ freq u v; CARDS; Yes Never 24 1 1 Yes Sometimes 35 1 2 Yes Often 21 1 3 Yes Always 30 1 4 No Never 1355 0 1 No Sometimes 603 0 2 No Often 192 0 3 No Always 224 0 4 ;
The model request in Genmod can be written as PROC GENMOD DATA=snoring; CLASS heart snore; MODEL freq = heart snore u*v/ DIST=poisson LINK=log ; RUN;
Parts of the output is Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
2 2 2 2 .
6.2398 6.2398 6.3640 6.3640 13733.2247
3.1199 3.1199 3.1820 3.1820 .
Analysis Of Parameter Estimates Parameter INTERCEPT HEART HEART SNORE SNORE SNORE SNORE U*V SCALE NOTE:
No Yes Always Never Often Sometime
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 1 1 1 0 1 0
1.9833 4.4319 0.0000 -1.0289 0.7912 -1.1353 0.0000 0.6545 1.0000
0.2258 0.2256 0.0000 0.0773 0.0479 0.0794 0.0000 0.0825 0.0000
77.1306 386.0188 . 177.1304 272.4822 204.5545 . 62.8977 .
0.0001 0.0001 . 0.0001 0.0001 0.0001 . 0.0001 .
The scale parameter was held fixed.
Earlier we found that the independence model gives a deviance of 21.97 on 3 df . If we include the single parameter γ for the linear by linear association c Studentlitteratur °
148
7.2. RC models
model we get a model with 2 df and a deviance of 6.24. The difference is 15.73 on 1 df which is highly significant. The Wald test of the parameter for the linear by linear association is also highly significant (χ2 = 62.9 on 1 df ; p < 0.0001). This indicates that most of the dependence between snoring and heart problems is captured by the linear interaction term. The original analysis using a simple χ2 test indicated “some form of relationship” between snoring and heart problems. The linear by linear association model suggests that snoring and heart problems may have a positive relationship.
7.2
RC models
The method of arbitrary scoring is often useful, but it is subjective in the sense that different choices of scores for the ordinal variables may result in different conclusions. An approach that has been suggested (see e.g. Andersen, 1980) is to include the row and column scores as parameters of the model. Thus, the model can be written as ¡ ¢ (7.3) log µij = µ + αi + β j + γ · µi · vj
where µi and vj are now parameters to be estimated from the data. This model, called an RC model, is nonlinear since it includes a product term in the row and column scores. Thus, the model is not formally a generalized linear model. However, Agresti (1985) suggested methods for fitting this model using standard software. This method is iterative. The row scores are kept fixed and the model is fitted for the column scores. These column scores are then kept fixed and the row scores are estimated. These two steps are continued until convergence. The method seems to converge in most cases.
7.3
Proportional odds
The proportional odds model for an ordinal response variable is a model for cumulative probabilities of type P (Y ≤ j) = p1 + p2 + . . . + pj , where for simplicity we index the categories of the response variable with integers. The cumulative logits are defined as logit (P (Y ≤ j)) = log
P (Y ≤ j) 1 − P (Y ≤ j)
(7.4)
The cumulative logits are defined for each of the categories of the response except the first one. Thus, for a response variable with 5 categories we would get 5 − 1 = 4 different cumulative logits. c Studentlitteratur °
149
7. Ordinal response
The proportional odds model for ordinal response suggests that all these cumulative logit functions can be modeled as logit (P (Y ≤ j)) = αj + βx
(7.5)
i.e. the functions have different intercepts αi but a common slope β. This means that the odds ratio, for two different values x1 and x2 of the predictor x, has the form P (Y ≤ j|x2 ) /P (Y > j|x2 ) . P (Y ≤ j|x1 ) /P (Y > j|x1 )
(7.6)
The log of this odds ratio equals β (x2 − x1 ), i.e. the log odds is proportional to the difference between x2 and x1 . This is why the model is called the proportional odds model. The proportional odds model is not formally a (univariate) generalized linear model, although it can be seen as a kind of multivariate Glim. The model states that the different cumulative logits, for the different ordinal values of the response, are all parallel but with different intercepts. Thus, the model gives, in a sense, a set of k − 1 related models if the response has k scale steps. The Genmod procedure in SAS version 8 (SAS 2000b), as well as the Logistic procedure in SAS, can handle this type of models. Example 7.2 We continue with the analysis of the data on page 145. For the sake of illustration we use the ordinal snoring variable as the response, and analyze the data to explore whether the risk of snoring depends on whether the patient has heart problems. A simple way of analyzing this type of data is to use the Logistic procedure of the SAS package: PROC LOGISTIC DATA=snoring; FREQ freq; MODEL v = u; RUN;
The following output is obtained: Score Test for the Proportional Odds Assumption Chi-Square = 1.1127 with 2 DF (p=0.5733)
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Criterion
Intercept Only
Intercept and Covariates
AIC SC -2 LOG L Score
5568.351 5585.804 5562.351 .
5505.632 5528.903 5497.632 .
Chi-Square for Covariates . . 64.719 with 1 DF (p=0.0001) 68.217 with 1 DF (p=0.0001) c Studentlitteratur °
150
7.4. Latent variables
The proportional odds assumption (i.e. the assumption of a common slope) cannot be rejected (p = 0.57). The hypothesis that β is zero is rejected (p < 0.0001). The estimates of the parameters α1 , α2 and α3 are given in the next part of the output, along with an estimate of the common slope β. The intercept for the last category is set equal to zero. Analysis of Maximum Likelihood Estimates
Variable DF INTERCP1 INTERCP2 INTERCP3 U
Parameter Standard Wald Pr > Standardized Estimate Error Chi-Square Chi-Square Estimate
1 1 1 1
0.2824 1.5545 2.2792 -1.4209
0.0414 0.0534 0.0687 0.1774
46.5871 846.5227 1101.5470 64.1619
0.0001 0.0001 0.0001 0.0001
. . . -0.161188
Odds Ratio . . . 0.242
A similar analysis can be done using Proc Genmod in SAS version 8 or later. The program can be written as PROC GENMOD data=snoring order=data; FREQ freq; CLASS heart; MODEL v = u /dist=multinomial link=cumlogit; RUN;
The information is essentially the same as in the Logistic procedure but the standard error estimates are slightly different. Also, Proc Genmod does not automatically test the common slope assumption. Analysis Of Parameter Estimates
Parameter Intercept1 Intercept2 Intercept3 u Scale
DF
Estimate
Standard Error
1 1 1 1 0
0.2824 1.5545 2.2792 -1.4208 1.0000
0.0414 0.0535 0.0686 0.1742 0.0000
Wald 95% Confidence Limits 0.2012 1.4497 2.1446 -1.7624 1.0000
0.3635 1.6594 2.4137 -1.0793 1.0000
ChiSquare
Pr > ChiSq
46.53 844.88 1102.40 66.49
<.0001 <.0001 <.0001 <.0001
¤
7.4
Latent variables
Another point of view when analyzing ordinal response variables is to assume that the observed ordinal variable Y is related to some underlying, latent, c Studentlitteratur °
151
7. Ordinal response
y=1
t1
y=2
t2
y=3
Figure 7.2: Ordinal variable with three scale steps generated by cutting a continuous variable at two thresholds
variable η through a relation of type y=1 y=2 .. .
if if
y=s
if
η < τ1 τ1 ≤ η < τ2
(7.7)
τ s−1 ≤ η
An example of this point of view is illustrated in Figure 7.2, where the latent variable is assumed to have a symmetric distribution, for example a logistic or a Normal distribution. Although (7.7) can be formally seen as a kind of link function, modelling the data by assuming a latent variable underlying the ordinal response is not formally a generalized linear model. However, it can be shown (see e.g. McCullagh and Nelder, 1989) that the latent variable approach gives a model that is identical to the proportional odds model with a logit link, for the case where the latent variable has a logistic distribution. The estimated intercepts would be the estimated thresholds for the latent variables. In a similar way, a proportional odds model using a complementary loglog link corresponds to a latent variable having a so called extreme value distribution. This is the well-known proportional hazards model used in survival analysis (Cox 1972). c Studentlitteratur °
152
7.4. Latent variables
20
16 y=3
12 y=2
8 y=1
4
0 0
1
2
3
4
Figure 7.3: An ordinal regression model
A similar approach can also be used for the case where the latent variable is assumed to follow a Normal distribution. In the Genmod or Logistic procedures in SAS it is possible to specify the form of the link function to be logistic, complementary log-log, or Normal. This leads to a class of models called ordinal regression models, for example ordinal logit regression or ordinal probit regression. The concept of an ordinal regression model can be illustrated as in Figure 7.3. We observe the ordinal variable y that has values 1, 2 or 3. y = 1 is observed if the latent variable η is smaller than the lowest threshold which has a value close to 8. We observe y = 2 if, approximately, 8 ≤ η < 11.5 and we observe y = 3 if η > 11.5. In practice the scale of η cannot be determined. The scale of η can be chosen arbitrarily, for example such that the distribution of η is standard Normal for one of the values of x. Note that probit models and logistic regression models can also be derived as models with latent variables. In these cases it is assumed that the observations are generated by a latent variable: if this latent variable is smaller than a threshold τ we observe Y = 1, else Y = 0; see Figure 5.1 on page 86. As a comparison with the results given on page 149 we have analyzed the data on page 145 using the Logistic procedure and a Normal link. The following results were obtained:
c Studentlitteratur °
153
7. Ordinal response
Score Test for the Equal Slopes Assumption Chi-Square = 2.6895 with 2 DF (p=0.2606)
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Criterion
Intercept Only
Intercept and Covariates
AIC SC -2 LOG L Score
5568.351 5585.804 5562.351 .
5507.436 5530.706 5499.436 .
Chi-Square for Covariates . . 62.916 with 1 DF (p=0.0001) 70.693 with 1 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Variable
DF
INTERCP1 INTERCP2 INTERCP3 U
1 1 1 1
Parameter Estimate
Standard Error
Wald Chi-Square
Pr > Chi-Square
Standardized Estimate
0.1749 0.9355 1.3266 -0.8415
0.0258 0.0300 0.0352 0.1071
46.0795 974.8408 1420.1155 61.6773
0.0001 0.0001 0.0001 0.0001
. . . -0.173150
The fit of the model, and the conclusions, are similar to the logistic model. The three thresholds are estimated to be 0.17; 0.94; and 1.33. For x = 0 this would give the probabilities as 0.5694, 0.2558, 0.0825 and 0.0923. For x = 1 the mean value of η is −0.8415 so the probabilities are 0.8463, 0.1159, 0.0227 and 0.0151.
7.5
A Genmod example
Example 7.3 Koch and Edwards (1988) considered analysis of data from a clinical trial on the response to treatment for arthritis pain. The data are as follows: Gender Female Female Male Male
Treatment Active Placebo Active Placebo
Response Marked Some None 16 5 6 6 7 19 5 2 7 1 0 10
The object is to model the response as a function of gender and treatment. We will attempt a proportional odds model for the cumulative logits and the c Studentlitteratur °
154
7.5. A Genmod example
cumulative probits, using the Genmod procedure of SAS (2000b). The data were input in a form where the data lines had the form F A 3 16 F A 2
5
... The program was written as follows: PROC GENMOD data=Koch order=formatted; CLASS gender treat; FREQ count; MODEL response = gender treat gender*treat/ LINK=cumlogit aggregate=response TYPE3; RUN;
Part of the output was: Analysis Of Parameter Estimates
Parameter Intercept1 Intercept2 gender gender treat treat gender*treat gender*treat gender*treat gender*treat Scale
F M A P F F M M
A P A P
DF
Estimate
Standard Error
1 1 1 0 1 0 1 0 0 0 0
3.6746 4.5251 -3.2358 0.0000 -3.7826 0.0000 2.1110 0.0000 0.0000 0.0000 1.0000
1.0125 1.0341 1.0710 0.0000 1.1390 0.0000 1.2461 0.0000 0.0000 0.0000 0.0000
Wald 95% Confidence Limits 1.6901 2.4983 -5.3350 0.0000 -6.0150 0.0000 -0.3312 0.0000 0.0000 0.0000 1.0000
5.6591 6.5519 -1.1366 0.0000 -1.5503 0.0000 4.5533 0.0000 0.0000 0.0000 1.0000
ChiSquare 13.17 19.15 9.13 . 11.03 . 2.87 . . .
LR Statistics For Type 3 Analysis
Source gender treat gender*treat
DF
ChiSquare
Pr > ChiSq
1 1 1
18.01 28.15 3.60
<.0001 <.0001 0.0579
We can note that there is a slight (but not significant) interaction; that there are significant gender differences and that the treatment has a significant effect. The signs of the parameters indicate that patients on active treatment experienced a higher degree of pain relief and that the females experienced better pain relief than the males. The cumulative probit model gave similar results except that the interaction term was further from being significant (p = 0.11). ¤ c Studentlitteratur °
155
7. Ordinal response
7.6
Exercises
Exercise 7.1 Ezdinli et al (1976) studied two treatments against lymphocytic lymphoma. After the experiment the tumour of each patient was graded on an ordinal scale from “Complete response” to “Progression”. Examine whether the treatments differ in their efficiacy by fitting an appropriate ordinal regression model. You are also free to analyze the data using other methods that you may have met during your training.
Complete response Partial response No change Progression Total
Treatment BP CP 26 31 51 59 21 11 40 34 138 135
Total 57 110 32 74 273
Exercise 7.2 The following data, from Hosmer and Lemeshow, (1989), come from a survey on women’s attitudes towards mammography. The women were asked the question “How likely is it that mammography could find a new case of breast cancer”. They were also asked about recent experience of mammography. Results: Mammography experience Never Over 1 year ago Within the past year
Detection of breast cancer Not likely Somewhat likely Very likely 13 77 144 4 16 54 1 12 91
Analyze these data.
c Studentlitteratur °
8. Additional topics
8.1
Variance heterogeneity
In general linear models, it is not uncommon that diagnostic tools indicate that the variance is not constant. This might indicate that the choice of distribution is wrong, such that some distribution where the variance depends on the mean should be chosen instead of the Normal distribution. An alternative approach, suggested by Aitkin (1987) works as follows. The response for observation i is modeled using the linear predictor yi = xi β + ei ¡ ¢ where we assume that ei ∼ N 0, σ 2i . The variance σ 2i is modeled as σ2i = exp (λzi ) .
(8.1)
(8.2)
Here, z is a vector that contains some or all of the predictors x, and λ is a vector of parameters to be estimated. Thus, the problem is to estimate the parameters of the linear predictor, as well as the parameters in the model for the variance. The estimation procedure suggested by Aitkin (1987) to estimate the parameter vector (β, λ) is to iterate between two generalized linear models. One model is a model with a Normal distribution and an identity link, (8.1), and the other model fits the squared residuals from this model to a Gamma distribution using a log link, corresponding to (8.2). Aitkin showed that this process produces the ML estimates, on convergence. A SAS macro for this process is given in the SAS (1997) manual. Example 8.1 In our analysis of the data on page 16 we found that the data strongly suggested variance heterogeneity, but that the distribution, for each contrast medium, was rather symmetric. This may indicate that the variance heterogeneity can be modeled using Aitkin’s method. The procedure produces one set of estimates for the mean model and a second set of estimates for the variance model. For these data we obtained the following results for the mean model:
157
158
OBS 1 2 3 4 5 6 7 8 9
8.2. Survival models
PARM INTERCEPT MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM SCALE
LEVEL1
Diatrizo Hexabrix Isovist Mannitol Omnipaqu Ringer Ultravis
DF
Mean model ESTIMATE
1 1 1 1 1 1 1 0 0
9.9075 11.4845 -5.9475 -8.2431 -2.2197 -1.5175 -9.6975 0.0000 1.0000
STDERR 0.3536 0.5701 0.4859 0.4859 0.4859 0.4743 0.5175 0.0000 0.0000
CHISQ
PVAL
785.2685 405.8269 149.8140 287.7796 20.8680 10.2347 351.0883 . .
0.0001 0.0001 0.0001 0.0001 0.0001 0.0014 0.0001 . .
The estimates for the variance model were as follows: Variance model OBS 1 2 3 4 5 6 7 8 9
PARM INTERCEPT MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM MEDIUM SCALE
LEVEL1
DF
ESTIMATE
STDERR
CHISQ
PVAL
Diatrizo Hexabrix Isovist Mannitol Omnipaqu Ringer Ultravis
1 1 1 1 1 1 1 0 0
2.9560 1.3196 -1.4736 -2.1732 -0.3517 0.0905 -6.2914 0.0000 0.5000
0.5000 0.8062 0.6872 0.6872 0.6872 0.6708 0.7319 0.0000 0.0000
34.9524 2.6788 4.5982 10.0016 0.2620 0.0182 73.8867 . .
0.0001 0.1017 0.0320 0.0016 0.6087 0.8927 0.0001 . .
The procedure converged after two iterations producing an overall deviance of 257.73 on 43 df . ¤
8.2
Survival models
Survival data are data for which the response is the time a subject has survived a certain treatment or condition. Survival models are used in epidemiology, as well as in lifetime testing in industry. Censoring is a special feature of survival data. Censoring means that the survival time is not known for all individuals when the study is finished. For right censored observations we only know that the survival time is at least the time at which censoring occurred. Left censoring, i.e. observations for which we do not know e.g. the duration of disease when the study started, is also possible. Denote the density function for the survival time with f (t), and let the Rt corresponding distribution function be F (t) = f (s) ds. The survival −∞
function is defined as
S (t) = 1 − F (t) c Studentlitteratur °
(8.3)
159
8. Additional topics
and the hazard function is defined as h (t) =
d log (S (t)) f (t) =− S (t) dt
(8.4)
The hazard function measures the instantaneous risk of dying, i.e. the probability of dying in the next small time interval of duration dt. The cumulative hazard function is H (t) =
Zt
h (s) ds
(8.5)
−∞
Modelling of survival data includes choosing a suitable distribution for the survival times or, which is equivalent, choosing a hazard function. This can be done in different ways: 1. In nonparametric modelling, the survival function is not specified, but is estimated nonparametrically through the observed survival distribution. This is the basis for the so called Kaplan-Meier estimates of the survival function. 2. In parametric models, the distribution of survival times is assumed to have some specified parametric form. The exponential distribution, Weibull distribution or extreme value distribution are often used to model survival times. 3. A semiparametric approach is to leave the distribution unspecified but to assume that the hazard function changes in steps which occur at the observed events. We will here only give examples of the parametric approach. For a more thorough description of analysis of survival data, reference is made to standard textbooks such as Klein and Moeschberger (1997).
8.2.1
An example
Although survival models are often discussed in texts on generalized linear models, the treatment of censoring makes it more convenient to analyze general survival data using special programs. However, data where there is no censoring can be analyzed using standard GLIM software, as long as the desired survival distribution belongs to the exponential family. Example 8.2 Feigl and Zelen (1965) analyzed the survival times for leukemia patients classified as AG positive or AG negative. The white cell count c Studentlitteratur °
160
8.2. Survival models
Table 8.1: Survival of leukemia patients AG + WBC Surv. 2300 65 750 156 4300 100 2600 134 6000 16 10500 108 10000 121 17000 4 5400 39 7000 143 9400 56 32000 26 35000 22 100000 1 100000 1 52000 5 100000 65
AG − WBC Surv. 4400 56 3000 65 4000 17 1500 7 9000 16 5300 22 10000 3 19000 4 27000 2 28000 3 31000 8 26000 4 21000 3 79000 30 100000 4 100000 43
(WBC) for each patient is also given. The data are reproduced in Table 8.1. As a first attempt, we model the data using a Gamma distribution. The log of the WBC was used. The interaction ag*logwbc was not significant. The program is PROC GENMOD data=feigl; CLASS ag; MODEL survival = ag logwbc / DIST=gamma obstats residuals; MAKE obstats out=ut; RUN;
c Studentlitteratur °
161
8. Additional topics Residual Model Diagnostics Normal Plot of Residuals
I Chart of Residuals 3 2
Residual
Residual
1 0 -1
3.0SL=2.314
1 0 -1
X=-0.3884
-2 5
-3 -4
-2 -2
-1
0
1
-3.0SL=-3.091
0
2
10
20
30
Normal Score
Observation Number
Histogram of Residuals
Residuals vs. Fits
6 1
Residual
Frequency
5 4 3 2
0 -1
1 -2
0 -2.0 -1.5 -1.0 -0.5 -0.0 0.5 1.0 1.5
0
50
Residual
100
150
200
250
Fit
Figure 8.1: Residual plots for Leukemia data
Parts of the output is Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
30 30 30 30 .
40.0440 38.2985 29.6222 28.3310 -146.3814
1.3348 1.2766 0.9874 0.9444 .
Analysis Of Parameter Estimates Parameter INTERCEPT AG AG LOGWBC SCALE
+ -
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1 1 0 1 1
-0.0020 -0.0344 0.0000 0.0061 0.9564
0.0262 0.0150 0.0000 0.0024 0.2065
0.0056 5.2495 . 6.5899 .
0.9403 0.0220 . 0.0103 .
The fit of the model is reasonable with a scaled deviance of 38.3 on 30 df, Deviance/df =1.28. The effects of WBC and AG are both significant. We asked the procedure to output predicted values and standardized Deviance residuals to a file. A residual plot based on these data is given in Figure 8.1. The distribution seems reasonable; one very large fitted value stands out. ¤ c Studentlitteratur °
162
8.3
8.3. Quasi-likelihood
Quasi-likelihood
In general linear models, the assumption that the observation come from a Normal distribution is not crucial. Estimation of parameters in GLM:s is often done using some variation of Least squares, for which certain optimality properties are valid even under non-normality. Thus, we can estimate parameters in, for example, regression models, without being too much worried by non-normality. Quasi-likelihood can give a similar peace of mind to users of generalized linear models. In principle, we need to specify a distribution (Poisson, binomial, Normal etc.) when we fit generalized linear models. However, Wedderburn (1974) noted the following property of generalized linear models. The score equations for the regression coefficients β have the form X ∂µ i −1 vi (yi − µi (β)) = 0 ∂β i
(8.6)
Note that this expression only contains the first two moments, the mean µi and the variance vi . Wedderburn (1974) suggested that this can be used to define a class of estimators that do not require explicit expressions for the distributions. A type of generalized linear models can be constructed by specifying the linear predictor η and the way the variance v depends on µ. The integral of (8.6) can be seen as a kind of likelihood function. This integral is Q (yi , µi ) =
Zµi
−∞
yi − t dt + f (yi ) vi
(8.7)
where f (yi ) is some arbitrary function of yi . Q (yi , µi ) is called a quasilikelihood. Maximizing (8.7) with respect to the parameters of the model yields quasi-likelihood (QL) estimators. QL estimators can be shown to have nice asymptotic properties. First, they are consistent, regardless of whether the variance assumption vi = V (µi ) is true, as long as the linear predictor is correctly specified. Secondly, QL estimators are asymptotically unbiased and efficient among the class of estimating equations which are linear functions of the data (McCullagh, 1983). Estimators of the variances of QL estimators can be obtained in different ways. The matrix Iθ of second order derivatives of (8.7) gives the QL equivalent of the Fisher information matrix. The inverse I−1 θ is an estimator of the covariance matrix³of ´the parameter estimates. This is called the model-based b . An alternative approach is to use the so called emestimator of Cov β pirical, or robust, estimator, which is less sensitive to assumptions regarding c Studentlitteratur °
163
8. Additional topics
variances and covariances. This is also called the sandwich estimator. It has general form ³ ´ b = I−1 I1 I−1 d β Cov θ θ where Iθ is the information matrix and I1 =
0 k X ∂µ
i
i=1
∂β
0
∂µ Vi−1 i . ∂β
Software supporting quasi-likelihood may have options to choose between model-based and robust variance estimators. The quasi-likelihood approach can be used in over-dispersed models. It is also used in the GEE method for analysis of repeated measures data, and in the method for analysis of mixed generalized linear models discussed below.
8.4
Quasi-likelihood for modeling overdispersion
The quasi-likelihood approach is sometimes useful when the data show signs of over-dispersion. Since the emprical variance estimates obtained in QL estimation are rather robust against the variance assumption, QL estimation is a viable alternative to the methods for modeling over-dispersion presented in earlier chapters, at least if the sample is reasonably large. We will illustrate this idea based on a set of data from Liang and Hanfelt (1994). Example 8.3 Two groups of rats, each consisting of sixteen pregnant females, were fed different diets during pregnancy and lactation. The control diet was a standard food whereas the treatment diet contained a certain chemical agent. After three weeks it was recorded how many of the live born pups that still were alive. The data are given as x/n where x is the number of surviving pups and n is the total litter size. Control Treated
13/13 9/10 12/12 8/9
12/12 9/10 11/11 4/5
9/9 8/9 10/10 7/9
9/9 11/13 9/9 4/7
8/8 4/5 10/11 5/10
8/8 5/7 9/10 3/6
12/13 7/10 9/10 3/10
11/12 7/10 8/9 0/7
A standard logistic model has a rather bad fit with a deviance of 86.19 on 30 df, p < 0.0001. In this model the treatment effect is significant, both when we use a Wald test (p = 0.0036) and when we use a likelihood ratio test (p = 0.0027). c Studentlitteratur °
164
8.4. Quasi-likelihood for modeling overdispersion
The bad fit may be caused by heterogeneity among the females: different females may have different ability to take care of their pups. If it can be assumed that the dispersion parameter is the same in both groups, this can be modeled by including a dispersion parameter in the model, as discussed in Chapter 5. Such a model gives a non-significant treatment effect (Wald test: p = 0.0855; LR test: p = 0.0765.) The quasi-likelihood estimates can be obtained in Proc Genmod by using the following trick. Proc Genmod can use QL, but only in repeated-measures models. We can then request a repeated-measures analysis but with only one measurement per female. The program can be written as PROC GENMOD data=tera; CLASS treat litter; MODEL x/n=treat / DIST=bin LINK=logit type3; REPEATED subject=litter; RUN;
The output is given in two parts. The first part uses the model-based estimates of variances and these results are identical to the first output. The second part presents the QL results: Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates
Parameter Intercept treat c treat t
Estimate
Standard Error
1.2220 0.9612 0.0000
0.3813 0.4751 0.0000
95% Confidence Limits 0.4747 0.0300 0.0000
1.9693 1.8925 0.0000
Z Pr > |Z| 3.20 2.02 .
0.0014 0.0431 .
Score Statistics For Type 3 GEE Analysis
Source
DF
ChiSquare
Pr > ChiSq
treat
1
2.89
0.0890
In these results the Wald test is significant (p = 0.043) but the Score test is not (p = 0.0890). ¤ In the paper by Liang and Hanfelt (1994), a simulation study compared the performances of different methods for allowing for overdispersion in this type of data. The methods included modeling as a beta-binomial distribution, and two QL approaches. In the simulations, the overdispersion parameter was c Studentlitteratur °
165
8. Additional topics
different for different treatments. Nevertheless, the QL approach assuming constant overdispersion performed surprisingly well. It was also concluded that results based on the beta-binomial distribution “can lead to severe bias in the estimation of the dose-response relationship” (p. 878.) Thus, the QL approach seems to be a useful and rather robust tool for modeling overdispersed data.
8.5
Repeated measures: the GEE approach
Suppose that measurements have been made on the same individuals on k occasions. The responses can then be represented as Yij , j = 1, . . . , ni , i = 1, . . . , k. Subject i has measurements on ni occasions, and we have a k P ni measurements. This type of data is called repeated measures total of i=1
data.
The main problem with repeated measures data is that observations within one individual are correlated. There are several ways to model this correlation. We will here only consider the Generalized estimating equations approach of Liang and Zeger (1986); see also Diggle, Liang and Zeger (1994). This approach is available in the Genmod procedure in SAS (2000b). The GEE approach can be seen as an extension of the quasi-likelihood approach to a multivariate mean vector. Models for repeated measures data have the same basic components as other generalized linear models. We need to specify a link function, a distribution and a linear predictor. But in addition we need to consider how observations within individuals are correlated. Suppose that we store all data for individual i in the vector Yi that has elements Yi = [Yi1 , ..., Yini ]0 . The corresponding vector of mean values is £ ¤0 µi = µi1 , ..., µini . Let Vi be the covariance matrix of Yi . The values of the independent variables for individual i at measurement (occasion) j are 0 collected in the vector xij = [xij1 , ..., xijp ]0 . The vector β contains the parameters to be estimated. The GEE approach means that we estimate the parameters by solving the GEE equation k X ∂µ
i
i=1
∂β
Vi−1 (Yi − µi (β)) = 0
(8.8)
This is similar to the quasi-likelihood equation (8.6), but it is here written in matrix form. It can be shown that the multivariate quasi-likelihood approach provides consistent estimates of the parameters even if the covariance c Studentlitteratur °
166
8.5. Repeated measures: the GEE approach
structure is incorrectly specified. Estimates of the variances and covariances b can be obtained in two ways. The model-based approach assumes that of β the model is correctly specified. The robust approach provides consistent estimates of variances and covariances even if Vi is incorrectly specified. Both approaches are available in Proc Genmod. The correlations between measurement occasions are modeled by a vector of parameters α. The following correlation structures are available in Proc Genmod: Fixed (user-specified): Corr (Yij , Yik ) = rjk . 1 if t = 0 αt if t = 1, ..., m where t is the time m-dependent: Corr (Yij , Yik ) = 0 if t > m span between the observations. The correlation is 0 for occasions more than m time units apart. ½ 1 if j = k Exchangeable: Corr (Yij , Yik ) = . All correlations are equal. α if j 6= k ½ 1 if j = k Unstructured: Corr (Yij , Yik ) = αjk if j 6= k Autoregressive, AR(1): Corr (Yij , Yik ) = αt for t = 0, 1, ..., ni − j As usual, the choice of model for the covariance structure is a compromise between realism and parsimony. A model with more parameters is often more realistic, but may be more difficult to interpret and may give convergence problems. The fixed structure means that the user enters all correlations, so there are no parameters to estimate. The exchangeable structure includes only one parameter. The AR(1) structure also has only one parameter but it is often intuitively appealing since the correlations decrease with increasing distance. If we assume unstructured correlations we need to estimate k (k − 1) /2 correlations, while the m-dependent correlation structure includes fewer correlations. Example 8.4 Sixteen children (taken from the data of Lipsitz et al, 1994) were followed from the age of 9 to the age of 12. The children were from two different cities. The binary response variable was the wheezing status of the child. The explanatory variables were city; age; and maternal smoking status. The structure of the data is given in Table 8.2; the complete data set is available from the publishers home page as the file Wheezing.dat.
c Studentlitteratur °
167
8. Additional topics
Table 8.2: Structure of the data on wheezing status
Child 1 1 1 1 2 2 2
City Portage Portage Portage Portage Kingston Kingston Kingston
Age 9 10 11 12 9 10 11
Smoke 0 0 0 0 1 2 2
Wheeze 1 1 1 0 1 1 0
A Genmod program for analysis of these data can be written as PROC GENMOD DATA=wheezing; CLASS child city; MODEL wheeze = city age smoke /dist=bin link=logit; REPEATED subject=child / type = exch covb corrw; RUN;
This program models the probability of wheezing as a function of city, age and maternal smoking. The effects of age and smoking are assumed to be linear. A binomial distribution with a logit link is used. The Repeated statement indicates that there are several measurements for each child. These are correlated, with an exchangeable correlation structure as described above. Additional output is also requested. The following output is obtained: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
60 60 60 60 .
76.9380 76.9380 63.9651 63.9651 -38.4690
1.2823 1.2823 1.0661 1.0661 .
c Studentlitteratur °
168
8.6. Mixed Generalized Linear Models
The model fits the data reasonably well. Covariance Matrix (Model-Based) Covariances are Above the Diagonal and Correlations are Below Parameter Number PRM1 PRM2 PRM4 PRM5
PRM1
PRM2
PRM4
PRM5
5.71511 -0.13847 -0.96838 0.01587
-0.22386 0.45733 -0.01553 0.06353
-0.53133 -0.002411 0.05268 -0.16530
0.01658 0.01877 -0.01658 0.19088
Covariance Matrix (Empirical) Covariances are Above the Diagonal and Correlations are Below Parameter Number PRM1 PRM2 PRM4 PRM5
PRM1
PRM2
PRM4
PRM5
9.33891 -0.40467 -0.97676 -0.15108
-0.85121 0.47378 0.29893 0.16125
-0.83232 0.05737 0.07775 -0.02187
-0.16667 0.04007 -0.002201 0.13032
b is estimated in two ways: assuming that the The covariance matrix of β model for V is correct, and using the “robust” method. Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates
Parameter INTERCEPT CITY CITY AGE SMOKE Scale
Kingston Portage
Estimate
Empirical Std Err
1.2754 0.1219 0.0000 -0.2036 -0.0928 0.9991
3.0560 0.6883 0.0000 0.2788 0.3610 .
95% Confidence Limits Lower Upper -4.7141 -1.2272 0.0000 -0.7501 -0.8003 .
7.2650 1.4709 0.0000 0.3429 0.6147 .
Z
0.4174 0.1771 0.0000 -.7302 -.2571 .
Pr>|Z 0.6764 0.8595 0.0000 0.4652 0.7971
Estimates of the parameters of the model are given, along with their empirical standard error estimates. None of the parameters are significantly different from 0. This, of course, may be related to the rather small sample size. ¤
8.6
Mixed Generalized Linear Models
Mixed models are models where some of the independent variables are assumed to be fixed, i.e. chosen beforehand, while others are seen as randomly sampled from some population or distribution. Mixed models have proven to be very useful in modeling different phenomena. An example of an application of mixed models is when several measurements have been taken on the c Studentlitteratur °
169
8. Additional topics
same individual. In such cases the effect of individual can often be included in the model as a random effect. A mixed linear model for a continuous response variable y can be written, for each individual i, as yi = Xi β + Zi ui + ei
(8.9)
In (8.9), yi is the ni × 1 response vector for individual i, Xi is a ni × p design matrix that contains values for the fixed effect variables, β is a p × 1 parameter vector for the fixed effects, Zi is a ni × q matrix that contains the random effects variables, and ui is a q × 1 vector of random effects. In mixed models based on Normal theory it is often assumed that ui ∼ N (0, D) and that ei ∼ N (0, Σi ). Σi is often chosen to be equal to σ 2 Ini , where Ini is the identity matrix of dimension ni . D is a general covariance matrix of dimension q × q. In general, the actual effects of the random factors is not of primary concern. The parameters of interest in a model such as (8.9) are often the regression parameters β; and estimates of the variance components. A special SAS procedure, Proc Mixed, can be used for fitting mixed linear models in cases where the response variable is continuous and approximately normally distributed.
There are situations when the response is of a type amenable for GLIM estimation but where there would be a need to assume that some of the independent variables are random. Breslow and Clayton (1993), and Wolfinger and O’Connell (1993) have explored a pseudo-likelihood approach to fitting models such as (8.9) but where the distributions are free to be any member of the exponential family, and where a link function is used to model the expected response as a function of the linear predictor. A SAS macro Glimmix has been written to do the estimation. Essentially, the macro iterates between Proc Mixed and Proc Genmod. The method and the macro are described in Littell et al (1996). Example 8.5 Thirty-three children between the ages of 6 and 16 years, all suffering from monosymptomatic nocturnal enuresis, were enrolled in a study. The study was carried out with a double-blind randomized threeperiod cross-over design. The children received 0.4 mg. Desmopressin, 0.8 mg. Desmopressin, or placebo tablets at bedtime for five consecutive nights with each dosage. A wash-out period of at least 48 hours without any medication was interspersed between treatment periods. Wet and dry nights were documented; for more details about the study and its analysis see Neveus et al (1999), and Olsson and Neveus (2000). The data consisted of nightly recordings, where a dry night was recorded as 1 and a wet night as 0. The nights were grouped into sets of five nights where the same treatment had been given. The structure of the data is given in Table 8.3. Only one patient is listed; the original data set contained 33 patients. c Studentlitteratur °
170
8.6. Mixed Generalized Linear Models
Table 8.3: Raw data for one patient in the enuresis study Patient 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Period 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Dose 1 1 1 1 1 0 0 0 0 0 2 2 2 2 2
Night 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dry 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1
Following Jones and Kenward (1989), the linear predictor part of a general model for our data may be written as ηijk = µ + sik + πj + τ d[i,j] + λd[i,j−1]
(8.10)
In (8.10), µ is a general mean; sik is the random effect of patient k in sequence i, πj is the effect of period j; τ d[i,j] is the direct effect of the treatment administered in period j of group i; and λd[i,j−1] is the carry-over effect of the treatment administered in period j − 1 of group i. The model further includes a logistic link function: ηijk = log
µijk 1 − µijk
(8.11)
Finally, the model assumes a binomial distribution of the observations. Models containing different combinations of model parameters were tested. The results are summarized in the following table. The numbers in the table are p-values to assess the significance of the different factors. Patient was included as a random factor in all models.
c Studentlitteratur °
171
8. Additional topics
Effects included
Dose
Dose Dose, Dose, Dose, Dose, Dose, Dose,
.0001 .0001 .0001 .0001 .0001 .0001 .0001
Seq Period After eff. After eff., Period After eff., Seq. Period, Seq
Period
Sequence
After effect
.7442 .0938 .0759 .0898
.7762 .7577
.6272 .8713 .6573
Based on these results, it was concluded that a model containing a random Patient effect, and fixed effects of Dose and Period, provided an appropriate description of the data. Neither the sequence effect nor the after effect was anywhere close to being significant in any of the analyses. Further analyses using pairwise comparisons revealed that there were no significant differences between doses but that the drug had a significant effect at both doses. ¤
c Studentlitteratur °
172
8.7
8.7. Exercises
Exercises
Exercise 8.1 Survival times in weeks were recorded for patients with acute leukaemia. For each patient the white cell count (wbc, in thousands) and the AG factor was also recorded. Patients with positive AG factor had Auer rods and/or granulate of the leukemia cells in the bone marrow at diagnosis while the AG negative patients had not. Time 65 108 56 5 143 156 121 26 65 1 100 4 22 56 134 39 1 16
wbc 2.3 10.5 9.4 52.0 7.0 0.8 10.0 32.0 100.0 100.0 4.3 17.0 35.0 4.4 2.6 5.4 100.0 6.0
AG 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Time 65 3 4 22 17 4 3 8 7 2 30 43 16 3 4
wbc 3.0 10.0 26.0 5.3 4.0 19.0 21.0 31.0 1.5 27.0 79.0 100.0 9.0 28.0 100.0
AG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The four largest wbc values are actually larger than 100. Construct a model that can predict survival time based on wbc and ag. Note that the wbc value may need to be transformed. Also note that there are no censored observations so a straight-forward application of generalized linear models is possible. Try a survival distribution based on the Gamma distribution. Exercise 8.2 The data in Exercise 1.2 show some signs of being heteroscedastic. Re-analyze these data using the method discussed in Section 8. The data are as follows: The level of cortisol has been measured for three groups of patients with different syndromes: a) adenoma b) bilateral hyperplasia c) cardinoma. The
c Studentlitteratur °
173
8. Additional topics
Air flow
Variety 1
Variety 2
Figure 8.2: Experimental layout for lice experiment
results are summarized in the following table: a 3.1 3.0 1.9 3.8 4.1 1.9
b 8.3 3.8 3.9 7.8 9.1 15.4 7.7 6.5 5.7 13.6
c 10.2 9.2 9.6 53.8 15.8
Exercise 8.3 An experiment on lice preferences for different varieties of plants (Nincovic et al, 2002) was preformed in the following way: Plants of one variety (Variety 1) were placed in a box. An adjacent box contained plants of some other variety (Variety 2); see Figure 8.2. Air was allowed to flow through the boxes from Variety 1 to Variety 2. Tubes were placed on four leaves of the plant of Variety 2. 10 lice were placed in each tube. After about two hours it was recorded how many of the 10 lice that were eating from the plant. The experiment was designed to answer the following types of questions: Are the eating preferences of the lice different for different varieties? Are the c Studentlitteratur °
174
8.7. Exercises
eating preferences affected by the smell from Variety 1? The structure of the raw data was as follows; only a few observations are listed. The complete dataset contains 320 observations and is listed at the end of the exercise. Pot 1 1 1 1 2 2 2 2 3 3 3
Tube 1 2 3 4 5 6 7 8 9 10 11
x2 9 7 7 8 9 10 10 10 10 10 9
n2 10 10 10 10 10 10 10 10 10 10 10
Var1 F F F F F F F F F F F
Var2 K K K K K K K K K K K
Pot indicates pot number and Tube indicates tube number. n2 is the number of lice in the tube, and x2 is the number of lice eating after two hours. Var1 and Var2 are codes for Variety 1 and Variety 2, respectively. Formulate a model for these data that can answer the question whether the eating preferences of the lice depends on Variety 1, Variety 2 or a combination of these. Hint: Since repeated observations are made on the same plants it may be reasonable to include Pot as a random factor in the model.
c Studentlitteratur °
175
8. Additional topics Pot 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4
Tube 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14
x2 9 7 7 8 9 10 10 10 10 10 9 7 8 9 8 9 9 9 10 10 8 7 10 7 9 8 10 8 10 10 10 10 7 10 7 10 8 7 10 7 9 7 7 8 7 10 9 8 8 9 7 9 10 8 6 9 9 8 8 9 8 8 8 8 7 8 8 7 9 8 9 6 7 8
n2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Var1 F F F F F F F F F F F F F F F F F F F F K K K K K K K K K K K K K K K K K K K K H H H H H H H H H H H H H H H H H H H H K K K K K K K K K K K K K K
Var2 K K K K K K K K K K K K K K K K K K K K F F F F F F F F F F F F F F F F F F F F K K K K K K K K K K K K K K K K K K K K H H H H H H H H H H H H H H
4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2
15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8
8 9 6 9 7 8 9 7 9 8 7 9 8 8 7 7 7 6 8 8 8 7 7 7 2 7 9 10 9 9 7 9 8 7 7 7 10 7 8 9 10 8 10 9 8 7 10 8 9 7 8 6 8 7 8 8 10 9 7 8 10 8 7 10 8 7 7 9 10 8 8 8 8 7
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
K K K K K K F F F F F F F F F F F F F F F F F F F F H H H H H H H H H H H H H H H H H H H H A A A A A A A A A A A A A A A A A A A A K K K K K K K K
H H H H H H H H H H H H H H H H H H H H H H H H H H F F F F F F F F F F F F F F F F F F F F K K K K K K K K K K K K K K K K K K K K A A A A A A A A
c Studentlitteratur °
176
8.7. Exercises 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1
9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2
6 3 7 8 4 8 9 8 6 5 10 7 8 5 5 6 8 10 3 7 6 9 6 8 8 5 8 6 5 6 9 7 8 10 9 9 8 8 7 7 7 8 10 6 10 9 8 9 9 8 8 9 7 7 6 4 8 7 5 7 7 5 6 6 10 9 8 8 6 7 5 8 7 6
c Studentlitteratur °
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
K K K K K K K K K K K K A A A A A A A A A A A A A A A A A A A A F F F F F F F F F F F F F F F F F F F F A A A A A A A A A A A A A A A A A A A A H H
A A A A A A A A A A A A F F F F F F F F F F F F F F F F F F F F A A A A A A A A A A A A A A A A A A A A H H H H H H H H H H H H H H H H H H H H A A
1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
9 8 9 10 8 5 5 9 8 8 10 9 9 9 8 7 9 5 9 9 9 5 10 8 9 7 8 10 7 10 9 7 8 8 8 7 10 9 9 10 10 10 10 6 9 8 8 10 10 8 4 8 8 8 7 9 9 8 9 8 9 8 9 9 8 9 8 8 8 7 8 8 8 9
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 8 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
H H H H H H H H H H H H H H H H H H K K K K K K K K K K K K K K K K K K K K A A A A A A A A A A A A A A A A A A A A F F F F F F F F F F F F F F F F
A A A A A A A A A A A A A A A A A A K K K K K K K K K K K K K K K K K K K K A A A A A A A A A A A A A A A A A A A A F F F F F F F F F F F F F F F F
177
8. Additional topics 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
10 9 10 8 9 7 8 7 8 6 8 7 10 7 9 8 8 10 8 9 10 10 8 9
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
F F F F H H H H H H H H H H H H H H H H H H H H
F F F F H H H H H H H H H H H H H H H H H H H H
c Studentlitteratur °
Appendix A: Introduction to matrix algebra
Some basic definitions Definition: A vector is an ordered set of numbers. Each number has a given position. 5 Example: x = 3 is a column vector with 3 elements. 8 y1 y2 Example: y = . is a column vector with n elements. .. yn
Definition: A matrix is a two-dimensional (rectangular) ordered set of numbers. ¶ µ 1 2 4 Example: A = is a matrix with two rows and three columns. 1 6 3 b11 b12 b1c b21 b22 Example: B = is a matrix with r rows and c .. .
brc br1 br2 columns. The general element of the matrix B is bij . The first index denotes row, the second index denotes column. Vectors are often written using lowercase symbols like x, while matrices are often written using uppercase letters like A. Both matrices and vectors are written in bold.
179
180
The dimension of a matrix
The dimension of a matrix Definition: A matrix that has r rows and c columns is said to have dimension r × c. Definition: A column matrix with n rows has dimension n × 1.
Definition: A row matrix with m columns has dimension 1 × m. Definition: A scalar, i.e. a number, is a matrix that has dimension 1 × 1.
The transpose of a matrix Transposing a matrix means to interchange rows and columns. If A is a matrix of dimension r × c then the transpose of A is a matrix of dimension c × r. The transpose operator is denoted with a prime, 0 , so the transpose of A is denoted with A0 (Some textbooks indicate a transpose by using the letter T). 0
For the elements aij of A0 it holds that 0
aij = aji
If x =
x1 x2 .. .
¡ is a column vector, then x0 = x1
xn vector with n elements.
x2
...
xn
¢
is a row
Example: The transpose of the matrix µ ¶ 1 2 4 A= 1 6 3 is 1 1 A0 = 2 6 . 4 3
Some special types of matrices Definition: A matrix where the number of rows = number of columns (i.e. r = c) is a square matrix. c Studentlitteratur °
181
Appendix A: Introduction to matrix algebra
Definition: A square matrix that is unchanged when transposed is symmetric. 3 0 −1 Example: The matrix A = 0 1 2 is square and symmetric. −1 2 4
Definition: The elements aii in a square matrix are called the diagonal elements. Definition: An identity matrix I is a symmetric matrix 1 0 are 0, except that the diagonal elements are 1: I = 0
where all elements 0 0 1 0 .. . 0
1
Definition: A diagonal matrix is amatrix where all elements are 0, except a1 0 0 0 a2 0 for the diagonal elements: D (ai ) = . . .. 0
0
ar
Definition: A unit vector is a vector where all elements are 1: 1 = The transpose is 10 =
¡
1 1 ···
¢ 1 .
1 1 .. . 1
.
Calculations on matrices Addition, subtraction and multiplication can be defined for matrices. Definition: Equality: Two matrices A and B with the same dimension r × c are equal if and only if aij = bij for all i and j, i.e. if all elements are equal. Definition: Addition: The sum of two matrices A and B that have the same dimension is the matrix that consists of the sum of the elements of A and B. µ ¶ µ ¶ 1 2 4 3 9 6 Example: If A = and B = then 1 6 3 4 2 1 µ ¶ 4 11 10 A+B= . 5 8 4
c Studentlitteratur °
182
Matrix multiplication
b11 b12 a11 a12 a1c b21 b22 a21 a22 and B = Example: If A = ar1 ar2 arc br1 br2 a11 + b11 a12 + b12 a1c + b1c a21 + b21 a22 + b22 . A + B = ar1 + br1 ar2 + br2 arc + brc
b1c
brc
then
Definition: Subtraction: Matrix subtraction is defined in an analogous way. It holds that A+B = B+A A + (B + C) = (A + B) + C A − (B − C) = (A − B) + C
For matrices that do not have the same dimensions, addition and subtraction are not defined.
Matrix multiplication Multiplication by a scalar To multiply a matrix A by a scalar (= a number) c means that all elements in A are multiplied by c. µ ¶ µ ¶ 1 2 4 4 8 16 Example: If A = then 4 · A = 1 6 3 4 24 12 k · a1c k · a11 k · a12 k · a21 k · a22 . Example: k · A = k · ar1 k · ar2 k · arc
Multiplication by a matrix Matrix multiplication of type C = A · B is defined only if the number of columns in A is equal to the number of rows in B. If A has dimension p × r and B has dimension r × q then the product A · B will have dimension p × q. The elements of C are calculated as r X cij = aik bkj . k=1
c Studentlitteratur °
183
Appendix A: Introduction to matrix algebra
6 5 4 1 2 3 Example: If A = and B = −1 1 −1 then AB = −1 0 1 0 2 0 µ ¶ 1 · 6 + 2 · (−1) + 3 · 0 1·5+2·1+3·2 1 · 4 + 2 · (−1) + 3 · 0 = µ −1 · 6 + 0 · (−1) ¶ + 1 · 0 −1 · 5 + 0 · 1 + 1 · 2 −1 · 4 + 0 · (−1) + 1 · 0 4 13 2 . −6 −3 −4 µ
¶
Calculation rules of multiplication It holds that A (B + C) = A · B + A · C A (B · C) = (A · B) ·C. Note that in general, AB 6= BA. The order has importance for multiplication. In the expression AB the matrix A has been post-multiplied with the matrix B. In the expression BA the matrix A has been pre-multiplied with 0 the matrix B. Note that (AB) = B0 A0 .
Idempotent matrices Definition: A matrix A is idempotent if A · A = A.
The inverse of a matrix Definition: The inverse of a square matrix A is the unique matrix A−1 for which it holds that AA−1 = A−1 A = I. That is: the matrix multiplied with its inverse results in the unit matrix. (Note that the same rule holds for scalars: 3 · 3−1 = 3 · 13 = 33 = 1). µ ¶ 5 10 Example: The inverse of the matrix A = is 3 2 µ ¶ −0.1 0.5 −1 A = . 0.15 −0.25
c Studentlitteratur °
184
Generalized inverses
To verify this we calculate µ ¶µ ¶ 5 10 −0.1 0.5 −1 = A·A 3 2 0.15 −0.25 µ ¶ 5 · (−0.1) + 10 · 0.15 5 · 0.5 + 10 · (−0.25) = 3 · (−0.1) + 2 · 0.15 3 · 0.5 + 2 · (−0.25) µ ¶ 1.0 0 = = I. 0 1.0 It is possible that the inverse A−1 does not exist. A is then said to be singular. The following relations hold for inverses: The inverse of a symmetric matrix is symmetric ¡ ¢0 −1 (A0 ) = A−1 .
The inverse of a product of several matrices is obtained by taking the product of the inverses, in opposite order: (ABC)−1 = C−1 B−1 A−1 .
If c is a scalar different from zero, then 1 (cA)−1 = A−1 . c
Generalized inverses A matrix B is said to be a generalized inverse of the matrix A if ABA = A. The generalized inverse of a matrix A is denoted with A− . If A is nonsingular then A− = A−1 . When A is singular, A− is not unique. A generalized inverse of a matrix A can be calculated as A− = (A0 A)
−1
A0 .
The rank of a matrix Definition: Two vectors are linearly dependent if the elements of one vector are proportional to the elements of the other vector. ¡ ¢ ¡ ¢ Example: If x0 = 1 0 1 and y0 = 4 0 4 then the vectors x and y are linearly dependent. c Studentlitteratur °
185
Appendix A: Introduction to matrix algebra
Definition: A set of vectors are linearly independent if it is impossible to write any one of the vectors as a linear combination of the others. ¡ ¢ ¡ ¢ 0 1 0 0 , u0 = 0 1 0 and v0 = ¡Example: ¢The vectors t = 0 0 1 are linearly independent. Definition: The degree of linear independence among a set of vectors is called the rank of the matrix that is composed by the vectors. The following properties hold for the rank of a matrix: The rank of A−1 is equal to the rank of A. The rank of A0 A is equal to the rank of A (It is also true that the rank of AA0 is equal to the rank of A). The rank of a matrix A does not change if A is pre- or postmultiplied with a nonsingular matrix.
Determinants To each square matrix A belongs a unique scalar that is called the determinant of A. The determinant of A is written as |A|. The determinant of n Q P aπi ,i . a matrix of dimension n can be calculated as |A| = (−1)#(π(n)) i=1
Here, π (n) denotes any permutation of the numbers 1, 2, . . . n. #π (n) denotes the number of inversions of a permutation π (n). This is the number of exchanges of pairs of the numbers in π (n) that are needed to bring them back into natural order. Determinants of small matrices can be calculated by hand, but for larger matrices we prefer to leave the work to computers. If A is singular, then the determinant |A| = 0.
Eigenvalues and eigenvectors To each symmetric square matrix A of dimension n × n belongs n scalars that are called the eigenvalues of A. These are solutions to the equation |A−λI| = 0. The eigenvalues have the following properties: The product of all eigenvalues of A is equal to |A|. The sum of all eigenvalues of A is equal to tr (A), which is the sum of the diagonal elements of A. The symbol tr (A) can be read as ”the trace of A”. c Studentlitteratur °
186
Some statistical formulas on matrix form
Some statistical formulas on matrix form
x0 x =
0
xy=
0
1y=
¡
¡
n X
x1
x1
yi
x2
x2
...
...
10 1 =n
xn
xn
¢
¢
x1 x2 .. . xn y1 y2 .. . yn
n X x2 = i=1 i
n X xi yi = i=1
−1
10 yn−1 = (10 1)
10 y = y
i=1
Further reading This chapter has only given a very brief and sketchy introduction to matrix algebra. A more complete treatment can be found in textbooks such as Searle (1982).
c Studentlitteratur °
Appendix B: Inference using likelihood methods
The likelihood function Suppose that we want to estimate the (single) parameter θ in some distribution. We assume that the distribution has some density function f (x; θ); we use the term ”density function” whether x is continuous or discrete. We take a random sample of and end up with the ¡ size n from the distribution ¢ observation vector x0 = x1 x2 . . . xn . The likelihood function of our sample is defined as L=
n Y
f (xi ; θ)
(B.1)
i=1
For discrete distributions, L is the probability of obtaining our sample. For continuous distributions we use the term ”likelihood” rather than probability since the probability of obtaining any specified value of x is zero. In either case, L indicates how likely our sample is, given the value of θ. The Maximum Likelihood estimator of θ is the value b θ which maximizes the likelihood function L. This seems intuitively sensible: we choose as our estimator the value of θ for which our sample of observations is most likely. In many cases it is more convenient to work with the log of the likelihood function. There are three reasons for this. First, the log function is monotone which means that L and l = log (L) have their maxima for the same parameter values. Secondly, the behavior of L can often be such that it is difficult numerically to find the maximum. Thirdly, if we take logs, we will replace the product sign with a summation sign which makes derivations somewhat easier. Thus, maximizing the likelihood (B.1) is equivalent to maximizing the log likelihood l = log (L) =
n X i=1
187
log (f (xi ; θ))
(B.2)
188
The Cramér-Rao inequality
with respect to θ. This is done by differentiating (B.2) with respect to θ. This gives the so called score equation µn ¶ P d log (f (xi ; θ)) n X dl f 0 (xi ; θ) i=1 = = = 0. (B.3) dθ dθ f (xi ; θ) i=1
The Cramér-Rao inequality We state without proof the following theorem: The variance of any unbiased estimator of θ must follow the Cramér-Rao inequality ³ ´ V ar b θ ≥ I−1 (B.4) θ
where
Iθ = E
"µ
d log (L (θ; x)) dθ
¶2 #
=E
"µ
dl dθ
¶2 #
(B.5)
I−1 is called the Cramér-Rao lower bound. Iθ is called the Fisher inforθ mation about θ. The connection between variance and information is that an estimator that has small variance gives us more information about the parameter.
Properties of Maximum Likelihood estimators The following properties of Maximum Likelihood estimators hold under fairly weak regularity conditions: Maximum Likelihood estimators can be biased or unbiased. Maximum Likelihood estimators are consistent. Maximum Likelihood estimators are asymptotically efficient. Maximum Likelihood estimators are asymptotically normally distributed. The asymptotic efficiency means that the variance of ML estimators approaches the Cramer-Rao lower bound as n increases. This means that, in large samples, we can regard b θ as normally distributed with mean θ and variance I−1 θ : ¢ ¡ b . θ ∼ N θ, I−1 θ c Studentlitteratur °
Appendix B: Inference using likelihood methods
189
Distributions with many parameters So far, we have discussed Maximum Likelihood estimation of a single parameter θ. In the case where the distribution has, say, p parameters, the expressions we have given so far must be written as vectors and matrices. If we have an observation vector x of dimension n · 1 and a parameter vector θ of dimension p · 1 the log likelihood equation can be written as l = log (L) =
n X
log (f (xi ; θ)) .
(B.6)
i=1
l should be maximized with respect to all elements of θ. The set of p score equations is ¶ µn P log (f (xi ; θ)) ∂ ∂l i=1 = =0 (B.7) ∂θj ∂θ j The asymptotic covariance matrix of θ is the inverse of the Fisher information matrix Iθ that has as its (j, k):th element ·µ ¶µ ¶¸ ∂l ∂l (B.8) Ij,k = E ∂θj ∂θk b of the parameter vector θ is asymptotThe Maximum likelihood estimator θ ically multivariate Normal with mean θ and covariance matrix I−1 θ .
Numerical procedures For complex distributions the score equations may be difficult to solve analytically. Numerical procedures have been developed that mostly, but not always, converge to the solution. Two commonly used procedures are the Newton-Raphson method and Fisher’s method of scoring.
The Newton-Raphson method We wish to maximize the log likelihood l (θ; x). Denote the vector of first derivatives of the log likelihood with respect to the elements of θ with g (θ), and denote the matrix of second derivatives with H (θ). Thus, the (j, k):th element of H is ∂ 2 l/∂θ j ∂θ k . The matrix H is known as the Hessian matrix.
c Studentlitteratur °
190
Numerical procedures
b The method works by a Suppose that we guess an initial estimate θ0 of θ. b Taylor series expansion of g (θ) around θ: ³ ´ ´ ³ b = g (θ0 ) + θ b − θ 0 H (θ 0 ) . g θ ³ ´ b = 0, this leads to a new approximation Since g θ
θ 1 = θ0 − g (θ0 ) H−1 (θ 0 ) .
(B.9)
We can now substitute θ1 for θ 0 in (B.9). We get a series of estimates θ1 , θ2 , and so on until the process has converged.
Fisher’s scoring Fisher’s scoring method is a variation of the Newton-Raphson method. The basic idea is to replace the Hessian matrix H with its expected value. It holds that E [H (θ)] = −Iθ , the Fisher information matrix. There are two advantages to using the expected Hessian rather than the Hessian itself. First, it can be shown that µ ¶ ·µ ¶µ ¶¸ ∂2l ∂l ∂l E = −E (B.10) ∂θ j ∂θk ∂θj ∂θk Thus, to calculate the expected Hessian we do not need to evaluate the second ∂l . order derivatives; it suffices to calculate the first-order derivatives of type ∂θ j A second advantage is that the expected Hessian is guaranteed to be positive definite so some non-convergence problems with the Newton-Raphson method do not occur. On the other hand, Fisher’s scoring method often converges more slowly than the Newton-Raphson method. However, for distributions in the exponential family, the Newton-Raphson method and Fisher’s scoring method are equivalent. Fisher’s scoring method can be regarded, at each step, as a kind of weighted least squares procedure. In the generalized linear model context, the method is also called Iteratively reweighted least squares.
c Studentlitteratur °
Bibliography [1] Aanes, W. A. (1961): Pingue (Hymenoxys richardsonii) poisoning in sheep. American J. of Veterinary Research, 22, 47-52. [2] Agresti, A. (1984): Analysis of ordered categorical data. New York, WIley. [3] Agresti, A. (1990): Categorical data analysis. New York, Wiley. [4] Agresti, A. (1996): An introduction to categorical data analysis. New York, Wiley. [5] Aitkin, M. (1987): Modelling variance heterogeneity in normal regression using GLIM. Applied statistics, 36, 332-339. [6] Akaike, H. (1973): Information theory and an extension of the maximum likelihood principle. In: Petrov, B. N. and Csàki, F. (eds): Second international symposium on inference theory, Budapest, Akadèmiai Kiadó, pp. 267-281. [7] Andersen, E. B. (1980): Discrete statistical models with social science applications. Amsterdam, North-Holland. [8] Anscombe, F. J. (1953): Contribution to the discussion of H. Hotelling’s paper. J. Roy. Stat. Soc, B, 15, 229-30. [9] Armitage, P. and Colton, T. (1998): Encyclopedia of Biostatistics. Chichester, Wiley. [10] Ben-Akiva, M. and Lerman, S. R. (1985): Discrete choice analysis: Theory and application to travel demand. Cambridge, MIT press. [11] Box, G. E. P. and Cox, D. R. (1964): An analysis of transformations. J. Roy. Stat. Soc., A, 143, 383-430. [12] Breslow, N. R. and Clayton, D. G. (1993): Approximate inference in generalized linear mixed models. JASA, 88, 9-25. 191
192
Bibliography
[13] Brown, B. W.: (1980): Prediction analysis for binary data. In: Biostatistics Casebook, Eds. R. J. Miller, B. Efron, B. Brown and L. E. Moses. New York, Wiley. [14] Christensen, R. (1996): Analysis of variance, design and regression. London, Chapman & Hall. [15] Cicirelli, M. F., Robinson, K. R. and Smith, L. D. (1983): Internal pH of Xenopus oocytes: a study of the mechanism and role of pH changes during meotic maturation. Developmental Biology, 100, 133-146. [16] Collett, D. (1991): Modelling binary data. London, Chapman and Hall. [17] Cox, D. R. (1972): Regression models and life tables. J. Roy. Stat. Soc, B, 34, 187-220. [18] Cox, D. R. and Lewis, P. A. W. (1966): The statistical analysis of series of events. London, Chapman & Hall. [19] Cox, D. R. and Snell, E. J. (1989): The analysis of binary data, 2nd ed. London, Chapman and Hall. [20] Diggle, P. J., Liang, K. Y. and Zeger, S. L. (1994): Analysis of longitudinal data. Oxford: Clarendon press. [21] Dobson, A. J. (2002): An introduction to generalized linear models, second edition. London: Chapman & Hall/CRC Press. [22] Draper, N. R. and Smith, H. (1998): Applied regression analysis, 3rd Ed. New York, Wiley. [23] Ezdinli, E., Pocock, S., Berard, C. W. et al (1976): Comparison of intensive versus moderate chemotherapy of lymphocytic lymphomas: a progress report. Cancer, 38, 1060-1068. [24] Fahrmeir, L. and Tutz, G. (1994; 2001): Multivariate statistical modeling based on generalized linear models. New York, Springer. [25] Feigl, P. and Zelen, M. (1965): Estimation of exponential survival probabilities with concomitant information. Biometrics, 21, 826-838. [26] Finney, D. J. (1947, 1952): Probit analysis. A statistical treatment of the sigmoid response curve. Cambridge, Cambridge University Press. [27] Freeman, D. H. (1987): Applied categorical data analysis. New York, Marcel Dekker. [28] Gill, J. and Laughton, C. D. (2000): Generalized linear models: a unified approach. New York, Sage publications. c Studentlitteratur °
Bibliography
193
[29] Francis, B., Green, M. and Payne, C. (Eds.) (1993): The GLIM system manual, Release 4. London, Clarendon press. [30] Haberman, S. (1978): Analysis of qualitative data. Vol. 1: Introductory topics. New York, Academic Press. [31] Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E. (1994): A handbook of small data sets. London, Chapman & Hall. [32] Horowitz, J. (1982): An evaluation of the usefulness of two standard goodness-of-fit indicators for comparing non-nested random utility models. Trans. Research Record, 874, 19-25. [33] Hosmer, D. W. and Lemeshow, S. (1989): Applied logistic regression. New York, Wiley. [34] Hurn, M. W., Barker, N. W. and Magath, T. D. (1945): The determination of prothrombin time following the administration of dicumarol with specific reference to thromboplastin. J. Lab. Clin. Med., 30, 432-447. [35] Hutcheson, G. D. (1999): Introductory statistics using Generalized Linear Models. New York, Sage publications. [36] Jones, B. and Kenward, M. G.: Design and analysis of cross-over trials. London, Chapman and Hall. [37] Jørgensen, B. (1987): Exponential dispersion models. Journal of the Royal Statistical Society, B49, 127-162. [38] Klein, J. and Moeschberger, M. (1997): Survival analysis: techniques for censored and truncated data. New York, Springer. [39] Koch, G. G. and Edwards, S. (1988): Clinical efficiacy trials with categorical data. In: Biopharmaceutical statistics for drug development, K. E. Peace, ed. New York, Marcel Dekker, pp. 403-451. [40] Leemis, L. M. (1986): Relationships among common univariate distributions. American Statistician, 40, 134-146. [41] Liang, K-Y. and Zeger, S. L. (1986): Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22. [42] Liang, K-Y and Hanfelt, J. (1994): On the use of the quasi-likelihood method in teratological experiments. Biometrics, 50, 872-880. [43] Lindahl, B., Stenlid, J., Olsson, S. and Finlay, R. (1999): Translocation of 32 P between interacting mycelia of a wood-decomposing fungus and ectomycorrhizal fungi in microcosm systems. New Phytol., 144, 183-193. c Studentlitteratur °
194
Bibliography
[44] Lindsey, J. K. (1997): Applying generalized linear models. New York, Springer. [45] Lipsitz, S. H., Fitzmaurice, G. M., Orav, E. J. and Laird, N. M. (1994): Performance of generalized estimating equations in practical situations. Biometrics, 50, 270-278. [46] Liss, P., Nygren, A., Olsson, U., Ulfendahl, H. R. and Eriksson, U.: Effects of contrast media and Mannitol on renal medullary blood flow and renal blood cell aggregation in the rat kidney. Kidney International, 1996, 49, 1268-1275. [47] Littell, R. C., Milliken, G. A., Stroup, W. W. and Wolfinger, R. D. (1996): SAS system for mixed models. Cary, N. C., SAS Institute Inc. [48] McCullagh, P. (1983): Quasi-likelihood functions. Annals of Statistics, 11, 59-67. [49] McCullagh, P. and Nelder, J. A. (1989): Generalized Linear Models. London, Chapman and Hall. [50] Minitab Inc. (1998): Minitab User’s Guide, Release 12. State College, Minitab Inc. [51] Montgomery, D. C. (1984): Design and analysis of experiments. New York, Wiley. [52] Nagelkerke, N. J. D. (1991): A note on a general definition of the coefficient of determination. Biometrika, 78, 691-692. [53] Nelder, J. (1971): Discussion on papers by Wynn, Bloomfield, O´Neill and Wetherill. JRSS (B), 33, 244-246. [54] Neveus T., Läckgren G., Tuvemo T., Olsson U. and Stenberg A. (1999): Desmopressin-resistant Enuresis: Pathogenetic and Therapeutic Considerations. Journal of Urology, 1999, 162, 2136. [55] Ninkovic, V., Olsson, U. and Pettersson, J. (2002). Mixing barley cultivars affect aphid host plant acceptance in field experiments. Entomologia Experimentalis et Applicata, in press. [56] Norton, P. G. and Dunn, E. V. (1985): Snoring as a risk factor for disease. British Medical Journal, 291, 630-632. [57] Olsson, U. (2000): Estimation of the number of drug addicts in Sweden - an application of capture-recapture methodology. Swedish University of Agricultural Sciences, Department of Statistics, Report 55.
c Studentlitteratur °
Bibliography
195
[58] Olsson, U. and Neveus, T. (2000): Generalized Linear Mixed Models used for Evaluating Enuresis Therapy. Swedish University of Agricultural Sciences, Department of Statistics, Report 54. [59] Rea, T. M., Nash, J. F., Zabik, J. E., Born, G. S. and Kessler, W. V. (1984): Effects of Toulene inhalation on brain biogenic amines in the rat. Toxicology, 31, 143-150. [60] Rosenberg, L., Palmer, J. R., Kelly, J. P., Kaufman, D. W. and Shapiro, S. (1988): Coffee drinking and nonfatal myocaridal infarction in men under 55 years of age. Am J. Epidemiol., 128, 570-578. [61] Samuels, M. and Witmer, J. A. (1999): Statistics for the life sciences. Upper Saddle River, NJ: Prentice-Hall. [62] SAS Institute Inc. (1997): SAS/Stat software: Changes and enhancements through release 6.12. Cary, NC. SAS Institute Inc. [63] SAS Institute Inc. (2000a): JMP software, version 4. Cary, NC. SAS Institute Inc. [64] SAS Institute Inc. (2000b): SAS/Stat user’s guide, Version 8. Cary, NC. SAS Institute Inc. [65] Searle, S. R.: Matrix Algebra Useful for Statistics. New York, Wiley, 1982. [66] Sen, A. and Srivastava, M. (1990): Regression analysis. Theory, methods and applications. New York, Springer. [67] Snedecor, G. W. and Cochran, W. G. (1980): Statistical methods, 7th ed. Ames, Iowa, The Iowa State University Press. [68] Socialdepartementet: Tungt narkotikamissbruk - en totalundersökning 1979. Rapport från utredningen om narkotikamissbrukets omfattning (UNO). Stockholm: Socialdepartementet (Ds S 1980:5). (Heavy drug use - a comprehensive survey; in Swedish). [69] Sokal, R. R. and Rohlf, F. J. (1973): Introduction to biostatistics. San Fransisco, Freeman. [70] Student (W. S. Gossett) (1907): On error of counting with an haemocytometer. Biometrika, 5, 351-360. [71] Wedderburn, R. W. M. (1974): Quasi-likelihood function, generalized linear models and the Gauss-Newton method. Biometrika, 61, 439-477. [72] Williams, D. A. (1982): Extra-binomial variation in linear logistic models. Applied Statistics, 31, 144-148. c Studentlitteratur °
196
Bibliography
[73] Wolfinger, R. and O’Connell, M. (1993): Generalized linear models: a pseudo-likelihood approach. J. Statist. Comput. Simul., 48, 233-243. [74] Zagal, E., Bjarnason, S. and Olsson, U. (1993): Carbon and nitrogen in the root-zone of Barley supplied with nitrogen fertilizer at two rates. Plant and Soil, 157, 51-63.
c Studentlitteratur °
Solutions to the exercises
Exercise 1.1 ¡ ¢ A. The model can be written as yi = α + βti + ei . ei ∼ N 0; σ 2 . A regression analysis, using the GLM procedure of the SAS package, gives the following results: Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept time
-.0475000000 0.0292500000
0.05719172 0.00158621
-0.83 18.44
0.4441 <.0001
B. The estimated regression equation is yb = −0.0475 + 0.02925t. A plot of the data and the regression line is as follows.
C. The Anova table is Dependent Variable: leucine
Source Model Error Corrected Total
Leucine level
DF
Sum of Squares
1 5 6
2.39557500 0.03522500 2.43080000
197
Mean Square
F Value
Pr > F
2.39557500 0.00704500
340.04
<.0001
198 It can be concluded that the leucine level increases with time. The increase is nearly linear, in the studied time range. Exercise 1.2 ¢ ¡ A. This is a one-way Anova model: yij = µ + αi + eij ; eij ∼ N 0; σ2 . We wishP to test the null hypothesis that there are no group differences, i.e. H0 : ni α2i = 0. The Anova table produced by Proc GLM is as follows: Dependent Variable: cortisol
Source
DF
Sum of Squares
Model Error Corrected Total
2 18 20
795.692190 1614.017333 2409.709524
Mean Square
F Value
Pr > F
397.846095 89.667630
4.44
0.0271
R-Square
Coeff Var
Root MSE
cortisol Mean
0.330203
100.3306
9.469299
9.438095
Source
DF
Type III SS
Mean Square
F Value
Pr > F
group
2
795.6921905
397.8460952
4.44
0.0271
The results suggest that there are significant differences between the groups (p = 0.0271). To study these differences we prepare a table of mean values, and a box plot: The GLM Procedure Level of group a b c
c Studentlitteratur °
N 6 10 5
-----------cortisol---------Mean Std Dev 2.9666667 8.1800000 19.7200000
0.9244818 3.7891072 19.2388149
199
Solutions to the exercises
B. The sample standard deviations are rather different in the different groups. Since the model assumes that the population variances are equal, the analysis presented above may not be the optimal one. The box plot suggests that one or two observations may be outliers. Exercise 1.3 A. We want to compare two competing models: i) yijk = µ + αi + βxj + eijk .
Equal slopes (no interaction)
ii) yijk = µ + αi + βxj + (αβ)ij xj + eijk . Different slopes (interaction exists). The corresponding GLM outputs are presented below. Model i) Dependent Variable: co2
Source
DF
Sum of Squares
Model Error Corrected Total
2 21 23
2350.424299 549.234956 2899.659255
Mean Square
F Value
Pr > F
1175.212150 26.154046
44.93
<.0001
R-Square
Coeff Var
Root MSE
co2 Mean
0.810586
19.53056
5.114103
26.18513
Source
DF
Type III SS
Mean Square
F Value
Pr > F
level days
1 1
264.809910 2085.614389
264.809910 2085.614389
10.13 79.74
0.0045 <.0001
Source
DF
Sum of Squares
Mean Square
F Value
Pr > F
Model Error Corrected Total
3 20 23
2445.093241 454.566014 2899.659255
815.031080 22.728301
35.86
<.0001
Model ii) Dependent Variable: co2
Source level days days*level
R-Square
Coeff Var
Root MSE
co2 Mean
0.843235
18.20660
4.767421
26.18513
DF
Type III SS
Mean Square
F Value
Pr > F
1 1 1
47.785031 2085.614389 94.668942
47.785031 2085.614389 94.668942
2.10 91.76 4.17
0.1626 <.0001 0.0547
c Studentlitteratur °
200 The test of parallelism is not significant (p = 0.0547). Still, for model building purposes, I would prefer to retain the interaction term in the model; see the discussion on model building strategy in the text. Thus, I would use Model 2 for interpretation and plotting. B. Estimates of model parameters for model ii) are as follows: Parameter Intercept level level days days*level days*level
Estimate
High Low High Low
-38.11803843 17.11095713 0.00000000 2.12991722 -0.74816925 0.00000000
B B B B B B
Standard Error
t Value
Pr > |t|
8.34443339 11.80081087 . 0.25921766 0.36658913 .
-4.57 1.45 . 8.22 -2.04 .
0.0002 0.1626 . <.0001 0.0547 .
Thus, the predicted value for High nitrogen level is −38.1180+17.1110+ 2.1299 · 35 − 0.7482 · 35 = 27.353. The predicted value for Low level is similarly −38.1180 + 2.1299 · 35 = 36.429. C. A graph of the model that does not assume parallel regression lines is:
D. There are strongly significant effects of time and of nitrogen level. The interaction, although not formally significant, indicates that the increase of CO2 emission may be somewhat faster for the low nitrogen treatment. Exercise 1.4 A. After taking logs of the count variable, a regression output is as follows:
c Studentlitteratur °
201
Solutions to the exercises Dependent Variable: logcount
Source Model Error Corrected Total
DF
Sum of Squares
1 3 4
5.69913226 0.02563161 5.72476387
Mean Square
F Value
Pr > F
5.69913226 0.00854387
667.04
0.0001
R-Square
Coeff Var
Root MSE
logcount Mean
0.995523
2.286363
0.092433
4.042798
Source minutes
DF 1
Type III SS 5.69913226
Mean Square 5.69913226
F Value 667.04
Pr > F 0.0001
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept minutes
5.552649942 -0.050328398
0.07159833 0.00194866
77.55 -25.83
<.0001 0.0001
B. If we assume that there is a multiplicative residual in the original model, we get: y = Ae−Bx · ². This gives, after taking logs, log (y) = log (A) − Bx + e (where e = log ²) which is a linear model. There is a strong relationship between log(count) and time. C. We prefer to make the graph on the original scale. Thus, we calculate predicted values and take the anti-logs of these. The corresponding graph is:
Exercise 2.1 We write the density as f (x) = λe−λx = elog λ−λx which is an exponential family distribution. If we use θ = −λ, then b (θ) = log (−θ), c Studentlitteratur °
202 a (φ) = 1, and c (y, φ) = 0. For¡the function, we find that ¢ variance d d 1 1 b0 = dθ (log (−θ)) = 1θ and b00 = dθ . = − 2 θ θ
Exercise 2.2
A. We write the distribution as −λ = e[−λ+yi ln λ−ln(1−e )−ln(yi !)] which is an ¡ ¢ Exponential family with θ = ln λ, a (φ) = 1, b (θ) = −λ − ln 1 − e−λ and c (y, φ) = − ln (yi !).
e−λ λyi (1−e−λ )yi !
=
e−λ eyi ln λ −λ eln(1−e ) eln(yi !)
B. If ³we insert ´λ = eθ into the expression for b (·), we get b (θ) = − exp (θ) − θ ln 1 − e−e . The derivatives of b with respect to θ are: b0 =
b00 =
d dθ
d dθ
function.
³ ³ ´´ θ − exp (θ) − ln 1 − e−e = ³
eθ −1+e−eθ
´
= −e
θ
θ
eθ −1+e−eθ
θ
−eθ−e −e2θ−e 2 (−1+e−eθ )
which is the required variance
Exercise 2.3 It is not easy to find a well-fitting model for these data. One of the best models is probably the one with a Gamma distribution and an inverse link, but other models might also be considered. However, most models we have tried do not indicate any significant relation between weight and survival: Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
11 11 11 11
2.5315 13.4077 4.3154 22.8557 -55.0956
0.2301 1.2189 0.3923 2.0778
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard Error
Intercept weight Scale
1 1 1
0.0178 0.0001 5.2964
0.0239 0.0004 2.0154
Wald 95% Confidence Limits -0.0291 -0.0006 2.5123
0.0647 0.0008 11.1655
ChiSquare
Pr > ChiSq
0.56 0.07
0.4560 0.7878
A graph of the data and the fitted line may explain why: one observation has an unusually long survival time. Since we have no other information about the data, deletion of this observation cannot be justified.
c Studentlitteratur °
203
Solutions to the exercises
Exercise 3.1 A. Data, predicted values and residuals are: Obs
time
leucine
1 2 3 4 5 6 7
0 10 20 30 40 50 60
0.02 0.25 0.54 0.69 1.07 1.50 1.74
pred -0.0475 0.2450 0.5375 0.8300 1.1225 1.4150 1.7075
res 0.0675 0.0050 0.0025 -0.1400 -0.0525 0.0850 0.0325
B. A plot of residuals against fitted values indicates no serious deviations from homoscedasticity, but this is difficult to see in such a small data set.
c Studentlitteratur °
204 C. The Normal probability plot was obtained using Proc Univariate in SAS:
D. The influence diagnostics can be obtained from Proc Reg: Obs
Residual
RStudent
Hat Diag H
Cov Ratio
1 2 3 4 5 6 7
0.0675 0.005000 0.002500 -0.1400 -0.0525 0.0850 0.0325
1.1284 0.0631 0.0294 -2.7205 -0.6490 1.2694 0.4870
0.4643 0.2857 0.1786 0.1429 0.1786 0.2857 0.4643
1.6783 2.1832 1.9014 0.2244 1.5570 1.1116 2.5993
------DFBETAS----DFFITS Intercept time 1.0504 0.0399 0.0137 -1.1106 -0.3026 0.8028 0.4534
1.0504 0.0391 0.0119 -0.6161 -0.0375 -0.1574 -0.1744
-0.8740 -0.0282 -0.0061 0.0000 -0.1353 0.5677 0.3772
Since there are n = 7 observations and p = 2 parameters the average leverage is 2/7 = 0.286. The rule of thumb that an observation is influential if h > 2 · p/n would suggest that observations with h > 0.571 are influential. None of the observations have a “Hat Diag” value above this limit. Exercise 3.2 A. and D. Data, predicted values, and leverage values (diagonal elements from the Hat matrix) are as follows:
c Studentlitteratur °
205
Solutions to the exercises Obs
LEVEL
days
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
H H H L L L H H H L L L H H H L L L H H H L L L
24 24 24 24 24 24 30 30 30 30 30 30 35 35 35 35 35 35 38 38 38 38 38 38
Co2 8.220 12.594 11.301 15.255 11.069 10.481 19.296 31.115 18.891 28.200 26.765 28.414 25.479 34.951 20.688 32.862 34.730 35.830 31.186 39.237 21.403 41.677 43.448 45.351
pred 12.1549 12.1549 12.1549 13.0000 13.0000 13.0000 20.4454 20.4454 20.4454 25.7795 25.7795 25.7795 27.3541 27.3541 27.3541 36.4291 36.4291 36.4291 31.4993 31.4993 31.4993 42.8188 42.8188 42.8188
res -3.9349 0.4391 -0.8539 2.2550 -1.9310 -2.5190 -1.1494 10.6696 -1.5544 2.4205 0.9855 2.6345 -1.8751 7.5969 -6.6661 -3.5671 -1.6991 -0.5991 -0.3133 7.7377 -10.0963 -1.1418 0.6292 2.5322
hat 0.26090 0.26090 0.26090 0.26090 0.26090 0.26090 0.09239 0.09239 0.09239 0.09239 0.09239 0.09239 0.11456 0.11456 0.11456 0.11456 0.11456 0.11456 0.19882 0.19882 0.19882 0.19882 0.19882 0.19882
B. The plot of residuals against fitted values shows no large differences in variance:
C. The Normal probability plot has a slight “bend”:
c Studentlitteratur °
206
The limit for influential observations is 2 · p/n = 2 · 4/24 = 0.333. The Hat values of all observations are below this limit. Exercise 3.3 A. Predicted values and deviance residuals are as follows: Obs
weight
survival
pred
res
1 2 3 4 5 6 7 8 9 10 11 12 13
46 55 61 75 64 75 71 59 64 67 60 63 66
44 27 24 24 36 36 44 44 120 29 36 36 36
44.4434 42.7112 41.6295 39.3067 41.1089 39.3067 39.9435 41.9839 41.1089 40.6013 41.8060 41.2810 40.7691
-0.01001 -0.42609 -0.50452 -0.45590 -0.12984 -0.08661 0.09831 0.04727 1.30216 -0.31864 -0.14589 -0.13383 -0.12188
B. In the plot of residuals against fitted values, one observation stands out as a possible outlier:
c Studentlitteratur °
Solutions to the exercises
207
C. The long-living sheep is an outlier in the Normal probability plot as well:
D. The influence of each observation can be obtained via the Insight procedure. A plot of hat diagonal values against observation number is as follows:
c Studentlitteratur °
208
The “influence limit” is 2·p/n = 2·2/13 = 0.308. The first observation is influential according to this criterion. Exercise 4.1 An analysis on the transformed data using a two-factor model with interaction gives the following edited output: Source
DF
Squares
Mean Square
F Value
Pr > F
treatment poison treatment*poison Error
3 2 6 36
20.41428935 34.87711982 1.57077226 8.64308307
6.80476312 17.43855991 0.26179538 0.24008564
28.34 72.63 1.09
<.0001 <.0001 0.3867
Corrected Total
47
65.50526450
R-Square 0.868055
Coeff Var 18.68478
Root MSE 0.489985
z Mean 2.622376
The effects of treatment and of poison are highly significant; there is no significant interaction. The residual plots (b e against yb; Normal probability plot) seem to indicate that the data agree fairly well with the assumptions:
c Studentlitteratur °
209
Solutions to the exercises
The same model analyzed as a generalized linear model with a Gamma distribution gives the following results: Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
36 36 36 36
1.9205 48.3179 1.8755 47.1866 50.0573
0.0533 1.3422 0.0521 1.3107
Source
LR Statistics For Type 3 Analysis ChiDF Square Pr > ChiSq
treatment poison treatment*poison
3 2 6
43.76 59.31 10.04
<.0001 <.0001 0.1232
The conclusions are the same: significant effects of treatment and poison, no significant interaction. The residual plots for this model are: c Studentlitteratur °
210
The distribution of the deviance residuals is close to normal, but the Gamma model seems to produce residuals for which the variance increases slightly with increasing yb.
Exercise 4.2
The exponential distribution is a special case of the gamma distribution, with scale parameter equal to 1. Such a model fits these data reasonably well, according to the fit statistics: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
c Studentlitteratur °
DF
Value
Value/DF
199 199 199 199
210.6205 210.6205 218.3507 218.3507 98.8650
1.0584 1.0584 1.0972 1.0972
211
Solutions to the exercises
An easy way of judging the fit to an exponential distribution is to ask Proc Univariate to produce an exponential probability plot:
It seems that the data deviate somewhat from an exponential distribution in the upper tail of the distribution. Exercise 5.1 One questions with these data is whether we should include the factors Exposure, Temperature and Humidity as “class” variables or as numeric variables. One approach is to compare deviances for the different approaches, for a main effects model: Types of terms All “class” Temperature numeric Humidity also numeric Also Exposure numeric
Deviance 30.865 31.2108 32.1509 55.0698
df 86 87 89 91
D/df 0.3582 0.3587 0.3612 0.6052
Treating temperature as numeric costs 31.2108 − 30.865 = 0.3458 on 1 df, clearly an unsignificant loss. Similarly, adding Humidity as a numeric factor gives 32.1509 − 31.2108 = 0.9401 on 2 df, which is clearly nonsignificant. On the other hand, when we treat Exposure as numeric and linear, we lose 55.0698 − 32.1509 = 22.919 on 2 df, so this approximation is not worthwhile. We could use a quadratic term for exposure, but we might as well keep it as a class variable. Many models for these data that include interactions lead to a Hessian matrix that is not positive definite. However, when some factors are included as numeric variables, most interactions can indeed be estimated. p-values for twoway interactions are Exposure*Humidity (p = 0.9834); Species*exposure (p = 0.9676); Temp*exposure (p = 0.3279); Temp*humidity (p = 0.6625); c Studentlitteratur °
212 Temp*species (p = 0.9091); and Humidity*species (p = 0.3006). There does not seem to be any need to include interactions. ³p ´ It is interesting to note that an “old-fashioned” Anova on arcsin pb suggests that the interactions species*exposure, temp*exposure and humidity*exposure are indeed significant. The generalized linear model approach may suffer from the fact that 41 of the 96 observations have pb = 0. The model with only main effects, and with humidity and temperature used as numeric variables, gives the following results: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
89 89 89 89
32.1509 32.1509 27.9761 27.9761 -534.9172
0.3612 0.3612 0.3143 0.3143
Analysis Of Parameter Estimates
Parameter Intercept Species Species Exposure Exposure Exposure Exposure Humidity Temp Scale
A B 1 2 3 4
DF
Estimate
Standard Error
1 1 0 1 1 1 0 1 1 0
5.6733 -1.2871 0.0000 -26.6441 -3.1793 -0.9434 0.0000 -0.1054 0.0930 1.0000
0.9636 0.1616 0.0000 32311.10 0.2988 0.1633 0.0000 0.0138 0.0192 0.0000
Wald 95% Confidence Limits 3.7847 -1.6038 0.0000 -63355.2 -3.7649 -1.2635 0.0000 -0.1324 0.0555 1.0000
7.5618 -0.9703 0.0000 63301.95 -2.5937 -0.6232 0.0000 -0.0784 0.1305 1.0000
ChiSquare
Pr > ChiSq
34.67 63.44 . 0.00 113.24 33.35 . 58.56 23.59
<.0001 <.0001 . 0.9993 <.0001 <.0001 . <.0001 <.0001
LR Statistics For Type 3 Analysis
Source Species Exposure Humidity Temp
DF
ChiSquare
Pr > ChiSq
1 3 1 1
69.34 385.00 63.52 24.38
<.0001 <.0001 <.0001 <.0001
The survival is highly related to all four factors. Residual plots for this model are as follows:
c Studentlitteratur °
Solutions to the exercises
213
Leverage diagnostics, in terms of diagonal elements of the Hat matrix, can be obtained e.g. from the Insight procedure but are not listed here in order to save space. Exercise 5.2 The inferential aspects of this exercise are interesting: to which population could we generalize the results? However, we set this question aside. One question in this data set is how to model Age. The relation between age and survival can be explored by plotting proportion survival against sex and age (in 10-year intervals). The resulting plot is as follows:
c Studentlitteratur °
214
It seems that survival probability for women is high, and increases with age, whereas only the young boys were rescued (“women and children first”). One possibility to modeling is to use a dummy variable for children under 10, and to use a linear age relation for ages above 10 years. If the dummy variable for childhood is d, a model for these data can be written as logit(b p) = β 0 + β 1 · sex + β 2 · pclass +d (β 2 + β 3 · age + β 4 · sex + β 5 · age · sex + β 6 · pclass · sex) . This model assumes a separate survival probability for boys and girls below 10, and a linear change in survival probability (different for males and females) for persons above 10 years. This model fits fairly well to the data, as judged by Deviance/df: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
c Studentlitteratur °
DF
Value
Value/DF
744 744 744 744
619.9224 619.9224 732.8209 732.8209 -309.9612
0.8332 0.8332 0.9850 0.9850
215
Solutions to the exercises LR Statistics For Type 3 Analysis
Source Pclass Sex d d*Age d*Sex d*Age*Sex d*Pclass*Sex
DF
ChiSquare
Pr > ChiSq
2 1 1 1 1 1 4
20.00 0.50 3.27 4.94 7.72 2.47 33.19
<.0001 0.4777 0.0705 0.0262 0.0055 0.1158 <.0001
As an interpretation of the parameter estimates: there is a highly significant effect of passenger class, as well as an interaction between class and sex for persons above 10 years. Sex (which actually should be interpreted as sex of a child) is not significant: young boys and girls have similar survival probabilities. The fact that d*Sex is significant means that there are differences in survival for males and females above 10 years. In this analysis, passengers with missing age data have been excluded. However, there seems to be a relation between missing age and passenger class: age data are missing for 30% of first class passengers, 24% of second class passengers but 55% of third class passengers, so the analysis should be interpreted with care. Exercise 5.3 A binomial model with treatment, ln(dose) and their interaction as factors produces the following results: Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
11 11 11 11
22.7228 22.7228 20.5940 20.5940 -368.7886
2.0657 2.0657 1.8722 1.8722
LR Statistics For Type 3 Analysis
Source treatment lnDose lnDose*treatment
DF
ChiSquare
Pr > ChiSq
2 1 2
3.66 287.20 9.25
0.1601 <.0001 0.0098
This analysis suggests that there is a significant interaction between treatment and ln(dose), i.e. that the slopes may be different. However, the fit of the model is not perfect: Deviance/df=2.07. If we fit the same model, but this time allowing the program to estimate the scale parameter, we get: c Studentlitteratur °
216 LR Statistics For Type 3 Analysis
Source treatment lnDose lnDose*treatment
Num DF
Den DF
F Value
Pr > F
ChiSquare
2 1 2
11 11 11
0.89 139.03 2.24
0.4395 <.0001 0.1529
1.77 139.03 4.48
Pr > ChiSq 0.4120 <.0001 0.1066
The p-values are rather sensitive to overdispersion. This analysis suggests that the interaction is not significant, i.e. that the slopes may be equal. The observed proportions (on a logit scale) plotted against ln(dose) are:
Exercise 5.4 A. H0 : β p∗a = 0 against H1 : β p∗a 6= 0 can be tested using the deviances. The test statistic is (D1 − D2 ) / (df1 − df2 ) which, under H0 , is asymptotically distributed as χ2 on (df1 − df2 ) degrees of freedom. The condition that model 1 is nested within model 2 is fulfilled. Assumptions: Independent observations; large sample. Result: (226.5177 − 226.4393) / (8 − 7) = 0.0784 which should be compared with χ2 on 1 d.f. The 5% limit is 3.841; the 1% limit is 6.635 and the 0.1% limit is 10.828. Our result is clearly non-significant; H0 cannot be rejected. B. H0 : β p∗r = 0 H1 : β p∗r 6= 0 can similarly be tested using the deviances. The test statistic is (D1 − D3 ) / (df1 − df3 ) which, under H0 , is asymptotically distributed as χ2 on (df1 − df3 ) degrees of freedom. The condition that model 1 is nested within model 3 is fulfilled. Assumptions: Independent observations; large sample. Result: (226.5177 − 216.4759) / (8 − 7) = 10.042 which should be compared with χ2 on 1 d.f. The 5% limit is 3.841; the 1% limit is 6.635 and the 0.1%
c Studentlitteratur °
217
Solutions to the exercises
limit is 10.828. Our result is significant at the 1% level but not at the 0.1% level. H0 is rejected. p C. The logit link is log (1−p) . When p is zero (or one) this is not defined. The four cells with observed count =0 do not contribute to the likelihood. When we replace 0 with 0.5 in these cells they are included, so we get an extra four d.f. compared with model 3.
D. The odds ratios of not being infected are calculated as eβ . The corresponding odds ratios of being infected are the inverses of these. This gives: Planned: OR=e−0.8311 = 0.436; OR of infection=1/0.436 = 2.294 Antibio: OR=e3.4991 = 33.086;OR of infection=1/33.086 = 0.030 Risk: OR=e−3.7172 = 0.024; OR of infection=1/0.024 = 41.667 Planned*Risk: e2.4394 = 11.466; OR of infection=1/11.466 = 0.087. In the presence of interactions, raw Odds ratios are not very informative. One might consider to calculate odds ratios separately for each cell of the 2 · 2 · 2 cross-table. All odds ratios take one cell as the baseline, with OR=1. We might use the cell Planned=0, Risk=0, Antibio=0 as a baseline. The remaining cell odds ratios (of no infection) compared to this baseline are: Planned 1 Risk Antibio 1 0
1 4.02 0.12
0 14.41 0.43
0 Risk 1 0.80 0.02
0 33.09 1.00
E. Remember that the observations, in this example, are binary, i.e. y = 1 and y = 0 are the only possible values of y. The first data line has Planned=1, Antibio=1, Risk=1 and Infection=1. Using the parameter estimates, we get for this observation logit(µ) = 2.1440−0.8311+3.4991− ex 3.7172 = 3.5342. Using the inverse logit transformation 1+e x this corre3.5342
e sponds to yb = 1+e 3.5342 = 0.9717 which is the predicted value. The raw residual is y − yb = 1 − 0.9716 = 0.0284.
For the second observation the predictors have the same value but y = 0 so the raw residual is 0 − 0.9716 = −0.9716. The third and fourth observations have the same predicted values, obtained through logit(µ) = 2.1440 − 0.8311 + 3.4991 = 4.812 which gives e4.812 predicted value yb = 1+e 4.812 = 0.9919 and raw residuals 1 − 0.9919 = 0.0081 and 0 − 0.9919 = −0.9919, respectively. c Studentlitteratur °
218 Note that the counts (Wt) are not the values to predict! Exercise 5.5 A. The model is g = µ + αi + β j + (αβ)ij + γzijk + (βγ)j zijk + eijk , i = 1, 2; j = 1, 2, 3; k = 1, 2, 3. This gives the model in matrix terms as y = XB + e, where 1 1 0 1 0 0 1 0 0 0 0 0 z1 z1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 z2 z2 0 0 1 1 0 1 0 0 1 0 0 0 0 0 z3 z3 0 0 1 1 0 0 1 0 0 1 0 0 0 0 z4 0 z4 0 1 1 0 0 1 0 0 1 0 0 0 0 z5 0 z5 0 1 1 0 0 1 0 0 1 0 0 0 0 z6 0 z6 0 1 1 0 0 0 1 0 0 1 0 0 0 z7 0 0 z7 1 1 0 0 0 1 0 0 1 0 0 0 z8 0 0 z8 1 1 0 0 0 1 0 0 1 0 0 0 z9 0 0 z9 ; X= 0 1 0 1 1 0 0 0 0 0 1 0 0 z10 z10 0 1 0 1 1 0 0 0 0 0 1 0 0 z11 z11 0 0 1 0 1 1 0 0 0 0 0 1 0 0 z12 z12 0 0 1 0 1 0 1 0 0 0 0 0 1 0 z13 0 z13 0 1 0 1 0 1 0 0 0 0 0 1 0 z14 0 z14 0 1 0 1 0 1 0 0 0 0 0 1 0 z15 0 z15 0 1 0 1 0 0 1 0 0 0 0 0 1 z16 0 0 z16 1 0 1 0 0 1 0 0 0 0 0 1 z17 0 0 z17 1 0 1 0 0 1 0 0 0 0 0 1 z18 0 0 z18 µ α1 α2 β1 β2 β3 (αβ) 11 (αβ) 12 B= (αβ) 13 (αβ) 21 (αβ) 22 (αβ) 23 γ (βγ) 1 (βγ) 2 (βγ)3
p B. The inverse of the logit link g (p) = log 1−p is g−1 =
c Studentlitteratur °
ep ep +1 .
219
Solutions to the exercises
Exercise 6.1 A model with a Poisson distribution and a log link gives the following model information: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
3 3 3 3
2.2906 2.2906 2.2453 2.2453 1911.7443
0.7635 0.7635 0.7484 0.7484
The fit of the model to the data is good, as judged by deviance/df. The parameter estimates are as follows: Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard Error
Intercept exposure Scale
1 1 0
5.5713 -0.0513 1.0000
0.0567 0.0030 0.0000
Wald 95% Confidence Limits 5.4602 -0.0572 1.0000
5.6825 -0.0455 1.0000
ChiSquare
Pr > ChiSq
9650.28 298.25
<.0001 <.0001
A plot of observed counts along with the fitted function indicates a good fit:
Exercise 6.2 Two Poisson models were fitted to the data: one with a log link, another with an identity link. The model with a log link fitted marginally better, as judged by the deviance/df criterion. First the log link results:
c Studentlitteratur °
220 Criteria For Assessing Goodness Of Fit Criterion
DF
Value
Value/DF
6 6 6 6
4.0033 4.0033 3.9505 3.9505 362.7354
0.6672 0.6672 0.6584 0.6584
Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard Error
Intercept mode1 mode2 Scale
1 1 1 0
2.1752 0.0070 0.0025 1.0000
0.2555 0.0024 0.0028 0.0000
Wald 95% Confidence Limits 1.6745 0.0023 -0.0030 1.0000
2.6759 0.0118 0.0081 1.0000
ChiSquare
Pr > ChiSq
72.50 8.34 0.81
<.0001 0.0039 0.3685
The model fit for the Identity link model: Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood
DF
Value
Value/DF
6 6 6 6
4.1971 4.1971 4.1567 4.1567 362.6385
0.6995 0.6995 0.6928 0.6928
Both models show a good fit; slightly better for the log link. Plots of residuals vs. fitted values are similar for the two models. The normal probability plot is slightly better for the identity link model:
c Studentlitteratur °
221
Solutions to the exercises
In all, it is difficult to judge which of the two models is “best”, based on statistical criteria. Exercise 6.3 Models with a Poisson distribution, a log link and using log(mon_serv) as an offset were fitted to the data. Some of the models were: Model 1. Main effects only 1. +type*yr_c 1. +type*per_op 1. +yr_c*per_op
deviance 38.695 14.587 33.756 36.908
df 25 13 21 23
It seems that a model with main effects, plus the type*yr_c interaction, would describe the data well. However, this model produces a near-singular Hessian matrix. Exercise 6.4 The model analyzed in the text is saturated, which means that the data should agree perfectly with the model. The model for males is log(b µ) = b which gives log (320) = log(21.4) + β b i.e. β = log (320) − log(t) + β 0 0 0 b +β b . log(21.4) = 2.7049. The model for females is log (b µ) = log(t) + β 0 1 b We get β 1 = log(175) − log(17.3) − 2.7049 = −0.39082. These results agree with the computer outputs in the text.
Exercise 6.5
A. The LR test of the hypothesis H0 : µA = µB is obtained by comparing the deviances of the two models:
c Studentlitteratur °
222 (D1 − D2 ) / (df1 − df2 ) = (27.857 − 16.2676) / (19 − 18) = 11.589 which is used as an asymptotic χ2 on 1 d.f. The 5% limit is 3.841; the 1% limit is 6.635 and the 0.1% limit is 10.828. Our observed value is even larger than 10.828; the result is clearly significant and H0 can be rejected at the 0.1% level. The Wald test of the same hypothesis uses the test statistic 0.5878 0.1764
b β−0 b) s.e.(β
=
= 3.3322. This is compared to appropriate limits of a standard normal variate z. Limits: 5%: 1.96; 1%: 2.576; 0.1%: 3.291. Our observed test statistic is (numerically) larger than the 0.1% limit: we reject the null hypothesis. B. The model is g (µB ) = β 0 + β 1 x, where x is a dummy variable (x = 1 for treatment A). For treatment A the model is g (µA ) = β 0 + β 1 , and for treatment B the model is g (µB ) = β 0 . The link function g is a log function. Thus, it holds that g (µB ) − g (µA ) = (β 0 ) − (β 0 ³ + β 1´) = −β 1 . Therefore log (µB ) − log (µA ) = −β 1 , which means that log µµB = −β 1 . A A 95% Wald confidence interval for β 1 is 0.5878 ± 1.96 · 0.1764; 0.5878 ± 0.3457; the limits are [0.2421 . . . 0.9335]. Taking antilogs of minus these limits, approximate 95% limits for µµB are obtained as e−0.2421 = 0.785 A and e−0.9335 = 0.393. Exercise 6.6 A model with a Poisson distribution, a log link, and no offset variable gives a deviance of 15.9371 on 13 df, deviance/df=1.2259. Inclusion of log(miles) as an offset gives the deviance 16.0602 on 13 df, deviance/df=1.2354. Inclusion of the offset does not affect the fit very much, possibly because the values for miles are rather similar for the different years. Exercise 6.7 A. In the full model, the three-way interaction is not significant (p = 0.2639). The model with all main effects and two-way interactions gives deviance 11.1746 on 9 df, deviance/df=1.2416. All main effects and interactions are highly significant (p < 0.0001). B. The model with scores 0, 1, 2, 3 for coffee and 0, 1, 2, 3 for cigarettes is not a good model: deviance=301.08 on 25 df. C. Of the two models, the model in A. is to be preferred because of a much better fit. Residuals plots for this model are as follows:
c Studentlitteratur °
Solutions to the exercises
223
The Residual vs. Fits plots shows some tendency towards an “inverse trumpet” shape, with a decreasing variance for increasing yb. The Normal plot is rather straight, with a couple of deviating observations at each end. Exercise 6.8 A. The test of the hypothesis of no relation between temperature and probability of failure is obtained by calculating the difference in deviance between the null model and the estimated model. These differences can be interpreted as χ2 variates on 1 d.f., for which the 5% limit is 3.841 and the 1% limit is 6.635. i) Poisson model: χ2 = 22.434 − 16.8337 = 5.600; this is significant at the 5% level (p = 0.018). c Studentlitteratur °
224 ii) Binomial model: χ2 = 24.2304 − 18.0863 = 6.1441; again significant at the 5% level (p = 0.0132). Both models indicate a significant relationships between failure risk and temperature. Note, however, that the number of observations and, in particular, the number of failures, is so small that the asymptotics may not work. B. Predicted values at 31o F are i) Poisson model: g (b µ) = 5.9691 − 0.1034 · 31 = 2.7637. Since g (·) is a log link, this gives µ b = exp (2.7637) = 15.858.
ii) Binomial model: g (b p) = 5.0850 − 0.1156 · 31 = 1.5014. g (·) is a ex e1.5014 logit link which has inverse 1+e b = 1+e x so p 1.5014 = 0.81778. With n = 6 O-rings on board, we would expect nb p = 6 · 0.81778 = 4. 9 of them to fail!
C. The Poisson model has the disadvantage that the expected number of failing O-rings is actually larger than the total number on board: we predict 16 failures among 6 O-rings. The Binomial model is more reasonable.
D. Using the Binomial model with n = 6 and p = 0.8179, P (x ≥ 3) = 1 − P (x ≤ 2) = 1 − 0.0121 = 0.9879. Exercise 6.9
³ ´ b . In the presence of The odds ratios can be calculated as exp β i interactions the main effect odds ratios are not very illuminating, so we only consider the interactions. In the table we abbreviate Gender=G; Location=L; Injury=I and Belt use=B. We interpret the parameters by the ordered values in the SAS printout. Since N is (alphabetically) before Y, the odds ratio for, for example, B*I means that persons with “B=No” have “I=No” less often. This, of course, could be stated as “Users of seat belts are injured less often”. For the different interaction effects in the model we get:
c Studentlitteratur °
225
Solutions to the exercises
Term G*L
OR exp (−0.2099) = 0.811
G*B
exp (−0.4599) = 0.631
G*I
exp (−0.5405) = 0.582
L*B
exp (−0.0849) = 0.919
L*I
exp (−0.7550) = 0.470
B*I
exp (−0.8140) = 0.443
Comment Females traveled in rural areas less often than males. Females avoided belt use less often than males. Females were uninjured less often than males. Belts are avoided less often in rural areas. Passengers are uninjured less often in rural areas Non-users of belts are uninjured less often
Exercise 7.1 A generalized linear model with a multinomial distribution and a cumulative logit link gave the following result: Analysis Of Parameter Estimates
Parameter Intercept1 Intercept2 Intercept3 treatment treatment Scale
BP CP
DF
Estimate
Standard Error
1 1 1 1 0 0
-1.1607 -0.6222 1.1782 0.3219 0.0000 1.0000
0.1814 0.1705 0.1817 0.2216 0.0000 0.0000
Wald 95% Confidence Limits -1.5163 -0.9564 0.8220 -0.1124 0.0000 1.0000
-0.8051 -0.2881 1.5344 0.7563 0.0000 1.0000
ChiSquare
Pr > ChiSq
40.93 13.32 42.03 2.11 .
<.0001 0.0003 <.0001 0.1462 .
According to this analysis, there is no significant association between treatment and response (p = 0.1462). A standard Chi-square test of independence gives χ2 = 4.6 on 3 df, p = 0.20. Exercise 7.2 The standard χ2 test of independence gives χ2 = 24.1481 on 4 df, p < 0.0001. An ordinal model for prediction of the attitude towards detection of cancer gives a Type 3 χ2 = 25.86 on 2 df, p < 0.0001. The parameter estimates for this model are as follows: Analysis Of Parameter Estimates
Parameter Intercept1 Intercept2 mammo mammo mammo Scale
< 1 year > 1 year Never
DF
Estimate
Standard Error
1 1 1 1 0 0
-2.7703 -0.4759 -1.4753 -0.4926 0.0000 1.0000
0.2500 0.1337 0.3247 0.2928 0.0000 0.0000
Wald 95% Confidence Limits
ChiSquare
Pr > ChiSq
-3.2602 -0.7380 -2.1117 -1.0664 0.0000 1.0000
122.80 12.66 20.64 2.83 .
<.0001 0.0004 <.0001 0.0924 .
-2.2803 -0.2137 -0.8388 0.0812 0.0000 1.0000
c Studentlitteratur °
226 An alternative model is the linear by linear association model. For these data, this model gives: Analysis Of Parameter Estimates
Parameter Intercept cancer cancer cancer mammo mammo mammo c*m Scale
0 1 2 < 1 year > 1 year Never
DF
Estimate
Standard Error
1 1 1 0 1 1 0 1 0
3.6821 -1.0564 0.0171 0.0000 -3.0357 -2.2606 0.0000 0.6437 1.0000
0.0707 0.1036 0.0411 0.0000 0.1375 0.0688 0.0000 0.0348 0.0000
Wald 95% Confidence Limits 3.5435 -1.2595 -0.0633 0.0000 -3.3052 -2.3954 0.0000 0.5756 1.0000
3.8206 -0.8533 0.0976 0.0000 -2.7662 -2.1258 0.0000 0.7119 1.0000
ChiSquare
Pr > ChiSq
2711.93 103.95 0.17 . 487.41 1079.83 . 343.06
<.0001 <.0001 0.6763 . <.0001 <.0001 . <.0001
The c ∗ m association is highly significant. All three analyses suggest a strong relationship between mammography experience and attitude towards cancer detection. Exercise 8.1 A gamma model with log-transformed wbc values was tried. The wbc*ag interaction was far from significant so it was excluded. The suggested model produces a deviance of 38.2342 on 30 df; deviance/df=1.2745. The output is: Analysis Of Parameter Estimates
Parameter Intercept ag ag lwbc Scale
0 1
DF
Estimate
Standard Error
1 1 0 1 1
0.0057 0.0431 0.0000 0.0061 0.9968
0.0036 0.0174 0.0000 0.0024 0.2160
Wald 95% Confidence Limits -0.0014 0.0089 0.0000 0.0014 0.6518
0.0128 0.0773 0.0000 0.0109 1.5242
ChiSquare
Pr > ChiSq
2.44 6.09 . 6.37
0.1180 0.0136 . 0.0116
A plot of observed survival times for the two groups, along with the survival times predicted by the model, is as follows:
c Studentlitteratur °
227
Solutions to the exercises
Exercise 8.2 The data were run using the macro for variance heterogeneity listed in the Genmod manual. The results for the mean value model was as follows: Mean model
Obs Parameter Level1 DF Estimate 1 2 3 4 5
Intercept group group group Scale
a b c
1 19.7200 1 -16.7533 1 -11.5400 0 0.0000 0 1.0000
StdErr
LowerCL
UpperCL
7.6955 4.6370 7.7032 -31.8514 7.7790 -26.7866 0.0000 0.0000 0.0000 1.0000
34.8030 -1.6553 3.7066 0.0000 1.0000
ChiSq
Prob ChiSq
6.57 0.0104 4.73 0.0296 2.20 0.1379 . . _ _
There is a significant difference (p = 0.0296) between groups a and c. The results for the variance model are: Variance model
Obs Parameter Level1 DF Estimate 1 2 3 4 5
Intercept group group group Scale
a b c
1 1 1 0 0
5.6907 -6.0301 -3.1318 0.0000 0.5000
Prob ChiSq
StdErr
LowerCL
UpperCL
ChiSq
0.6325 0.8563 0.7746 0.0000 0.0000
4.4511 -7.7085 -4.6500 0.0000 0.5000
6.9303 -4.3517 -1.6136 0.0000 0.5000
80.96 <.0001 49.58 <.0001 16.35 <.0001 . . _ _
The results indicate significant differences in variance between the groups. Exercise 8.3 A SAS program for analysis of these data using the Glimmix macro is as follows. Note that the macro itself must be run before this program is
c Studentlitteratur °
228 submitted. %glimmix( data=labexp, stmts=%str( class pot var1 var2; model x2/n2 = var1 var2 var1*var2; random pot*var1*var2; ), error=binomial, link=logit ); run;
Some of the output is as follows: Solution for Fixed Effects
Effect Intercept Var1 Var1 Var1 Var1 Var2 Var2 Var2 Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2 Var1*Var2
Var1
Var2
Estimate
Standard Error
DF
t Value
Pr > |t|
A F H K A F H K A F H K A F H K A F H K
1.6849 -0.1987 0.4159 -0.1333 0 -0.6850 0.1817 -0.4186 0 0.9072 -0.9360 -0.3074 0 0.2112 -0.5441 -0.6812 0 0.4641 -0.07013 0.4596 0 0 0 0 0
0.2299 0.3172 0.3447 0.3194 . 0.3047 0.3324 0.3107 . 0.4402 0.4423 0.4266 . 0.4581 0.4800 0.4501 . 0.4323 0.4600 0.4426 . . . . .
64 64 64 64 . 64 64 64 . 64 64 64 . 64 64 64 . 64 64 64 . . . . .
7.33 -0.63 1.21 -0.42 . -2.25 0.55 -1.35 . 2.06 -2.12 -0.72 . 0.46 -1.13 -1.51 . 1.07 -0.15 1.04 . . . . .
<.0001 0.5332 0.2320 0.6779 . 0.0280 0.5865 0.1826 . 0.0434 0.0382 0.4737 . 0.6464 0.2612 0.1351 . 0.2870 0.8793 0.3030 . . . . .
A F H K
A A A A F F F F H H H H K K K K
Type 3 Tests of Fixed Effects
Effect Var1 Var2 Var1*Var2
Num DF
Den DF
F Value
Pr > F
3 3 9
64 64 64
3.22 4.37 3.05
0.0285 0.0074 0.0042
There is a significant interaction between varieties, i.e. some combinations of varieties are more palatable than others to the lice. This conclusion may be followed up by comparing least squares mean values for the different combinations.
c Studentlitteratur °
Index compound distribution, 101, 134 computer software, 24 conditional independence, 119 conditional odds ratio, 120 confidence interval, 7 constraints, 4 contingency table, 111 contrast, 15 Cook’s distance, 60 correlation structure, 166 count data, 111 covariance analysis, 21 Cramér-Rao inequality, 188 cross-over design, 169 cumulative logits, 148
adjusted deviance residual, 57 adjusted Pearson residual, 57 adjusted R-square, 6 Akaike’s information criterion, 46 analysis of covariance, ix, 21 analysis of variance, ix, 13 analysis of variance table, 5 ANCOVA, 21 ANOVA, 13 ANOVA as GLIM, 71 Anscombe residual, 58 AR(1) structure, 166 arbitrary scores, 146 arcsine transformation, 92 assumptions in general linear models, 24 autoregressive correlation structure, 166
dependent variable, 2 design matrix, 2, 42 deterministic model, 1 deviance, 45 deviance residual, 57 Dfbeta, 60 dilution assay, 86 dispersion parameter, 39 dummy variable, 12, 14
Bernoulli distribution, 87 binomial distribution, 37, 88, 113 Bonferroni adjustment, 16 boxplot, 17 canonical link, 42 canonical parameter, 37 capture-recapture data, 122 censoring, 158 chi-square distribution, 73 chi-square test, 117 class variables, 26 classification variables, 12 coefficient of determination, 5 comparisonwise error rate, 15 complementary log-log link, 40, 86
ED50, 92 empirical estimator robust estimator sandwich estimator, 162 estimable functions, 23 exchangeable correlation structure, 166 expected frequencies, 112 experimentwise error rate, 15 229
230 exponential dispersion family, 37 exponential distribution, 53, 75 exponential family, 31, 36, 37 extreme value distribution, 87, 151 F test, 6 factorial experiments, 18 Fisher information, 188 Fisher’s scoring, 190 fitted value, 4 fixed effects, 169 frequency table, 111 full model, 45 gamma distribution, 73 gamma function, 73 GEE, 165 general linear model, ix, 1, 2 Generalized estimating equations, 165 generalized inverse, 4 generalized linear model, ix, 36 geometric distribution, 134 Glimmix, 169 Gumbel distribution, 87 hat matrix, 56 hat notation, 3 hazard function, 159 Hessian matrix, 189 homogenous association, 119 homoscedasticity, 24 identity link, 40 independent variable, 2 influential observations, 55, 59 interaction, 18, 112 intercept, 3 intrinsically nonlinear models, 23 iteratively reweighted least squares, 44, 190 Kaplan-Meier estimates, 159 latent variable, 151 c Studentlitteratur °
Index
latin square design, 129 least squares, 3 leverage, 59 likelihood function, 187 likelihood ratio test, 48 likelihood residual, 58 linear by linear association model, 146 linear predictor, 36, 42 linear regression, ix linear regression as GLIM, 69 link function, 36, 40 LL model, 146 log likelihood, 187 log link, 40 log-linear model, ix, 112 logistic distribution, 86, 151 logistic regression, 91, 121 logit link, 40, 86 logit regression, 91 m-dependent correlation structure, 166 marginal odds ratio, 120 marginal probability, 111 mass significance, 15 Maximum Likelihood, 3, 31, 42 Minitab, 8 mixed generalized linear models, 168 mixed models, 168 model, 1 model building, 25 model-based estimator, 162 multinomial distribution, 113, 115 multiple logistic regression, 92 multiple regression, 10 multivariate quasi-likelihood, 165 mutual independence, 119 negative binomial distribution, 115, 134 nested models, 45 Newton-Raphson’s method, 189 nominal logistic regression, 122
231
Index
nominal response, 145 nominal variable, 113 non-linear regression, 23 normal distribution, 38 normal equations, 3 normal probability plot, 64 null model, 45 observed residual, 4 odds, 98 odds ratio, 98, 116, 120, 122 offset, 131 ordinal logit regression, 152 ordinal probit regression, 152 ordinal regression, 152 ordinal response, 32, 35, 145 outlier, 59 overdispersion, 55, 61, 115, 133 overdispersion parameter, 62 pairwise comparison, 14 parameter, 3 partial leverage, 60 partial odds ratio, 120 partial sum of squares, 8 Pascal distribution, 134 Pearson chi-square, 46, 116 Pearson residuals, 56 Poisson distribution, 37, 114, 117 Poisson regression, 126 power link, 40 predicted value, 4 probit analysis, 89 probit link, 40, 85 PROC GLM, 16 Proc GLM, 26 Proc Mixed, 169 product multinomial distribution, 114 proportional hazards, 151 proportional odds, 148 proportional odds model, 149 quantal response, 32
quasi-likelihood, 162 R-square, 5 random effects, 169 rate data, 131 RC model, 148 relative risk, 98 repeated measures data, 165 residual, 1, 3, 4, 56 residual plots, 55 residual sum of squares, 5 response variable, 31 response variables, binary, 32 response variables, binomial, 32, 33 response variables, continuous, 32 response variables, counts, 32, 34 response variables, rates, 32, 35 SAS, 8, 15, 16, 26 saturated model, 45, 113 scale parameter, 133 scaled deviance, 45 score equation, 188 score residual, 57 score test, 48 sequential sum of squares, 8 simple linear regression, 8 statistical independence, 112 statistical model, 1 sum of squares, 4 survival data, 158 survival function, 158 t test, 12 tests on subsets of parameters, 7 tolerance distribution, 85 total sum of squares, 4 truncated Poisson distribution, 53 type 1 SS, 8 Type 1 test, 49 type 2 SS, 8 type 3 SS, 8 Type 3 test, 49 type 4 SS, 8 c Studentlitteratur °
232 underdispersion, 61 variance function, 39 variance heterogeneity, 157 Wald test, 47 Wilcoxon-Mann-Whitney test, 52
c Studentlitteratur °
Index