FRONTIERS OF STATISTICS
FRONTIERS OF STATISTICS The book series provides an overview on the new developments in the frontiers of statistics. It aims at promoting statistical research that has high societal impacts and offers a concise overview on the recent developments on a topic in the frontiers of statistics. The books in the series intend to give graduate students and new researchers an idea where the frontiers of statistics are, to learn common techniques in use so that they can advance the fields via developing new techniques and new results. It is also intended to provide extensive references so that researchers can follow the threads to learn more comprehensively what the literature is, to conduct their own research, and to cruise and lead the tidal waves on the frontiers of statistics.
SERIES EDITORS Jianqing Fan
ZhimingMa
Frederick L. Moore' 18 Professor of Finance.
Academy of Math and Systems Science,
Director of Committee of Statistical Studies,
Institute of Applied Mathematics,
Department of Operation Research and
Chinese Academy of Science,
Financial Engineering,
No.55, Zhong-guan-cun East Road,
Princeton University, NJ 08544, USA.
Beijing 100190, China.
EDITORIAL BOARD Tony Cai, University of Pennsylvania Min Chen, Chinese Academy of Science Zhi Geng, Peking University Xuming He, University of Illinois at Urbana-Champaign Xihong Lin, Harvard University Jun Liu, Harvard University Xiao-Ji Meng, Harvard University Jeff Wu, Georgia Institute of Technology Heping Zhang, Yale University
evv Developments in Biostatistics and Bioinformatics Editors
Jianqing Fan Princeton University, USA
Xihong Lin Harvard University, USA
Jun S. Liu Harvard University, USA
Volume 1
Higher Education Press
World Scientific NEW JERSEY. LONDON· SINGAPORE· BEIJING· SHANGHAI· HONG KONG· TAIPEI· CHENNAI
Jianqing Fan
XihongLin
Department of Operation Reasearch and
Department of Biostatistics of the School of
Financial Engineering
Public Health
Princeton University
Harvard University
Jun Liu Department of Statistics Harvard University
Copyright @ 2009 by
Higher Education Press 4 Dewai Dajie, 1000 II, Beijing, P.R. China and
World Scientific Publishing Co Pte Ltd 5 Toh Tuch Link, Singapore 596224
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without permission in writing from the Publisher.
ISBN 13: 978-981-283-743-1 ISBN 10: 981-283-743-4 ISSN 1793-8155
Printed in P. R. China
Preface
The first eight years of the twenty-first century has witted the explosion of data collection, with relatively low costs. Data with curves, images and movies are frequently collected in molecular biology, health science, engineering, geology, climatology, economics, finance, and humanities. For example, in biomedical research, MRI, fMRI, microarray, and proteomics data are frequently collected for each subject, involving hundreds of subjects; in molecular biology, massive sequencing data are becoming rapidly available; in natural resource discovery and agriculture, thousands of high-resolution images are collected; in business and finance, millions of transactions are recorded every day. Frontiers of science, engineering, and humanities differ in the problems of their concerns, but nevertheless share a common theme: massive or complex data have been collected and new knowledge needs to be discovered. Massive data collection and new scientific research have strong impact on statistical thinking, methodological development, and theoretical studies. They have also challenged traditional statistical theory, methods, and computation. Many new insights and phenomena need to be discovered and new statistical tools need to be developed. With this background, the Center for Statistical Research at the Chinese Academy of Science initiated the conference series "International Conference on the Frontiers of Statistics" in 2005. The aim is to provide a focal venue for researchers to gather, interact, and present their new research findings, to discuss and outline emerging problems in their fields, to lay the groundwork for future collaborations, and to engage more statistical scientists in China to conduct research in the frontiers of statistics. After the general conference in 2005, the 2006 International Conference on the Frontiers of Statistics, held in Changchun, focused on the topic "Biostatistics and Bioinformatics". The conference attracted many top researchers in the area and was a great success. However, there are still a lot of Chinese scholars, particularly young researchers and graduate students, who were not able to attend the conference. This hampers one of the purposes of the conference series. However, an alternative idea was born: inviting active researchers to provide a bird-eye view on the new developments in the frontiers of statistics, on the theme topics of the conference series. This will broaden significantly the benefits of statistical research, both in China and worldwide. The edited books in this series aim at promoting statistical research that has high societal impacts and provide not only a concise overview on the recent developments in the frontiers of statistics, but also useful references to the literature at large, leading readers truly to the frontiers of statistics. This book gives an overview on recent development on biostatistics and bioinformatics. It is written by active researchers in these emerging areas. It is intended v
VI
Preface
to give graduate students and new researchers an idea where the frontiers of biostatistics and bioinformatics are, to learn common techniques in use so that they can advance the fields via developing new techniques and new results. It is also intended to provide extensive references so that researchers can follow the threads to learn more comprehensively what the literature is and to conduct their own research. It covers three important topics in biostatistics: Analysis of Survival and Longitudinal Data, Statistical Methods for Epidemiology, and Bioinformatics, where statistics is still advancing rapidly today. Ever since the invention of nonparametric and semiparametric techniques in statistics, they have been widely applied to the analysis of survival data and longitudinal data. In Chapter 1, Jianqing Fan and Jiancheng Jiang give a concise overview on this subject under the framework of the proportional hazards model. Nonparametric and semiparametric modeling and inference are stressed. Dongling Zeng and Jianwen Cai introduce an additive-accelerated rate regression model for analyzing recurrent event in Chapter 2. This is a flexible class of models that includes both additive rate model and accelerated rate models, and allows simple statistical inference. Longitudinal data arise frequently from biomedical studies and quadratic inference function provides important approaches to the analysis of longitudinal data. An overview is given in Chapter 3 on this topic by John Dziak, Runze Li, and Annie Qiu. In Chapter 4, Yi Li gives an overview on modeling and analysis of spatially correlated data with emphasis on mixed models. The next two chapters are on statistical methods for epidemiology. Amy Laird and Xiao-Hua Zhou address the issues on study designs for biomarker-based treatment selection in Chapter 5. Several trial designs are introduced and evaluated. In Chapter 6, Jinbo Chen reviews recent statistical models for analyzing twophase epidemiology studies, with emphasis on the approaches based on estimatingequation, pseudo-likelihood, and maximum likelihood. The last four chapters are devoted to the analysis of genomic data. Chapter 7 features protein interaction predictions using diverse data sources, contributed by Yin Liu, Inyoung Kim, and Hongyu Zhao. The diverse data sources information for protein-protein interactions is elucidated and computational methods are introduced for aggregating these data sources to better predict protein interactions. Regulatory motif discovery is handled by Qing Zhou and Mayetri Gupta using Bayesian approaches in Chapter 8. The chapter begins with a basic statistical framework for motif finding, extends it to the identification of cis-regulatory modules, and then introduces methods that combine motif finding with phylogenetic footprint, gene expression or ChIP-chip data, and nucleosome positioning information. Cheng Li and Samir Amin use single nucleotide polymorphism (SNP) microarrays to analyze cancer genome alterations in Chapter 9. Various methods are introduced, including paired and non-paired loss of heterozygosity analysis, copy number analysis, finding significant altered regions across multiple samples, and hierarchical clustering methods. In Chapter 10, Evan Johnson, Jun Liu and Shirley Liu give a comprehensive overview on the design and analysis of ChIPchip data on genome tiling microarrays. It spans from biological background and ChIP-chip experiments to statistical methods and computing. The frontiers of statistics are always dynamic and vibrant. Young researchers
Preface
vii
are encouraged to jump into the research wagons and cruise with tidal waves of the frontiers. It is never too late to get into the frontiers of scientific research. As long as your mind is evolving with the frontiers, you always have a chance to catch and to lead next tidal waves. We hope this volume helps you getting into the frontiers of statistical endeavors and cruise on them thorough your career. Jianqing Fan, Princeton Xihong Lin, Cambridge Jun Liu, Cambridge August 8, 2008
This page intentionally left blank
Contents
Preface
Part I
Analysis of Survival and Longitudinal Data
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis Jianqing Fan, Jiancheng Jiang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Cox's type of models............................................... 3 Multivariate Cox's type of models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Model selection on Cox's models.................................. 5 Validating Cox's type of models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6 Transformation models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7 Concluding remarks............................................... References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3 4 14 24 27 28 30 30
Chapter 2 Additive-Accelerated Rate Model for Recurrent Event Donglin Zeng, Jianwen Cai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Inference procedure and asymptotic properties. . . . . . . . . . . . . . . . . . .. 3 Assessing additive and accelerated covariates .. . . . . . . . . . . . . . . . . . . .. 4 Simulation studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 Application....................................................... 6 Remarks.......................................................... Acknowledgements ................................................. ;. Appendix............................................................ References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
35 37 40 41 42 43 44 44 48
Chapter 3
An Overview on Quadratic Inference Function Approaches for Longitudinal Data John J. Dziak, Runze Li, Annie Qu.................................. 49
1 Introduction...................................................... 2 The quadratic inference function approach. . . . . . . . . . . . . . . . . . . . . . .. 3 Penalized quadratic inference function. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Some applications of QIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 Further research and concluding remarks. . . . . . . . . . . . . . . . . . . . . . . . .. Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
49 51 56 60 65 68
Contents
x
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68 Chapter 4
Modeling and Analysis of Spatially Correlated Data Yi Li................................................................ 73
1 Introduction .................................................... " 2 Basic concepts of spatial process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 Spatial models for non-normal/discrete data ....................... 4 Spatial models for censored outcome data ....................... " 5 Concluding remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
Part II
73 76 82 88 96 96
Statistical Methods for Epidemiology
Chapter 5
Study Designs for Biomarker-Based Treatment Selection Amy Laird, Xiao-Hua Zhou. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 103
1 Introduction..................................................... 2 Definition of study designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 Test of hypotheses and sample size calculation. . . . . . . . . . . . . . . . . .. 4 Sample size calculation ......................................... " 5 Numerical comparisons of efficiency. . . . . . . . . .. . . . . . . . . . . . . . . . . . .. 6 Conclusions...................................................... Acknowledgements.................................................. Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References ......................................................... "
103 104 108 111 116 118 121 122 126
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies Jinbo Chen......................................................... 127
1 Introduction... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Two-phase case-control or cross-sectional studies. . . . . . . . . . . . . . . .. 3 Two-phase designs in cohort studies ............................ " 4 Conclusions .................................................... " References...........................................................
Part III
127 130 136 149 151
Bioinformatics
Chapter 7
Protein Interaction Predictions from Diverse Sources Yin Liu, Inyoung Kim, Hongyu Zhao............................... 159
1 Introduction..................................................... 159 2 Data sources useful for protein interaction predictions .......... " 161 3 Domain-based methods.......................................... 163 4 Classification methods ......................................... " 169
Contents 5 Complex detection methods ..................................... , 6 Conclusions...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Acknowledgements ................................................. , References .......................................................... ,
xi 172 175 175 175
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis Qing Zhou, Mayetri Gupta ...... ................................... , 179
1 Introduction..................................................... 2 A Bayesian approach to motif discovery. . . . . . . . . . . . . . . . . . . . . . . . .. 3 Discovery of regulatory modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Motif discovery in multiple species. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Motif learning on ChIP-chip data ............................... , 6 Using nucleosome positioning information in motif discovery.. . .. 7 Conclusion......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
179 181 184 189 195 201 204 205
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide Polymorphism (SNP) Microarrays Cheng Li, Samir Amin .............................................. 209
1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Loss of heterozygosity analysis using SNP arrays. . . . . . . . . . . . . . . .. 3 Copy number analysis using SNP arrays ........................ , 4 High-level analysis using LOH and copy number data............ 5 Software for cancer alteration analysis using SNP arrays. . . . . . . . .. 6 Prospects............................. . . . . . . . . . . . . . . . . . . . . . . . . . .. Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
209 212 216 224 229 231 231 231
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays W. Evan Johnson, Jun S. Liu, X. Shirley Liu....................... 239
1 Background molecular biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 A ChIP-chip experiment......................................... 3 Data description and analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Follow-up analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 Conclusion................................... . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
239 241 244 249 254 254
Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 259 Author Index . ......................................................... , 261
This page intentionally left blank
Part I
Analysis of Survival and Longitudinal Data
This page intentionally left blank
Chapter 1 N on- and Semi- Parametric Modeling in Survival Analysis * Jianqing Fan t
Jiancheng Jiang +
Abstract In this chapter, we give a selective review of the nonparametric modeling methods using Cox's type of models in survival analysis. We first introduce Cox's model (Cox 1972) and then study its variants in the direction of smoothing. The model fitting, variable selection, and hypothesis testing problems are addressed. A number of topics worthy of further study are given throughout this chapter. Keywords: Censoring; Cox's model; failure time; likelihood; modeling; non parametric smoothing.
1
Introduction
Survival analysis is concerned with studying the time between entry to a study and a subsequent event and becomes one of the most important fields in statistics. The techniques developed in survival analysis are now applied in many fields, such as biology (survival time), engineering (failure time), medicine (treatment effects or the efficacy of drugs), quality control (lifetime of component), credit risk modeling in finance (default time of a firm). An important problem in survival analysis is how to model well the conditional hazard rate of failure times given certain covariates, because it involves frequently asked questions about whether or not certain independent variables are correlated with the survival or failure times. These problems have presented a significant challenge to statisticians in the last 5 decades, and their importance has motivated many statisticians to work in this area. Among them is one of the most important contributions, the proportional hazards model or Cox's model and its associated partial likelihood estimation method (Cox, 1972), which stimulated -The authors are partly supported by NSF grants DMS-0532370, DMS-0704337 and NIH ROl-GM072611. tDepartment of ORFE, Princeton University, Princeton, NJ 08544, USA, E-mail: jqfan@ princeton.edu tDepartment of Mathematics and Statistics, University of North Carolina, Charlotte, NC 28223, USA, E-mail:
[email protected]
3
Jianqing Fan, Jiancheng Jiang
4
a lot of works in this field. In this chapter we will review related work along this direction using the Cox's type of models and open an academic research avenue for interested readers. Various estimation methods are considered, a variable selection approach is studied, and a useful inference method, the generalized likelihood ratio (GLR) test, is employed to address hypothesis testing problems for the models. Several topics worthy of further study are laid down in the discussion section. The remainder of this chapter is organized as follows. We consider univariate Cox's type of models in Section 2 and study multivariate Cox's type of models using the marginal modeling strategy in Section 3. Section 4 focuses on model selection rules, Section 5 is devoted to validating Cox's type of models, and Section 6 discusses transformation models (extensions to Cox's models). Finally, we conclude this chapter in the discussion section.
2
Cox's type of models
Model Specification. The celebrated Cox model has provided a tremendously successful tool for exploring the association of covariates with failure time and survival distributions and for studying the effect of a primary covariate while adjusting for other variables. This model assumes that, given a q-dimensional vector of covariates Z, the underlying conditional hazard rate (rather than expected survival time T),
A(tlz)=
. hm
D.t--+o+
1 ut
AP{t:::;T
is a function of the independent variables (covariates): A(tlz) = Ao(t)W(z),
(2.1)
where w(z) = exp( 'lj;(z)) with the form of the function 'lj;(z) known such as 'lj;(z) = (3T z, and Ao(t) is an unknown baseline hazard function. Once the conditional hazard rate is given, the condition survival function S(tlz) and conditional density f(tlz) are also determined. In general, they have the following relationship:
S(tlz) = exp(-A(tlz)),
J;
f(tlz) = A(tlz)S(tlz),
(2.2)
A(tlz)dt is the cumulative hazard function. Since no assumpwhere A(tlz) = tions are made about the nature or shape of the baseline hazard function, the Cox regression model may be considered to be a semiparametric model. The Cox model is very useful for tackling with censored data which often happen in practice. For example, due to termination of the study or early withdrawal from a study, not all of the survival times T 1 , ... ,Tn may be fully observable. Instead one observes for the i-th subject an event time Xi = min(Ti' Ci ), a censoring indicator 8i = J(Ti :::; Ci ), as well as an associated vector of covariates Zi. Denote the observed data by {(Zi' Xi, 8i ) : i = 1,··· , n} which is an i.i.d. sample from the population (Z, X, 8) with X = min(T, C) and 8 = J(T :::; C). Suppose that
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
5
the random variables T and C are positive and continuous. Then by Fan, Gijbels, and King (1997), under the Cox model (2.1),
E{8IZ = z} llI(x) = E{Ao(X)IZ = z}'
(2.3)
where Ao (t) = J~ >'0 (u) du is the cumulative baseline hazard function. Equation (2.3) allows one to estimate the function III using regression techniques if >'o(t) is known. The likelihood function can also be derived. When 8i = 0, all we know is that the survival time Ti ;? Ci and the probability for getting this is
whereas when 8i = 1, the likelihood of getting Ti is !(TiIZi ) = !(XiIZi). Therefore the conditional (given covariates) likelihood for getting the data is
and using (2.2), we have
L = I : 10g(>'(XiIZi )) - I:A(XiIZ i ) 8;=1
i
(2.5) For proportional hazards model (2.1), we have specifically (2.6) Therefore, when both 'ljJ(-) and >'00 are parameterized, the parameters can be estimated by maximizing the likelihood (2.6).
Estimation. The likelihood inference can be made about the parameters in model (2.1) if the baseline >'00 and the risk function 'ljJ(.) are known up to a vector of unknown parameters {3 (Aitkin and Clayton, 1980), i.e. >'00
= >'o{-; (3)
and
'ljJ(-)
= 'ljJ(.; (3).
When the baseline is completely unknown and the form of the function 'ljJ(.) is given, inference can be based on the partial likelihood (Cox, 1975). Since the full likelihood involves both (3 and >'o(t), Cox decomposed the full likelihood into a product of the term corresponding to identities of successive failures and the term corresponding to the gap times between any two successive failures. The first term inherits the usual large-sample properties of the full likelihood and is called the partial likelihood.
Jianqing Fan, Jiancheng Jiang
6
The partial likelihood can also be derived from counting process theory (see for example Andersen, Borgan, Gill, and Keiding 1993) or from a profile likelihood in Johansen (1983). In the following we introduce the latter. Example 1 [The partial likelihood as profile likelihood; Fan, Gijbel, and King (1997)] Consider the case that 'Ij;(z) = 'Ij;(z; (3). Let tl < ... < tN denote the ordered failure times and let (i) denote the label of the item failing at k Denote by Ri the risk set at time ti-, that is Ri = {j : Xj ~ td. Consider the least informative nonparametric modeling for Ao('), that is, Ao(t) puts point mass (}j at time tj in the same way as constructing the empirical distribution: N
Ao(t; (})
= L (}jI(tj
~ t).
(2.7)
j=l Then N
AO(Xi ;{}) = L{}jI(i E Rj ).
(2.8)
j=l Under the proportional hazards model (2.1), using (2.6), the log likelihood is n
logL
L[6i{logAo(Xi;{}) +'Ij;(Zi;(3)} i=l -AO(Xi; (}) exp{ 'Ij;(Zi; (3))].
=
(2.9)
Substituting (2.7) and (2.8) into (2.9), one establishes that n
logL
=
L[1og{}j j=l n
+ 'Ij;(Z(j);(3)]
N
- LL{}jI(i E Rj)exp{'Ij;(Zi;(3)}. i=l j=l
(2.10)
Maximizing log L with respect to {}j leads to the following Breslow estimator of the baseline hazard [Breslow (1972, 1974)]
OJ
=
[L exp{'lj;(Zi; (3)} iERj
rl.
(2.11)
Substituting (2.11) into (2.10), we obtain
~~logL =
n
L('Ij;(Zei);(3) -log[L exp {'Ij;(Zj;(3)}]) - N. t=l JERi
This leads to the log partial likelihood function (Cox 1975) n
£((3)
=L t=l
('Ij;(Zei); (3) -
log
[L exp{'lj;(Zj; (3)}]). JERi
(2.12)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
7
An alternative expression is n
n
R({3) = I:('Iji(ZCi);,8) -log[I:}j(Xi )exP {'Iji(Zj;,8)}]), i=l
j=l
where }jet) = I(Xj ~ t) is the survival indicator on whether the j-th subject survives at the time t. The above partial likelihood function is a profile likelihood and is derived from the full likelihood using the least informative nonparametric modeling for Ao(·), that is, Ao(t) has a jump (h at k <>
/3
Let be the partial likelihood estimator of,8 maximizing (2.12) with respect to,8. By standard likelihood theory, it can be shown that (see for example Tsiatis 1981) the asymptotic distribution y'n(/3 - (3) is multivariate normal with mean zero and a covariance matrix which may be estimated consistently by (n- 11(/3))-1, where
1(,8) =
r J
T
o
and for k
[S2(,8,t) _ (S1(,8,t))®2] dN(t) So (,8, t) So (,8, t)
= 0, 1, 2, n
Sk(,8,t) = I:Yi(t)'ljiI(Zi;{3)®kexp{'Iji(Zi;,8)}, i=1
where N(t) = I(X ~ t,o = 1), and x®k = 1,x,xxT , respectively for k = 0,1 and 2. Since the baseline hazard Ao does not appear in the partial likelihood, it is not estimable from the likelihood. There are several methods for estimating parameters related to Ao. One appealing estimate among them is the Breslow estimator (Breslow 1972, 1974) (2.13)
Hypothesis testing. After fitting the Cox model, one might be interested in checking if covariates really contribute to the risk function, for example, checking if the coefficient vector ,8 is zero. More generally, one considers the hypothesis testing problem Ho: ,8 = ,80· From the asymptotic normality of the estimator null distribution of the Wald test statistic
/3, it follows that the asymptotic
8
Jianqing Fan, Jiancheng Jiang
is the chi-squared distribution with q degrees of freedom. Standard likelihood theory also suggests that the partial likelihood ratio test statistic (2.14) and the score test statistic Tn
=
U(!3of rl(!3o)U(!3 o)
have the same asymptotic null distribution as the Wald statistic, where U(!3o)
=
C'(!3o) is the score function (see for example, Andersen et al., 1993). Cox's models with time-varying covariates. The Cox model (2.1) assumes that the hazard function for a subject depends on the values of the covariates and the baseline. Since the covariates are independent of time, the ratio of the hazard rate functions oftwo subjects is constant over time. Is this assumption reasonable? Consider, for example, the case with age included in the study. Suppose we study survival time after heart transplantation. Then it is possible that age is a more critical factor of risk right after transplantation than a later time. Another example is given in Lawless (1982, page 393) with the amount of voltage as covariate which slowly increases over time until the electrical insulation fails. In this case, the impact of the covariate clearly depends on time. Therefore, the above assumption does not hold, and we have to analyze survival data with time-varying covariates. Although the partial likelihood in (2.12) was derived for the setting of the Cox model with non-time-varying covariates, it can also be derived for the Cox model with time-varying covariates if one uses the counting process notation. For details, see marginal modeling of multivariate data using the Cox's type of models in Section 3.1. More about Cox's models. For the computational simplicity of the partial likelihood estimator, Cox's model has already been a useful case study for formal semiparametric estimation theory (Begun, Hall, Huang, and Wellner 1982; Bickel, Klaassen, Ritov, and Wellner 1993; Oakes 2002). Moreover, due to the derivation of the partial likelihood from profile likelihood (see Example 1), Cox's model has been considered as an approach to statistical science in the sense that "it formulates scientific questions or quantities in terms of parameters 'Y in a model f(y; 'Y) representing the underlying scientific mechanisms (Cox, 1997)j partition the parameters 'Y = ((), TJ) into a subset of interest () and other nuisance parameters TJ necessary to complete the probability distribution (Cox and Hinkley, 1974),develops methods of inference about the scientific quantities that depend as little as possible upon the nuisance parameters (Barndorff-Nielsen and Cox, 1989),- and thinks critically about the appropriate conditional distribution on which to base inferece" (Zeger, Diggle, and Liang 2004). Although Cox's models have driven a lot of statistical innovations in the past four decades, scientific fruit will continue to be born in the future. This motivates us to explore some recent development for Cox's models using the nonparametric idea and hope to open an avenue of academic research for interested readers.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
2.1
9
Cox's models with unknown nonlinear risk functions
Misspecification of the risk function 'IjJ may happen in the previous parametric form 'IjJ(.; {3), which could create a large modeling bias. To reduce the modeling bias, one considers nonparametric forms of 'IjJ. Here we introduce such an attempt from Fan, Gijbels, and King (1997). For easy exposition, we consider only the case with q = 1:
A(tlz)
= Aa(t) exp{'IjJ(z)},
(2.15)
where z is one dimensional. Suppose the form of 'IjJ(z) in model (2.15) is not specified and the p-th order derivative of 'IjJ(z) at the point z exists. Then by the Taylor expansion,
'IjJ(Z) ~ 'IjJ(z)
+ 'IjJ'(z)(Z -
z)
'IjJ(p) (z)
+ ... + - - I-(Z p.
z)P,
for Z in a neighborhood of z. Put Z
= {I, Z -
Z,··· ,(Z - Z)P}T and Zi
= {I, Zi -
Z,··· ,(Zi - z)p}T,
where T denotes the transpose of a vector throughout this chapter. Let h be the bandwidth controlling the size of the neighborhood of x and K be a kernel function with compact support [-1,1] for weighting down the contribution of remote data points. Then for IZ - zl ::;:; h, as h - t 0,
'IjJ(Z) ~ ZT a , where a = (aa,a1,··· ,apf = {'IjJ(z),'IjJ'(z),· .. ,'IjJ(p)(z)/p!}T. By using the above approximation and incorporating the localizing weights, the local (log) likelihood is obtained from (2.9) as n
£n({3,8)
=
n- 1
L: [8i {log Aa(Xi ; 8) + zf a} i=l
-Aa(Xi ; 8) exp(Zf a)] Kh(Zi - x),
(2.16)
where Kh(t) = h- 1 K(t/h). Then using the least-informative nonparametric model (2.7) for the baseline hazard and the same argument as for (2.12), we obtain the local log partial likelihood N
L:Kh(Z(i) - z)(Z0)a i=l
-IOg[L: exp{ZG)a}Kh(Zj -
z)]).
(2.17)
JERi
Maximizing the above function with respect to a leads to an estimate ex of a. Note that the function value 'IjJ(z) is not directly estimable; (2.17) does not involve the intercept aa since it cancels out. The first component a1 = (f;'(z) estimates 'IjJ'(z).
10
Jianqing Fan, Jiancheng Jiang
It is evident from model (2.15) that 'ljJ(z) is only identifiable up to a constant. By
imposing the condition 'ljJ(0)
= 0, the function 'ljJ(z) can be estimated by ,(}(z)
=
l
z
,(}'(t) dt.
According to Fan, Gijbels, and King (1997), under certain conditions, the following asymptotic normality holds for ,(}' (z):
where
and
v~,(z) = a2(z)J-1(z)
JK~(t)2
dt
with K;(t) = tK(t)/ J t 2K(t) dt and a 2(z) = E{8IZ = z} -1. With the estimator of 'ljJO, using the same argument as for (2.13), one can estimate the baseline hazard by (2.18) Inference problems associated with the resulting estimator include constructing confidence intervals and hypothesis tests, which can be solved via standard nonparametric techniques but to our knowledge no rigor mathematical theory exists in the literature. A possible test method can be developed along the line of the generalized likelihood ratio (GLR) tests in Section 5, and theoretical properties of the resulting tests are to be developed. For multiple covariates cases, the above modeling method is applicable without any difficulty if one employs a multivariate kernel as in common nonparametric regression. See Section 2.2 for further details. However, a fully nonparametric specification of 'ljJ(.) with large dimensionality q may cause the "curse of dimensionality" problem. This naturally leads us to consider some dimension reduction techniques.
2.2
Partly linear Cox's models
The partly linear Cox's model is proposed to alleviate the difficulty with a saturated specification of the risk function and takes the form (2.19) where Ao 0 is an unspecified baseline hazard function and
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
11
where the form of the function '1/JI(zl;{3) is known up to an unknown vector of finite parameters (3, and 'l/J2(-) is an unknown function. This model inherents nice interpretation of the finite parameter {3 in model (2.1) while modeling possible nonlinear effects of the d x 1 vector of covariates Z2. In particular, when there is no parametric component, the model reduces to the aforementioned full nonparametric model in Section 2.1. Hence, in practice, the number of components in Z2 is small. The parameters (3 and function 'l/J2(Z2) can be estimated using the profile partial likelihood method. Specifically, as argued in the previous section, the function 'l/J2 admits the linear approximation
'l/J2(Z2) ~ 'l/J2(Z2)
+ 'I/J~(z2f(Z2 -
Z2)
== a T Z2
when Z2 is close to Z2, where a = {'l/J2(Z2), 'I/J~(Z2)TV and Z2 = {l, (Z2 - Z2)TV. Given (3, we can estimate the function 'l/J2(-) by maximizing the local partial likelihood N
in(a)
=
:L KH(Z2(i) - Z2) ('I/J~ (Zl(i); (3)
+ Zr(i) a
i=l
-log[:L exp{'l/Jl(Zl(j);(3)
+ Zr(j)a}K H(Z2j
- Z2)]) ,
(2.20)
JERi
where K H(Z2) = \HI- 1 K(H- 1 Z2) with K(·) being a d-variate probability density (the kernel) with unique mode 0 and J uK(u)du = 0, H is a nonsingular d x d matrix called the bandwidth matrix (see for example Jiang and Doksum 2003). For expressing the dependence of the resulting solution on (3, we denote it by &(Z2; (3) = {~2 (Z2; (3), ~~ (Z2; (3)). Substituting ~2(-; (3) into the partial likelihood, we obtain the profile partial likelihood of (3 n
i n (!3)
=
:L('l/Jl(Zl(i);f3)
+~2(Z2(i);13)
i=l
-log[:L exp{'I/Jl(Zlj; (3)
+ ~2(Z2j; {3)}]).
(2.21 )
JERi
.a
.a.
Maximizing (2.21) with respect to will lead to an estimate of We denote by 13 the resulting estimate. The estimate of function 'l/J2(-) is simply ~2('; 13). By an argument similar to that in Cai, Fan, Jiang, and Zhou (2007), it can be shown that the profile partial likelihood estimation provides a root-n consistent estimator of (see also Section 3). This allows us to estimate the nonparametric component 'l/J2 as well as if the parameter (3 were known.
.a
2.3
Partly linear additive Cox's models
The partly linear model (2.19) is useful for modeling failure time data with multiple covariates, but for high-dimensional covariate Z2, it still suffers from the so-called "curse-of-dimensionality" problem in high-dimensional function estimation. One
Jianqing Fan, Jiancheng Jiang
12
of the methods for attenuating this difficulty is to use the additive structure for the function 'ljJ2(·) as in Huang (1999), which leads to the partly linear additive Cox model. It specifies the conditional hazard of the failure time T given the covariate value (z, w) as
A{tlz, w} = AO(t) exp{ 'ljJ(z; 13) + ¢(w)},
(2.22)
where ¢(w) = ¢l(wd + ... + ¢J(wJ). The parameters of interest are the finite parameter vector 13 and the unknown functions ¢j's. The former measures the effect of the treatment variable vector z, and the latter may be used to suggest a parametric structure of the risk. This model allows one to explore nonlinearity of certain covariates, avoids the "curse-of-dimensionality" problem inherent in the saturated multivariate semiparametric hazard regression model (2.19), and retains the nice interpretability of the traditional linear structure in Cox's model (Cox 1972). See the discussions in Hastie and Tibshirani (1990). Suppose that observed data for the i-th subject is {Xi, lSi, Wi, Zi}, where Xi is the observed event time for the i-th subject, which is the minimum of the potential failure time Ti and the censoring time Gi , lSi is the indicator of failure, and {Zi' Wi} is the vector of covariates. Then the log partial likelihood function for model (2.22) is n
C(j3, ¢)
=
L lSi { 'ljJ(Zi; 13) + ¢(Wi ) -log L
rj(j3, ¢)},
(2.23)
JERi
i=l
where
rj(j3,¢) = exp{'ljJ(Zj;j3)
+ ¢(Wj)}.
Since the partial likelihood has no finite maximum over all parameters (13, ¢), it is impossible to use the maximium partial likelihood estimation for (13, ¢) without any restrictions on the function ¢. Now let us introduce the polynomial-spline based estimation method in Huang (1999). Assume that W takes values in W = [0, Let
IF.
< 6 < ... < ~K < ~K+l = I} be a partition of [0, IJ into K subintervals IKi = [~i' ~i+l) i = 0, ... ,K - 1, and IKK = [~K' ~K+1J, where K == Kn = O(nV) with 0 < v < 0.5 being a positive {= {O =
~o
integer such that
h ==
max
l~k~K+1
I~k
-
~k-ll =
O(n-V).
Let S( C, ~) be the space of polynomial splines of degree C ~ 1 consisting of functions s(·) satisfying: (i) the restriction of s(·) to IKi is a polynomial of order C- 1 for 1 ~ i ~ K; (ii) for C ~ 2, sis C- 2 times continuously differentiable on [O,IJ. According to Schumaker (1981, page 124), there exists a local basis B i (·), 1 ~ i ~ qn for S(C, {) with qn = Kn + C, such that for any ¢nj (.) E S(C, {), qn
¢nj(Wj) =
L bjiBi(Wj), i=l
1 ~ j ~ J.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
13
Put
B(w) = (Bl(w),· .. ,Bqn(w))T, B(w) = (BT(Wr), ... ,BT(wJ)f,
b j = (bjl , ... ,bjqn)T, b = (bi, ... ,b'})T.
bf
Then cPnj(Wj) = B(wj) and cPn(w) == 2:,1=1 cPnj(Wj) = bTB(w). Under regular smoothness assumptions, cPj's can be well approximated by functions in S(C, ~). Therefore, by (2.23), we have the logarithm of an approximated partial likelihood n
C({J, b)
L b"i{ ¢(Zi;}3) + cPn(Wd -log L exp[¢(Zj;}3) + cPn(Wj )]},
=
.=1
(2.24)
JERi
where J
L cPnj(Wji )
cPn(Wi) =
j=l
with Wji being the j-th component of Wi, for i = 1,··· ,n. Let (/3, b) maximize the above partial likelihood (2.24). Then an estimator of cP(-) at point w is simply ,
the cP(w)
J'
= 2:,j=l cPj(Wj)
.'
wIth cPj(Wj)
=
,T
b j B(wj).
As shown in Huang (1999), when ¢(z; f3) = zT f3, the estimator vn-consistency. That is, under certain conditions,
vn(/3 -
f3) = n- l / 2 I- l (f3)
/3 achieves
n
L l~(Xi' b"i, Zi, Wi) + Op(l) i=l
where I(f3)
= E[l~(X,
b., Z, W)]02 is the information bound and
l~(X, 8, Z, W) = IT (Z -
a*(t) - h*(W)) dM(t)
is the efficient score for estimation of f3 in model (2.22), where h*(w) = hiCwr) + ... + h j (w J) and (a * , hi, . .. ,h j) is the unique L2 functions that minimize
where
M(t)
=
M{X
~ t} - I
t
I{X
~ u} exp[Z'f3 + cP(W)] dAo(u)
is the usual counting process martingale. Since the estimator, achieves the semiparametric information lower bound and is asymptotically linear, it is asymptotically efficient among all the regular estimators (see Bickel, Klaassen, Ritov, and Wellner 1993). However, the information lower bound cannot be consistently estimated, which makes inference for f3 difficult in practice. Further, the asymptotic distribution of the resulting estimator
/3,
Jianqing Fan, Jiancheng Jiang
14
¢ is hard to derive.
This makes it difficult to test if ¢ admits a certain parametric
form. The resulting estimates are easy to implement. Computationally, the maximization problem in (2.24) can be solved via the existing Cox regression program, for example coxph and bs in Splus software [for details, see Huang (1999)]. However, the number of parameters is large and numerical stability in implementation arises in computing the partial likelihood function. An alternative approach is to use the profile partial likelihood method as in Cai et al. (2007) (see also Section 3.2). The latter solves many much smaller local maximum likelihood estimation problems. With the estimators of (3 and ¢('), one can estimate the cumulative baseline hazard function Ao(t) = J~ Ao(u)du by a Breslow's type of estimators:
1(8 t
Ao(t)
=
n
Yi (u)eX P{¢(Zi;,B)
+ ¢(Wi
where Yi(u) = I(Xi ~ u) is the at-risk indicator and Ni(u) is the associated counting process.
3
n
nr 8 dNi(u), 1
= I(Xi <
U,
~i
=
1)
Multivariate Cox's type of models
The above Cox type of models are useful for modeling univariate survival data. However, multivariate survival data often arise from case-control family studies and other investigations where either two or more events occur for the same subject, or from identical events occurring to related subjects such as family members or classmates. Since failure times are correlated within cluster (subject or group), the independence of failure times assumption in univariate survival analysis is violated. Developing Cox's type of models to tackle with such kind of data is in need. Three types of models are commonly used in the multivariate failure time literature: overall intensity process models, frailty models, and marginal hazard models. In general, the overall hazard models deal with the overall intensity, which is defined as the hazard rate given the history of the entire cluster (Andersen and Gill 1982). Interpretation of the parameters in an overall hazard model is conditioned on the failure and censoring information of every individual in the cluster. Consequently, most attention over the past two decades has been confined to marginal hazard models and frailty models. The frailty model considers the conditional hazard given the unobservable frailty random variables, which is particularly useful when the association of failure types within a subject is of interest (see Hougaard 2000). However, such models tend to be restrictive with respect to the types of dependence that can be modeled and model fitting is usually cumbersome. When the correlation among the observations is unknown or not of interest, the marginal hazard model approach which models the "population-averaged" covariate effects has been widely used (see Wei, Lin and Weissfeld 1989, Lee, Wei and Amato 1992, Liang, Self and Chang 1993, Lin 1994, Cai and Prentice 1995, Prentice and Hsu 1997, Spiekerman and Lin 1998, and Cai, Fan, Jiang, and Zhou 2007 among others).
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
15
Suppose that there are n subjects and for each subject there are J failure types. Let Tij denote the potential failure time, C ij the potential censoring time, Xij = min(Tij , Cij ) the observed time, and Zij the covariate vector for the jth failure type of the i-th subject (i = 1,,,, ,n;j = 1,'" ,J). Let Aij be the indicator which equals 1 if Xij is a failure time and 0 otherwise. Let Ft,ij represent the failure, censoring and covariate information for the j-th failure type as well as the covariate information for the other failure types of the ith subject up to time t. The marginal hazard function is defined as
The censoring time is assumed to be independent of the failure time conditioning on the covariates. There are various methods to model the marginal hazard rates of multivariate failure times. In general, different methods employ different marginal models. We here introduce the methods leading to nonparametric smoothing in our research papers.
3.1
Marginal modeling using Cox's models with linear risks
Failure rates differ in both baseline and coefficients. Wei, Lin and Weissfeld (1989) proposed a marginal modeling approach for multivariate data. Specifically, for the j-th type of failure of the i-th subject, they assume that the hazard function Aij (t) takes the form (3.1) where AOj (t) is an unspecified baseline hazard function and (3 j = ((31j, ... ,(3pj f is the failure-specific regression parameter. Now, let Rj(t) = {l : Xl j ~ t}, that is, the set of subjects at risk just prior to time t with respect to the j-th type of failure. Then the j-th failure-specific partial likelihood (Cox 1972; Cox 1975) is L j ((3)
=
II[ i=l
exp{(3TZij L,IERj (Xij)
(.;i
j )}
t
ij
;
(3.2)
exp{(3 Zlj (Xij )}
see also (2.12). Note that only the terms Aij = 1 contribute to the product of (3.2). The maximum partial likelihood estimator /3j for (3j is defined as the solution to the score equation 8 log L j ((3)/8(3
=
o.
(3.3)
Using the counting process notation and the martingale theory, We~, Lin and Weissfeld (1989) established the asymptotic properties of the estimates (3/s, which show that the estimator /3j is consistent for (3j and the estimators /3/s are generally correlated. For readers' convenience, we summarize their argument in the following two examples. The employed approach to proving normality of the estimates is typical and can be used in other situations. Throughout the remainder
Jianqing Fan, Jiancheng Jiang
16
of this chapter, for a column vector a, we use aaT , respectively for k = 0, 1, and 2.
a®k
to denote 1, a, and the matrix
Example 2 (Score Equation in Counting Process Notation). Let Nij(t) = [{Xij ~ t'~ij = I}, Yij(t) = [{Xij ? t}, and Mij(t) = Nij(t) - J~Yij(u) ..ij(u)du. Then the log partial likelihood for the j-th type of failure evaluated at time t is
£j((3,t)
=
t lot
where Nj(u)
= 2:~=1 Nij(u).
Uj ({3, t) =
-lot 109[tYij(u)exP((3TZij(U))] dNj(u),
(3 TZ ij(u)dNij (u)
It is easy to see that the score equation (3.3) is
~ lot Zij(U) dNij(u) -lot SY)((3, U)/S;O) ((3, u) dNj(u) =
0, (3.4)
where and thereafter for k = 0, 1,2
sy) ((3, u)
=
n- 1
n
L Yij( U)Zij (u)®k exp{{3TZ ij (u)). i=l
Example 3 (Asymptotic Normality of the Estimators). By (3.4),
Uj ((3j,t) = where Mj(u)
t; Jro Zij(u)dMij(U) - Jo(t S?)((3j,u)/sjO)((3j,u)dMj (u), (3.5) n
= 2:~=1 Mij(U).
For k
= 0, 1, let
s;k)({3, t) = E[Y1j(t)Zlj(t)®k eXP{{3TZ 1j (t)}]. Using the Taylor expansion of Uj (/3j' (0) around (3, one obtains that
n- 1/2 Uj ((3j,oo)
=
' * ' Aj((3 )../ii((3j - (3j),
where {3* is on the line segment between
A(~) = J fJ
n
-1
/3j and (3j' and
~ ~ .. [8;Z)({3,Xij) ~ i=l
'J
8j
(0)
_ (8 j 1)({3,Xij))®2] (0) . ({3, X ij ) 8 j ((3, X ij )
Note that for any (3, 00
n -1/210
{
8j1) ((3, U) /8jO\(3, U) - S]1) ((3, u) / S]O\(3, U) } dMj (U)
-->
°
in probability. It follows from (3.5) that
(3.6)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
17
which is asymptotically normal with mean zero. By the consistency of Aj ({3) to a matrix Aj ({3) and by the asymptotic normality of n -1/2 Uj ({3 j' 00), one obtains that
(3.7)
Then by the multivariate martingale central limit theorem, for large n, ({3~, ... ,
{3~?
is approximately normal with mean ({3f, ... ,{3})T and covariance matrix D = (D jz ), j, l = 1,··· ,J, say. The asymptotic covariance matrix between -Jri({3j - (3j) and -Jri({3z - (3z) is given by
D jz ({3j' (3z) = Aj1 ({3j )E{ Wj1 ({3j )Wll ({3Z)T} A 11({3z), where
1
00
Wj1 ({3j)
=
{Zlj (t) - sY) ({3j' t) / 8)°) ({3j' tn dM1j (t).
Wei, Lin and Weissfeld (1989) also gave a consistent empirical estimate of the covariance matrix D. This allows for simultaneous inference about the {3/s. Failure rates differ only in the baseline. Lin (1994) proposed to model the j-th failure time using marginal Cox's model: (3.8) For model (3.1), if the coefficients {3j are all equal to {3, then it reduces to model (3.8), and each {3j is a consistent estimate of {3. Naturally, one can use a linear combination of the estimates, J
{3(w) = LWj{3j
(3.9)
j=l
to estimate {3, where 'LJ=l Wj = 1. Using the above joint asymptotic normality of {3/s, Wei, Lin and Weissfeld (1989) computed the variance of {3(w) and employed the weight w = (W1' ... ,W J ) T minimizing the variance. Specifically, let E be the covariance matrix of ({31' . .. ,{3 J ) T. Then
Using Langrange's multiplication method, one can find the optimal weight:
Jianqing Fan, Jiancheng Jiang
18
If all of the observations for each failure type are independent, the partial likelihood for model (3.8) is (see Cox 1975) J
L(f3)
=
II L (f3) j
j=l
(3.10)
where L j (f3) is given by (3.2) and Yij(t) = I(Xl j ~ t). Since the observations within a cluster are not necessarily independent, we refer to (3.10) as pseudopartial likelihood. Note that J
log L(f3)
= l:)og L j (f3), j=l
and
8 log L(f3) 8f3
= "
J
L.-
)=1
8 log L j (f3) 8f3 .
Therefore, the pseudo-partial likelihood merely aggregates J consistent estimation equations to yield a more powerful estimation equation without using any dependent structure. Maximizing (3.10) leads to an estimator '/3 of f3. We call this estimation method "pseudo-partial likelihood estimation". Following the argument in Example 3, it is easy to derive the asymptotic normality of fo('/3 - (3). For large nand small J, Lin (1994) gave the covariance matrix estimation formula for '/3. It is interesting to compare the efficiency of '/3 with respect to '/3(w), which is left as an exercise for interested readers.
3.2
Marginal modeling using Cox's models with nonlinear risks
The marginal Cox's models with linear risks provide a convenient tool for modeling the effects of covariates on the failure rate, but as we stressed in Section 2.1, they may yield large modeling bias if the underlying risk function is not linear. This motivated Cai, Fan, Zhou, and Zhou (2007) to study the following Cox model with a nonlinear risk: (3.11)
where f3(.) is the regression coefficient vector that may be a function of the covariate Vij , g(.) is an unknown nonlinear effect of Vij. Model (3.11) is useful for modeling the nonlinear effect of Vij and possible interaction between covariates Vij and Zij. A related work has been done in Cai and Sun (2003) using the time-varying coefficient Cox model for univariate data with J = 1.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
19
Similar to (3.10), the pseudo partial likelihood for model (3.11) is
L({3(.),g(.)) =
II II{ J
n
j=l i=l
{(
j
T
exp (3 Vij) Zij ~ g(Vij)} }t;,.i . L1ERj(Xij ) exp{{3(Vij) Zlj + g(Vij)}
(3.12)
The pseudo-partial likelihood (3.10) can be regarded as parametric counterpart of (3.12). The log-pseudo partial likelihood is given by
10gL({30,g(·))
=
J
n
j=l
i=l
LL~ij{{3(VijfZij + g(Vij) -log
L exp{{3(Vij fZlj lERj(Xij)
+ g(Vij)} }.
(3.13)
Assume that all functions in the components of (3(.) and gO are smooth so that they admit Taylor's expansions: for each given v and u, where u is close to v,
(3(u) :::::: (3(v) + (3'(v)(u - v) == ~ + ".,(u - v), g(u) :::::: g(v) + g'(v)(u - v) == a + 'Y(u - v).
(3.14)
Substituting these local models into (3.12), we obtain a similar local pseudo-partial likelihood to (2.17) in Section 2.1: J
C(e)
=
n
L L Kh(Vij - V)~ij j=li=l x{eXij-log(
L exp(eX;j)Kh(Vij-v))}, lERj(Xij)
(3.15)
e
where = (~T,,,.,T,'Y)T and Xij = (Z'I;,Z'I;(Vij - v),(Vij - v))T. The kernel function is introduced to confine the fact that the local model (3.14) is only applied to the data around v. It gives a larger weight to the data closer to the point v. Let e(v) = (~(V)T,,,.,(v)T,i(v))T be the maximizer of (3.15). Then (:J(v) = 8(v) is a local linear estimator for the coefficient function (30 at the point v. Similarly, an estimator of g'(.) at the point v is simply the local slope i(v), that is, the curve gO can be estimated by integration of the function g'(v). Using the counting process theory incorporated with non parametric regression techniques and the argument in Examples 2 and 3, Cai, Fan, Zhou, and Zhou (2007) derived asymptotic normality of the resulting pseudo-likelihood estimates. An alternative estimation approach is to fit a varying coefficient model for each failure type, that is, for event type j, to fit the model (3.16) resulting in ~)v) for estimating ej(v) = ({3J(v),{3~(v)T,gj(v))T. Under model (3.11), we have = = ... = eJ. Thus, as in (3.9), we can estimate e(v) by a linear combination
e1 e2
J
e(v; w)
=
LWjej(v) j=l
Jianqing Fan, Jiancheng Jiang
20
with L;=l Wj 1. The weights can be chosen in a similar way to (3.10). For details, see the reference above.
3.3
Marginal modeling using partly linear Cox's models
The fully nonparametric modeling of the risk function in the previous section is useful for building nonlinear effects of covariates on the failure rate, but it could lose efficiency if some covariates' effects are linear. To gain efficiency and to retain nice interpretation of the linear Cox models, Cai, Fan, Jiang, and Zhou (2007) studied the following marginal partly linear Cox model: (3.17) where Zij (-) is a main exposure variable of interest whose effect on the logarithm of the hazard might be non-linear; W ij (-) = (Wij1 (·),··· ,Wijq(·)f is a vector of covariates that have linear effects; AOj (.) is an unspecified baseline hazard function; and g(.) is an unspecified smooth function. For d-dimensional variable Zij, one can use an additive version g(Z) = gl(Zl) + ... + g(Zd) to replace the above function g(.) for alleviating the difficulty with curse of dimensionality. Like model (3.8), model (3.17) allows a different set of covariates for different failure types of the subject. It also allows for a different baseline hazard function for different failure types of the subject. It is useful when the failure types in a subject have different susceptibilities to failures. Compared with model (3.8), model (3.17) has an additional nonlinear term in the risk function. A related class of marginal models is given by restricting the baseline hazard functions in (3.17) to be common for all the failure types within a subject, i.e., (3.18) While this model is more restrictive, the common baseline hazard model (3.18) leads to more efficient estimation when the baseline hazards are indeed the same for all the failure types within a subject. Model (3.18) is very useful for modeling clustered failure time data where subjects within clusters are exchangeable. Denote by Rj(t) = {i : Xij ~ t} the set of subjects at risk just prior to time t for failure type j. If failure times from the same subject were independent, then the logarithm of the pseudo partial likelihood for (3.17) is (see Cox 1975) J
£(/3, g(.» =
n
2:: 2:: Llij {/3TWij (Xij ) + g(Zij (Xij » -
Rij (/3, g)},
(3.19)
j=li=l
where R ij (/3,g) = log (L1ERjCXij) exp[/3TW1j (Xij ) + g(Zlj(Xij»l). The pseudo partial likelihood estimation is robust against the mis-specification of correlations among failure times, since we neither require that the failure times are independent nor specify a dependence structure among failure times. Assume that g(.) is smooth so that it can be approximated locally by a polynomial of order p. For any given point Zo, by Taylor's expansion,
g(z) ~ g(zo)
f.. gCk)(zo) k! (z -
+6
k=l
zo)k == 0: + "(TZ,
(3.20)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis T
-
21
.
where, = bl,'" ,"(p) and Z = {z - Zo,'" , (z - zo)pV. Usmg the local model (3.20) for the data around Zo and noting that the local intercept a cancels in (3.19), we obtain a similar version of the logarithm of the local pseudo-partial likelihood in (2.17): J
e({3,,)
=
n
LLKh(Zij(Xij ) - ZO)~ij j=1 i=1 (3.21)
where
Rtj ({3,,) = log(
L
exp[{3TWlj(Xij)
+ ,TZlj(Xij)]Kh(Zlj(Xij) -
zo)),
lERj(Xij)
and Zij(U) = {Zij(U) - zo,'" , (Zij(U) - zo)pV. Let (,6(zo),--y(zo)) maximize the local pseudo-partial likelihood (3.21). Then, ~n estimator of g' (.) at the point Zo is simply the first component of i( zo), namely g'(zo) = ,oil (zo)· The curve ?J can be estimated by integration on the function g'(zo) using the trapezoidal rule by Hastie and Tibshirani (1990). To assure the identifiability of g(.), one can set g(O) :t:: 0 without loss of generality. Since only the local data are used in the estimation of {3, the resulting estimator for (3 cannot be root-n consistent. Cai, Fan, Jiang, and Zhou (2007) referred to (,6(zo), i(zo)) as the naive estimator and proposed a profile likelihood based estimation method to fix the drawbacks of the naive estimator. Now let us introduce this method. For a given (3, we obtain an estimator g(k)(.,{3) of g(k)(.), and hence g(.,{3), by maximizing (3.21) with respect to,. Denote by i(zo,{3) the maximizer. Substituting the estimator g(.,{3) into (3.19), one can obtain the logarithm of the profile pseudo-partial likelihood: J
fp({3)
=
n
LL~ij{,8TWij +g(Zij,{3) j=li=1 -lOg(
L
eXP [{3TW1j
+?J(Zlj,{3)])}.
(3.22)
lERj (X ij )
Let ,6 maximize (3.22) and i = i(zo, ,6). Then the proposed estimator for the parametric component is simply ,6 and for the nonparametric component is gO =
g(., ,6). Maximizing (3.22) is challenging since the function form ?J(., (3) is implicit. The objective function ep (') is non-concave. One possible way is to use the backfitting algorithm, which iteratively optimizes (3.21) and (3.22). More precisely, given (3o, optimize (3.21) to obtain ?J(., (3o). Now, given g(., (3o), optimize (3.22) with respect to {3 by fixing the value of (3 in ?J(-' (3) as (3o, and iterate this until convergence. An alternative approach is to optimize (3.22) by using the NewtonRaphson method, but ignore the computation of 8~2?J(·,{3), i.e. setting it to zero in computing the Newton-Raphson updating step.
22
Jianqing Fan, Jiancheng Jiang
As shown in Cai, Fan, Jiang, and Zhou (2007), the resulting estimator /3 is root-n consistent and its asymptotic variance admits a sandwich formula, which leads to a consistent variance estimation for /3. This furnishes a practical inference tool for the parameter {3. Since /3 is root-n consistent, it does not affect the estimator of the nonparametric component g. If the covariates (WG' Zlj)T for different j are identically distributed, then the resulting estimate 9 has the same distribution as the estimate in Section 2.1. That is, even though the failure types within subjects are correlated, the profile likelihood estimator of g(.) performs as well as if they were independent. Similar phenomena were also discovered in nonparametric regression models (see Masry and Fan 1997; Jiang and Mack 2001). With the estimators of (3 and g(.), one can estimate the cumulative baseline hazard function AOj(t) = J~ AOj(u)du under mild conditions by a consistent estimator:
where Yij(u) = J(Xij ;?: u) is the at-risk indicator and Nij(u) 1) is the associated counting process.
3.4
=
J(Xij ~
U,
f}.ij =
Marginal modeling using partly linear Cox's models with varying coefficients
The model (3.17) is useful for modeling nonlinear covariate effects, but it cannot deal with possible interaction between covariates. This motivated Cai, Fan, Jiang, and Zhou (2008) to consider the following partly linear Cox model with varying coefficients: (3.24) where W ij (-) = (Wij1 (·),··· ,Wijq(·))T is a vector of covariates that has linear effects on the logarithm of the hazard, Zij(-) = (Zijl (.), ... ,Zijp(· ))T is a vector of covariates that may interact with some exposure covariate Vij (.); AOj (.) is an unspecified baseline hazard function; and Q{) is a vector of unspecified coefficient functions. Model (3.24) is useful for capturing nonlinear interaction between covariates V and Z. This kind of phenomenon often happens in practice. For example, in the aforementioned FRS study, V would represent the calendar year of birthdate, W would consist of confounding variables such as gender, blood pressure, cholesterol level and smoking status, etc, and Z would contain covariates possibly interacting with V such as the body mass index (BMI). In this example, one needs to model possible complex interaction between the BMI and the birth cohort. As before we use R j (t) = {i : X ij ;?: t} to denote the set of the individuals at risk just prior to time t for failure type j. If failure times from the same subject
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
23
were independent, then the partial likelihood for (3.24) is
For the case with J = 1, if the coefficient functions are constant, the partial likelihood above is just the one in Cox's model (Cox 1972). Since failure times from the same subject are dependent, the above partial likelihood is actually again a pseudo-partial likelihood. Assume that o{) is smooth so that it can be approximated locally by a linear function. Denote by fj (.) the density of V1j . For any given point Vo E U]=ISUPP(fj), where supp(fj) denotes the support of fj(') , by Taylor's expansion,
a(v) ~ a(vo) + a'(vo)(v - vo) == 0 + ",(v - vo).
(3.26)
Using the local model (3.26) for the data around Vo, we obtain the logarithm of the local pseudo-partial likelihood [see also (2.17)J: J
£((3,,)
=
n
LLKh("Vij(Xij ) - vo)ll ij j=li=1 (3.27)
R:j ({3,,) = log(
L
exp[{3TWlj(Xij) + ,TUlj(Xij, VO)JKh(Vzj(Xij ) - vo)).
lERj(Xij)
Let (i3(vo), i(vo)) maximize the local pseudo-partial likelihood (3.27). Then, an estimator of a(·) at the point Vo is simply the local intercept 6(vo), namely &(vo) = 6(vo), When Vo varies over a grid of prescribed points, the estimates of the functions are obtained. Since only the local data are used in the estimation of {3, the resulting estimator for {3 cannot be y'n-consistent. Let us refer to (13 (vo), &(vo)) as a naive estimator. To enhance efficiency of estimation, Cai, Fan, Jiang and Zhou (2008) studied a profile likelihood similar to (3.22). Specifically, for a given {3, they obtained an estimator of &(-, (3) by maximizing (3.27) with respect to,. Substituting the estimator &(-,{3) into (3.25), they obtained the logarithm of the profile pseudopartial likelihood: J
n
£p({3) = L L ll ij {{3TWij j=1 i=1 -log(
L lERj(Xij)
+ &("Vij, (3)TZij
eXP[{3TWlj +&(Vzj,{3f Z ljJ)}.
(3.28)
Jianqing Fan, Jiancheng Jiang
24
Let 13 maximize (3.28). The final estimator for the parametric component is simply 13 and for the coefficient function is o{) = o{,j3). The idea in Section 3.3 can be used to compute the profile pseudo-partial likelihood estimator. The resulting estimator 13 is root-n consistent and its asymptotic variance admits a sandwich formula, which leads to a consistent variance estimation for 13. Since 13 is yin-consistent, it does not affect the estimator of the non parametric component a. If the covariates (W'G, Zlj)Y for different j are identically distributed, then even though the failure types within subjects are correlated, the profile likelihood estimator of a(·) performs as well as if they were independent [see Cai, Fan, Jiang, and Zhou (2008)]. With the estimators of (3 and a(·), one can estimate the cumulative baseline hazard function AOj (t) = J~ AOj (u )du by a consistent estimator:
1 t
AOj(t) =
n
[2:)'ij(u) exp{j3TWij(u)
o
+ a(Vij (U))yZij (u)}
i=l
n
r L dNij(U), 1
i=l
where YijO and Nij(u) are the same in Section 3.3.
4
Model selection on Cox's models
For Cox's type of models, different estimation methods have introduced for estimating the unknown parameters/functions. However, when there are many covariates, one has to face up to the variable selection problems. Different variable selection techniques in linear regression models have been extended to the Cox model. Examples include the LASSO variable selector in Tibshirani (1997), the Bayesian variable selection method in Faraggi and Simon (1998), the nonconcave penalised likelihood approach in Fan and Li (2002), the penalised partial likelihood with a quadratic penalty in Huang and Harrington (2002), and the extended BIC-type variable selection criteria in Bunea and McKeague (2005). In the following we introduce a model selection approach from Cai, Fan, Li, and Zhou (2005). It is a penalised pseudo-partial likelihood method for variable selection with multivariate failure time data with a growing number of regression coefficients. Any model selection method should ideally achieve two targets: to efficiently estimate the parameters and to correctly select the variables. The penalised pseudo-partial likelihood method integrates them together. This kind of idea appears in Fan & Li (2001, 2002). Suppose that there are n independent clusters and that each cluster has Ki subjects. For each subject, J types of failure may occur. Let Tijk denote the potential failure time, Cijk the potential censoring time, X ijk = min(Tijk' Cijk) the observed time, and Zijk the covariate vector for the j-th failure type of the k-th subject in the i-th cluster. Let 6. ijk be the indicator which equals 1 if Xijk is a failure time and 0 otherwise. For the failure time in the case of the j-th type of failure on subject k in cluster i, the marginal hazards model is taken as
(4.1)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
25
where {3 = ((31, ... ,(3dn ) T is a vector of unknown regression coefficients, dn is the dimension of (3, Zijk(t) is a possibly external time-dependent covariate vector, and AOj(t) are unspecified baseline hazard functions. Similar to (3.10), the logarithm of a pseudo-partial likelihood function for model (4.1) is
R({3) =
J
n
Ki
j=1
i=1
k=1
L L L ~ijk ((3TZijk(Xijk) -
R({3)) ,
(4.2)
where R({3) = log [2:7=1 2::~1 Yij9(Xijk)exP{{3TZlj9(Xijk)}] and Yijg(t) = I (X1jg ;;?: t) is the survival indicator on whether the g-th subject in the l-th cluster surviving at time t. To balance modelling bias and estimation variance, many traditional variable selection criteria have resorted to the use of penalised likelihood, including the AIC (Akaike, 1973) and BIC (Schwarz, 1978). The penalised pseudo-partial likelihood for model (4.1) is defined as dn
L({3) = R({3) - n LP>'j (l(3j I), j=1
(4.3)
where P>'j (l(3j I) is a given nonnegative function called a penalty function with Aj as a regularisation or tuning parameter. The tuning parameters can be chosen subjectively by data analysts or objectively by data themselves. In general, large values of Aj result in simpler models with fewer selected variables. When Ki = 1, J = 1, dn = d, and Aj = A, it reduces to the penalized partial likelihood in Fan and Li (2002). Many classical variable selection criteria are special cases of (4.3). An example is the Lo penalty (or entropy penalty)
p>.(IBI) = 0.5A 2 I(IBI =I- 0). In this case, the penalty term in (4.3) is merely 0.5nA 2 k, with k being the number of variables that are selected. Given k, the best fit to (4.3) is the subset of k variables having the largest likelihood R({3) among all subsets of k variables. In other words, the method corresponds to the best subset selection. The number of variables depends on the choice of A. The AIC (Akaike, 1973), BIC (Schwarz, 1978), qr.criterion (Shibata, 1984), and RIC (Foster & George, 1994) correspond to A = (2/n)1/2, {10g(n)/n}1/2, [log{log(n)}] 1/2, and {10g(dn )/n}1/2, respectively. Since the entropy penalty function is discontinuous, one requires to search over all possible subsets to maximise (4.3). Hence it is very expensive computationally. Furthermore, as analysed by Breiman (1996), best-subset variable selection suffers from several drawbacks, including its lack of stability. There are several choices for continuous penalty functions. The L1 penalty, defined by p>.(IBI) = AIBI, results in the LASSO variable selector (Tibshirani, 1996). The smoothly clipped absolute deviation (SCAD) penalty, defined by
p~(B)
= Al(IOI ::;; A)
+ (aA -
0)+ l(O > A),
a-I
(4.4)
Jianqing Fan, Jiancheng Jiang
26
for some a > 2 and A > 0, with PA(O) = O. Fan and Li (2001) recommended a = 3.7 based on a risk optimization consideration. This penalty improves the entropy penalty function by saving computational cost and resulting in a continuous solution to avoid unnecessary modelling variation. Furthermore, it improves the L1 penalty by avoiding excessive estimation bias. The penalised pseudo-partial likelihood estimator, denoted by maximises (4.3). For certain penalty functions, such as the L1 penalty and the SCAD penalty, maximising L(f3) will result in some vanishing estimates of coefficients and make their associated variables be deleted. Hence, by maximising L(f3), one selects a model and estimates its parameters simultaneously. Denote by f30 the true value of f3 with the nonzero and zero components f310 and f320. To emphasize the dependence of Aj on the sample size n, Aj is written as Ajn. Let Sn be the dimension of f31O,
/3,
As shown in Cai, Fan, Li, and Zhou (2005), under certain conditions, if 0, bn ----f 0 and d~jn ----f 0, as n ----f 00, then with probability tending to one, there exists a local maximizer /3 of L(f3), such that an
----f
Furthermore, if Ajn ----f 0, JnjdnAjn ----f 00, and an = O(n- 1 / 2 ), then with probability tending to 1, the above consistent local maximizer /3 = (/3f, /3'f)T must be such that (i) /32 = 0 and (ii) for any nonzero constant Sn x 1 vector Cn with c;cn
r::: Tr-1/2(A 11 11
ynC n
+ ~ ){ f31 A
- f310
+
(
All
= 1,
+ ~ )-1 b}
D N(O, 1),
-----7
where All and r ll consist of the first Sn columns and rows of A(f3lO, 0) and r(f3lO, 0), respectively (see the aforementioned paper for details of notation here). The above result demonstrates that the resulting estimators have the oracle property. For example, with the SCAD penalty, we have an = 0, b = 0 and ~ = 0 for sufficiently large n. Hence, by the above result,
VnC;:r~11/2 All (/31
-
f31O) ~ N(O, 1).
The estimator /31 shares the same sampling property as the oracle estimator. Furthermore, /32 = 0 is the same as the oracle estimator that knows in advance that f32 = O. In other words, the resulting estimator can correctly identify the true model, as if it were known in advance. Further study in this area includes extending the above model selection method to other Cox's type of models, such as the partly linear models in Sections 2.3, 3.3 and 3.4.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
5
27
Validating Cox's type of models
Even though different Cox's type of models are useful for exploring the complicate association of covariates with failure rates, there is a risk that misspecification of a working Cox model can create large modeling bias and lead to wrong conclusions and erroneous forecasting. It is important to check whether certain Cox's models fit well a given data set. In parametric hypothesis testing, the most frequently used method is the likelihood ratio inference. It compares the likelihoods under the null and alternative models. See for example the likelihood ratio statistic in (2.14). The likelihood ratio tests are widely used in the theory and practice of statistics. An important fundamental property of the likelihood ratio tests is that their asymptotic null distributions are independent of nuisance parameters in the null model. It is natural to extend the likelihood ratio tests to see if some nonparametric components in Cox's type of models are of certain parametric forms. This allows us to validate some nested Cox's models. In nonparametric regression, a number of authors constructed the generalized likelihood ratio (GLR) tests to test if certain parametric/nonparametric null models hold and showed that the resulting tests share a common phenomenon, the Wilks phenomenon called in Fan, Zhang, and Zhang (2001). For details, see the reviewing paper of Fan and Jiang (2007). In the following, we introduce an idea of the GLR tests for Cox's type of models. Consider, for example, the partly linear additive Cox model in (2.22):
.A{ tlz, w}
=
.Ao(t) exp{ zT,B + (h (WI) + ... +
(5.1)
where ,B is a vector of unknown parameters and
Ha: A,B = 0 versus HI: A,B
-I- 0
(5.2)
and
Hb:
(5.3)
for d = 1,'" ,J. The former (5.2) tests the linear hypothesis on the parametric components, including the significance of a subset of variables, and the latter (5.3) tests the significance of the nonparametric components. Under model (5.1), the maximum partial likelihood is
£(HI) =
t
c5i { Z[ j3
i=1
+ ¢(Wi )
-log
L exp[ZJ j3 + ¢(W
j )]},
JERi
where j3 and ¢(Wi ) = Ef=1 ¢j(Wji) are estimators in Section 2.3. For the null model (5.2), the maximum partial likelihood is
£(Ha)
=
t
i=1
c5i { Z[ j3a
+ ¢a(Wj ) -log L JERi
exp[ZJ j3a
+ ¢a(W j )]},
Jianqing Fan, Jiancheng Jiang
28
where ¢a(W j ) = L.~=l ¢k(Wkj) is the estimate based on polynomial splines under the null model. For the null model (5.3), the maximum partial likelihood is n
£(Hb)
=
L
2::>5i { Zf,l\ + ¢b(Wj ) -log exp[ZJ,Bb i=l JERi
+ ¢b(Wj )]},
where ¢b(W j ) = L.~=d+l ¢k(Wkj) is again the polynomial-spline based estimate under Hb. The GLR statistics can be defined as
and
An,b
=
£(Hd - £(Hb),
respectively for the testing problems (5.2) and (5.3). Since the estimation method for ¢ is efficient, we conjecture that the Wilks phenomenon holds (see Bickel 2007). That is, asymptotic null distribution of An,a is expected to be the chi-squared distribution with l degrees of freedom (see Fan and Huang 2005; Fan and Jiang 2007). Hence, the critical value can be computed by either the asymptotic distribution or simulations with nuisance parameters' values taken to be reasonable estimates under Ho. It is also demonstrated that one can proceed to the likelihood ratio test as if the model were parametric. For the test statistic An,b, we conjecture that Wilks' phenomenon still exists. However, it is challenging to derive the asymptotic null distribution of the test statistic. Similar test problems also exist in other Cox's type of models. More investigations along this direction are needed.
6
Transformation models
Although Cox's type of models are very useful for analyzing survival data, the proportionality hazard assumption may not hold in applications. As an alternative to Cox's model (2.1), the following model
A(tIZ(t)) = AO(t) + Z(t)'/3 postulates an additive structure on the baseline and the covariates' effects. This model is called an additive hazards model and has received much attention in statistics. See, for example, Lin and Ying (1994), Kulich and Lin (2000), and Jiang and Zhou (2007), among others. A combination of the multiplicative and additive hazards structures was proposed by Lin and Ying (1995), which takes the form
A(tlZl (t), Z2(t))
=
AO(t) exp(/3~ Zl (t))
+ /3;Z2(t),
where Zl(t) and Z2(t) are different covariates of Z(t). It may happen in practice that the true hazard risks are neither multiplicative nor additive. This motivated Zeng, Yin, and Ibrahim (2005) to study a class of transformed hazards models
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
29
by imposing both an additive structure and a known transformation C(·) on the hazard function, that is, G(A(tIZ(t))
=
Ao(t)
+ (3' Z(t),
(6.1 )
where C(·) is a known and increasing transformation function. Essentially, model (6.1) is a partial linear regression model for the transformed hazard function. In particular, within the family of the Box-Cox transformations S
C(x) = {(x -l)/s, ifs > 0, log(s),
ifs=O,
model (6.1) is the additive hazards model when s = 1 and the Cox model when s = O. Since the model (6.1) allows a much broader class of hazard patterns than those of the Cox proportional hazards model and the additive hazards model, it provides us more flexibility in modeling survival data. The sieve maximum likelihood method can be used to estimate the model parameters, and the resulting estimators of parameters are efficient in the sense that their variances achieve the semiparametric efficiency bounds. For details, see Zeng, Yin, and Ibrahim (2005). Further work along this topic includes variable selection using the SCAD introduced before, hypothesis testing for the model parameters, and extensions to multivariate data analysis, among others, to which interested readers are encouraged to contribute. Let S (·1 Z) be the survival function of T conditioning on a vector of covariates Z. Cox's model can be rewritten as log[-log{S(tIZ)}] = H(t)
+ Z' (3,
(6.2)
where H is an unspecified strictly increasing function. An alternative is the proportional odds model (Pettitt 1982; Bennett 1983): -logit{S(tIZ)} = H(t)
+ Z'(3.
(6.3)
Thus, a natural generalisation of (6.2) and (6.3) is C{S(tIZ)} = H(t)
+ Z'(3,
(6.4)
where C(·) is a known decreasing function. It is easy to see that model (6.4) is equivalent to C(T)
=
-(3' Z
+ e,
(6.5)
where e is a random error with distribution function F = 1 - G- 1 . For the noncensored case, the above model was studied by Cuzick (1988) and Bickel and Ritov (1997). For model (6.5) with possibly right censored observations, Cheng, Wei and Ying (1995) studied a class of estimating functions for the regression parameter (3. A recent extension to model (6.5) is considered by Ma and Kosorok (2005), which takes the form H(T) = (3' Z
+ feW) + e,
Jianqing Fan, Jiancheng Jiang
30
where f is an unknown smooth function. This model obviously extends the partly linear Cox model (2.22) and model (6.5). Penalized maximum likelihood estimation has been investigated by Ma and Kosorok (2005) for the current status data, which shows that the resulting estimator of f3 is semiparametrically efficient while the estimators of Hand fare n 1/ 3 -consistent. Since the estimation method is likelihood based, the variable selection method and the GLR test introduced before are applicable to this model. Rigor theoretical results in this direction are to be developed.
7
Concluding remarks
Survival analysis is an important field in the theory and practice of statistics. The techniques developed in survival analysis have penetrated many disciplines such as the credit risk modeling in finance. Various methods are available in the literature for studying the survival data. Due to the limitation of space and time, we touch only the partial likelihood ratio inference for Cox's type of models. It is demonstrated that the non- and semi- parametric models provide various flexibility in modeling survival data. For analysis of asymptotic properties of the nonparametric components in Cox's type of models, counting processes and their associated martingales play an important role. For details, interested readers can consult with Fan, Gijbels, and King (2007) and Cai, Fan, Jiang, and Zhou (2007). There are many other approaches to modeling survival data. Parametric methods for censored data are covered in detail by Kalbfleisch and Prentice (1980, Chapters 2 and 3) and by Lawless (1982, Chapter 6). Semiparametric models with unspecified baseline hazard function are studied in Cox and Oakes (1984). Martingale methods are also used to study the parametric models (Borgan 1984) and the semi parametric models (Fleming and Harrington 2005; Andersen et aI, 1993).
References [1 J H. Akaike. Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika 60 (1973), 255-265. [2J P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding. Statistical Models Based on Counting Processes, Springer-Verlag, New York 1993. [3] M. Aitkin and D. G. Clayton. The fitting of exponential, Weibull and extreme value distributions to complex censored survival data using GLIM. Appl. Statist. 29 (1980), 156-163. [4J O. E. Barndorff-Nielsen and D. R. Cox. Asymptotic Techniques for Use in Statistics. Chapman & Hall, 1989, page 252. [5] J. M. Begun, W. J. Hall, W. -M. Huang, and J. A. Wellner. Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11 (1982), 432-452. [6] S. Bennetts. Analysis of survival data by the proportional odds model. Statist. Med. 2 (1983), 273-277.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
31
[7] P. J. Bickel. Contribution to the discussion on the paper by Fan and Jiang, "Nonparametric inference with generalized likelihood ratio tests". Test 16 (2007), 445-447. [8] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and Adaptive Estimation in Semiparametric Models, Johns Hopkins University Press, Baltimore 1993. [9] P. J. Bickel and Y. Ritov. Local asymptotic normality ofranks and covariates in transformation models. In "Festschrift for Lucien Le Cam" (eds. D. Pollard, E. Torgersen and G. L. Yang) 43-54. Springer-Verlag, New York 1997. [10] L. Breiman. Heuristics of instability and stabilization in model selection. Ann. Statist. 24 (1996), 2350-2383. [11] N. E. Breslow. Contribution to the discussion on the paper by D.R. Cox, "Regression and life tables" . J. Royal Statist. Soc. B 34 (1972), 216-217. [12] N. E. Breslow. Covariance analysis of censored survival data. Biometrics 30 (1974) 89-99. [13] F. Bunea, and 1. W. McKeague. Covariate selection for semiparametric hazard function regression models. J. Mult. Anal. 92 (2005), 186-204. [14] J. Cai, J. Fan, J. Jiang, and H. Zhou. Partially Linear Hazard Regression for Multivariate Survival Data. Jour. Am. Statist. Assoc. 102 (2007), 538-55l. [15] J. Cai, J. Fan, J. Jiang, and H. Zhou. Partially Linear Hazard Regression with Varying-coefficients for Multivariate Survival Data. J. Roy. Statist. Soc. B 70 (2008), 141-158. [16] J. Cai, J. Fan, R. Li, and H. Zhou. Variable selection for multivariate failure time data. Biometrika 92 (2005), 303-316. [17] J. Cai, J. Fan, H. Zhou, and Y. Zhou. Marginal hazard models with varyingcoefficients for multivariate failure time data. Ann. Statist. 35 (2007), 324354. [18] J. Cai and R. L. Prentice. Estimating equations for hazard ratio parameters based on correlated failure time data, Biometrika 82 (1995), 151-164. [19] Z. Cai and Y. Sun. Local linear estimation for time-dependent coefficients in Cox's regression models. Scandinavian Journal of Statistics 30 (2003), 93-11l. [20] S. C. Cheng, L. J. Wei, and Z. Ying. Analysis of transformation models with censored cata. Biometrika 82 (1995), 835-845. [21] D. R. Cox. Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34 (1972), 187-220. [22] D. R. Cox. Partial likelihood. Biometrika 62 (1975), 269-276. [23] D. R. Cox. The current position of statistics: a personal view, (with discussion). International Statistical Review 65 (1997), 261-276. [24] D. R. Cox and D.V. Hinkley. Theoretical Statistics. Chapman & Hall, London 1974. [25] J. Cuzick. Rank regression. Aniz. Statist. 16 (1988), 1369-1389. [26] J. Fan, 1. Gijbels, and M. King. Local likelihood and local partial likelihood in hazard regression. Ann Statist. 25 (1997), 1661-1690. [27] J. Fan and J. Jiang. Nonparametric inference with generalized likelihood ratio tests (with discussions). Test 16 (2007), 409-478. [28] J. Fan and T. Huang. Profile Likelihood Inferences on semiparametric varying-
32
Jianqing Fan, Jiancheng Jiang
coefficient partially linear models. Bernoulli 11 (2005), 1031-1057. [29] J. Fan and R. Li. Variable selection via penalized likelihood. J. Am. Statist. Assoc. 96 (2001), 1348-1360. [30] J. Fan and R. Li. Variable selection for Cox's proportional hazards model and frailty model. Ann. Statist. 30 (2002), 74-99. [31] D. Faraggi and R. Simon. Bayesian variable selection method for censored survival data. Biometrics 54 (1998), 1475-1485. [32] T. R. Fleming and D. P. Harrington, Counting Processes and Survival Analysis, John Wiley & Sons, New Jersey 2005. [33] D. P. Foster and E. 1. George. The risk inflation criterion for multiple regression. Ann. Statist. 22 (1994), 1947-1975. [34] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman & Hall, London 1990. [35] P. Hougaard. Analysis of Multivariate Survival Data. Springer-Verlag, New York 2000. [36] J. Huang. Efficient estimation of the partly linear additive Cox model. Ann. Statist. 27 (1999), 1536-1563. [37] J. Huang and D. Harrington. Penalised partial likelihood regression for rightcensored data with bootstrap selection of the penalty parameter. Biometrics 58 (2002), 781-791. [38] J. Jiang and K. A. Doksum. Empirical Plug-in Curve and Surface Estimates, In "Mathematical and Statistical Methods in Reliability", eds. B. H. Lindqvist and K. A. Doksum. Ser. Qual. Reliab. Eng. Stat. 7,433-453. World Scientific Publishing Co., River Edge, New Jersey 2003. [39] J. Jiang and Y.P. Mack. Robust local polynomial regression for dependent data. Statistic a Sinica 11 (2001), 705-722. [40] J. Jiang and H. Zhou. Additive Hazards Regression with Auxiliary Covariates. Biometrika 94 (2007), 359-369. [41] S. Johansen. An extension of Cox's regression model. International Statistical Review 51 (1983), 258-262. [42] J. D. Kalbfleisch and R. L. Prentice. The Statistical Analysis of Failure Time Data. Wiley, New York 2002. [43] M. Kulich and D. Y. Lin. Additive hazards regression with co- variate measurement error. J. Am. Statist. Assoc. 95 (2000), 238-248. [44] J. F. Lawless. Statistical Models and Methods for Lifetime Data. Wiley, New York 1982. [45] E. W. Lee, L. J. Wei, and D. A. Amato. Cox-type regression analysis for large numbers of small groups of correlated failure time observations, Survival Analysis: State of the Art. J. P. Klein and P. K. Goel (eds.), Kluwer Academic Publishers 1992, 237-247. [46] K. Y. Liang, S. G. Self, and Y. Chang. Modeling marginal hazards in multivariate failure time data, J. Roy. Statist. Soc. B 55 (1993),441-453. [47] D. Y. Lin. Cox regression analysis of multivariate failure time data: The marginal approach. Statistics in Medicine 13 (1994), 2233-2247. [48] D. Y. Lin and Z. Ying. Semiparametric analysis of the additive risk model. Biometrika 81 (1994),61-71.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
33
[49] D. Y. Lin and Z. Ying. Semiparametric analysis of general additivemultiplicative hazard models for counting processes, Ann. Statis. 23 (1995), 1712-1734. [50] S. Ma and M. R. Kosorok. Penalized log-likelihood estimation for partly linear transformation models with current status data. Ann. Statist. 33 (2005), 2256-2290. [51] E. Masry and J. Fan. Local polynomial estimation of regression functions for mixing processes. Scandinavian Journal of Statistics 24 (1997), 165-179. [52] D. Oakes. Survival analysis, In "Statistics in the 21st Century", eds. A. E. Raftery, M. A. Tanner, and M. T. Wells. Monographs on Statistics and Applied Probability 93,4-11. Chapman & Hall, London 2002. [53] A. N. Pettitt. Inference for the linear model using a likelihood based on ranks. J. R. Statist. Soc. B 44 (1982), 234-243. [54] R. L. Prentice and L. Hsu. Regression on hazard ratios and cross ratios in multivariate failure time analysis. Biometrka 84 (1997), 349-363. [55] G. Schwarz. Estimating the dimension of a modeL Ann. Statist. 6 (1978), 461-464. [56] 1. Schumaker. Spline Functions: Basic Theory. Wiley, New York 1981. [57] R. Shibata. Approximation efficiency of a selection procedure for the number of regression variables. Biometrika 71(1984), 43-49. [58] C. F. Spiekerman and D. Y. Lin. Marginal regression models for multivariate failure time data, Jour. Am. Statist. Assoc. 93 (1998), 1164-1175. [59] R. Tibshirani. The lasso method for variable selection in the Cox modeL Statist. Med. 16 (1997), 385-395. [60] A. A. Tsiatis. A large sample study of Cox's regression modeL Ann. Statist. 9 (1981),93-108. [61] L. J. Wei, D. Y. Lin, and L. Weissfeld. Regression analysis of multivariate incomplete failure time data by modelling marginal distributions. J. Am. Statist. Assoc. 84 (1989), 1065-1073. [62] Zeger, Diggle, and Liang (2004). A Cox model for biostatistics of the future. Johns Hopkins University, Dept. of Biostatistics Working Papers. [63] D. Zeng, G. Yin, and J. G. Ibrahim. Inference for a Class of Transformed Hazards Models. J. Am. Statist. Assoc. 100 (2005), 1000-1008.
This page intentionally left blank
Chapter 2 Additive-Accelerated Rate Model for Recurrent Event Donglin Zeng * Jianwen Cai t
Abstract We propose an additive-accelerated rate regression model for analyzing recurrent event. Covariates are split into two classes with either additive effect or accelerated effect. The proposed model includes both additive rate model and accelerated rate model as special cases. We propose a simple inference procedure for estimating parameters and derive the asymptotic properties. Furthermore, we discuss and justify a method for assessing additive covariates and accelerated covariates. Simulation studies are conducted to examine small-sample performance. Real data are analyzed to illustrate our approach.
Keywords: Recurrent event; additive rate model; accelerated rate model; estimating equation.
1
Introduction
Recurrent events are common in many medical applications, when the same subject experiences the same event sequentially. Examples of recurrent events include cancer relapse, multiple infection episodes, and repeated use of drugs etc. A large number of literature have been contributed to develop statistical models and associated inference for estimating the effects of risk covariates on recurrent events. Among them, the most commonly used models include proportional intensity model (Andersen and Gill, 1982) where the effects of risk covariates are assumed to be multiplicative. Later, the intensity model has been generalized to proportional rate model (Pepe and Cai, 1993; Lawless and Nadeau, 1995), where instead of modeling intensity function of recurrent event, one aims to model the rate function defined as the mean event count over time. In the common proportional rate model, the covariate effect is still assumed to be multiplicative. In many practical applications, the effects of risk covariates are observed to be non-multiplicative which makes the use of either proportional intensity model or *Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420, USA, E-mail:
[email protected] tDepartment of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420, USA, E-mail:
[email protected]
35
36
Donglin Zeng, Jianwen Cai
proportional rate model questionable. Recently, some alternative models have been proposed. One class of these models (Ghosh, 2004) is time-transformation model, where all the subjects have the similar rate function and the effect of covariates is to accelerate or decelerate the development of rate function. Particularly, if we let N*(t) denote recurrent event process and X be covariates, then such an accelerated rate model assumes
This model mimics the accelerate failure time model in univariate survival data (Cox and Oakes, 1984, Chapter 5). The estimation and inference are provided in Ghosh (2004). Different from the accelerated rate model, Schaubel, Zeng and Cai (2006) proposes an additive model for modelling the rate function of recurrent events. If we let Z be covariates with additive effect, then an additive rate model assumes E[dN*(t)JZ] = dJt(t)
+ ZT "fdt.
An estimating equation can be constructed to estimate the effect "f (Schaubel et aI, 2006). Neither accelerated rate model nor additive rate model may be satisfactory in some applications. For example, in the data from the Vitamin A Community Trial study which will be analyzed in Section 5, there are two covariates, one is a dichotomous treatment covariate and the other is an age variable. Figure 1 plots the behavior of the empirical cumulative rate function for these covariates. Especially, the first plot is the cumulative rate functions in the two treatment groups and the second plot is the cumulative rate functions in the four quarter age groups. Interestingly, it appears that the treatment effect may not be additive due to the nonlinear trend of difference between two cumulative rates; however, for some age groups, there is a clear additive effect structure over time. Therefore, a good model to accommodate these different behaviors of different covariates should be useful. In this paper, we propose a more flexible rate model which we name as additive-accelerated rate model. In notation, as before, we let N* (t) denote recurrent event process and (X, Z) be covariates. The additive-accelerated rate model assumes where Jt is an unknown and increasing function. That is, we treat part of covariates to have additive effect while the others as accelerated. Clearly, when f3 = 0, we obtain the additive rate model; when "f = 0, we obtain the accelerated rate model. Thus, the proposed model includes both the accelerated rate model and the additive rate model as special cases. In Section 2, we give the details of inference procedure for estimating the parameters in the additive-accelerated rate model. We also provide the asymptotic properties of the proposed estimators. In Section 3, we discuss some issues on practical implementation of the proposed model, especially focusing on choosing covariates with additive effect and covariates with accelerated effect. Section 4 . reports the results from a number of simulations with moderate sample sizes. Our
Chapter 2
E
"a;
15
to
.c
E
::>
z
Additive-Accelerated Rate Model for Recurrent Event
;~ 0
I
20
40
60
80
100
37
I
120
Days
i~ i o
··········1
~~--,-------r------.------,------,------~.
20
40
60
80
100
120
Days
Figure 1: The upper plot is the empirical cumulative rate functions in placebo group and treatment group: the solid line is the treatment group and the dashed line is the placebo group. The lower plot is the empirical cumulative rate functions among four age groups: the sold line, the dashed, the dotted line and the dot-dashed line in turn correspond to age~1.18, 1.18< age ~ 1.83, 1.83
2.70
methods are applied to analyzing real data in Section 5. The paper concludes with a few remarks in Section 6.
2
Inference procedure and asymptotic properties
Suppose we observe data from n i.i.d subjects and denote them as
where T is the study duration and C i is the censoring time for subject i. Assume that the censoring time is independent of recurrent event conditional on covariates, we propose the following inference procedure for estimating f3 and f. Fix f3 and ,. We first estimate /.L(t). Since
E[dN(t)IX, z, C 1\ T > where yet)
tJ = Y(t)E[dN*(t)IX, ZJ = Y(t)dJ.L(e XT {3t) + Y(t)dtZT"
= I(C 1\ T > t),
we obtain
E[dN(te- XT (3) _ Y(te- XT (3)dte- XT {3 ZT ,IX, Z, C 1\ T > te- XT {3J
Donglin Zeng, Jianwen Cai
38
Therefore, for fixed (3 and " we can estimate /-l(t) by rescaling N(t) and CAT with a factor exTf3 ; thus, an estimator for /-l(t) is the Breslow-type estimator given by A
/-l(t;{3,,) =
lt 2:;=1 o
x x {dNj (se- XJ f3) - Yj(se- Jf3)dse- Jf3 ZJ,} Y( -XT(3) . uj=1 j se J
"n
To estimate (3 and" we consider the following estimating equations
After substituting il(t;{3,,) into the left-hand side, we have
_ 2:;=1 Yj(te- XJf3)(dNj (te- XJf3) -
ZJ,e- XJ f3 dt )} = O.
(2.1)
2:;=1 Yj(te- XJf3 )
The re-transformation of the numerator on the left-hand side yields
Equivalently,
n
2:jYi(t)
i=1
Z
{ (x.)-
"n )} uJ=1 y. (tex,[ f3-xJ (3) (Zj X· J
XTf3-XTf3 ' 2 : j =1 Yj(te' J) n
J
(dNi(t)-zT,dt) =0.
(2.2) We wish to solve equation (2.2), or equivalently equation (2.1). However, the estimating function in (2.2) is not continuous in {3. Therefore, we suggest minimizing the norm of the left-hand side of (2.2) to obtain the estimators. Particularly, we implement the Neider-Meader simplex method to find the minimum in our numerical studies. At the end of this section, we will show y'n(S - (30,.:y - ,0) converges in distribution to a multivariate normal distribution with mean zeros and covariance in a sandwiched form A -1 B (A -1 f, where
Chapter 2
A
~ V."
Additive-Accelerated Rate Model for Recurrent Event
J+,(t){(~)
En
Y.(teX'{f3-xJf3) ( Zj )
- J;n~ y.(tex,{f3-xJf3~j J
and B
=
39
1
} (dNi(t) - zT ,dt)
J
II
f3=f30 ,'Y=,o
E [SiST] with Si equal to
x (dNi (te- X'{f30) - zT ,e-X'{f3odt) -
J
Yo (te •
_XT (Zi) E [Y(te,f30 ) Xi
+
XTf30
) {dN(te-
XTf30
) - ZT ,oeXTf3o E [Y(te)]
J
x XTf3O }i(te- '{f3O)E [Y(te)
XTf30
dt}]
(i)]
E [Y(te- XTf30 ) {dN(te- XTf30 ) - ZT ,oe- XTf30 dt}] E [Y(te- XTf30 )] 2
x
(2.3) .
Therefore, to estimate the asymptotic covariance, we can consistently estimate Si by Hi , simply replacing (30, ,0 and f..Lo by their corresponding estimators and replacing expectation with empirical mean in the expression of Si. To estimate A, we let () denote ((3,,) and define
Un ((})
= n- 1
tJ
_ E7=1
X [}i(te- '{f3)
(i:) {
dNi(te- X'{f3) - ZT,dte- x '{f3
[}j(te-XJf3)(dNj(te-XJf3) - Z!,e- XJ f3 dt)]
E7=1
}l.
[}j(te- XJ f3)]
We also let A be the numerical derivative of A with appropriate choice of perturbation size hn, i.e., the k-th column of A is given by (Un(B + hnek) - Un(B))/h n where ek is the k-th canonical vector. In Theorem 2 given later, we will show A is a consistent estimator for A. Thus, the asymptotic covariance is estimated by
40
Donglin Zeng, Jianwen Cai
We start to state the asymptotic properties for (/3/'1). We assume the following conditions. (AI) Assume X and Z are bounded and with positive probability, [1, X, Z) are linearly independent. (A2) Matrix A is assumed to be non-singular. (A3) P(C > TIX, Z) > 0 and C given X, Z has bounded density in [0, T). (A4) !-lo(t) is strictly increasing and has bounded second derivative in [0, T]. Condition (AI) is necessary since otherwise, f3 and "( assumed in the model may not be identifiable. Both conditions (A2) and (A3) are standard. Under these conditions, the following theorems hold. These conditions are similar to conditions 1~4 in Ying (1993). Theorem 1. Under conditions (Al)~(A4), there exists (/3, i) locally minimizing the norm of the estimating function in (2) such that
Theorem 2. Under conditions (Al)~(A4) and assuming h n ---t 0 and fohn E (co, Cl) for two positive constants Co and Cl, A-I B(A -1) T is a consistent estimator for A- 1B(A-1 )T.
Thus, Theorem 1 gives the asymptotic normality of the proposed estimators and Theorem 2 indicates that the proposed variance estimator is valid. Using the estimators for f3 and ,,(, we then estimate the unknown function p,O with fl(t; /3, i), i.e.,
The following theorem gives the asymptotic properties for fl(t). Theorem 3. Let Dx be the support of X. Then under conditions (Al)~(A4), xT for any to < TSUPXED x e- {3o, fo(fl(t) - P,o(t)) converges weakly to a Gaussian process in loo [0, to].
The upper bound for to in Theorem 3 is determined by the constraint that the study duration time is T. The proofs of all the theorems utilize the empirical process theory and are given in the appendix.
3
Assessing additive and accelerated covariates
One important assumption in the proposed model is that covariates X have accelerated effect while covariates Z have additive effect. However, it is unknown in practice which covariates should be included as part of X or Z. In this section, we propose some ways for assessing these covariates. From the proposed model, if X and Z are determined correctly, we then expect that the mean of Y(t)(N(t)* - p,(te XT {3) - tZT "() is zero. As the result,
Chapter 2
Additive-Accelerated Rate Model for Recurrent Event
41
correct choices of X and Z should minimize the asymptotic limit of n-
1
i
T
o
n
2
L"Yi(t) {Nt(t) - f-t(te XTf3 ) - tz[ 'Y} dt. i=l
The following result also shows that if we use wrong choice of X or Z, this limit cannot be minimized.
Proposition 1. Suppose Xc and Zc are correct covariates with non-zero accelerated effect f3c and non-zero additive effect 'Yc respectively. Further assume (C1) the domain of all the covariates is offull rank; (C2) f-tc(t) is continuous and satisfies f-tc(O) = 0 and f-t~(0) > 0; (C3) Xc contains more than 2 covariates; or, Xc contains one covariate taking at least 3 different values; or, Xc is a single binary variable but f-tc is not a linear function. Then if Xw and Zw are the wrong choices of the accelerated effect and the additive effect respectively, i.e., Xw #- Xc or Zw #- Zc, then for any non-zero effect f3w and 'Yw and function f-tw,
1T E [Y(t) {N*(t) - f-tw(te X;f3w) - tZ;;''Yw } 2] dt > 1T E[Y(tHN*(t)-f-tc(te X;f3C)-tZ';'Ycf]dt. The proof of Proposition is given in the appendix. Condition (C1) says that the covariates can vary independently; Condition (C2) is trivial; Condition (C3) says that the only case we exclude here is the f-t(t) = At and Xc is binary. The excluded case is unwanted since f-t(te Xcf3c ) = A{{ef3c - l)Xc + l}t resulting that Xc can also be treated as additive covariate. From Proposition 1, it is concluded that whenever we misspecify the accelerated covariates or additive covariates which has non-zero effect, we cannot minimize the limit of the square residuals for all observed t. Hence, if we find X and Z minimizing the criterion, it implies that the differences between X and Xc and Z and Zc are only those covariates without effect.
4
Simulation studies
We conduct two simulation studies to examine the performance of the proposed estimators with moderate sample sizes. In the first simulation study, we generate recurrent event using the following intensity model
E[dN(t)IN(t-), X, z, ~J = ~(df-t(tef3x)
+ 'YZdt),
where Z is a Bernoulli random variable and X = Z + N(O, 1), f-t(t) = O.Se t , and ~ is a gamma-frailty with mean 1. It is clear this model implies the marginal model as given in Section 2. Furthermore, we generate censoring time from a uniform distribution in [0,2J so that the average number of events per subject is about
Donglin Zeng, Jianwen Cai
42
1.5. In the second simulation study, we consider the same setting except that /-L(t) = 0.8t2 and the censoring distribution is uniform in [0,3]. To estimate (3 and " we minimize the Euclidean norm of the left-hand side in equation (2). Our starting values are chosen to be close to the true values in order to avoid local minima. To estimate the asymptotic covariance of the estimators, we use the numerical derivative method as suggested in Theorem 2. Particularly, we choose h n to be n- 1 / 2 , 3n- 1 / 2 and 5n- 1 / 2 but find the estimates robust to these choices. Therefore, our table only reports the ones associated with h n = 3n- 1 / 2 . Our numerical experience shows that when sample size is as small as 200, the minimization may give some extreme value for (3 in a fraction of 4% in the simulations; the fraction of bad cases decreases to about 1% with n = 600. Hence, our summary reports the summary statistics after excluding outliers which are 1.5 inter-quartile range above the third quartile or below the first quartile. Table 1 gives the summary of the simulation studies from n = 200, 400, and 600 based on 1000 repetitions. Particularly, "Est" is the median of the estimates; "SEE" is the estimate for the standard deviation from all the estimates; "ESE" is the mean of the estimated standard errors; "CP" is the coverage probability of the 95% confidence intervals. From the table, we conclude that the proposed estimators perform reasonably well with the sample size considered in terms of small bias and accurate inference.
Table 1: Summary of Simulation Studies /-L(t) = 0.8e t
n 200
parameter
f3 'Y
400
(3
600
(3
'Y 'Y
true -0.5 1 -0.5 1 -0.5 1
Est -0.501 0.991 -0.496 0.979 -0.490 1.000
SEE
ESE
CP
0.366 0.414 0.246 0.296 0.202 0.240
0.381 0.437 0.252 0.311 0.207 0.255
0.94 0.96 0.93 0.96 0.94 0.96
/-L(t) = 0.8t2
n 200
parameter (3
400
(3
600
(3
'Y 'Y 'Y
5
true -0.5 1 -0.5 1 -0.5 1
Est -0.478 0.970 -0.498 0.992 -0.493 0.998
SEE
ESE
CP
0.331 0.307 0.240 0.227 0.192 0.185
0.344 0.328 0.237 0.235 0.195 0.195
0.94 0.95 0.92 0.95 0.93 0.96
Application
We apply the proposed method to analyze two real data sets. The first data arise from a chronic granulomatous disease (CGD) study which was previously analyzed by Therneau and Hamilton (1997). The study contained a total 128 patients with 65 patients receiving interferon-gamma and 63 receiving placebo. The number of recurrent infections was 20 in the treatment arm and was 55 in the placebo arm.
Chapter 2
Additive-Accelerated Rate Model for Recurrent Event
43
Table 2: Results from Analyzing Vitamin A Trial Data Covariate vitamin A vs placebo age in years
Estimate -0.0006 -0.0040
Std Error 0.0411 0.0008
p-value 0.98 < 0.001
We also included the age of patients as one covariate. The proposed additiveaccelerated rate model was used to analyze this data. To assess whether the treatment variable and the age variable have either additive or accelerated effect, we utilized the criterion in Section 3 and considered the four different models: (a) both treatment and age were additive effects; (b) both treatment and age were accelerated effects; (c) treatment was additive and age was accelerated; (d) treatment was accelerated and age was additive. The result shows that the model with minimal error is model (b), i.e., the accelerated rate model. Such a model was already fitted in Ghosh (2004) where it showed that the treatment had a significant benefit in decreasing the number of infection occurrences. A second example is from the Vitamin A Community Trial conducted in Brazil (Barreto et aI, 1994). The study was a randomized community trial to examine the effect of the supplementation of vitamin A on diarrhea morbidity and diarrhea severity in children living in areas where its intake was inadequate. Some previous information showed that the vitamin A supplement could reduce the child mortality by 23% to 34% in populations where vitamin deficiency was endemic. However, the effect on the morbidity was little known before the study. The study consisted of young children who were assigned to receive either vitamin A or placebo every 4 months for 1 year in a small city in the Northeast of Brazil. Since it was indicated before that treatment effect might be variable over time, for illustration, we restrict our analysis to the first quarter of the follow-up among boys, which consists of 486 boys with an average number of 3 events. Two covariates of interest are the treatment indicator (vitamin A vs placebo) and the age. As before, we use the proposed method to select among the aforementioned models (a)-(d). The final result shows that model (c) yields the smallest prediction error. That is, our finding shows that the treatment effect is accelerative while the age effect is additive. Particularly, Table 2 gives the results from model (c). It shows that the treatment effect is not significant; however, younger boys tended to experience significantly more diarrhea episodes than older boys. Figure 2 gives the estimated function for J.L(t).
6
Remarks
We have proposed a flexible additive-accelerated model for modelling recurrent event. The proposed model includes the accelerated rate model and additive rate model as special cases. The proposed method performs well in simulation studies. We also discuss the method for assessing whether the covariates are additive covariates or accelerated covariates. When the model is particularly used for prediction for future observations, overfitting can be a problem. Instead of using the criterion function in assessing
Donglin Zeng, lianwen Cai
44
o
20
40
80
60
100
120
Days
Figure 2: Estimated function for J-l(t) in the Vitamin A trial
additive covariates or accelerated covariates, one may consider a generalized crossvalidation residual square error to assess the model fit. Theoretical justification of the latter requires more work. Our model can be generalized to incorporate time-dependent covariates by assuming
E[N(t)\X, Zl = {L(te X (t)Tf3)
+ Z(tf 'Y-
The same inference procedure can be used to estimate covariate effects. However, the nice result on assessing additive covariates or accelerated covariates as Proposition 1 may not exist due to the complex relationship among time-dependent covariates. Finally, our model can also be generalized to incorporate time-varying coefficient.
Acknowledgements We thank Professor Mauricio Barreto at the Federal University of Bahia and the Vitamin A Community Trial for providing the data. This work was partially supported by the National Institutes of Health grant ROI-HL57444.
Appendix Proof of Theorems 1-3. We prove Theorem 1 and Theorem 2. For convenience, we let 0 denote ({3, ,) and use the definition of Un(O). For 0 in any compact set, it is easy to see the class of functions {Y(te-
XT
(3)}, {ZT ,e-xT f3}, {N(te- XT (3) }
Chapter 2
Additive-Accelerated Rate Model for Recurrent Event
45
are P-Donsker so P-Glivenko-Cantelli. Therefore, in the neighborhood of 80 , Un (8) uniformly converges to
U(8) == E [I Yi(te- x [i3)
(i:) {dN i(te- X[i3) - Z'[-r dte - x [i3
_ E [Yj (te- XJi3) (dNj (te-xJ.B) - ZJ"{e-XJ.Bdt)] }] E [Yj(te-XJ.B)]
.
Furthermore, we note that uniformly in 8 in a compact set,
v'n(Un(8) - U(8))
~ Go
[f Y(te-XT~)
(;.) {dN(e-XT't) -
zr7 e- XT 'dt
P n [Y(te-XTi3)(dN(e-XTi3t) - ZT"{e-XT.Bdt )] }] P n [Y(te-XT.B)]
(i) J]
-G n
[IYCte-XT.B)CdN(e-XT.Bt) _ ZT"{e-XT.Bdt) P [Y(te-XT.B) P n [YCtcXT.B)]
+G n [I Y(te-
XTi3
)p [Y Cte-
XTi3
)
(i)]
x P [Y(te-XTi3)CdN(e-XTi3t) - ZT"{e- XTi3 dt)] P n [Y(te-XT.B)] P [Y(te- XTi3 )]
1 ,
where P nand P refer to the empirical measure and the expectation respectively, and G n = y'n(P n - P). Hence, from the Donsker theorem and the definition of 8 i , we obtain n
sup Iv'nUn(8) - v'nU(8) \lJ-Oo\:;;;M/v'n
n- 1 / 2
L 8 l = op(l). i
i=1
On the other hand, conditions (A2)-(A4) implies that U(8) is continuously differentiable around 80 and U(8) = U(80 ) + A(8 - 80 ) + 0(18 - 80 1). Additionally, it is straightforward to verify U(80) = O. As the result, it holds n
sup Iv'nUn (8) - v'nA(8 - 80 ) \O-Oo\:;;;M/v'n
-
1 2
n- /
L 8 l = op(l). i
i=1
We obtain a similar result to Theorem 1 in Ying (1993). Following the same proof as given in Corollary 1 in Ying (1993), we prove Theorem 1.
Donglin Zeng, Jianwen Cai
46
To prove Theorem 2, from the previous argument, we obtain uniformly in a
y'n- neighbor hood of 00 , n
:L S
..;nun(e) = ..;nA(O -eo) +n- 1/ 2
i
+op(1).
i=l
Thus, for any canonical vector e and h n
-+
0,
As the result,
Un(O + hne) - Un(O) _ A (_1_) h - e + op y'nh . n n Since y'nh n is bounded away from zero,
We have shown that the estimator based on the numerical derivatives, i.e., A, converges in probability to A. The consistency of Si to Si is obvious. Therefore,
A-1B(A-l)T
A-1B(A-1)T.
-+
To prove Theorem 3, from fl(t)'s expression, we can obtain
Y(se-xTi3)E[dN(se-xTi3) - Y(se-xTi3)dse-xTi3ZTil o E[Y(se- XT ,8)J2 E[dN(se- xTi3 ) - Y(se-xTi3)dse-xTi3 ZTil +..;n io E[Y(se-XTt3)] - J.L(t) + op(l).
-G n
[
i
t
A
1
r
Clearly, N(te- XT ,8), Y(te- XT ,8), ZT'Y all belong to some Donsker class. Additionally, when t ~ to, E[Y(te- XTt3 )] > O. Hence, after the Taylor expansion of the third term and applying the Donsker theorem, we obtain A
..;n(J.L(t)
_
r{dN(se-XT,80)-Y(se-XT,80)dSCXT,8oZT'Yo} n J.L(t)) - G io E[Y(se-XT,8o)] _
Chapter 2 Additive-Accelerated Rate Model for Recurrent Event
47
Therefore, using the asymptotic expansions for (J and l' as given in proving Theorem 1, we can show vfn(P,(t) - Ito(t)) converges weakly to a Gaussian process in
lOO[O, to]. Proof of Proposition 1. We prove by contradiction. Assume
faT E
[Y(t) {N*(t) - Itw(te x J;f3w) _ tZ'{;/yw}
~ faT E
[Y(t) {N*(t) -
lte(teX~f3c) -
tZ'[ "Ie}
2] dt 2] dt.
Notice E [Y(t) {N*(t) - gt(X, Z)}2] is minimized for gt(X, Z) = E[N*(t)IY(t), X,
Z] = lte(teX~ f3c) + tZ'[ "Ie and such gt(X, Z) is unique almost surely. Therefore, for any t E [0, rJ, we conclude Itw(te XJ;f3 w) + tZJ;"Iw = lte(teX~f3c) with probability one. We differentiate (A.l) at t =
1t~(O)exJ;f3w
+ tZ'[ "Ie
(A.l)
°and obtain
+ ZJ;"Iw = 1t~(O)eX~f3c + Z'[ "Ie.
(A.2)
We show Xc must be included in Xw. Otherwise, there exists some covariate Xl in Xc but in Zw. We will show this is impossible by considering the following two cases. Case I. Suppose Xl has at least three different values and f3el is the coefficient of Xl. Fix the values of all the other covariates. Then equation (A.2) gives
aef3cIXl - bXl
=
d
for some constants a > 0, b, and d. However, since f3el i- 0, the function on the left-hand side of the above equation is strictly convex in Xl so the equation has at most two solutions. Since Xl has at least three different values, we obtain the contradiction. Case II. Suppose Xl has only two different values. Without loss of generality, we assume Xl = or 1. Then we obtain
°
for some function g. Here, Xc,-l means covariates in Xc except Xl and the same is defined for f3e,-l. Consequently, X'[_lf3e,-1 =constant, which implies that Xe,-l has to be empty. That is, Xc only contains a single binary variable. Thus, equation (A.l) gives If ef3c < 1, we replace t by ef3c k t and sum over k = 0,1,2, .... This gives Ite(t) = Act for some constant Ae. If ef3c > 1, we replace t by e-f3c k t and sum over k = 1,2, ... and obtain the same conclusion. Since we assumed that J-le(t) is not a
48
Donglin Zeng, Jianwen Cai
linear function of t when Xc is a binary covariate, P,c(t) = Act introduces the contradiction. As the result, Xc must be included in Xw. Thus, it is easy to see p,:"(0) =1= o. We repeat the same arguments but switch Xw and Xc. We conclude Xw is also included in Xc. Therefore, Xw = Xc. Clearly, f3w = f3c and p,:"(0) = p,~(0) by the fact that Zw and Zc are different from Xw and Xc and Condition (A.1). This further gives z'E,w = z'[ Ic. Since the covariates are linearly independent and IW =1= 0 and IC =1= 0, we obtain Zw = ZC. This contradicts with the assumption Xw =1= Xc or Zw =1= ZC·
References [1] Andersen, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. Annals of Statistics, 10, 1100-1120. [2] Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall: London. [3] Ghosh, D. (2004). Accelerated rates rgression models for recurrent failure time data. Lifetime Data Analysis, 10, 247-26l. [4] Kalbfleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data. Wiley: New York. [5] Lawless, J. F. and Nadeau, C. (1995). Some simple and robust methods for the analysis of recurrent events. Technometrics, 37, 158-168. [6] Pepe, M. S. and Cai, J. (1993). Some graphical displays and marginal regression analysis for recurrent failure times and time dependent covariates. Journal of the American Statistical Association, 88, 811-820. [7] Schaubel, D. E., Zeng, D., and Cai, J. (2006). A Semiparametric Additive Rates Model for Recurrent Event Data. Lifetime Data Analysis, 12,386-406. [8] Therneau, T. M. and Hamilton, S. A. (1997). RhDNase as an example of recurrent event analysis. Statistics in Medicine, 16, 2029-2047. [9] Ying, Z. (1993). A large sample study of rank estimation for censored regression data. Annals of Statistics, 21, 76-99.
Chapter 3 An Overview on Quadratic Inference Function Approaches for Longitudinal Data John J. Dziak * Runze Li t
Annie Qu
:j:
Abstract Correlated data, mainly including longitudinal data, panel data, functional data and repeated measured data, are common in the fields of biomedical research, environmental studies, econometrics and the social sciences. Various statistical procedures have been proposed for analysis of correlated data in the literature. This chapter intends to provide a an overview of quadratic inference function method, which proposed by Qu, Lindsay and Li (2000). We introduce the motivation of both generalized estimating equations method and the quadratic inference method. We further review the quadratic inference method for time-varying coefficient models with longitudinal data and variable selection via penalized quadratic inference function method. We further outline some applications of the quadratic inference function method on missing data and robust modeling.
Keywords: Mixed linear models; hierarchical linear models; hierarchical generalized linear models; generalized estimating equations; quadratic inference function; dispersion parameter; GEE estimator; QIF estimator; modified Cholesky decomposition; time-varying coefficient model; penalized QIF; smoothly clipped absolute; LASSO.
1
Introduction
Correlated data occurs almost everywhere, and is especially common in the fields of biomedical research, environmental studies, econometrics and the social sciences (Diggle et al. 2002, Davis 2002, Hedeker & Gibbons 2006). For example, to achieve sufficient statistical power using a limited number of experiment units in clinical trials, the outcome measurements are often repeatedly obtained from the *The Methodology Center, The Pennsylvania State University, 204 E. Calder Way, Suite 400 State College, PA 16801, USA. E-mail: [email protected] tDepartment of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111, USA. E-mail: [email protected] tDepartment of Statistics, Oregon State University, Corvallis, OR 97331-4606, USA. E-mail: [email protected]
49
50
John J. Dziak, Runze Li, Annie Qu
same subject at different time points; in education, students' achievements are more likely to be similar if they are from the same class, and the class, school or community are treated as natural clusters; in spatial environmental studies, researchers often have no control over spatially correlated field samples such as streams or species abundance. There are two major existing approaches to modeling and analyzing correlated data. One is the subject-specific approach including mixed linear models (Laird & Ware, 1982), hierarchical linear models (Bryk & Raudenbush, 1992), and hierarchical generalized linear models (Lee & NeIder, 1996). The other is a population average approach including generalized estimating equations (Liang & Zeger, 1986). The former approach emphasizes modeling heterogeneity among clusters which induces a correlation structure among observations. The latter models the correlation structure directly. These two approaches yield regression estimators with different interpretations, and different values in practice (Zeger, Liang & Albert, 1988; Neuhaus, Kalbfleisch & Hauck, 1991; Lee & NeIder 2004). The major drawback to subject-specific approaches is that random effects are assumed to follow an explicit distribution, and typically a normal random effects distribution is assumed. Lee & NeIder (1996) allow a broader class of parametric models for random effects. However, in general there is not enough information for goodness-of-fit tests for random effects distributions. Neuhaus, Hauck & Kalbfleish (1992) show that when the distributions of the random effects are misspecified, the estimator of fixed effects could be inconsistent for logistic models. On the other hand, the generalized estimating equations approach has an advantage in that it requires only the first two moments of the data, and a misspecified working correlation does not affect the root n consistency of the regression parameter estimation; though the misspecification of working correlation does affect the efficiency of the regression parameter estimation (see Liang & Zeger 1986, Fitzmaurice 1995, Qu, Lindsay & Li, 2000). In general, the generalized estimating equation approach lacks a probabilistic interpretation since the estimating functions are not uniquely determined. Therefore, it is not obvious how to do model checking or goodness-of-fit test based on likelihood functions such as the likelihood ratio test. For these reasons, Heagerty & Zeger (2000) developed marginalized multilevel models, a likelihood based method which is less sensitive to misspecified random effects distributions. However, this might be computational complex and difficult since the marginal approach requires integrations, and, if a high dimension of random effects is involved, the estimates may not be analytically tractable. In order to overcome the limitations of the above approaches, Qu, Lindsay & Li (2000) proposed the quadratic inference function (QIF) method to analyze longitudinal data in a semiparametric framework defined by a set of mean zero estimating functions. This approach has the advantages of the estimating function approach, as it does not require the specification of the likelihood function and does not involve intractable computations. It also overcomes the limitation of estimating function approach such as lacking of inference functions. The difference in QIF between two nested models is analogous to the difference in minus twice the log likelihood, so it may provide a semiparametric analog to the likelihood ratio
Chapter 3
Quadratic Inference Function Approaches
51
test. Specifically, it provides an asymptotic chi-squared distribution for goodnessof-fit tests and hypothesis testing for nested regression parameters. Since the QIF method was proposed, this method has been further developed to cope with various difficulties in the analysis of longitudinal data. Its applications have been published in the various statistical journals. This chapter aims to provide a partial review on this topic. This chapter is organized as follows. Section 2 gives background and motivation of the quadratic inference function approach, and presents some theoretic properties of the quadratic inference function estimator. In Section 3, we introduce penalized quadratic inference functions for time-varying coefficient models and variable selection for longitudinal data. Section 4 presents some main ideas about how to apply the quadratic inference function approach for testing whether missing data is ignorable for an estimating equation approach CQu & Song, 2002). Section 4 also demonstrates that quadratic inference function estimation is potentially more robust to outliers than the ordinary generalized estimating equations estimation (Qu & Song, 2004). Some research topics that need to further study are presented in Section 5.
2
The quadratic inference function approach
Suppose that we collect a covariate vector Xij and a response Yij for individual = 1,···,J and i = 1,··· ,no Denote Yi = CYil,··· ,YiJ)T and Xi = (XiI, ... , XiJ ) T. Start with a simple continuous response case. The following linear regression model is useful to explore the relationship between the covariates and the continuous response.
i at time tj, j
(2.1) where ei is assumed to be an independent and identically random error with mean o. It is well known that when the within subject random errors are correlated, the ordinary least squares estimator for j3 is not efficient. To improve efficiency of the ordinary least squares estimator, consider the weighted least squares estimator (WLSE)
i=1
i=1
If the covariance matrix of ei is of the form (7"2~ with a known ~, but unknown (7"2, then WLSE with Wi = ~-1 is the best linear unbiased estimator (BLUE). In practice, the covariance matrix of ei is typically unknown, and therefore the WLSE requires us to specify a covariance structure. In practice, the true covariance structure is generally complicated and unknown, except in very simple situations. How to construct a good estimator for f3 when the covariance structure is misspecified? This poses a challenge in the analysis of longitudinal data.
52
2.1
John J. Dziak, Runze Li, Annie Qu
Generalized estimating equations
The outcome variable could be discrete in many longitudinal studies. Naturally, we consider a generalized linear model for a discrete response. For iid data, the generalized linear model assumes that given the covariates, the conditional distribution of the response variable belongs to the exponential family (McMullagh & NeIder, 1989). In some situations, specification of a full likelihood function might be difficulty. It might be more desirable to assume the first two moments instead. The quasi-likelihood approach can be used to develop inference procedures for the generalized linear model (Wedderburn, 1974). Unfortunately, it is still a challenge for correlated discrete responses, such as binary or count response. As done for weighted least squares approach, the generalized estimating equation (GEE) approach (Liang & Zeger, 1986) assumes only the mean structure and variance structure along with the working correlation for the discrete longitudinal/repeated measurement data. In what follows, we briefly introduce the GEE approach. As in the generalized linear models, it is assumed that
E(YijIXij)
g-l(x'f;{3),
=
Var(Yijlxij)
=
1>V(P,ij),
where P,ij = E(Yijlxij), g(.) is a known link function, 1> is called a dispersion parameter, and V (.) is called a variance function. Let us present a few examples. Example 1. For continuous response variable, the normal error linear regression model assumes that the random error in (2.1) follows a normal distribution. In other words, given Xij, the conditional distribution of Yij is N(xI;i3,u 2 ). Then g(p,) = p" the identity link, and 1> = u 2 and V(p,) = 1. Example 2. For binary outcome variable, the logistic regression model assumes that given Xij, Yij follows a Bernoulli distribution with success probability
p(xd = J
Then g(p)
= p/(l - p),
exp(xfi3) J. 1 + exp(xI;{3)
the logit link, 1>
=
1 and V(p)
= p(l - p).
Example 3. For count outcome variable, the log-linear Poisson regression model assumes that given Xij, Yij follows a Poisson distribution with mean
>'(Xij) = exp(x'f;{3). Then g(>.)
= log(>.),
the log link, 1> = 1 and V(>.)
= >..
For these examples, it is easy to specify the first two moments for Yij marginal ly. Thus, we may further construct quasi-likelihood when the conditional distribution of Yij is not available. It is quite difficult to specify the joint distribution of binary or count responses Yij, j = 1, ... ,J here.
Chapter 3 Let Pi
=
Quadratic Inference Function Approaches
(Pil,··· ,PiJ
f,
53
be a J x d matrix with the j-th row being
Di
8g- l (xT;(3)/8(3, Ai be a J x J diagonal matrix with the j-th diagonal element
V(pij). For a given working correlation matrix R i , the GEE estimator is the solution of the following estimation equation: (2.2) i==l
It can be verified that for model (2.1), the WLSE with weight matrix Wi = A;/2RiA;/2 is the solution of (2.2). As demonstrated in Liang & Zeger (1986), the choice of the working correlation matrix Ri does not affect the consistency of the GEE estimator, but could affect its efficiency; and if Ri is correctly specified, then the resulting estimation is the most efficient.
2.2
Quadratic inference functions
The quadratic inference function (QIF) approach (Qu, Lindsay & Li, 2000) shares the same asymptotic efficiency of the estimator as that of GEE when the correlation matrix equals the true one, and is the most efficient in asymptotic sense for a given class of correlation matrices. Thus, the QIF estimator is at least as asymptotic efficient as the corresponding GEE for a given working correlation. Before we describe the ideas of QIF, we first briefly discuss the common choices of the working correlation matrix Ri in the GEE method. There are two commonly used working correlation matrices: equicorrelated matrix and AR working correlation structures. The equicorrelated (also known as exchangeably correlated or compound symmetric) matrix is defined as
1 p ... P] pl··· p
R=
. . .. ..
.. [ ..
'.
p p ... 1
In the implementation of the GEE method, the inverse of R is used here. For equicorrelated working correlation structure, we have
where 1 is the J x 1 vector with all elements equaling 1, and al = 1/(1 - p) and a2 = - p/ { (1 - p) (1 - P + J p)}. This indicates that the inverse of equicorrelated matrix can be represented as a linear combination of nonnegative definite matrices. For the AR correlation structure, assume that Zt, t = 1,··· ,J is an AR sequence with order q. In other words, we can represent Zj as min{t-l,q} Zt
=
L
j==l
CPjZt-j
+ et·
(2.3)
54
John J. Dziak, Runze Li, Annie Qu
where et's are independent white noise with mean zero and variance (J2. Denote z = (Zl,'" ,zJ)Y, e = (el,'" ,eJ)Y and L to be a lower triangular matrix having ones on its diagonal and (i, i - j)-element -CPj, for i = 1"" ,J and j = 1,'" ,min{i - 1, q}. Then (2.3) can be rewritten as Lz = e. Thus, Lcov(z)L T = cov(e) = (J2 I. This indeed is the modified Cholesky decomposition of cov(z) (see (2.7) below). Therefore, cov-l(z)
= (J-2L T L
Denote U j is a J x J matrix with (i, i - j)-element being 1 and all other elements being O. Note that UfUk = 0 for j -1= k. Then
j
j
j
j
Thus, the inverse of the covariance matrix of an AR sequence can be also represented as a linear combination of symmetric matrices. Based on these observations, the QIF approach assumes that the inverse of a within-subject working correlation matrix R can be expressed as a linear combination 2::~=1 bkMk, where the bk's are unknown constants, and bkMk'S are known, symmetric matrices. Reexpressing (2.2), the GEE estimate is then the value of {3 which sets the following quasi-score to zero: M s = n -1 DiAi-1/2 ( blb l 1
+ ... + br M r ) A-i l / 2(yi
Notice that s is a linear combination of the "extended score" where mi
=
[Dr
A~1/2hMl~~1/2(Yi -
-
J-Li ) .
mn =
(2.4)
~ 2::~=1 bimi,
J-Li)] . (2.5)
Dr A~1/2brMrA~1/2(Yi - J-Li) Since mn contains more estimating equations than the dimension of {3, it could be impossible to set all equations to be zero. Hansen (1982) proposed the generalized method of moment (GMM) which attempts to combine these estimating equations optimally. The GMM could be traced back from the minimum X2 method introduced by Neyman (1949) which is further developed by Ferguson (1958,1996). The GMM estimator 7J is obtained by minimizing m~C-l mn, where C is a weight matrix, instead of solving s = O. Hansen (1982) has shown that the best weight matrix C is the covariance matrix of mn . In practice the covariance matrix of mn is often unknown. Qu, Lindsay & Li (2000) suggested taking the weight matrix C to be its sample counterpart, i.e. n- 2 2::~1 MiMr, and defined the quadratic inference function (2.6)
Chapter 3 and the QIF estimator
Quadratic Inference Function Approaches
fj is
55
defined to be
13 =
argmin(3Q(f3).
The idea of QIF approach is to use data-driven weights, which gives less weight to the estimating equations with larger variances, rather than setting the weights via ad hoc estimators of the parameters of the working correlation structure as in (2.4). There is no explicit form for fj. A numerical algorithm, such as NewtonRaphson algorithm or Fisher scoring algorithm, should be used to minimization Q(f3). See Qu, Lindsay & Li (2000) for details. One of the main advantage of the QIF approach is that if the working correlation is correctly specified, the QIF estimator has an asymptotic variance as low as the GEE. If the working structure is incorrect, the QIF estimator is still optimal among the same linear class of estimating equations, while the GEE estimator with the same working correlation is not. See Qu, Lindsay & Li (2000) for some numerical comparisons. The asymptotic property of the QIF estimator fj has been studied under the framework of the GMM. It has been shown that no matter whether C is consistent for C or not, fj is root n consistent and asymptotic normal, provided that C is positive definite. As shown in Qu & Lindsay (2003), if the true score function is included in M, then the QrF estimator in the context of a parametric model, is asymptotically equivalent to the MLE and thus shares its first-order asymptotic optimality. Unlike GEE, the QIF estimation minimizes a clearly defined objective function. The quadratic form Qn itself has useful asymptotic properties, and is directly related to the classic quadratic-form test statistics (see Hansen 1982, Greene 2000, Lindsay & Qu 2003), as well as somewhat analogous to the quadratic GEE test statistics of Rotnitzsky & Jewell (1990) and Boos (1992). When p < q, Qn can be used as an asymptotic X2 goodness-of-fit statistic or a test statistic for hypotheses about the parameters 13k, it behaves much like a minus twice of log-likelihood. ~
L
• Q(f3o) - Q(f3) ---t X~ under the null hypothesis Ho : 13 = 130· • More generally, if 13 = ['I,b, ~lT where 'I,b is an r-dimensional parameter of interest and ~ is a (p - r)-dimensional nuisance parameter, then the profile test statistic Qn('l,bo,[O) - Qn(;j),[) is asymptotically X; for testing Ho : 'I,b = 'l,bo. This could be used for testing the significance of a block of predictors. Thus, the QIF plays a similar role to the log-likelihood function of the parametric models. For the two primary class of working correlation matrices, the equicorrelated matrices and AR-1 working correlation structure, their inverse can be expressed as a linear combination of several basis matrices. One can construct various other working correlation structure through the linear combination 2::~=1 bkMk. For example, we can combine equi-correlation and AR-1 correlation structures together by pooling their bases together. One may find such linear combination of given correlation structure through the Cholesky decomposition. It is known that a positive definite matrix I; has a modified Cholesky decomposition: I;-l
= LTDL,
(2.7)
56
John J. Dziak, Runze Li, Annie Qu
where L is a lower triangular matrix having ones on its diagonal and typical element -¢ij in the (i, j) position for 1 :::; j < i :::; m, and D is a diagonal matrix with positive elements. As demonstrated for the AR working correlation structure, the modified Cholesky decomposition may be useful to find such a linear combination. In the recent literature, various estimation procedures have been suggested for covariance matrices using the Cholesky decomposition. Partial references on this topic are Barnard et al. (2000), Cai & Dunson (2006), Dai & Guo (2004), Daniels & Pourahmadi (2002), Houseman, et al. (2004), Huang, et al (2006), Li & Ryan (2002), Pan & Mackenzie (2003), Pourahmadi (1999, 2000), Roverato (2000), Smith & Kohn (2002), Wang & Carey (2004), Wu & Pourahmadi (2003) and Ye & Pan (2006). We limit ourselves in this section with the setting in which the observation times tj'S are the same for all subjects. In practice, this assumption may not always be valid. For subject-specific observation times tij'S, we may bin the observation times first, and then apply the QIF approach to improve efficiency based on the binned observation times. From our experience, this technique works well for functional longitudinal data as discussed in the next section.
3
Penalized quadratic inference function
We now introduce the penalized QIF method to deal with high-dimensionality of parameter space. We first show how to apply the QIF for time-varying coefficient models.
3.1
Time-varying coefficient models
Varying coefficient models have become popular since the work by Hastie & Tibshirani (1993). For functional longitudinal data, suppose that there are n subjects, and for the i-th subject, data {Yi(t), XiI (t), ... ,Xid(t)} were collected at times t = tij, j = 1, ... ,ni. To explore possible time-dependent effects, it is natural to consider d
Yi(t)
=
!3o(t)
+L
Xik (t)!3k (t)
+ Ei(t).
(3.1)
k=I
This is called a time-varying coefficient model. In these models, the effects of predictors at any fixed time are treated as linear, but the coefficients themselves are smooth functions of time. These models enable researchers to investigate possible time-varying effects of risk factors or covariates, and have been popular in the literature of longitudinal data analysis. (See, for example, Hoover, et al., (1998), Wu, Chiang & Hoover (1998), Fan & Zhang (2000), Martinussen & Scheike (2001), Chiang, Rice & Wu (2001), Huang, Wu & Zhou (2002,2004) and references therein.) For discrete response, a natural extension of model (3.1) is d
E{Yi(t)lxi(t)} = g-I{!30(t)
+L k=l
Xik(t)!3k(t)}.
(3.2)
Chapter 3
Quadratic Inference Function Approaches
57
With slight abuse of terminology, we shall still refer this model to as a timevarying coefficient model. Researchers have studied how to incorporate the correlation structure to improve efficiency of estimator of the functional coefficients. For a nonparametric regression model, which can be viewed as a special case of model (3.2), Lin & Carroll (2000) demonstrated that direct extension of GEE from parametric models to nonparametric models fails to properly incorporate the correlation structure. Under the setting of Lin & Carroll (2000), Wang (2003) proposed marginal kernel GEE method to utilize the correlation structure to improve efficiency. Qu & Li (2006) proposed an estimation procedure using penalized QIF with L2 penalty. The resulting penalized QIF estimator may be applied to improve efficiency of kernel GEE type estimators (Lin & Carroll, 2000; Wang, 2003) when the correlation structure is misspecified. The penalized QIF method originates from the penalized splines (Ruppert, 2002). The main idea of the penalized QIF is to approximate each (3k(t) with a truncated spline basis with a number of knots, thus creating an approximating parametric model. Let /'O,Z, l = 1"" ,K be chosen knots. For instance, if we parametrize (3k with a power spline of degree q = 3 with knot /'O,z's we have (3k(t)
= 'YkO + 'Yk1t + 'Yk2 t2 + 'Yk3 t3 +
2: 'Yk,l+3(t -
/'O,z)t·
z
Thus, we can reexpress the problem parametrically in terms of the new spline regression parameters 'Y. To avoid large model approximation error (higher bias), it might be necessary to take K large. This could lead to an overfitting model higher variance). To achieve a bias-variance tradeoff by reducing overfitting, Qu & Li (2006) proposed a penalized quadratic inference function with the L2 penalty: K
QC'Y) + n)..
L L 'Y~,l+3' k
(3.3)
Z=l
where).. is a tuning parameter. In Qu & Li (2006), the tuning parameter).. is chosen by minimizing a modified GCV statistic: GCV
=
(1 _ n-1dfQ)2
(3.4)
with dfQ = tr( (Q+n).. Nf1Q), where N is a diagonal matrix such that Dii = 1 for knot terms and 0 otherwise. The GCV statistic is motivated by replacing RSS in the classic GCV statistic (Craven & Wahba, 1979) with Q. For model (3.2), it is of interest to test whether some coefficients are timevarying or time-invariant. Furthermore, it is also of interest to delete unnecessary knots in the penalized splines because it is always desirable to have parsimonious models. These issues can be formulated as statistical hypothesis tests of whether the corresponding 'Y's are equal to zero or not. Under the QIF approach this can be done by comparing Q for the constrained and unconstrained models. See Qu & Li (2006) for a detailed implementation of penalized spline QIF's.
58
3.2
John J. Dziak, Runze Li, Annie Qu
Variable selection for longitudinal data
Many variables are often measured in longitudinal studies. The number of potential predictor variables may be large, especially when nonlinear terms and interactions terms between covariates are introduced to reduce possible modeling biases. To enhance predictability and model parsimony, we have to select a subset of important variables in the final analysis. Thus, variable selection is an important research topic in the longitudinal data analysis. Dziak & Li (2007) gives an overview on this topic. Traditional variable selection criteria, such as AIC and BIC, for linear regression models and generalized linear models sometimes are used to select significant variables in the analysis of longitudinal data (see Zucchini, 2000, Burnham & Anderson 2004, Kuha 2004, Gurka 2006 for general comparisons of these criteria). These criteria are not immediately relevant to marginal modeling because the likelihood function is not fully specified. In the setting of GEE, some workers have recently begun proposing analogues of AIC (Pan, 2001, Cantoni et al., 2005) and BIC (Jiang & Liu, 2004), and research here is ongoing. Fu (2003) proposed penalized GEE with bridge penalty (Frank & Friedman, 1993) for longitudinal data, Dziak (2006) carefully studied the sampling properties of the penalized GEE with a class of general penalties, including the SCAD penalty (Fan & Li, 2001). Comparisons between penalized GEE and the proposals of Pan (2001) and Cantoni, et al (2005) are given in Dziak & Li (2007). This section is focus on summarizing the recent development of variable selection for longitudinal data by penalized QIF. A natural extension of the AIC and BIC is to replace the corresponding negative twice log-likelihood function by the QIF. For a given subset M of {1, ... ,d}, denote Me to be its complement, and i3M to be the minimizer of Qf((3), the QIF for the full model, viewed as a function of (3M by constraining (3Mc = O. Define a penalized QIF (3.5) where #(M) is the cardinality of M, Qf(i3M) is the Qf((3) evaluated at (3M = i3 and (3Mc = O. The AIC and BIC correspond to A = 2 and log(n), respectively. Wang & Qu (2007) showed that the penalized QIF with BIC penalty enjoys the well known model selection consistency property. That is, suppose that the true model exists, and it is among a fixed set of given candidate models, then with probability approaching one, the QIF with BIC penalty selects the true model as the sample size goes to infinity. The best subset variable selection with the traditional variable selection criteria becomes computationally infeasible for high dimensional data. Thus, instead of the best subset selection, stepwise subset selection procedures are implemented for high dimensional data. Stepwise regression ignores the stochastic errors inherited in the course of selections, and therefore the sampling property of the resulting estimates is difficult to understand. Furthermore, the subset selection approaches lack of stability in the sense that a small changes on data may lead to a very different selected model (Breiman, 1996). To handle issues related with high dimensionality, variable selection procedures have been developed to select significant variables and estimate their coefficients simultaneously (Frank & Friedman,
Chapter 3
Quadratic Inference Function Approaches
59
1993, Breiman, 1995, Tibshirani, 1996, Fan & Li, 2001 and Zou, 2006). These procedures have been extended for longitudinal data via penalized QrF in Dziak (2006). Define penalized QrF d
QJ(3)
+n LPAj(l,6jl),
(3.6)
j=l
where PAj (-) is a penalty function with a regularization parameter Aj. Note that different coefficients are allowed to have different penalties and regularization parameters. For example, we do not penalize coefficients 1'0, ... ,1'3 in (3.3); and in the context of variable selection, data analysts may not want to penalize the coefficients of certain variables because in their professional experience they believe that those variables are especially interesting or important and should be kept in the model. Minimizing the penalized QrF (3.6) yields an estimator of,6. With proper choice of penalty function, the resulting estimate will contain some exact zeros. This achieves the purpose of variable selection. Both (3.3) and (3.5) can be re-expressed in the form of (3.6) by taking the penalty function to be the L2 penalty, namely, PAj (l,6j I) = Aj l,6j 12 and the Lo penalty, namely, PAj (I,6j I) = AjI(I,6j I =I- 0), respectively. Frank & Friedman (2003) suggested using Lq penalty, PAj (l,6j I) = Aj l,6j Iq (0 < q < 2). The L1 penalty was used in Tibshirani (1996), and corresponds to the LASSO for the linear regression models. Fan & Li (2001) provides deep insights into how to select the penalty function and advocated using a nonconvex penalty, such as the smoothly clipped absolute deviation (SCAD) penalty, defined by
AI,6I, 2
(.II) _
PA (I f-/
-
(a -1)A2-(Ii3I_aA)2
{
2(a-1) (a+1)A2 2
'
,
if 0 :::; 1,61 < A; if A :::; 1,61 < aA; if 1,61 ? aA.
Fan & Li (2001) suggested fixing a = 3.7 from a Bayesian argument. Zou (2006) proposed the adaptive LASSO using weighted L1 penalty, PAj (l,6j I) = AjWj l,6j I with adaptive weights Wj. The L 1 , weighted L 1 , L2 and SCAD penalty functions are depicted in Figure 1. The SCAD estimator is similar to the LASSO estimator since it gives a sparse and continuous solution, but the SCAD estimator has lower bias than LASSO. The adaptive LASSO uses the adaptive weights to reduce possible bias of LASSO. Dziak (2006) showed that, under certain regularity conditions, nonconvex penalized QIF can provide a parsimonious fit and enjoys something analogous to the asymptotic "oracle property" described by Fan & Li (2001) in the least-squares context. Comparisons of penalized QIF estimator with various alternatives can be found in Dziak (2006).
John J. Dziak, Runze Li, Annie Qu
60
Penalty Functions 3.5 ~--------.------,------,------., , - ,L 1 - - Weighted L 1 3 III'~
2.5 ...
C
2
]
p...1.5
"
. •~
','., '. , , , ... ... ,
, ~: ,
"
~
'", "
~ ~
......
~
,,
",
';""
' OL-----________ .-,.,~~
~~~~
-5
o
...
,
.'
~
"
0.5
, ~ ,"
... ,
,
,
... ... ,
______________
~
5
b
Figure 1: Penalty Functions. The values for>. are 0.5,0.5,0.125 and 1, respectively. The adaptive weight for the weighted L1 penalty is 3
4
Some applications of QIF
It is common for longitudinal data to contain missing data and outlying observations. The QIF method has been proposed for testing whether missing data are ignorable or not in the estimating equation setting (Qu & Song, 2002). Qu & Song (2004) also demonstrated that the QIF estimator is more robust than the ordinary GEE estimator in the presence of outliers. In this section, we outline the main ideas of Qu & Song (2002, 2004).
4.1
Missing data
Qu & Song (2002) define whether missing data is ignorable in the context of the estimating equation setting based on whether the estimating equations satisfy the mean-zero assumption. This definition differs somewhat from Rubin's (1976) definition which is based on the likelihood function. Qu & Song's (2002) approach shares the same basis as Chen & Little (1999) on decomposing data based on missing-data patterns. However, it avoids exhaustive parameter estimation for each missing pattern as in Chen & Little (1999). The key idea of Qu & Song's approach is that if different sets of estimating equations created by data sets with different missing patterns are compatible, then the missing mechanism is ignorable. This is equivalent to testing whether different sets of estimating equations satisfy the zero-mean assumption under common parameters, i.e., whether E(s) = O. This can be carried out by applying an over-identifying
Chapter 3
Quadratic Inference Function Approaches
61
restriction test, which follows a chi-squared distribution asymptotically. For example, suppose each subject has three visits or measurements, then there are four possible patterns of missingness. The first visit is mandatory; then subjects might show up for all appointments, miss the second, miss the third, or miss the second and third. We construct the QIF as
(4.1)
where 8j and 6 j are the estimating functions for the j-th missing pattern group and its empirical variance respectively. If 81, ... ,84 share the same mean structure, then the test statistic based on the QIF above will be relatively small compared to the cut-off chi-squared value under the null. Otherwise the QIF will be relatively larger. Clearly, if 81, ... ,84 do not hold mean-zero conditions under common parameters, the missingness might not be ignorable, since estimating functions formulated by different missing patterns do not lead to similar estimators. It might be recommended to use working independence here, or combine several similar missing patterns together, to keep the dimension to a reasonable size if there are too many different missing patterns. This approach is fairly simple to apply compared to that of Chen & Little (1999), since there is no need to estimate different sets of parameters for different missing patterns. Another advantage of this approach can be seen in the example of Rotnitzky & Wypij (1994), where the dimension of parameters for different missing patterns are different. The dichotomous response variables record asthma status for children at ages 9 and 13. The marginal probability is modeled as a logistic regression (Rotnitzky & Wypij, 1994) with gender and age as covariates: logit{Pr(Yit = I)} = /30
+ /31I(male) + /32I(age =
13),
where Yit = 1 if the i-th child had asthma at time t = 1,2 and I(E) is the indicator function for event E. About 20% of the children had asthma status missing at age 13, although every child had his or her asthma status recorded at age 9. Note that there are three parameters /30, /31 and /32 in the model when subjects have no missing data, but only two identifiable parameters, /30 and /31, for the incomplete case. Since the dimension of parameters is different for different missing patterns, Chen & Little's (1999) approach requires a maximum identifiable parameter transformation in order to perform the Wald test. However, the transformation might not be unique. Qu & Song (2002) do not require such a transformation, but show that the QIF goodness-of-fit test and the Wald test are asymptotically equivalent.
4.2
Outliers and contamination
The QIF estimator is more robust against outlying observations than the ordinary GEE estimator. Both GEE and QIF asymptotically solve equations which lead to
62
John J. Dziak, Runze Li, Annie Qu
an M-estimator. A robust estimator has a bounded influence function (Hampel et al., 1986). The influence function of the GEE is not bounded, while the influence function of the QIF is bounded (Qu & Song, 2004). This could explain why the QIF is more robust than the GEE for contaminated data. Hampel (1974) defines the influence function of an estimator as IF(z, Pj3)
=
inf
~((1- c)Pj3 + c~j3) - ~(Pj3)
E:~O
C
(4.2)
where Pj3 is the probability measure of the true model and ~z is the probability measure with mass 1 at contaminated data point z. If (4.2) is not bounded as a function of z, then the asymptotic bias in ~ introduced by contaminated point z could be unbounded, that is, one could move ~ to infinite value by allowing the contaminated point z to be infinite. Hampel et al. (1986) show that for an M-estimator, solving the estimating equation L~=l Si(Zi, (3), the influence function is proportional to Si(Z, (3). Thus the influence function is bounded if and only if the contribution of an individual observation to the score function is bounded. If the influence function is not bounded, then the asymptotic "breakdown point" is zero and the corresponding estimator could be severely biased even with a singlegross outlier. Consider a simple case of GEE with a linear model using working independence. An individual observation's contribution to the score function is Xit (Yit x~(3), which diverges if Yit is an outlier relative to the linear model. Qu & Song (2004) showed that the QIF does not have this kind of problem. In fact, the QIF has a "redescending" property whereby the contribution of a single anomalous observation to the score function goes to zero as that outlying observation goes to infinity. This is because the weighting matrix C is an empirical variance estimator of the extended score, and the inverse of C plays a major role in the estimation as it assigns smaller weights for dimensions with larger variance. Thus, the QIF automatically downweights grossly unusual observations. This result, however, would not hold for the working-independence structure since in that case the QIF is equivalent to the GEE estimator. Qu & Song (2004) show in their simulation how sufficiently large changes in a few observations can cause drastic effects on the GEE estimator but have only minor effects on the QIF estimator. The ordinary GEE can be made robust by downweighting unusual clusters and/or unusual observations (Preisser & Qaqish, 1996, 1999; He et al. 2002; Cantoni 2004; Cantoni et al. 2005). However, it could be difficult to identify outliers, since if some data do not fit the model well, it is not necessary that they are outliers. In addition, the choice of weighting scheme might not be obvious, and therefore difficult to determine.
4.3
A real data example
In this section we demonstrate how to use the QIF approach in real data analysis. We consider the CD4 data set, described in Kaslow et al. (1987) and a frequentlyused data set in the literature of varying-coefficient modeling (Wu, Chiang & Hoover, 1998; Fan & Zhang 2000; Huang, Wu & Zhou 2002, 2004; Qu & Li
Chapter 3
Quadratic Inference Function Approaches
63
2006). The response of interest is CD4 cell level, a measure of immune system strength, for a sample of HIV-positive men. Covariates include time in years from start of study (TIME), age at baseline (AGE, in years), smoking status at baseline (SMOKE; binary-coded with 1 indicating yes), and CD4 status at baseline (PRE). Measurements were made up to about twice a year for six years, but with some data missing due to skipped appointments or mortality. There were 284 participants, each measured at from 1 to 14 occasions over up to 6 years (the median number of observations was 6, over a median time of about 3.4 years). Some previous work has modeled this dataset with four time-varying linear coefficients:
y(t)
=
(3o(t)
+ (3s(t)SMOKE + (3A(t)AGE + (3p(t)PRE + c:(t).
(4.3)
We chose to extend this model to check for possible second-order terms:
y(t)
=
(3o(t) + (3s(t)SMOKE + (3A(t)AGE + (3p(t)PRE +(3SA(t)SMOKE' AGE + (3sp(t)SMOKE' PRE +(3AP(t)AGE' PRE + (3AA(t)AGE 2 + (3pp(t)PRE2
(4.4)
+ c:(t),
and to center the time variable at 3 to reduce correlation among the resulting terms. In fitting the model, AGE and PRE were also centered at their means to reduce collinearity. We initially modeled each (3 coefficient using a quadratic spline with two evenly spaced knots:
We did not apply a penalty. Combining (4.4) with (4.5) we effectively have a linear model with 45 parameters. Fitting this model under, say, AR-1 working QIF structure, would be very challenging because the nuisance C matrix would be of dimension 90 x 90 and would have to be empirically estimated and inverted, resulting in high sampling instability and perhaps numerical instability. Therefore, we instead start with a working independence structure. The estimated coefficients are depicted in Figure 2. As one might expect, the wobbly curves suggest that this large model overfits, so we will proceed to delete some of the 45 ,),'s to make the fit simpler. If the last four ,),'s for a particular (3 are not significantly different (as a block) from zero we will consider the (3 to be essentially constant in time (a time-invariant effect). If the last three are not significantly different from zero, we will consider the coefficient to be at most a linear function in time (this is equivalent to simply having a linear interaction term between the predictor and time). Otherwise, we will retain the spline form. To test whether a set of ')"s may be zero, we fit the model with and without the added constraint that they be zero, and compare the difference in Q between the models to its null-hypothesis distribution, a chi-squared with degrees of freedom equal to the number of parameters constrained to zero. Because we are using working-independence QIF (for which the number of equations exactly equals the number of coefficients), the full-model QIF will be zero, so the constrained-model QIF serves as the test statistic. It was concluded after some exploratory testing
John J. Dziak, Runze Li, Annie Qu
64
~A
~S
~o
I' /
Vl
J \ ,. - -
0
co.
M
co.
0
\
/
, Vl-
o
2 4 Time
o
6
co.
/,
_
-
"\
\ I
I
I
2
~p ~_~ -
,
\
0 N
4
6
024
6
Time
Time
~SA
~SP
:3-~1 -
I
,"'-"
/
\
/
/
co.<=!_ o
,
\ \
- ,. /
-
Vl-
"\
/
\
<=!-
o
I
I
I
I
~
I
024
6
I
I
I
I
I
024
Time
6
\
o
Time
2 4 Time
~AA
~pp
6
~....r-------,
o-
-
/
'
Vl
o
co. 0
....
'-_
Vl
~ '....
-,---
/
o
o
- - /'/
o
I
I
I
I
2 4 Time
.....
"" \
g -;--'-T""1,'-I rI- /
I
6
o 0
2 4 Time
6
I
I
I
I
024
I
6
Time
Figure 2: Estimated Regression Coefficients for Model (4.4). The solid lines are the estimates, and the dotted lines indicate pointwise 95% confidence interval
that a model constraining f3s and f3A to be constants, and f3SA, f3sp, f3pp and f3AA to be only linear in time, did not fit significantly more poorly than the full model (reduced Q=24.46 with df = 20, p > .2). Thus only f30(t) , f3p(t) and f3AP(t) had to be treated as nonparametric, nonlinear functions. In fact, f3s and f3A were not significantly different from zero at all, but we chose to conservatively leave them in the model as constants, because otherwise the model at any fixed time would contain interactions for which some of the first-order terms were absent. The resulting model had 25 parameters. We now refit the model using QIF with
Chapter 3
Quadratic Inference Function Approaches
55
a working AR-l structure; the C matrix will still be large but not nearly as large as if we had not streamlined the model. The resulting AR-l fit is simpler than the previous model Neither (3s, (3A, (3sp or (3AA appear to be significantly different from a constant zero. This seems reasonable; a model taking covariance into account ought to be less likely to include spurious or non-informative features than one which does not, if only because an unusual subject only acts as a single outlier instead of several outliers. We deleted (3sp and (3AA and refit the model. Our final model is
fj(t)
=
+ fjp(t)PRE + fjAP(t)AGE' PRE + 0.3486SMOKE + 0.08451AGE -{0.3533 + 0.09924(t - 3)}SMOKE' AGE
fjo(t)
-{0.01148 + 0.007018(t - 3)}PRE2 ,
(4.6)
where fjo(t), fjp(t) and fjAP(t) are depicted in Figure 3.
~o
~AP 00
o
,
g-
\
o
o
/ ' -\ ....J /1
c:o.~-
o
,
-
g _. . ,
N
0 N
0
0
2 4 Time
6
I
0
0
2 4 Time
6
....
f"
0
-
/
I,
4 2 Time
6
Figure 3: Estimated coefficient functions in model (4.6). The caption is similar to that in Figure 2
5
Further research and concluding remarks
In this section, we will point out some limitations of the current QIF approach. Further researches is needed to address such limitations.
High dimensionality. Note that the dimension of the weighted matrix C can be very large if the dimension of (3 is large. The weighted matrix suggested in Qu et al. (2000) indeed is the unstructured covariance matrix for M n , and may not work well for large dimensional (3, although it is theoretically optimal. Qu & Lindsay (2003) suggested a possible alternative QIF method, for the case where accurately modeling and inverting the rd x rd matrix C is too difficult or is hard to specify a realistic working covariance structure. Instead of using fully rd dimensions, this reduced-dimension QIF adds only one or a few extra dimensions individually to the initial d, using a modified conjugate gradient algorithm to select the most
John J. Dziak, Runze Li, Annie Qu
66
informative additional moment functions available. It is interest to have further research along this direction. Moment selection. Construction of QIF is based on the moment equations/ conditions (2.5). In practice, some of them are correct, while some are incorrect. It is of interest to select the correct ones. Andrews (1999) proposed several procedures for consistently selecting the correct moment conditions. To deal with issues related to high dimensionality, it would be an interesting topic to identify which moment conditions are least informative and should be omitted to reduce dimensionality. Time-dependent covariates. It is important to know that the classic GEE estimation has limitations in that jj is not necessarily consistent when covariates vary over time (Hu, 1993, Pepe & Anderson, 1994, Davis, 2002). Specifically, (2.2) does not necessarily have expectation zero (i.e., is not necessarily a valid unbiased estimating function) unless either 1. we have
(5.1) for t = 1,··· ,T; i.e., the response now is independent of the predictors in the past or future, conditionally upon the predictor now, or, 2. we use working independence (see Hu, 1993, Pepe & Anderson, 1994, Lai & Small, 2007). This is because the k-th entry of the left-hand side of (2.2) can be rewritten as
n
n
n
L L L Diuk vtu(Yt -
JLt) = 0
(5.2)
i=l t=l u=l
for covariate k. Under the marginal mean and variance assumptions, and independence among subjects, which form the basis of GEE, we are assured that E(Ditk vtt(Yt - JLt)) = 0 and so working-independence GEE (vtu = 0 for t =I- u) will be consistent. However, we are not necessarily assured that E(Diuk vtu(Yt - JLt)) = o for t =I- u. Lai & Small point out that condition (5.1) is met if the predictors are timeinvariant (e.g., gender), if they are constants at any time u given their value at time t (e.g., age), or if they are randomly assigned at each measurement time and affect only the response at that measurement time (e.g., current treatment in a withinsubjects crossover experiment, assuming no carryover effects). It might hold in other cases (e.g., if we are predicting smoking behavior from mood, it might be that smoking today depends only on mood today, not mood on any other day) but then again it might not (e.g., smoking too much yesterday could cause depressed mood today in a person who is trying to quit), so it is not a safe assumption in general. Using working independence, on the other hand, guarantees consistency but entails a serious loss of efficiency in many cases (Fitzmaurice, 1995, Lai & Small, 2007).
Chapter 3
Quadratic Inference Function Approaches
67
To deal with this problem, Lai & Small divided covariates into three possible types. Type I is compatible with (5.1); values Xitk ofthis covariate are independent of each Yit conditional upon the associated Xit. Type II does not satisfy this independence condition, but does satisfy that the response variable is independent of future values of the covariate, conditionally upon current values. That is, for Type II covariates, there may be delayed influences of past covariates upon future responses, but not vice versa. Lastly, Type III covariates may have feedback loops in which the response influences future values of the covariate, so only the moment conditions associated with working independence hold. For an Xk which is Type III, only working-independence estimating equations are valid. If it is Type I, all entries in the sum in (5.1) validly have expectation zero. For a Type II covariate, some of the independence conditions are valid and some are not. Lai & Small proposed evaluating the type of each covariate and then using GMM, i.e., incorporating all possible valid equations, but no invalid ones, into a customized QIF. They also discuss issues in determining the type of a covariate and a possible test which could be used. A major practical difficulty here is uncertainty in classifying covariates into the three types. The proposal of Lai & Small (2007) is potentially important but has not yet received much attention. It is not the same as the QIF of Qu, Lindsay & Li (2000). It is of interest to study whether or not the Qu, Lindsay & Li (2000) method has the same problem as classical GEE with non-Type-I covariates. Final conclusion remark. In this chapter, we introduce the QIF approach, the penalized QIF approach and their applications. It is known that the QIF approach and the GMM method are closely related. The GMM has become very popular in econometrics. Since it proposed by Hansen (1982), there has been much work on addressing various issues in the implementation of the GMM (see Hansen et al., 1996, and Imbens, 2002). Techniques in these work are certainly useful for implementation of the QIF approach. GMM is closely related to the idea of empirical likelihood (see Owen, 1990, Qin & Lawless, 1994). For the connections between them see Imbens (2002), Lindsay & Qu (2003). Further reviews of GMM in the econometrics context is provided by Johnston & DiNardo (1997), Matyas (1999), Greene (2000), or Hansen & West (2002). The major limitation of QIF is that its good properties are asymptotic and may be "likely to be achieved only in very large samples", as analyzed in Johnston & DiNardo (1997) for the GMM method. Indeed, if q is not sufficiently small relative to n then GMM may perform no better, or even worse, than simple alternatives: "much simulation evidence indicates that the first-order asymptotic approximations for fj, and for t tests and J tests, work poorly in samples of typical size" (Hansen & West, 2002, p. 464; see Hansen et al., 1996, Burnside & Eichenbaum, 1996, and Small, et al. 2006). In some cases the poor testing behavior may be improved using bootstrapping (see, e.g., Hall & Horowitz, 1996, Hansen & West, 2002, Lindsay & Qu, 2003). More research here is needed.
68
John J. Dziak, Runze Li, Annie Qu
Acknowledgements The authors would like to thank Professor Jianqing Fan for his invitation. Dziak's research is supported by a National Institute on Drug Abuse (NIDA) grant P50 DAI0075. Li's research was supported by a National Institute on Drug Abuse (NIDA) grant 1 R21 DA024260. Qu's research was supported by an NSF grant DMS-03048764.
References [1] D. W. K. Andrews (1999). Consistent moment selection procedures for generalized method of moments estimation. Econometrica, 67, 543-564. [2] J. Barnard, R. McCulloch and X.-L. Meng (2000). Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10, 1281-1311. [3] D. D. Boos (1992). On generalized score tests. Amer. Stat., 46, 327-333. [4] L. Breiman (1995). Better subset regression using the nonnegative garotte. Technometrics, 37, 1995, 373-384. [5] L. Breiman (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350-2383. [6] A. S. Bryk and S. W. Raudenbush. (1992). Hierarchical linear models: Applications and data analysis methods. Sage, Newbury Park, CA. [7] K. P. Burnham and D. R. Anderson (2004). Multimodel inference: Understanding AIC and BIC in model selection. Soc. Methods and Research, 33, 261-304. [8] C. Burnside and M. Eichenbaum. (1996). Small-Sample Properties of GMMBased Wald Tests. J. Bus. and Econ. Statistics, 14, 294-308. [9] B. Cai, and D. B. Dunson (2006). Bayesian covariance selection in generalized linear mixed models. Biometrics, 62, 446-457. [10] E. Cantoni (2004). A robust approach to longitudinal data analysis. Canadian J. of Stat., 32, 169-180. [11] E. Cantoni, J. M. Flemming, and E. Ronchetti (2005). Variable selection for marginal longitudinal generalized linear models. Biometrics, 61, 507-514. [12] H. Y. Chen and R. J. A. Little (1999). Proportional Hazards Regression with Missing Covariates. J. Amer. Stat. Assoc., 94, 896-908. [13] C. -T. Chiang, J. A. Rice, and C. O. Wu (2001). Smoothing Spline Estimation for Varying Coefficient Models With Repeatedly Measured Dependent Variables. J. Amer. Stat. Assoc., 96, 605-619. [14] P. Craven and G. Wahba(1979). Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized crossvalidation. Numerical Mathematics, 31, 377-403. [15] M. Dai and W. Guo (2004). Multivariate spectral analysis using Cholesky decomposition. Biometrika, 91, 629-643. [16] M. J. Daniels and M. Pourahmadi (2002). Bayesian analysis of covariance matrices and dynamic models for longitudinal data. Biometrika, 89, 553-566.
Chapter 3
Quadratic Inference Function Approaches
69
[17] C. S. Davis(2002). Statistical methods for the analysis of repeated measurements. Springer-Verlag, New York. [18] P. J. Diggle, P. Heagerty, K. Y. Liang, and S. L. Zeger (2002). Analysis of longitudinal data (2nd ed). Oxford Univ. Press, New York. [19] J. J. Dziak (2006). Penalized quadratic inference functions for variable selection in longitudinal research. Ph.D. dissertation, Pennsylvania State University, State College, PA. [20] J. J. Dziak and R. Li (2007). An overview on variable selection for longitudinal data. Quantitative Medical Data Analysis using Mathematical Tools and Statistical Techniques, (D. Hong and Y. Shyr, eds). World Scientific Publisher, Singapore. [21] J. Fan and R. Li. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Stat. Assoc., 96, 1348-1360. [22) J. Fan and J. Zhang (2000). Two-Step Estimation of Functional Linear Models With Applications to Longitudinal Data. J. Royal Stat. Soc., B, 62, 303-322. [23) T. S. Ferguson, (1958). A method of generating best asymptotically normal estimates with application to the estimation of bacterial densities. Ann. Math. Statist. 29, 1046-1062. [24) T. S. Ferguson, (1996). A course in large sample theory. Chapman and Hall, Brookhaven, NY. [25) G. M. Fitzmaurice (1995). A caveat concerning independence estimating equations with multiple multivariate binary data. Biometrics, 51, 309-317. [26) 1. E. Frank and J. H. Friedman. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35, 109-148. [27J W. J. Fu. (2003). Penalized estimating equations. Biometrics, 59,126-132. [28] W. H. Greene (2000). Econometric analysis, 4th ed. Prentice Hall, New Jersey. [29J M. J. Gurka (2006). Selecting the best linear mixed model under REML. American Stat., 60, 19-26. [30] P. Hall, and J. L. Horowitz. (1996). Bootstrap critical values for tests based on generalized method-of-moments estimators. Econometrica, 64, 891-916. [31] F. R. Hampel. (1974). The influence curve and its role in robust estimation. J. Amer. Stat. Assoc., 69, 383-393. [32] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, W. A. Stahel (1986). Robust statistics: The approach based on influence functions. Wiley, New York. [33] L. P. Hansen. (1982). Large sample properties of generalized method of moments estimators. Econometrics, 59, 1029-1054. [34] L. P. Hansen, J. Heaton, and A. Yaron. (1996). Finite-sample properties of some alternative GMM estimators. J. Bus. and Econ. Stat., 14, 262-280. [35J B. E. Hansen and K. D. West. (2002). Generalized Method of Moments and Macroeconomics. J. Bus. and Econ. Stat., 20, 460-469. [36] T.J. Hastie and R. Tibshirani (1993). Varying-coefficient models. J. Royal Stat. Soc., B, 55, 757-796. [37) X. He, Z. Zhu, and W. Fung (2002). Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika, 90, 579-590. [38] P. J. Heagerty and S. L. Zeger (2000). Marginalized multilevel models and
70
John J. Dziak, Runze Li, Annie Qu
likelihood inference. Statist. Sci., 15, 1-26. [39] D. Hedeker and R. D. Gibbons. (2006). Longitudinal data analysis. Wiley, Hoboken, NJ. [40] D. R. Hoover and J. A. Rice and C. O. Wu and L.-P. Yang (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85, 809-822. [41] F. Hu. (1993). A statistical methodology for analyzing the causal health effect of a time-dependent exposure from longitudinal data. Sc.D. dissert., Dept. of Biostatistics, Harvard School of Public Health, Boston. [42] J. Z. Huang, N. P. Liu, M. Pourahmadi and L. X. Liu (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93, 85-98. [43] J. Z. Huang and C. O. Wu and L. Zhou (2002). Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika, 89, 111-128. [44] J. Z. Huang and C. O. Wu and L. Zhou. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 14, 763-788. [45] G. W. Imbens (2002). Generalized method of moments and empiricallikelihood. J. of Bus. and Econ. Stat., 20, 493-506. [46] W. Jiang and X. Liu. (2004). Consistent model selection based on parameter estimates. J. of Stat. Planning and Inference, 121, 265-283. [47] J. Johnston and J. DiNardo (1997). Econometric methods, 4th edition. McGraw Hill, New York. [48] J. Kuha (2004). AIC and BIC: Comparisons of assumptions and performance. Sociol. Meth. and Research, 33, 188-229. [49] T. Lai, and D. Small (2007). Marginal regression analysis of longitudinal data with time-dependent covariates: a generalized method of moments approach. J. Royal Stat. Soc., B, 69, 79-99. [50] N. M. Laird and J. H. Ware (1982). Random-effects models for longitudinal data. Biometrics, 38, 963-974. [51] Y. Lee and J. A. NeIder (1996). Hierarchical generalized linear models (with discussion). J. Royal Stat. Soc., B, 58, 619-678. [52] Y. Lee and J. A. NeIder (2004). Conditional and marginal models: Another view. Statist. Sci., 19, 219-238. [53] Y. Li and L. Ryan (2002). Modeling spatial survival data using semiparametric frailty models. Biometrics, 58, 287-297. [54] K. Y. Liang and S. L. Zeger (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22. [55] X. Lin and R. J. Carroll (2000). Nonparametric function estimation for clustered data when the predictor is measured without/with error. J. Amer. Stat. Assoc., 95, 520-534. [56] B. G. Lindsay, and A. Qu (2003). Inference functions and quadratic score tests. Statist. Sci., 18, 394-410. [57] T. Martinussen and T. H. Scheike (2001). Sampling-Adjusted Analysis of Dynamic Additive Regression Models for Longitudinal Data, Scand. J. of
Chapter 3
Quadratic Inference Function Approaches
71
Stat., 28, 303-323. [58] L. Matyas (1999). Generalized Method of Moments Estimation. Cambridge, New York. [59] P. McMullagh and J. A. NeIder (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, New York. [60] J. M. Neuhaus, J. D. Kalbfleisch, and W. W. Hauck (1991). A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int. Stat. Review, 59, 25-35. [61] J. M. Neuhaus, W. W. Hauck, and J. D. Kalbfleisch (1992). The effect of mixture distribution misspecification when fitting mixed-effects logistic models. Biometrika, 79, 755-762. [62] J. Neyman (1949) Contribution to the theory of the X2 test. Proc. Berkeley Symp. Math. Statist. Probab. 239-273. Univ. California Press. [63] A. B. Owen (1990). Empirical likelihood confidence regions, Annals of Stat., 18,90-120. [64] J. Pan and G. Mackenzie (2003). On modelling mean-covariance structures in longitudinal studies. Biometrika, 90, 239-244. [65] W. Pan (2001). Akaike's Information Criterion in generalized estimating equations. Biometrics, 57, 120-125. [66] M. S. Pepe and G. L. Anderson (1994). A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Comm. in Stat., B - Simul. and Computation, 23 (1994), 93995l. [67] M. Pourahmadi (1999) Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation. Biometrika, 86, 677-690. [68] M. Pourahmadi (2000). Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix. Biometrika, 87, 425-435. [69] J. S. Preisser and B. F. Qaqish (1996). Deletion diagnostics for generalised estimating equations. Biometrika, 83, 551-562. [70] J. S. Preisser and B. F. Qaqish (1999). Robust regression for clustered data with application to binary responses. Biometrics, 55, 574-579. [71] J. Qin and J. Lawless (1994). Empirical likelihood and general estimating equations. Ann. Stat., 22, 300-325. [72] A. Qu and R. Li (2006). Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics, 62, 379-39l. [73] A. Qu and B. G. Lindsay (2003). Building adaptive estimating equations when inverse of covariance estimation is difficult. J. Royal Stat. Soc., B, 65, 127-142. [74] A. Qu, B. G. Lindsay, and B. Li (2000). Improving generalized estimating equations using quadratic inference functions. Biometrika, 87, 823-836. [75] A. Qu, and P. X. Song (2002). Testing ignorable missingness in estimating equation approaches for longitudinal data. Biometrika, 89, 841-850. [76] A. Qu, and P. X. Song (2004). Assessing robustness of generalised estimating equations and quadratic inference functions. Biometrika, 91, 447-459. [77] A. Rotnitzky and N. P. Jewell (1990). Hypothesis testing of regression parameters in semi parametric generalized linear models for cluster correlated data.
72
John J. Dziak, Runze Li, Annie Qu
Biometrika, 77, 485-497. [78] A. Rotnitzky and D. Wypij (1994). A note on the bias of estimators with missing data. Biometrics, 50, 1163-1170. [79] A. Roverato (2000). Cholesky decomposition of a hyper inverse Wishart matrix. Biometrika, 87, 99-112. [80] D. B. Rubin (1976). Inference and missing data. Biometrika, 63, 581-592. [81] D. Ruppert (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics, 11, 735-757. [82] D. Small and J. Gastwirth and A. Krieger and P. Rosenbaum (2006). R-estimates vs. GMM: A theoretical case study of validity and efficiency. Statistical Science, 21, 363-375. [83] M. Smith and R. Kohn (2002). Parsimonious covariance matrix estimation for longitudinal data. J. Amer. Stat. Assoc., 97, 1141-1153. [84] R. Tibshirani. (1996). Regression shrinkage and selection via the Lasso. J. Royal Stat. Soc., B, 58, 267-288. [85] L. Wang and A. Qu (2007). Consistent model selection and data-driven smooth tests for clustered data. Manuscript. [86] N. Wang. (2003). Marginal nonparametric kernel regression accounting for within-subject correlation. Biometrika, 90, 43-52. [87] Y. Wang and V.J. Carey (2004). Unbiased estimating equations from working correlation models for irregularly timed repeated measures. J. Amer. Stat. Assoc., 99, 845-853. [88] R. W. M. Wedderburn (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61, 439-447. [89] C. O. Wu and T. Chiang and D. R. Hoover (1998). Asymptotic Confidence Regions for Kernel Smoothing of a Time-Varying Coefficient Model With Longitudinal Data. J. Amer. Stat. Assoc., 88, 1388-1402. [90] W. Wu and M. Pourahmadi (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika, 90, 831-844. [91] H. Ye and J. Pan (2006). Modelling of covariance structures in generalised estimating equations for longitudinal data. Biometrika, 93, 927-941. [92] S. L. Zeger and K.- Y. Liang and P. S. Albert (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44, 1049-1060. [93] H. Zou (2006). The Adaptive Lasso and Its Oracle Properties. J. Amer. Stat. Assoc., 101, 1418-1429. [94] W. Zucchini (2000). An introduction to model selection. J. Mathematical Psych., 44, 41-61.
Chapter 4 Modeling and Analysis of Spatially Correlated Data * Yi Li t Abstract The last decade has seen a resurgence of statistical methods for the analysis of spatial data arising from various scientific fields. This chapter reviews these methodologies, mainly within the geostatistical framework. We consider data measured at a finite set of locations and draw inference about the underlying spatial process, based on the partial realization over this subset of locations. Methodologically, we employ linear mixed models and generalized linear mixed models that enable likelihood inference for fully observable spatial data. For spatial data subject to censoring, we review a class of semiparametric normal transformation models for spatial survival data. A key feature of this model is that it provides a rich class of models, where regression coefficients have a population-level interpretation and the spatial dependence of survival times is conveniently modeled using flexible normal random fields. This would be appealing to practitioners, especially given that there are virtually no spatial failure time distributions that are convenient to work with.
Keywords: Geostatistical data; areal data; semivariogram; kriging; spatial (generalized) linear mixed models; EM algorithm; PQL approximation; spatial survival data; semiparametric normal transformation; conditional Martingale covariance rate function.
1
Introd uction
Spatial data are commonly acquired to serve diverse scientific purposes. For example, meterologists are interested in the amount of precipitation over adjacent tropical forests. Mining engineers are keen to predict the conserve of a new oil field based on the product of nearing fields. Political geologists link election results to spatial regions so as to map political power on geographical space. Economists study economic activities on macro spatial scales, as factors like access to the sea and to the raw materials tend to impact economic activities at a country level. Epidemiologists investigate the incidence of a disease and its variations across regions for a better understanding of the etiology. Environmentalists monitor the -The author is partly supported by an NIH ROl grant. tHarvard University, Boston, MA, USA. E-mail: [email protected]
73
74
Yi Li
level of PM2.5 (a particulate matter that can travel into a person's lungs) in airpollution monitoring sites and characterize its spatial distributions. The shared features of data arising from these studies fall into the following two categories: (a) geostatistical (or point-referenced) type, where the outcomes are random variables indexed by locations that vary continuously over a subset of a Euclidean space. (b) areal type, where the locations are finite number of areal units with well-defined boundaries and the data are typically summary statistics over these units. Some spatial data sets even feature both geostatistical and areal types. For example, in the NCI Surveillance Epidemiology and End Results (SEER) data, the outcome for each individual was measured, while the location information was only available at the county (areal unit) level. As individual-level or point-referenced data have become increasingly common in public health studies with the use of the Geographic Information System (GIS) technology and geocoding of individua addresses, this chapter focuses on the statistical analysis of geostatistical data, while necessary modifications for analysis of the areal data will be discussed. Spatial data analysis is challenged by the presence of spatial dependence among observations. We give a simple example to illustrate the effect of correlation on analysis. First consider independent samples Y1 , ... ,Yn from a normal distribution with mean Il and known variance (J'2. The most efficient unbiased estimator of Il is the sample average y- = L:i Y;,/n, which follows a normal distribution with mean Il and variance (J'2/n, yielding a two-sided 95% confidence interval for Il (Y - 1.96(J'/vn, y- + 1.96(J'/vn). Now instead of independent data, suppose that the data exhibit a spatial correlation in Rl that decreases exponentially as the separation between data points increases cov(Yi, Yj) = (J'p1i- jl . (1.1) Under (1.1), y- will still follow a normal distribution with mean Il, but with variance var(Y)
= (J'2/n [1 + 2{p/(1 - p)}(l - l/n) - 2{p/(1 - p)}2(1 - pn-l )/n]. (1.2)
For n = 10 and p = 0.26, var(Y) confidence interval for Il
=
(J'2/1O x 1.608, resulting in a two-sided 95%
(Y - 2.485(J'/JiO, y- + 2.485(J'/JiO); see Cressie (1993). Thus, failure to account for the underlying correlations among data tends to narrow the confidence intervals. More intuitive explanation of the impact of spatial correlation can be obtained from (1.2), based on which the effective sample size can be computed as
n*
=
n[l + 2{p/(1- p)}(l-l/n) - 2{p/(1- p)}2(1_ pn-l)/nrl.
For large n, it follows that n* /n = (1 - p)/(l correlation is palpable even for large samples.
+ p),
hinting that the effect of
Chapter
4
Modeling and Analysis of Spatially Correlated Data
75
Hence, it is important to take into account the spatial correlation for correct inference. Mixed effects models have provided a convenient means of modeling spatial correlations by using random effects, with common spatial correlation structures including, for example, random fields (for geostatistical data) and autoregressive (CAR) structure (for areal data) (Waller, et al., 1997). Over the past two decades, spatial statistical methods have been well established for normally distributed data (Cressie, 1993; Haining, et al., 1989) and discrete data (Journel, 1983; Cressie, 1993; Carlin and Louis, 1996; Diggle et al., 1998). Statistical models for such data are often fully parameterized, and inference procedures are based on maximum likelihood (Clayton and Kaldor, 1987; Cressie, 1993), penalized maximum likelihood (Breslow and Clayton, 1993) and Markov chain Monte Carlo (Besag, York, Mollie, 1991; Waller et al., 1997). Further complicate the analysis of spatial data is the presence of censoring for outcomes as reflected in many epidemiological and social behavioral studies. For example, in the East Boston Asthma Study on childhood asthma, subjects were enrolled at community health clinics in the east Boston area, and questionnaire data, documenting ages at onset of childhood asthma and other environmental factors, were collected during regularly scheduled visits. Apart from the basic demographic data, residential addresses were geocoded for each study subject so that the latitudes and longitudes were available. Residents of East Boston are mainly relatively low income working families. Children residing in this area have similar social economical backgrounds and are often exposed to similar physical and social environments. These environmental factors are important triggers of asthma but are often difficult to measure in practice. Ages at onset of asthma of the children in this study were hence likely to be subject to spatial correlation. The statistical challenge is to identify significant risk factors associated with age at onset of childhood asthma while taking the possible spatial correlation into account. Li and Ryan (2002) proposed semi-parametric frailty models for spatially correlated survival data, where the spatial units are prespecified geographic regions (e.g. census tract). Their approach is to extend the ordinary frailty models to accommodate spatial correlations and exploit a robust rank likelihood-based inferential procedure. However, frailty survival models typically do not possess regression coefficients with population-level interpretations, less appealing to population scientists. In this chapter, we review the methodologies for the analysis of spatial data, mainly within the geostatistical framework. That is, the data consist of the measurements at a finite set of locations and the statistical problem is to draw inference about the spatial process, based on the partial realization over this subset of locations. The rest of this chapter is organized as follows. We introduce the building blocks of geostatistical modeling, including stationarity, isotropy and (semi-) variograms. We next discuss statistical models, including linear mixed models and generalized linear models, that enable likelihood inference for fully observable spatial data. For spatial data that are subject to censoring, we review a semiparametric normal transformation model that was recently developed by Li and Lin (2006). A key feature of this model is that it provides a rich class of models where regression coefficients have a population-level interpretation and the spatial
76
Yi Li
dependence of survival times is conveniently modeled using flexible normal random fields. We conclude this chapter with further topics and some open questions.
2
Basic concepts of spatial process
We present the essential elements of geostatistical spatial models, starting with the fundamental underlying concept of a stochastic spatial process {Y(s), s E V}. For example, Y(s) represents the level of PM2.5 at monitor site s, and V is a fixed subset of Euclidean space RT, containing all the pollution monitoring sites. In the spatial context, The dimension of the Euclidean space, r, is often 2 (latitude and longitude) or 3 (latitude, longitude and altitude above sea level). Assuming the existence of the first two moments of Y (s) for every s E V, the first moment E{Y(s)} = f.t(s) is often termed trend or drift, while the existence of the second moment allows the definition of (weak) stationarity. To be more specific, a spatial process is weakly stationary if f.t(s) == f.t (i.e. the process has a constant mean) and
cov{Y(s) , Y(s
+ h)} =
C(h)
for all h E RT such that s,s + h E V, where C(h) is termed the covariance function. In contrast, a spatial process is termed strongly stationary if, for any given n ~ 1 and any given SI,··· ,Sn and any h E RT (as long as Si + h E V), (Y(sI),··· ,Y(Sn)) has the same joint distribution as (Y(SI + h),··· ,Y(sn + h)). Of course, strong stationarity implies weak stationarity, but not vice versa. Apart from these two stationary types, there indeed exists a third type of stationarity called intrinsic stationarity. Here, we assume E(Y(s)) == f.t and define intrinsic stationarity if E(Y(s + h) - Y(s))2 depends only on h. If that is the case, we write 2'Y(h) = E(Y(s + h) - Y(s))2 and call 2'Y(h) the variogram and 'Y(h) the semivariogram. In the following, we use I . I to denote the Euclidean norm for a vector. The behavior of 'Y(h) near Ihl = 0 is informative about the continuity properties of Y(·). Specifically, (i) if 'Y(h) ----t 0 as Ihl----t 0, then Y(.) is L2 continuous, namely, IY(s + h) - Y(S)IL2 ----t 0 for any s as Ihl ----t O. For example, a Brownian motion in Rl is L2 continuous; (ii) if 'Y(h) does not approach 0 as Ihl ----t 0, then Y(·) is not L2 continuous and termed irregular. The discontinuity of 'Y(h) at 0 is called the nugget effect reflecting microscale variation, which will be discussed later; (iii) if 'Y(h) is a positive constant, then Y(sd and Y(S2) are uncorrelated for any SI -=I=- S2, regardless of their proximity. If the semivariogram 'Y(h) depends on vector h only though its length, we call the underlying process isotropic, reflecting that the pairwise correlations among subjects depend only on their distances; otherwise, it is called anisotropic. Isotropic processes are popular in spatial data analysis for their interpretability and availability of parametric functions for the semivariogram, which can be simply written as 'Y(lhj). The Matern model has recently emerged as a powerful model for 'Y(lhl)
Chapter
4
Modeling and Analysis of Spatially Correlated Data
77
in practice, which is given by (2.1) where
2
m«(]"2,(,I/,d) =
2V-~r(I/) (2(ylvdt Kv(2(ylvd)
(2.2)
is the Matern function. Here, ( measures the correlation decay with the distance and 1/ is a smoothness parameter, f(·) is the conventional Gamma function, KvO is the modified Bessel function of the second kind of order 1/ (see, e.g. Abramowitz and Stegun, 1965). While 1'(0) = 0 by definition, limlhl---+o+ = 7 2 , termed nugget, which characterizes local variations; in addition, limlhl---+oo = 7 2 + (]"2 is called sill; finally, sill minus nugget is termed partial sill, which is (]"2 in this case. Model (2.2) is rather general, special cases including the exponential function (]"2 exp( -d) when the smoothness parameter v = 0.5 and the "decay parameter" ( = 1, and the "Gaussian" correlation function (]"2 exp( -d2 ) corresponding to v --+ 00 and ( = 1. In cases like these, ( characterizes the effective range, the distance at which there is practically no lingering spatial correlation. Other common choices of semivariogram functions can be found in Cressie (1993) and Banerjee et al. (2003). Given a variety of choices of semivariogram function, a natural question, of course, is to decide which best fits a given data or whether the data can distinguish them. It is customary to empirically estimate the semivariogram, with the goal of comparing it to the theoretical shapes. Under the constant mean assumption, a straight-forward estimator, due to Matheron (1962), would be 2i'(lhl)
=
IN~h)1
L
(Y(Si) - Y(Sj))2,
(2.3)
NCh)
where N(h) = {(Si' Sj) : lSi - Sj I = Ihl} denotes the collection of pairs that are distanced by Ihl and IN(h)1 is the number of distinct pairs in N(h). In practice, however, Matheron's estimator is of limited value unless the observations fall on a regular grid. Instead, we would partition the half-line into distance bins h = (0, hd,I2 = [h2' h3) and up to IK = [hK-I, h K ) for some prespecified 0< hI < ... < hK. Then we can alter the definition of N(h) by N(hk) = {(Si' Sj) : lSi - sjl E
h}, k = 1,··· ,K.
For the choice of hk and K, see Journel and Huijbregts (1979). Moreover, as (2.3) is sensitive to outliers, a more robust fourth-root-of-squared-difference estimator can be constructed; see Cressie and Hawkins (1980). We close this section by noting the relationship between the semivariogram I'(h) and the covariance function C(h). Apparently, I'(h)
= C(o) - C(h).
(2.4)
Hence, given C, we can recover 1'. But how about recovering C from I'? It turns out that we need to add more condition, i.e. C(h) --+ 0 when Ihl --+ o. This
Yi Li
78
condition is sensible as it regulates that the covariance between two observations diminishes as they become farther apart. Taking the limit of both sides of (2.4), we have C(o) = limlhl--->oo ,(h). Thus, we have that C(h)
=
lim ,(h) -,(h). Ihl--->oo
(2.5)
In general, such limit may not exist; but if it does, the process is weakly stationary with covariance function C(h).
2.1
Spatial regression models for normal data
In many spatial studies, also observed along with the outcomes are exploratory variables, measuring for example the characteristics of each individual at given locations. Within the framework of spatial process, we denote such covariate process by (X(s), S E V). The spatial linear mixed model of Y(s) given X(s) can be written as Y(s) = X(s)',6 + Z(s), (2.6) where X( s) is a q x 1 covariate vector, ,6 is a vector of regression coefficients in some open subset of Rq, say, 5, and Z(s) is a mean zero (weakly stationary) Gaussian process with spatial covariance function
cov{Z(s),Z(s')}
=
C(s,s';O)
(2.7)
where 0 is a k x 1 vector of spatial dependence parameters in some open subset of Rk, say, A. The parameter ,6 characterizes the deterministic part of the spatial data and is sometimes called the trend parameter, while 0 characterizes the variability of the underlying spatial field through the spatial covariance function C(s, s'; 0). A common practice is to further assume Z(s) is isotropic so that the covariance function C(·) depends on sand s' only through their distance Is - s'l. In this case, we write C(s, s'; 0) as C(ls - s'l; 0), which can be easily specified from the semivariogram" e.g. the Matern model (2.1), via relationship (2.5). We are in a position to draw inference based on model (2.6). Given a finite number of locations Sl,··· ,Sn, then the n x n matrix :En = [C(lsi - sjl;O)] is positive-definite. Let Yn = (Y(Sl),··· ,Y(sn)) denote the outcome data and Xn = (X(Sl),··· ,X(sn))' denote the design matrix. Let 8 = (,6',0'), denote the (q + k) x 1 parameter vector. Then the log likelihood is
Ln(8)
=
-n/2log(27r) -1/2log(l:E n l) -1/2(Yn - Xn,6)':E;;-l(y n - Xn,6).
A
The maximum likelihood estimator 8
=
(2.8)
A' A'
(,6 ,0 )' satisfies
Computational details are given in Cressie (1993). However, the MLE of especially when the sample size is small. A common
o can be seriously biased,
Chapter
4 Modeling and Analysis of Spatially Correlated Data
79
remedy is the restricted maximum likelihood estimation (REML) approach that filters the data so that the joint distribution of the filtered data is free of (3. To proceed with the REML, we need to introduce a new concept, error contrast. Specifically, we call a linear combination of outcome, say, a'Y n, where a E Rn, an error contrast if E(a'Y n) = 0 for all (3 and 8; hence, a'Y n is an error contrast if and only if a'X n = o~, where Oq is a q x 1 zero vector. We assume that the design matrix Xn is full rank, i.e. rank(Xn) = q. Hence, the kernel space of Xn has dimension n - q. That is, we can find an n x (n - q) full rank matrix A (i.e. rank(A) = n - q)such that A'Xn = 0, yielding a vector of n - q linearly independent error contrast A'y n ~f W n. Under the normality assumption, W n rv MV N(on' A'~n(8)A), which does not depend on {3. Here, MV N (IL,~) denotes a multivariate normal distribution with mean vector IL and variance-covariance matrix ~. The choice of A is not unique; however, Harville (1974) showed that the log likelihood function would differ each other only by an additive constant for various A's that form the n - q linearly independent contrasts. Indeed, for the A that satisfies AA' = I - Xn (X~ Xn) -1 X~ and AA' = I, the log likelihood function for Wn = A'Y n is
LR(8) = -(n - q)/210g(27f) -1/210g IX~Xnl + 1/2 log l~n(8)1 +1/210gIX~~~1(8)Xnl + 1/2Y~II(8)Yn
(2.9)
where 11(8) = ~~1(8) - ~~1(8)Xn(X~~~1(8)Xn)-lX~~~1(8). An REML estimator {j REM L would satisfy
A Basyesian justification has been provided by Harville (1974), which showed the marginal posterior density for 8 is proportional to (2.9) with a noninformative prior on (3. Estimation is also feasible based on (2.6) without imposing normality assumption on the spatial component Z(s). Instead, we assume the existence of the first two components of Z(s) such that E(Z(s)) = 0 and Z(s) has spatial covariance function C(·; 8) as defined in (2.7). Assuming for now that 8 is known, we can obtain the best linear unbiased estimator of (3 by minimizing the quadratic "loss" function" where ~n(8)
=
[Cij (8)] and C ij (8) = C(Si - Sj; 8), i,j = 1,···
,n, leading to (2.10)
In reality, 8 is most likely unknown, necessitating an iterative reweighted least squares estimation:
Yi Li
80 where
O(k)
is a variogram-based estimate (see Cressie, 1993) based on the current A(k)
-
A(k)
-
(k)
estimate of {3 . Write {3 = limk---+oo {3 and 0 = limk---+oo () . It can be shown that this procedure yields an asymptotically efficient and consistent estimate for {3, whose variance matrix is subsequently
FUrther details of statistical properties can be found in del Pino (1989). Though our models, along with the inferential procedures, are framed within the geostatistical context, they can easily accommodate areal data with a proper :En. A popular choice of :En for the areal data is the conditional auto-regressive (CAR) structure, possessing both appealing theoretical properties and attractive interpretation (Cressie, 1993). The CAR structure assumes that the full conditional distribution of the Z(Si) [or Y(Si)] on the rest of data depends only on its "neighbors". In particular, for the normal data, the CAR model states that
where qii = 0, qi+ = Lj qij and the nonnegative qij controls the strength of connection between areas i and j, and often takes value 0 when areas i,j are not neighbors, -1 :::::; () :::::; 1 is the spatial dependence parameter controlling the amount of information in an area provided by its neighbors, and T2 is the nugget, measuring local variations. When areas i and j are neighbors, a common choice of qij is qij = 1, reflecting equal weights from neighboring areas. Brook's lemma (Brook, 1964) suggests a multivariate normal distribution for Z(Sl),··· ,Z(sn) with mean 0 and variance matrix
(2.11) where Q = {qij} is an n x n symmetric matrix; M is an n x n diagonal matrix with diagonal elements l/qi+. It is worth noting that the flexibility of the CAR structure allows for a more general neighborhood concept than a mere geographical proximity (Cressie, 1993).
2.2
Spatial prediction (Kriging)
Having estimated the trend parameter (3 and the spatial correlation parameter 0, we are ready to discuss geostatistical techniques to predict the value of a random field at a given location from nearby observations. Such techniques are termed kriging, developed by a French mathematician Georges Matheron and named after Daniel Gerhardus Krige, a mining engineer who developed a distance-weighted average method for determining gold grades based on nearby samples. Specifically, kriging is about interpolating the value Y(so) of a random field Y(s) at a prespecified location So from observations Yi = Y(Si), i = 1,··· ,n. In essence, it is a minimum-mean-squared-error method of prediction that depends on the second-order properties of Y (.), via computing the best linear unbiased
Chapter 4 Modeling and Analysis of Spatially Correlated Data
81
estimator Y(so) for Y(so), given by n
YK(so) = Co +
L WiY(Si) = CO + w'y n· i=1
The weights w ~f (WI, . .. ,Wn )' are chosen in such a way that the prediction error variance (also called kriging variance or kriging error)
O";(xo)
=
var (Yk(so) - Y(so))
=
var(Y(so)) - 2w'k(so) + w'~n w,
where k(so) = (C(so - SI; 8),··· ,C(so - Sn; 8))' and ~n = [C(Si minimized subject to the unbiasedness condition:
(2.12) Sj;
8)]nxn, is (2.13)
This optimization problem can be solved by using the Lagrange multipliers, yielding a closed-form kriging estimator
(2.14) where
j3
is obtained from (2.10), with prediction error variance given by
O";(so) = [var(Y(so)) - k(SO)'~~lk(so)] +(X(so) - X~~~lk(so))'(X~~~IXn)-I(X(so) - X~~~lk(so)). Refer to Ripley (1981) for detailed derivations. Several issues merit attention. First, ~n is an n x n matrix, presenting much difficulty for inverting it, especially when n is large. An attractive solution is to use a subset of (SI,··· ,sn), say, (Kl,··· ,KK), where K« n, as representative knots and perform low rank kriging as proposed by Kammann and Wand (2003). This subset can be obtained while an efficient space filling algorithm (e.g. Nychka and Saltzman, 1998). Secondly, the development so far needs the covariance function C(·; 8) or the spatial variance component (J is known. When (J is unknown (which is often the case), one needs to replace (J that involves in (2.14) by its consistent estimate 0 (discussed in the previous section), but it remains an open problem to derive the prediction error variance that also accounts for the variability of 0. A kriging method that blends prediction and nonparametric estimation of the covariance function C(·) was given by Opsomer et al. (1999), though the estimate of the prediction error that fully accounts variations from all sources, e.g. owing to estimation of f3 and variance function, is still elusive. Thirdly, equation (2.14) reveals that the classical kriging uses the linear combination of the observed values to approximate that of a new location, with larger weights assigned to more nearby locations. However, in many situations, e.g. for
Yi Li
82
non-normal data, such linearity assumption is too strict and may not be plausible. We therefore consider an optimal predictor that minimizes the following conditional-mean-squared-error function (2.15) where p(Y n; so) is a predictor at So based on the observed Y n. In view of
it is obvious that the optimal predictor is
When Y(·) is a Gaussian process, the optimal predictor Yo(so) coincides with the classical kriging Yk(SO) in (2.14). For non-normal data, we will consider a generalized linear mixed model, based on which the optimal predictor will be nonlinear and often requires numerical approximations. Finally, as we have been confined to use a fully parametric form to model the trend surface, i.e. the deterministic part of the spatial process, as in (2.6), one possible alternative is to estimate the trend surface by non parametric regression. This would produce enough flexibility to absorb the signal almost completely into the trend, effectively diminishing spatial dependence. Mueller (2000) considered the following model in the spirit of Hastie and Tibshirani (1993),
Y(s)
=
rJ(X(s), ,8(s))
+ Z(s)
where rJ is assumed to be a smooth function and Z(s) is a white noise with cov(Z(s), Z(s')) = 0 if s -=I s' and var(Z(s)) = (72(s). By allowing,8 to change smoothly over the location s, one would be able to recover the trend surface of the underlying spatial process. A unified approach that encompasses both regression and kriging based on this model is given by Host (1999).
3
Spatial models for non-normal/discrete data
While by far we have focused on the normal outcome data, non-normal data do frequently arise from spatial studies. Sometimes it is possible to transform the data so that they feature more as realizations from the Gaussian process, but in most cases, especially for discrete data, such transformation is not possible. Examples include the forest defoliation study reported in Heagerty and Lele (1998), with binary outcomes indicating the presence or absence of Gypsy moth egg masses at given locations, and the sudden infant death (SID) study (Cressie and Read, 1989), which studied the counts of SIDs cross 100 counties of North Carolina. Statistical models for independent non-normal data, can be traced back as early as 1934, when Bliss (1934) proposed the first probit regression model for binary data. It was not, however, until four decades later did Neider and Wedderburn (1972) and McCullagh and Neider (1983 1st ed., 1989 2nd ed.) propose Generalized Linear
Chapter 4 Modeling and Analysis oj Spatially Correlated Data
83
Models (GLMs) to unify the models and modeling techniques for analyzing more general data (e.g. count data and polytomous data). Several authors (Laird and Ware, 1982; Stiratelli et al., 1984; Schall, 1991, among others) considered a natural generalization of the GLMs to accommodate correlated non-normal data by incorporating random terms into the linear predictor parts. The resulting models are termed generalized linear mixed models (GLMMs), providing a convenient and flexible way to model multivariate non-normal data. In particular, GLMMs constitute a unified framework for modeling geostatistical non-normal data, using mixed terms to model the underlying spatial process. We call the special application of GLMMs to the geostatistical data as spatial generalized linear mixed models (see e.g. Diggle et al., 1998; Zhang, 2002), which is to be discussed in the next section.
3.1
Spatial generalized linear mixed models (SGLMMs)
Consider a simple illustration of a spatial logistic regression for binary data: Y(s)jX(s), Z(s)
i'!!:.,d
Bernoulli(/-Ls);
logit(/-Ls) = X(s)'j3
+ Z(s)
(3.1)
where Y(s) denotes the binary outcome (e.g. 1 corresponds to the presence of Gypsy moth egg masses and 0 to the absence) at location s, X(s) is a vector of additional individual-level covariates of interest and Z(s) are unobserved spatially correlated random effects indexed by s. In practice, we posit a random field structure on Z(s), with covariance structure specified as in Section 2. This class of logistic regression model was originally designed for prospective studies, but is also applicable to case-control studies. Inference based on (3.1) has been detailed in Paciorek (2007). It is straightforward to generalize (3.1) to accommodate more general data beyond binary outcomes. Specifically, conditional on unobserved spatial random variables Z(s), the Y(s) are assumed to be independent and follow a distribution of the exponential family: Y(s)jX(s), Z(s)
j(Y(s)IX(s), Z(s)), j(Y(s)IX(s), Z(s)) = exp{[Y(s)as - h(a s )]jr2 - c(Y(s), rH. i'!!:.,d
(3.2) (3.3)
The conditional mean of Y(s)IX(s), Z(s), denoted by /-Ls, is related to as through the identity /-Ls = 8h(a s )/8as . It is to be modeled, after a proper transformation, as a linear model in both the fixed and spatial random effects:
g(/-Ls)
=
X(s)'j3
+ Z(s).
(3.4)
Here, g(.) is coined a link junction, often chosen as an invertible and continuous function, and Z(s) is assumed to have a random field structure, whose covariance function is characterized by a finite dimensional parameter 0, termed the spatial
variance components.
Yi Li
84
Model (3.4) is comprehensive and encompasses a variety of models, including the aforementioned spatial logistic regression model as a special case. Specifically, for binary outcome data, let
h(o:) = log{l + exp(o:)},
7
== 1, C(y,7 2 ) == O.
Choosing g(f.1) = logit(f.1) yields spatial logistic regression model (3.1), while choosing g(f.1) = iI>-1(f.1), where iI>(.) is the CDF for a standard normal, gives a probit random effects model. On the other hand, for continuous outcome data, by setting
and g(.) to be an identity function, model (3.4) reduces to a linear mixed model. For count data, putting
h(o:) = e'''',
7
== 1, c(y, 7 2 ) = logy
and choosing g(f.1) = log(f.1) results in a Poisson regression model. Given data (Y(Si), i = 1"" ,n), (3.2) and (3.3) induce the following log likelihood that the inference will be based on:
£ = log
JIT
f(Y(Si)IX(Si), Z(Si); ,B)f(ZIXn; 9)dZ,
i=l
where the integration is over the n-dimensional random effect Z = (Z(sd,··· ,
Z(sn))' . We can further reformulate model (3.4) in a compact vectorial form. With Y n, Xn defined as in the previous section, we write
g{E(Y nlXn, Z)}
=
Xn(3 + AZ,
(3.5)
where A is a non-random design matrix, compatible with the random effects Z. The associated log likelihood function can be rewritten as
£(Y nlXn; (3, 9) = log L(Y nlXn; (3, 9) = log
J
f(Y nlXn, Z; (3)f(ZIX n ; 9)dZ,
(3.6) where f(Y nlXn, Z; (3) is the conditional likelihood for Y nand f(ZIX n ; 9) is the density function for Z, given the observed covariates X n . Model (3.5) is not a simple reformat - it accommodates more complex data structure beyond spatial data. For example, with properly defined A and random effects Z it encompasses non-normal clustered data and crossed factor data (Breslow and Clayton, 1993). When A is defined as matrix indicating membership of spatial regions (e.g. counties or census tracts), (3.5) models areal data as well. Model (3.5) accommodates a low-rank kriging spatial model, where the spatial random effects Z will have a dimension that does not increase with the sample size n and, in practice, is often far less than n. Specifically, consider a subset of
Chapter
4
Modeling and Analysis of Spatially Correlated Data
locations (Sl' ... ,Sn), say, (1\;1,'" ,I\;K), where K Let
85
< < n, as representative knots.
and Z be a K x 1 vector with covariance 0- 1 . Then (3.5) represents a low-kriging model by taking a linear combination of radial basis functions C(s - I\;k; 0)), 1 ~ k ~ K, centered at the knots (1\;1,'" ,I\;K), and can be viewed a generalization of Kammann and Wand's (2003) linear geoadditive model to accommodate nonnormal spatial data. Because of the generality of (3.5), the ensuing inferential procedures in Section 3.2 will be based on (3.5) and (3.6), facilitating the prediction of spatial random effects and, hence, each individual's profile. Two routes can be taken. The best predictor of random effects minimizing the conditional-mean-squared-error (2.15) is E(ZIY n), not necessarily linear in Y n' But if we confine our interest to an unbiased linear predictors of the form
for some conformable vector c and matrix Q, minimizing the mean squared error (2.12) subject to constraint (2.13) leads to the best linear unbiased predictor (BLUP)
Z = E(Z) + cov(Z, Y n){var(Y n)} -l{y n -
(3.7)
E(Y n)}.
Equation (3.7) holds true without any normality assumptions (McCulloch and Searle, 2001). For illustration, consider a Dirichlet model for binary spatial outcomes such that Y(s)IZ(s)
rv
Bernoulli(Z(s))
and the random effect Z = (Z(st)"" ,Z(sn)) rv Dir(a1,'" ,an), where Using (3.7), we obtain the best linear predictor for Z(Si),
where ao = L:~1 ai, 1£0 = (at/ao,'" ,an/ao)', ~o (ao + 1) for i =f j, Cii = ai(ao - ai)/a5(ao + 1) and As a simple example, when n = 2,
=
ei
ai
> O.
[Cij]nxn, Cij = -ai a j/a5 is the i-th column of ~o·
86
3.2
Yi Li
Computing MLEs for SGLMMs
A common theme in fitting a SGLMM has been the difficulty of computation of likelihood-based inference. Computing the likelihood itself actually is often challenging for SGLMMs, largely due to high dimensional intractable integrals. We present below several useful likelihood-based approaches to estimating the coefficients and variance components, including iterative maximization procedures, such as the Expectation and Maximization (EM) algorithm, and approximation procedures, such as the Penalized Quazi-likelihood method and the Laplace method. The EM algorithm (Dempster et al., 1977) was originally designed for likelihood-based inference in the presence of missing observations, and involves an iterative procedure that increases likelihood at each step. The utility of the EM algorithm in a spatial setting lies in treating the unobserved spatial random terms as 'missing' data, and imputing the missing information based on the observed data, with the goal of maximizing the marginal likelihood of the observed data. Specifically, if the random effects Z were observed, we would be able to write the 'complete' data as (Y n, Z) with a joint log likelihood
As Z is unobservable, directly computing (3.8) is not feasible. Rather the EM algorithm adopts a two-step iterative process. The Expectation step ('E' step) computes the expectation of (3.8) conditional on the observed data. That is, calculate i = E{£(Y n, ZIX n ;,l3, 8)IY n, Xn;,l3o, 8 0 }, where ,l30, 8 0 are the current values, followed by a Maximization step (M step), which maximizes i with respect to ,l3 and 8. The E step and M step are iterated until convergence is achieved; however, the former is much costly, as the conditional distribution of ZIX n , Y n involves the distribution f(Y nIXn), a high dimensional intractable integral. A useful remedy is the Metropolis-Hastings algorithm that approximates the conditional distribution of ZIX n , Y n by making random draws from ZIXn, Y n without calculating the density f(Y nlXn) (McCulloch, 1997). Apart from the common EM algorithm that requires a full likelihood analysis, several less costly techniques have proved useful for approximate inference in the SGLMMs and other nonlinear variance component models, among which the Penalized Quasi-likelihood (PQL) method Penalized Quasi-likelihood method is most widely used. The PQL method was initially exploited as an approximate Bayes procedure to estimate regression coefficients for semiparametric models; see Green (1987). Since then, several authors have explored the PQL to draw approximate inferences based on random effects models: Schall (1991) and Breslow and Clayton(1993) developed iterative PQL algorithms, Lee and NeIder (1996) applied the PQL directly to hierarchical models. We consider the application of the PQL for the SGLMM (3.5). For notational simplicity we write the integrand of the likelihood function
f(Y nlXn, Z; J3)f(ZIX n ; 8) = exp{ -K(Yn, Z)},
(3.9)
Chapter
4
Modeling and Analysis of Spatially Correlated Data
87
where, for notational simplicity, we do not list Xn as an argument in function K. Next evaluate the marginal likelihood. Temporarily we assume that (J is known. For any fixed (3, expanding K(Y n, Z) around its mode Z up to the second order term, we have
L(Yn!Xn ; (3, (J)
=
J
exp{ -K(Yn, Z)}dZ
= IJ21T{ K(2)(y n, Z)} -1 W/ 2 exp{ -K(Yn, Z)}, wher: K(2)(y n, Z) denotes the second derivative of K(Y n, Z) with respect to Z, and Z lies in the segment joining 0 and Z. If K(2)(y n, Z) does not vary too much as Z changes (for instance, K(2)(y n, Z) = constant for normal data), maximizing the marginal likelihood (3.6) is equivalent to maximizing
This step is also equal to jointly maximizing fey n !X n , Z; (3)f(Z!X n ; (J) w.r.t (3 and Z with (J being held constant. Finally, only (J is left to be estimated, but it can be estimated by maximizing the approximate profile likelihood of (J,
refer to Breslow and Clayton (1993). As no close-form solution is available, the PQL is often performed through an iterative process. In particular, Schall (1991) derived an iterative algorithm when the random effects follow normal distributions. Specifically, with the current estimated values of (3, (J and Z, a working 'response' Yn is constructed by the first order Taylor expansion of g(Y) around p,z, or explicitly, (3.10) where g(1)(-) denotes the first derivative and g(.) is defined in (3.4). When viewing the last term in (3.10) as a random error, (3.10) suggests fitting a linear mixed model on Y n to obtain the updated values of (3, Z and (J, followed by a recalculation of the working 'responses'. The iteration shall continue until convergence. Computationally, the PQL is easy to implement, only requiring repeatedly invoking existing macros, for example, SAS 'PROC MIXED'. The PQL procedure yields exact MLEs for normally distributed data and for some cases when the conditional distribution of Y n and the distribution of Z are conjugate. Several variations of the PQL are worth mentioning. First, the PQL is actually applicable in a broader context where only the first two conditional moments of Y n given Z are needed, in lieu of a full likelihood specification. Specifically, fey n!X n , Z; (3) in (3.9) can be replaced by the quasi-likehood function exp{ ql(Y n!X n , Z; (3)}, where m
ql(Y n!X n , Z; (3)
=
rf)li - t Vet) dt.
~ }Yi
Here J-li = E(Y(Si)!X n , Z; (3) and V(J-lt) = var(Y(si)!X n , Z; (3).
88
Yi Li
Secondly, the PQL is tightly related to other approximation approaches, such as the Laplace method and the Solomon-Cox method, which have also received much attention. The Laplace method (see, e.g. Liu and Pierce (1993)) differs from the PQL only in that the former obtains Z((3,8) by maximizing the integrand e-K(Y n,Z) with (3 and 8 being held fixed, and subsequently estimates (13,0) by jointly maximizing
On the other hand, with the assumption of E(ZIX n ) = 0, the Solomon-Cox technique approximates the integral J f(Y nlXn, Z)f(ZIXn)dZ by expanding the integrand f(Y nlXn, Z) around Z = 0; see Solomon and Cox (1992). In summary, none of these approximate methods produce consistent estimates, with exception in some special cases, e.g. normal data. Moreover, as these methods are essentially normal approximation-based, they typically do not perform well for sparse data, e.g. for binary data, and when the cluster size is relatively small (Lin and Breslow, 1996). Nevertheless, they provide a much needed alternative, especially given that full likelihood approaches are not always feasible for spatial data.
4
Spatial models for censored outcome data
Biomedical and epidemiological studies have spawned an increasing interest in and practical need for developing statistical methods for modeling time-to-event data that are subject to spatial dependence. Little work has been done in this area. Li and Ryan (2002) proposed a class of spatial frailty survival models. A further extension accommodating time-varying and nonparametric covariate effects, namely geoadditive survival model, was proposed by Hennerfeind et al. (2006). However, the regression coefficients of these frailty models do not have an easy population-level interpretation, less appealing to practitioners. In this section, we focus on a new class of semiparametric likelihood models recently developed by Li and Lin (2006). A key advantage of this model is that observations marginally follow the Cox proportional hazard model and regression coefficients have a population level interpretation and their joint distribution can be specified using a likelihood function that allows for flexible spatial correlation structures. Consider in a geostatistical setting a total of n subjects, who are followed up to event (e.g. death or onset of asthma) or being censored, whichever comes first. For each individual, we observe a q x 1 vector of covariates X, and an observed event time T = min(T, U) and a non-censoring indicator b = I(T :::;; U), where T and U are underlying true survival time and censoring time respectively, and 1(·) is an indicator function. We assume noninformative censoring, i.e., the censoring time U is independent of the survival time T given the observed covariates, and the distribution of U does not involve parameters of the true survival mode1. The covariates X are assumed to be a predictable time-dependent (and spacedependent) process. Also documented is each individual's geographic location Si.
Chapter
4
Modeling and Analysis of Spatially Correlated Data
89
Denote by X(t) = (X(s) : 0 ~ s ~ t) the X-covariate path up to time t. We specify that the survival time T marginally follows the Cox model
A{tIX(t)} = Ao(t)'IjJ{X(t),,8}
(4.1)
where 'IjJ{., .} is a positive function, ,8 is a regression coefficient vector and Ao (t) is an unspecified baseline hazard function. A common choice of 'IjJ is the exponential function, in which case, 'IjJ{X(t) , ,8} = exp{,8'X(t)}, corresponding to the Cox proportional hazards model discussed in Li and Lin (2006). This marginal model refers to the assumption that the hazard function (4.1) is with respect to each individual's own filtration, F t = O"{ICT ~ s,8 = l),I(T ~ s), Xes), 0 ~ s ~ t}, the sigma field generated by the survival and covariate paths up to time t. The regression coefficients ,8 hence have a population-level interpretation. Use subscript i to flag each individual. A spatial joint likelihood model for T I ,··· ,Tn is to be developed, which allows Ti to marginally follow the Cox model (4.1) and allows for a flexible spatial correlation structure among the T/s. Denote by Ai(t) = Ai(sIXi)ds the cumulative hazard and Ao(t) = Ao(s)ds the cumulative baseline hazard. Then Ai(Ti) marginally follows a unit exponential distribution, and its pro bit-type transformation
J;
J;
Tt
=
{
1 - e-Ai(Ti) }
(4.2)
follows the standard normal distribution marginally, where
= {Tt,i
= 1,··· ,n}
rv
MVN(on,r),
(4.3)
where r is a positive definite matrix with diagonal elements being 1. Denote by ()ij the (i,j)th element of r. We assume that the correlation ()ij between a pair of normalized survival times, say Tt and T], depends on their geographic locations Si and Sj, i.e. corr(Tt, Tn = ()ij = ()ij(Si, Sj) (4.4) for i f= j (i,j = 1,··· ,n), where ()ij E (-1,1). Generally a parametric model is assumed for ()ij, which depends on a parameter vector a as ()ij (a). Since the transformed times T* are normally distributed, a rich class of models can be used to model the spatial dependence by specifying a parametric model for ()ij. For instance, ()ij (a) may be parameterized as p( dij , a), an isotropic
Yi Li
90
correlation function which decays as the Euclidean distance dij between two individuals increases. A widely adopted choice for the correlation function is the Matern function m((J"2,(,v,d) as defined in (2.2). Recall that (J"2 is a scale parameter and corresponds to the 'partial sill', ( measures the correlation decay with the distance and v is a smoothness parameter, characterizing the behavior of the correlation function near the origin, but its estimation is difficult as it requires dense space data and may even run into identifiability problems. Stein (1999) has argued that data can not distinguish between v = 2 and v> 2. Li and Lin (2006) fixing v to estimate the other parameters and performing a sensitivity analysis by varying v for data analysis, in which case the unknown a = ((J"2, Of.
4.1
A class of semiparametric estimation equations
As a full likelihood-based inferential procedure, which involves a large dimensional integral, is difficult, we opt for a class of spatial semiparametric estimating equations constructed using the first two moments of individual survival times and the covariance functions of all pairs of survival times. First derive the Martingale covariance rate function under the semiparametric normal transformation model (4.2)-(4.3). We denote the counting process Ni(t) = J(Ti ~ t, bi = 1) and the at-risk process Yi(t) = J(Ti ;;:::: t). Next define a Martingale, adapted to the filtration Yi,t = (J"(Ni(s), Yi(s), Xi(s), 0 ~ s < t), as
To relate the correlation parameters a to the counting processes, one needs to consider the joint counting process of two individuals. Define the conditional Martingale covariance rate function for the joint counting process of two individuals, a multi-dimensional generalization of the conditional hazard function, as (Prentice and Cai, 1992)
Then we have
Denote by Sij(Vl,V2) the joint survival function of Ai(Ti) and Aj(Tj), the exponential transformations of the original survival times. Then
(4.5) Following Prentice and Cai (1992), one can show that the covariance rate can be written as
Chapter
4
Modeling and Analysis of Spatially Correlated Data
91
where
As a special case, AO(Vi, V2; () = 0) == O. Li and Lin (2006) showed that as () ----+ 0+, AO(Vi, V2; e) converges to 0 uniformly at the same rate as that when (Vi, V2) lies in a compact set. We simultaneously estimate the regression coefficients f3 (a q x 1 vector) and the correlation parameters a (a k x 1 vector) by considering the first two moments of the Martingale vector (Mi' ... ,Mn). In particular, for a pre-determined constant T > 0 such that it is within the support of the observed failure time, i.e P(T < Ui 1\ Ti ) > 0 (in practice T is usually the study duration), we consider the following unbiased estimating functions for 8 = {f3, a} for an arbitrary pair of two individuals, indexed by u and v: • if u
= V,
where WCu,u)(s) (a scalar) and Vuu (a length-q vector) are non-random weights . • if u -=1= v,
Uu,v(8)
=
[J;vuv{Mu(T)Mv(T) Xu,v(s)WCu,v) (s)dMu,v(s) ] - Auv}
where Xu,v(s) = {Xu(s),Xv(s)}, dMu,v(s) = {dMu(s),dMv(s)}', and WCu,v) (s) = {w};",v)hX2 and Vuv (a length-q vector) are non-random weights and
It can be easily shown that U u,v is an unbiased estimating function, since E{U u ,v(8 0 )} = 0, where the expectation is taken under the true 8 0 = (f3 0 , ao) and the true cumulative hazard function AoO. In fact, the first component of U u ,v, which is the estimating equation for f3, is unbiased even when the spatial correlation structure is misspecified. Hence the regression coefficient estimator f3 is robust to misspecification of the spatial correlation structure. As Ao(t) in the estimating equations is unknown, a natural alternative is to substitute it with the Breslow-type estimator ~
92
Yi Li
As a result, the parameters of interest 8 = ((3,0:) are estimated by solving the following estimating equations, constructed by weightedly pooling individual Martingale residuals and weightedly pooling all pairs of Martingale residuals respectively G n = n- 1 LUu,v(8) = 0,
(4.6)
u;3v
where UO arises from U(·) by substituting Ao(t) by Ao(t). With the matrix notation, (4.6) can be expressed conveniently as
(4.7)
where j = 1,··· , k, Wand Vj are weight matrices, M = (Ah,··· , £In)', X(s) = {X1(S),··· ,Xn(S)}', A is an n x n matrix whose uv-th (u =I- v) entry is Auv obtained from Auv with Ao(t) replaced by Ao(t), and Auu = Yu(s)dAu(s). The weight matrices Wand V 1, . .. , V k are meant to improve efficiency and convergence of the estimator of {3 and 0:. In particular, following Cai and Prentice (1992) W can be specified as (D- 1/ 2AD- 1/ 2)-1, the inverse of the correlation matrix of the Martingale vector M(r), where D = diag(An,'" , Ann). In the absence of spatial dependence, W is an identity matrix and hence the first set of equations of (4.7) is reduced to the ordinary partial likelihood score equation for regression coefficients {3. To specify Vj (j = 1,··· , q), one could assume Vj = A -1(8A/8aj)A -1. Under this specification, the second set of estimating equations in (4.7) resembles the score equations of the variance components 0: if the 'response' M followed a multivariate normal distribution MVN(on, A) (Cressie, 1993, p483). To ensure numerical stability, we consider a modification of the spatial estimating equation (4.7) by adding a penalty term,
I;
where 0 is a (q + k) x (q + k) positive definite matrix, acting like a penalty term. This penalized version of the spatial estimating equation (4.7) can be motivated from the perspective of ridge regression or from Bayesian perspectives by putting a Gaussian prior MVN(Oq+k, 0- 1 ) on 8, and results in stabilized variance component estimates of 0: for example, for moderate sample sizes, and is likely to force the resulting estimates to lie in the interior of the parameter space (Heagerty and Lele, 1998). Therefore in practice, especially when the sample size is not large, we consider using a small penalty, {} = wI, where 0 < w < 1, for numerical stability. As the sample size n goes to 00, we have ~08 - O. Therefore G n (8) and G~(e) are asymptotically equivalent, and therefore the large sample results of the original and penalized estimating equations are equivalent.
Chapter
4.2
4
Modeling and Analysis of Spatially Correlated Data
93
Asymptotic Properties and Variance Estimation
The large sample properties of the estimators can be established, facilitating drawing inference based on the semiparametric normal transformation model. Under the regularity conditions listed in Li and Lin (2006), the estimators obtained by solving G m (9) = 0 exist and are consistent for the true values of 9 0 = ({30, 0::0) and that nl/2{E> - 9 0 } is asymptotic normal with mean zero and a covariance matrix that can be easily estimated using a sandwich estimator. The results are formally stated in the following Proposition and can be proved along the line of Li and Lin (2006), which focused on the proportional hazards models.
Proposition Assume the true 9 0 is an interior point of an compact set, say, 13 x A E Rq+k, where q is the dimension of (3 and k is the dimension of 0::. When n is sufficiently large, the estimating equation G n (9) = 0 has a unique solution in a neighborhood of 9 0 with probability tending to 1 and the resulting estimator is consistent for 9 0 . Furthermore, v'np:;(2)}-1/2:E{(,B,a)'- ({3o,O::o)'}!!:.; MV N { 0q+k, I}, where I is an identity matrix whose dimension is equal to that of 9 0 , and
e
It follows that the covariance of
e can be estimated in finite samples by
1 ~-1~(2) {~-l}' I;;: =:E:E :E
(4.8)
where ~ and ~(2) are estimated by replacing U uv (-) by U uv (-) and evaluated at
80.
Although each E { UUl,Vl (90)U~2,v2 (9 0)} could be evaluated numerically, the total number of these calculations would be prohibitive, especially when the ~(2)
sample size m is large. To numerically approximate:E ,one can explore the resampling techniques of Carlstein (1986) and Sherman (1996). Specifically, under the assumption of n x E {GnG~} -> :Eoo ,
:Eoo can be estimated by averaging K randomly chosen subsets of size 1,··· ,K) from the n subjects as
nj
(j =
where Gnj is obtained by substituting 9 with 8 in G nj • The nj is often chosen to be proportional to n so as to capture the spatial covariance structure. For
Yi Li
94
practical utility, Li and Lin recommended to choose nj to be roughly 1/5 of the total population. Given the estimates iS oo and is, the covariance of can be ~-1
~
~-1
e
estimated by ~ [lin x ~ool(~ )'. To estimate the covariance matrix of the estimates arising from the penalized estimator obtained by solving G~(e) = 0, is is replaced by is - ~n. A similar procedure was adopted by Heagerty and Lele (1998) for spatial logistic regression.
4.3
A data example: east boston asthma study
Li and Lin reported the application of the proposed method to analyze the East Boston Asthma study, focusing on assessing how the familial history of asthma may have attributed to disparity in disease burden. In particular, this study was to establish the relationship between the Low Respiratory Index (LRI) in the first year of life, ranging from 0 to 16, with high values indicating worse respiratory functioning, and age at onset of childhood asthma, controlling for maternal asthma status (MEVAST), coded as l=ever had asthma and O=never had asthma, and log-transformed maternal cotinine levels (LOGMCOT). This investigation would help better understand the natural history of asthma and its associated risk factors and to develop future intervention programs. Subjects were enrolled at community health clinics throughout the east Bost on area, with questionnaire data collected during regularly scheduled well-baby visits. The ages at onset of asthma were identified through the questionnaires. Residential addresses were recorded and geocoded, with geographic distance measured in the unit of kilometer. A total of 606 subjects with complete information on latitude and longitude were included in the analysis, with 74 events observed at the end of the study. The median followup was 5 years. East Boston is a residential area of relatively low income working families. Participants in this study were largely white and hispanic children, aging from infancy to 6 years old. Asthma is a disease strongly affected environmental triggers. Since the children living in adjacent locations might have had similar backgrounds and living environments and, therefore, were exposed with similar unmeasured similar physical and social environments, their ages at onset of asthma were likely to be subject to spatial correlation. The age at onset of asthma was assumed to marginally follow a Cox model
A(t) = Ao(t) exp{,BL x LRI +,BM x MEVAST +,Be x LOGMCOT},
(4.9)
while the Matern model (2.1) was assumed for the spatial dependence. Evidently, betaL,,BM and ,Be measured the impact of main covariates and have populationlevel interpretations. The regression coefficients and the correlation parameters were estimated using the spatial semi parametric estimating equation approach, and the associated standard error estimates were computed using (4.8). To check the robustness of the method, Li and Lin varied the smoothness parameter 1I in (2.1) to be 0.5, 1 and 1.5. As the East Boston Asthma Study was conducted in a fixed region, to examine the performance of the variance estimator in (4.8), which was developed
Chapter
4
Modeling and Analysis of Spatially Correlated Data
95
under the increasing-domain-asymptotic, Li and Lin calculated the variance using a 'delete-a-block' jackknife method (see, e.g. Kott (1998)). Specifically, they divided the samples into B nonoverlapping blocks based on their geographic proximity and then formed B jackknife replicates, where each replicate was formed by deleting one of the blocks from the entire sample. For each replicate, the estimates based on the semiparametric estimating equations were computed, and the jackknife variance was formulated as B-1
Varjackknife
=
B
A
~ 2)8 j
A
-
A
8)(8 j
A
-
8)'
(4.10)
j=l
where 8 j was the estimate produced from the jackknife replicate with the jth 'group' deleted and 8 was the estimate based on the entire population. In their calculation, B was chosen to be 40, which appeared large enough to render a reasonably good measure of variability. This jackknife scheme, in a similar spirit of Carlstein (1986, 1988), treated each block approximately independent and seemed plausible for this data set, especially in the presence of weak spatial dependence. Loh and Stein (2004) termed this scheme as the splitting method and found it work even better than more complicated block-bootstrapping methods (e.g. Kunsch, 1989; Liu and Singh, 1992; Politis and Romano, 1992; Bulhmann and Kunsch, 1995). Other advanced resampling schemes for spatial data are also available, e.g double-subsampling method (Lahiri et al., 1999; Zhu and Morgan, 2004) and linear estimating equation Jackknifing (Lele, 1991), but are subject to much more computational burden compared with the simple jackknife scheme we used. Their results are summarized in the following table, with the large sample standard errors (SEa) computed using the method described in Section 4.3 and the Jackknife standard errors (SEj) computed using (4.10). 1/ = 0.5 Parameters Estimate SEa 0.3121 0.0440 (h 13M 0.2662 0.3314 f3c 0.0294 0.1394 (Y2 1.68E-3 9.8&3 4.974 2.2977 (
1/=1 1/ = 1.5 SEj Estimate SEa SEj SEJ. Estimate SEa 0.0357 0.3118 0.0430 0.0369 0.3124 0.0432 0.0349 0.3222 0.2644 0.3289 0.3309 0.2676 0.3283 0.3340 0.1235 0.02521 0.1270 0.1063 0.0277 0.1288 0.1083 0.0127 0.74E-3 5.0E-3 7.1&3 0.72E-3 5.5E-3 4.8&3 3.708 2.1917 4.7945 4.1988 1.8886 6.5005 5.01617
The estimates of the regression coefficients and their standard errors were almost constant with various choices of the smoothness parameter 1/ and indicated that the regression coefficient estimates were not sensitive to the choice of 1/ in this data set. The standard errors obtained from the large sample approximation and the Jackknife method were reasonably similar. Low respiratorl index was highly significantly associated with the age at }>nset of asthma, e.g. f3L = 0.3121 (SEa = 0.0440, SEj = 0.0357) when 1/ = 0.5; f3L = 0.3118 (SEa = 0.0430, SEj = 0.0369) when 1/ = 1.0; fJL = 0.3124 (SEa = 0.0432, SEj = 0.0349) when 1/ = 1.5, indicating that a child with a poor respiratory functioning was more likely to
Yi Li
96
develop asthma, after controlling for maternal asthma, maternal cotinine levels and accounting for the spatial variation. No significant association was found between ages at onset of asthma and maternal asthma and cotinine levels. The estimates of the spatial dependence parameters, (J'2 and ( varied slightly with the choices of 1/. The scale parameter (J'2 corresponds to the partial sill and measures the correlation between subjects in close geographic proximity. Thi analysis showed that such a correlation is relatively small. The parameter ( measures global spatial decay of dependence with the spatial distance (measured in kilometers). For example, when 1/ = 0.5, i.e., under the exponential model, ( = 2.2977 means the correlation decays by 1- exp( - 2.2977 xl) ~ 90% for everyone kilometer increase in distance.
5
Concluding remarks
This chapter has reviewed the methodologies for the analysis of spatial data within the geostatistical framework. We have dealt with data that consist of the measurements at a finite set of locations, where the statistical problem is to draw inference about the spatial process, based on the partial realization over this subset of locations. Specifically, we have considered using linear mixed models and generalized linear models that enable likelihood inference for fully observable spatial data. The fitting of such models by using maximum likelihood continues to be complicated owing to intractable integrals in the likelihood. In addition to the methods discussed in this chapter, there has been much research on the topic since the last decade, including Wolfinger and O'Connell (1993), Zeger and Karim (1993), Diggle et al. (1998), Booth and Hobert (1999). We have also reviewed a new class of semi parametric normal transformation model for spatial survival data that was recently developed by Li and Lin (2006). A key feature of this model is that it provides a rich class of models where regression coefficients have a population-level interpretation and the spatial dependence of survival times is conveniently modeled using flexible normal random fields, which is advantageous given that there are virtually none spatial failure time distributions that are convenient to work with. Several open problems, however, remain to be investigated for this new model, including model diagnostics (e.g. examine the spatial correlation structure for censored data), prediction (e.g. predict survival outcome for new locations) and computation (e.g. develop fast convergent algorithms for inference). Lastly, as this chapter tackles geostatistical data mainly from the frequentist points of view, we have by-passed the Bayesian treatments, which have been, indeed, much active in the past 20 years. Interested readers can refer to the book of Banerjee et al. (2004) for a comprehensive review of Bayesian methods.
References [1] Abramowitz M., and Stegun 1. A. (Editors)(1965), Handbook of Mathematical Functions, Dover Publications, New York.
Chapter
4
Modeling and Analysis of Spatially Correlated Data
97
[2] Banerjee, S. and Carlin, B. P (2003), Semiparametric Spatio-temporal Frailty Modeling, Environmetrics, 14 (5), 523-535. [3] Banerjee, S., Carlin, B. P. and Gelfand, A. E (2004), Hierarchical Modeling and Analysis for Spatial Data, Chapman and Hall/eRC Press, Boca Raton. [4] Besag, J., York, J. and Mollie, A. (1991), Bayesian Image Restoration, With Two Applications in Spatial Statistics, Annals of the Institute of Statistical Mathematics, 43, 1-20. [5] Breslow, N. E. and Clayton, D. G. (1993), Approximate Inference in Generalized Linear Mixed Models, Journal of the American Statistical Association, 88, 9-25. [6] Brook, D. (1964), On the Distinction Between the Conditional Probability and the Joint Probability Approaches in the Specification of NearestNeighbour Systems, Biometrika, 51, 481-483. [7] Bulhmann, P. and Kunsch, H. (1995), The Blockwise Bootstrap for General Parameters of a Stationary Time Series, Scandinavian Journal of Statistics, 22, 35-54. [8] Carlin, B. P. and Louis, T. A. (1996), Bayes and empirical Bayes methods for data analysis, Chapman and Hall Ltd, London. [9] Carlstein, E. (1986), The Use of Subseries Values for Estimating the Variance of a General Statistic from a Stationary Sequence, The Annals of Statistics, 14, 1171-1179. [10] Carlstein, E. (1988), Law of Large Numbers for the Subseries Values of a Statistic from a Stationary sequence, Statistics, 19, 295-299. [11] Clayton, D. and Kaldor, J. (1987), Empirical Bayes Estimates of Agestandardized Relative Risks for Use in Disease Mapping, Biometrics, 43, 671681. [12] Cressie, N. (1993), Statistics for Spatial Data, Wiley, New York. [13] del Pino, G. (1989), The unifying role of iterative generalized least squares in statistical algorithms (C/R: p403-408) Statistical Science, 4, 394-403. [14] A. P. Dempster, N. M. Laird, D. B. Rubin(1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. B, 39 1-22. [15] Kammann, E. E. and Wand, M. P. (2003), Geoadditive models. Journal of the Royal Statistical Society, Series C (Applied Statistics), 52, 1-18. [16] P. J. Green(1987), Penalized likelihood for general semi-parametric regression models, International Statistical Review, 55 (1987) 245-259. [17] Haining, R, Griffith, D. and Bennett, R (1989), Maximum likelihood estimation with missing spatial data and with an application to remotely sensed data, Communications in Statistics: Theory and Methods, 1875-1894. [18] Harville, D. A. (1974), Bayesian inference for variance components using only error contrasts, Biometrika, 61, 383-385. [19] Heagerty, P. J. and Lele, S. R (1998), A Composite Likelihood Approach to Binary Spatial Data, Journal of the American Statistical Association, 93, 1099-1111. [20] Hennerfeind, A., Brezger, A. and Fahrmeir, L. (2006), Geoadditive Survival Models, Journal of the American Statistical Association, Vol. 101, 1065-1075.
98
Yi Li
[21] Host, G. (1999), Kriging by local polynomials, Computational Statistics & Data Analysis, 29, 295-312. [22] Journel, A. G. (1983), Geostatistics, Encyclopedia of Statistical Sciences (9 vols. plus Supplement), 3,424-431. [23] Kott, P. S. (1998), Using the Delete-a-group Jackknife Variance Estimator in Practice, ASA Proceedings of the Section on Survey Research Methods, 763-768. [24] Lahiri, S. N., Kaiser, M. S., Cressie, N. and Hsu, N. (1999), Prediction of Spatial Cumulative Distribution Functions Using Subsampling (C/R: p97110), Journal of the American Statistical Association, 94, 86-97. [25] N. M. Laird, J. H. Ware(1982), Random-effects models for longitudinal data, Biometrics, 38, 963-974. [26] Y. Lee, J. A. Nelder(1996), Hierarchical Generalized Linear Models, J.R. Statist. Soc. B, 58, 619-678. [27] Le1e, S. (1991), Jackknifing Linear Estimating Equations: Asymptotic Theory and Applications in Stochastic Processes, Journal of the Royal Statistical Society, Series B, 53, 253-267. [28] Li, Y. and Ryan, L. (2002), Modeling spatial survival data using semiparametric frailty models, Biometrics, 58, 287-297. [29] Li, Y. and Lin, X. (2006), Semiparametric Normal Transformation Models for Spatially Correlated Survival Data, Journal of the American Statistical Association, 101, 591-603. [30] X. Lin, N. E. Breslow(1995), Bias Correction in generalized linear mixed models with multiple components of dispersion, Journal of the American Statistical Association, 91, 1007-1016. [31] Q. Liu, D. A. Pierce(1993), Heterogeneity in Mantel-Haenszel-type models Biometrika. 80, 543-556. [32] Liu, R. and Singh, K. (1992), Moving Blocks Jackknife and Bootstrap Capture Weak Dependence, Exploring the Limits of Bootstrap. LePage, Raoul (ed.) and Billard, Lynne (ed.) 225-248. [33] Loh, J. M. and Stein, M. L. (2004), Bootstrapping a Spatial Point Process, Statistica Sinica, 14, 69-101. [34] P. McCullagh, J. A. Nelder(1989), Generalized Linear Models, 2nd deition. Chapman and Hall, London. [35] J. A. NeIder, R. W. Wedderburn(1972), Generalized linear models, J. R. Statist. Soc. A, 135, 370-384. [36] D. Nychka and N. Saltzman (1998), Design of Air-Quality Monitoring Networks. Case Studies in Environmental Statistics, Lecture Notes in Statistics ed. Nychka, D., Cox, L. and Piegorsch, W. , Springer Verlag, New York. [37] Opsomer, J. D., Hossjer, 0., Hoessjer, 0., Ruppert, D., Wand, M. P., Holst, U. and Hvssjer, O. (1999), Kriging with nonparametric variance function estimation, Biometrics, 55, 704-710. [38] Paciorek, C. J. (2007), Computational techniques for spatial logistic regression with large datasets, Computational Statistics and Data Analysis, 51, 36313653. [39] Politis, D. N. and Romano, J. P. (1992), A General Resampling Scheme for
Chapter
[40]
[41] [42] [43] [44] [45]
[46] [47] [48]
4
Modeling and Analysis of Spatially Correlated Data
99
Triangular Arrays of a-mixing Random Variables with Application to the Problem of Spectral Density Estimation, The Annals of Statistics, 20, 19852007. Prentice, R. L. and Cai, J. (1992), Covariance and Survivor Function Estimation Using Censored Multivariate Failure Time Data, Biometrika, 79, 495-512. Sherman, M. (1996), Variance estimation for statistics computed from spatial lattice data, Journal of the Royal statistical Society,Series B, 58, 509-523. Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory of Kriging, Springer, New York. R. Schall(1991), Estimation in generalized linear models with random effects, Biometrika, 78, 719-727. P. J. Solomon, D. R. Cox(1992), Nonlinear component of variance models, Biometrika, 79, 1-1l. Waller, L. A., Carlin, B. P., Xia, H. and Gelfand, A. E. (1997), Hierarchical Spatio-temporal Mapping of Disease Rates, Journal of the American Statistical Association, 92, 607-617. P. J. Diggle, R. A. Moyeed and J. A. Tawn(1998), Model-based Geostatistics. Applied Statistics, 47, 299-350. zhanghao H. Zhang (2002), On Estimation and Prediction for Spatial Generalized Linear Mixed Models Biometrics, 58, 129-136. Zhu, J. and Morgan, G. D. (2004), Comparison of Spatial Variables Over Subregions Using a Block Bootstrap, Journal of Agricultural, Biological, and Environmental Statistics, 9, 91-104.
This page intentionally left blank
Part II
Statistical Methods for Epidemiology
This page intentionally left blank
Chapter 5 Study Designs for Biomarker-Based Treatment Selection Amy Laird * Xiao- H ua Zhou t Abstract Among patients with the same clinical disease diagnosis, response to a treatment is often quite heterogeneous. For many diseases this may be due to molecular heterogeneity of the disease itself, which may be measured via a biomarker. In this paper, we consider the problem of evaluating clinical trial designs for drug response based on an assay of a predictive biomarker. In the planning stages of a clinical trial, one important consideration is the number of patients needed for the study to be able to detect some clinically meaningful difference between two treatments. We outline several trial designs in terms of the scientific questions each one is able to address, and compute the number of patients required for each one. We exhibit efficiency graphs for several special cases to summarize our results.
Keywords: Biomarker; predictive biomarker; treatment selection; clinical trial design; validation; predictive value.
1
Introduction
A group of patients with the same clinical disease and receiving the same therapy may exhibit an enormous variety of response. Genomic technologies are providing evidence of the high degree of molecular heterogeneity of many diseases. A molecularly targeted treatment may be effective for only a subset of patients, and ineffective or even detrimental to another subset, due to this molecular heterogeneity. Potential side effects, as well as time wasted with ineffective drugs, can create a substantial cost to the patient. The ability to predict which treatment would give the best result to a patient would be universally beneficial. We are interested in evaluating clinical trial designs for drug response based on an assay of a biomarker. We consider the design of a phase III randomized clinical trial to compare the performance of a standard treatment, A, and an *Department of Biostatistics, University of Washington, Seattle, WA, USA. E-mail: [email protected] tDepartment of Biostatistics, University of Washington, Seattle, WA, USA. HSR&D Center of Excellence, VA Puget Sound Health Care System, Seattle, WA, USA. E-mail: azhou@u. washington.edu
103
104
Amy Laird, Xiao-Hua Zhou
alternative and possibly new treatment, B. In the case of stage II colorectal cancer, which will be used as an example in our discussion of the study designs [1], treatment A consists of resection surgery alone, while treatment B is resection surgery plus chemotherapy. We assume that patients can be divided into two groups based on an assay of a biomarker. This biomarker could be a composite of hundreds of molecular and genetic factors, for example, but in this case we suppose that a cutoff value has been determined that dichotomizes these values. In our example the biomarker is the expression of guanylyl cyclase C (GCC) in the lymph nodes of patients. We assume that we have an estimate of the sensitivity and specificity of the biomarker assay. The variable of patient response is taken to be continuous-valued; it could represent a measure of toxicity to the patient, quality of life, uncensored survival time, or a composite of several measures. In our example we take the endpoint to be three-year disease recurrence. We consider five study designs, each addressing its own set of scientific questions, to study how patients in each marker group fare with each treatment. Although consideration of which scientific questions are to be addressed by the study should supersede consideration of necessary sample size, we give efficiency comparisons here for those cases in which more than one design would be appropriate. One potential goal is to investigate how treatment assignment and patient marker status affect outcome, both separately and interactively. The marker under consideration is supposedly predictive: it modifies the treatment effect. We may want to verify its predictive value and to assess its prognostic value, that is, how well it divides patients receiving the same treatment into different risk groups. Each study design addresses different aspects of these over arching goals. This paper is organized as follows: 1. Definition of study designs 2. Test of hypotheses 3. Sample size calculation 4. Numerical comparison of efficiency 5. Conclusions
2
Definition of study designs
The individual study designs are as follows.
2.1
Traditional design
To assess the safety and efficacy of the novel treatment, the standard design (Fig.l) is to register patients, then randomize them with ratio K, to receive treatment A or B. We compare the response variable across the two arms of the trial without regard for the marker status of the patients. In our example, we would utilize this design if we wanted only to compare the recurrence rates of colorectal cancer in the two treatment groups independent of each patient's biomarker status. Without the marker information on each patient we can address just one scientific question:
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
105
; / treatment A register - - - - - - - - -.... randomize(ratio~)
~
treatment B
Figure 1: The traditional design
• How does the outcome variable compare across treatment groups A and B?
2.2
Marker by treatment interaction design
The next three designs presented have the goal of validating the predictive value of the biomarker. In the marker by treatment interaction design (Fig. 2) we register patients and test marker status, then randomize patients with ratio Ii in each marker subgroup to receive treatment A or B. By comparing the outcomes for different marker subgroups in a given treatment group, the prognostic value of the marker in each treatment group can be measured. By examining the treatment effect in a given marker group, we can measure the predictive value of the marker with respect to the treatments.
V +_
randomize (ratio~)
~ treatment B
1
/ register ----.test marker
~
treatment A
~
treatment A
- _randomize (ratio~) 1
~treatmentB
Figure 2: The marker by treatment interaction design
In our example, suppose we want to be able to look at the treatment effect within each marker group, as well as the effect of marker status on recurrence in each treatment group. In this example this design yields exactly the four outcomes we want to know: patient response in each treatment group in each of the two marker subgroups. In the case of just two marker subgroups, this design gives the clearest and simplest output. In the case of many marker subgroups or many treatments, it may not be feasible or even ethical to administer every treatment to every marker group, and this design may give an unnecessarily large amount of output that is difficult to interpret. In general this design can address the scientific questions: • Does the treatment effect differ across marker groups? • Does marker-specific treatment improve outcome? That is, how does treatment affect outcome within each marker group?
106
Amy Laird, Xiao-Hua Zhou
2.3
Marker-based strategy design
To address the issue of feasibility, we may wish to compare the marker-based and non-marker-based approaches directly. In the marker-based strategy design (Fig. 3) we register patients and randomize them with ratio A to either have a treatment assignment based on marker status, or to receive treatment A regardless of marker status. By comparing the overall results in each arm of the trial, the effectiveness of the marker-based treatment strategy relative to the standard treatment can be assessed. This comparison will yield the predictive value of the marker. Note that this design could not possibly illuminate the case in which treatment B were better for marker-positive and marker-negative patients alike, as marker-negative patients never receive treatment B.
/+ B
Y
\-__
marker-based strategy-test marker
A
register -------_~ randomize (ratio>.. )
~
non-marker-based strategy _ _ _ A
Figure 3: The marker-based strategy design
Returning once again to our example, one study has demonstrated a strong tendency for patients who relapsed within three years to be marker-positive, and for those without relapse for at least six years to be marker-negative [2]. Given the extreme toxicity of chemotherapy treatment, it may be unethical in some cases to administer treatment B to marker-negative patients. In this case, and in similar cases, it may be wise to use the marker-based strategy design. This design can address the scientific questions: • How does the marker-based approach compare to the standard of care? (validation) • Secondarily, do marker-positive patients do better with novel treatment or standard of care?
2.4
Modified marker-based strategy design
Although the direct design presents a reasonable way to compare the marker-based approach with the standard of care, it does not allow us to assess the markertreatment interaction. We present a modification to this design that allows us to calculate this interaction. In the modified marker-based strategy design (FigA) we register patients and test marker status, then randomize the patients with ratio A (in the "first" randomization) to either have a treatment assignment based on marker status, or to undergo a second randomization to receive one of the two treatments. By comparing the overall results in each arm of the trial, the
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
107
effectiveness of the targeted (marker-based) treatment strategy relative to the untargeted (non marker-based) strategy can be assessed. If we test marker status in all patients, this comparison can yield both the prognostic and predictive value of the marker. /
7 ,
/
marker-based strategy
treatment A
~ treatment B
register_test marker------.randomize (ratio.x)
~ 1
~
+ -randomizeY
/
non-marker-based strategy
'" "'" - _
treatment A
(ratio K) 1 ~ treatment B
. y
treatment A randomize (ratio K) l~ treatment B
Figure 4: The modified marker-based strategy design
This design can address the following scientific questions: • How does the marker-based approach compare to the non-marker-based approach? (validation) • What is the prognostic value of the marker? • Secondarily, we can assess the marker-treatment interaction.
2.5
Targeted design
The goal of the traditional design is to assess the safety and efficacy of the novel treatment, but with a marker assay available, there are other possibilities for a trial of this type. For reasons of ethics or interest, we may wish to exclude marker-negative patients from the study. In the targeted design (Fig.5) we register patients and test marker status, then randomize the marker-positive patients to treatments B and A with ratio K,. By comparing the results in each arm of this trial, the performance of treatment B relative to treatment A can be assessed in marker-positive patients. We must therefore screen sufficiently many patients to attain the desired number of marker-positive patients. If the marker prevalence is low, the number of patients we need to screen will be very high. This trial design if very useful when the mechanism of action of treatment B is well-understood. In our example, suppose there were a treatment that had been demonstrated to have a significant positive effect on response for marker-positive patients, and no significant effect for marker-negative patients. We could employ the targeted design as a confirmatory study among marker-positive patients to better characterize the treatment effect in these patients. This design can address the scientific question: • What is the treatment effect among marker-positive patients?
108
Amy Laird, Xiao-Hua Zhou
/+
yB ""A
~ randomize (ratio ")"
I
register ~test marker
~ --_~ _
not included in trial
Figure 5: The targeted design
3
Test of hypotheses and sample size calculation
There are various types of hypothesis tests that we can use in a trial, and the choice of hypothesis test depends on the question in which we are interested. Using a hypothesis test we can find the number of patients that must be included in each arm of the study to detect a given clinically meaningful difference 8 in mean outcome between arms 1 and 2 of the trial at the significance level a we require and at a given level of power 1- (3. Calculation of this number, n, depends on the type of hypothesis that we are testing. We present four main types of hypothesis tests here. Let Xj and Wj be the response variables of the jth of n subjects each in part 1 or 2 of a trial, respectively, relative to the hypothesis test under consideration. We assume now that X and Ware independent normal random variables with respective means /-Lx and /-Lw, and respective variances O'r and O'~. Suppose that there are nl patients in part 1 of the trial and n2 in part 2. We let X and W denote the sample means of the response variable in each part of the trial:
We let E = /-Lw - /-Lx be the true mean difference in response between the two groups. So if a higher response value represents a better outcome, then E > 0 indicates that arm 2 of the trial has had a better mean outcome than arm 1. For ethical reasons we may have unequal numbers of patients in each group, so we call r;, = ndn2' We consider four hypothesis tests to compare the means of X and W.
Chapter 5
3.1
Study Designs for Biomarker-Based Treatment Selection
109
Test for equality
Suppose we want to know whether there is any difference in means between the two arms of the trial. We might have this question if we were interested in proving that there was a significant difference in the outcome variable between the two arms of the trial. We consider the hypotheses
Ho : E = 0
versus
Ha : E =1=
o.
If we know or can estimate the values of the variances (J"f and (J"~ from pilot studies, then we reject Ho at significance level 0: if
w-x denotes the 1 - 0:/2-quantile of the standard normal distribution. If 0, then the power of the above test for rejecting the false null hypothesis is given approximately by
where E
= Ea
Zl-0/2
=1=
We set the above expression in parentheses equal to Zl-/3 and solve for n2 in terms of n1. Therefore the sample sizes necessary in each part of the trial to reject a false null hypothesis are given asymptotically by and
3.2
Test for non-inferiority or superiority
Now suppose we want to know whether the mean of W is less than the mean of X by a significant amount O. We might have this question if we wanted to find out if a new, less toxic treatment were not inferior in efficacy to the current standard. We invoke the test for non-inferiority and consider the hypotheses
Ho:
E
~
0
versus
Ha:
E
> 0,
where 8 < O. Rejection of the null hypothesis signifies non-inferiority of the mean of W as compared to the mean of X. In the test for superiority, we want to know if the mean of W is greater than the mean of X by a significant amount, so we can use these same hypotheses with 0 > O. In this case, rejection of the null hypothesis indicates superiority of the mean of W as compared to the mean of X. If we know the variances (J"f and (J"~, then we reject Ho at significance level 0: if
x-w-o
Amy Laird, Xiao-Hua Zhou
110 If E
= Ea > 8, then the power of the above test for rejecting a false
null hypothesis
is given by
We set the above expression in parentheses equal to Zl-{3 and solve for n2 in terms of nl. Therefore the sample sizes necessary in each part of the trial to reject a false null hypothesis are given asymptotically by and
3.3
Test for equivalence
Suppose we want to know whether the means of X and W differ in either direction by some significant amount 8. We might have this question if we were interested in proving that there is no significant difference in the outcome variable between the two arms of the trial. We use the test for equivalence and consider the hypotheses versus
Ha:
lEI < o.
Note that rejection of the null hypothesis indicates equivalence of the means of X and W. If we know the variances O"f and O"~, then we reject Ho at significance level a if or If lEI = lEal < 0, then the power ofthe above test for rejecting a false null hypothesis is given by
So to find the sample size needed to achieve power 1- f3 at solve
lEI = lEal < 8 we must
and hence the sample size necessary in each part of the trial to reject a false null hypothesis are given asymptotically by and
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
111
By comparison with the formula for n2 in the test for equality, we see that the roles of a and (3 have been reversed. Note that in the usual case, in which 8 > 0 and 1 - (3 > a, this hypothesis test requires a greater sample size than do the other three.
4
Sample size calculation
We use the test for equality to compute the number of patients necessary to include in each arm of the trial to reject a false null hypothesis of zero difference between the means of response in the two arms of the trial, with type I error a and type II error 1 - (3. Here we derive this total number for each of the five designs under consideration, similar to [5], except equal variances in each marker-treatment subgroup and 1:1 randomization ratios are not assumed. Similarly to the notation in the last section, let X be the continuous-valued response of a patient in the trial. Let Z be the treatment assignment (A or B) of a patient. Let D be the true (unmeasurable) binary-valued marker status of a patient, and let R be the result of the marker assay. We denote by X ZR the response of a subject with marker assay R who is receiving treatment Z. Let /LZR and a~R denote the mean and variance of response, respectively, among patients with marker assay R and receiving treatment Z. For ease of computation we let p denote a combination of Z- and R-values, which we use for indexing. Finally we let 'Y denote the proportion of people in the general population who are truly marker-negative, that is, 'Y = P(D = 0). To relate Rand D, we let Asens = peR = liD = 1) denote the sensitivity of the assay in diagnosing marker-positive patients and Aspec = peR = OlD = 0) denote the specificity of the assay in diagnosing marker-negative patients. On the other hand, we let w+ = P(D = 11R = 1) denote the positive predictive value of the assay, and B_ = P(D = OIR = 0) denote the negative predictive value. By Bayes' rule,
w+=P(D=lIR=l) P(D = 1,R= 1) peR = 1) peR = liD = l)P(D = 1) - peR = liD = l)P(D = 1) + peR = liD = O)P(D = 0) Asens (1 - 'Y) - Asens(l - 'Y) + (1 - Aspech and similarly we have
B_ = P(D = OIR = 0)
=
Aspec'Y (1 - Asens)(l - 'Y) + Aspec'Y
Amy Laird, Xiao-Hua Zhou
112
General formula for variance To calculate the variance of patient response in one arm of a trial, we use the general formula
where Varp(E(Xlp)) = Ep(E(Xlp) - E(X))2 = ~all
kP(p = k)(E(Xlp) - E(X)?
and Ep(Var(Xlp))
4.1
=
~all kP(p
= k)Var(Xlp = k).
Traditional design
Note that since the traditional design does not involve a marker assay, the sample size calculation is independent of the assay properties. In the test for equality we test the null hypothesis of equality of means of patient response, Ho : /-LA = /-LB. The expected response in arm 1 of the trial, in which all patients receive treatment A is given by /-LA
= E(XIZ = A) = /-LAO"! + /-LAl(l- "!)
/-LB
= E(XIZ = B) = /-LBO"! + /-LBl(l- ,,!).
and similarly
We calculate the components of the variance of response using the general formula:
We obtain
and
See the Appendix for details of the calculations throughout this section. Therefore the number of patients needed in each arm of the trial to have a test of equality with type I error Q and type II error 1 - f3 are given by and so the total number of patients necessary is n
= nl + n2 = ('" + 1)n2.
Chapter 5
4.2
Study Designs for Biomarker-Based Treatment Selection
113
Marker by treatment interaction design
Let 1/1 denote the difference in means of response between treatments B and A within assay-positive patients, and I/o denote the difference in means of response between the treatments within assay-negative patients. As the primary question that this design addresses is whether there is a marker-treatment interaction, the null hypothesis under consideration is Ho : 1/1 = I/o. Since the quality of the assay determines in which group a given patient will be, the imperfection of the assay will have an impact on the sample size calculation in this design. The expected difference in response among assay-positive and assay-negative patients are given respectively by
1/1
= E(XIZ = B, R = 1) - E(XIZ = A, R = 1) = (/.LBI - /.LAdw+
+ (/.LBO -
/.LAo)(I - w+)
and
Hence we have
We calculate the variance of response among assay-positive patients as such:
Tf = Var(XIZ = B,R = 1) + Var(XIZ = A, R = 1), by independence, and using the general formula
Var(XIZ
= B,R = 1) = VarD(E(XIZ = B,R = I,D)) +ED(Var(XIZ = B,R = I,D)).
Hence the variances of patient response in each marker-treatment group are given by
+ (/.LBo - 1/81)2(1 - w+) + O"~lW+ + O"~o(I - w+) + (/.LBO - 1/81)20_ + O"~l (1 - 0_) + 0"~00I/Al?W+ + (/.LAO - I/Al)2(1 - w+) + O"~lW+ + O"~o(I - w+) I/Ad(I- 0-) + (/.LAO - I/Ado- + O"~l(I- 0_) + 0"~00-.
T11
=
(/.LBI - 1/81)2W+
T10
=
(/.L81 - I/Bd 2 (1 - 0_)
Til = (/.LAI Tio = (/.LAI -
Therefore we have 2 2 2 Tl = TB1 + TAl = [(/.L81 - 1/81)2
+ (/.LAl - I/Ad 2]w+ + [(/.LBO +[0"~1 + 0"~11w+ + [O"~o + O"~ol(I - w+),
1/81)2
+ (/.LAO
- I/Ar)2](1 - W+)
and in the same way, 2
TO
2 2 = TBO + TAO = [(/.LBI - 1/81? + (/.LAl
+[O"~l
+ O"~ll (1 -
0_)
- I/Adl(I - 0_)
+ [(/.LBO -
+ [O"~o + O"~olO-.
1/81)2
+ (/.LAO -
I/Al)2]e_
Amy Laird, Xiao-Hua Zhou
114
In a large trial, if each patient screened for the trial is randomized, we may expect that the proportion of assay-positive patients in the trial reflects the proportion of assay-positive people we would see in the general population. Alternatively, an investigator may wish to have balanced marker status groups, in which case the prevalence of the marker in the trial is 0.50. In the second case the number of patients needed to be randomized in the trial is, as in [5],
so that the total number needed is n
4.3
= n1 + n2 = 2n2.
Marker-based strategy design
Let lim denote the mean of response in the marker-based arm (M = 1), and lin denote the mean ofresponse in the non-marker-based arm (M = 2). In the test for equality we test the null hypothesis Ho : lim = lin. Note that the imperfection of the assay will affect the assignment of patients to treatments in the marker-based arm, and hence the sample size. The mean of response in the marker-based arm is given by
lim
= E(XIM = 1) = [/tBlW+ + /tBo(l - W+)][Asens(1- "Y) + (1 - Aspech] +[/tA1(1- 0_)
+ /tA oO_][l- Asens(1- "Y) -
(1 - Aspech].
We observe that lin is ItA from the traditional design:
We calculate the variance of response in the marker-based arm using the general formula:
7~
= Var(XIM = 1) =
+ ER[Var(XIM =
VarR[E(XIM = 1, R)]
1, R)].
Then using properties of the study design, we have
VarR[E(XIM
= 1, R)]
= [/tBlW+ + /tBo(l - w+)]2[Asens(1- "Y + (1 - Aspech] +[/tAl(l - 0_)
+ /tA oO_]2[(1- Asens)(l - "Y) + Aspec"Y]
and
ER[Var(XIM =
=
[(/tBl - lIBl)2W+
1, R)]
+ (/tBO -
lIBl)2(1 - w+)
+ (1 - Aspech] +[(/tBl - lIB1)2(1 - 0_) + (/tBO *[(1 - Asens)(l - "Y) + Aspec"Y].
+ 0'11W+ + 0'10(1- W+)]
*f.~sens(1- "Y)
- lIBl)20_
+ 0'11 (1 -
0_)
+ 0'10 0-1
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
115
Now, we observe that the variance of response in the non-marker-based arm is T1 from the traditional design: T;
= T1 = ,(1-,)(/LAO - /LAd 2 + ,O"~o + (1-')0"~1'
Hence, the number of patients needed in each of the trial is and so the total number of patients needed is n
4.4
= nl
+ n2 = ('x + l)n2'
Modified marker-based strategy design
We again let Vm denote the mean of response in the marker-based arm (M = 1) and we now let Vnr denote the mean of response in the non-marker-based arm (M = 2). In the test for equality we test the null hypothesis Ho : Vm = Vnr . Note that the imperfection of the assay will affect the assignment of patients to treatments in each of the arms, and hence the total sample size. Note that Vm is the same as in the marker-based strategy design. We calculate the mean Vnr of response in the non-marker-based arm to be Vnr
= E(XIM = 2) 1
= '" + 1 [(1 -,)(/LBI
+ "'/LAd + ,(/LBO + "'/LAO)],
and we calculate the variance of response in this arm using the general formula: T;r
= Var(XIM = 2) = Var[E(XIM = 2, Z)] + E[Var(XIM = 2, Z)],
which gives T;r
=
_1_[[/LBl(1-,) ",+1
+ /LBo,J 2 + "'[/LAl(1 -I) + /LAo,f + T~ + M1J·
Hence, the number of patients needed in each arm of the trial is and
n2
=
(Zl-<>/2
so the total number of patients needed is n
4.5
2
2 + Zl-,B) 2 (T + Tnr) 7
(Vm
= nl
- Vnr )2
'
+ n2 = (,x + l)n2'
Targeted design
The means and variances of response in the targeted design are exactly VBl, VAl, T~l' and T1l from the marker by treatment interaction design:
+ (/LBO - /LAo)(1 - W+) /LAd(1 - (L) + (/LBO - /LAO)eVBl)2w+ + (/LBO - vBl)2(1 - W+) + O"~lW+ + O"~o(1- W+) VAd 2w+ + (/LAO - vAd 2(1- W+) + O"~lW+ + O"~o(1 - w+).
VBl = (/LBI - /LAl)W+ VAl = (/LBI T~l = (/LBI T1l = (/LAI -
Amy Laird, Xiao-Hua Zhou
116
To reject the null hypothesis Ho : VEl = of patients needed in each arm is
V Al
using the test for equality, the number 2
and
n2
=
(ZI-a/2
+ Zl-,B?(~ + Tin) (VEl - VAr)2
and the total number necessary in the trial is n = nI + n2 = (r;, + 1)n2. Our sample size computations pertain only to the test of equality of means, and we would have attained different formulae had we used a different type of hypothesis test.
5
Numerical comparisons of efficiency
We calculated the efficiency of each "alternative" design relative to the traditional design with regard to the number of patients needed to be randomized for each design to have a test of equality with type I error rate DO = 0.05 and type II error rate 1-,8 = 0.80. We evaluated this quantity, the ratio of the number of patients in the traditional design to the number in the alternative design, as a function of the true prevalence (1- ')') of the marker among patients in the population of interest (on the x-axis). In this calculation we considered various values of the sensitivity and specificity of the assay, and the size of the treatment effect for marker-negative patients relative to marker-positive patients. Specifically, we evaluated this quantity for each combination of sensitivity and specificity equal to 0.6, 0.8, and 1.0, and for the cases in which there is no treatment effect in marker-negative patients, and that in which the treatment effect for marker-negative patients is half that of the marker-positive patients. As in [3] and [5], there is no loss of generality in choosing specific values for the means of response. We present results with the following values: Scenario 1:
/-LBI =
2
1, /-LBO
0, /-LAI
=
2
2
= 2
aBI = aBO = aAI = aAO
Scenario 2:
/-LBI =
1, /-LBO
=
0.5, /-LAI
2 2 2 2 a BI = aBO = a Al = a AD
0, /-LAO 1 = ; =
=
0, /-LAO
0, =
0,
1
= .
We assumed here the variance of patient response was constant across all marker and treatment subgroups. Results shown in Figures 6-9 are alterations of those in
[5].
5.1
Marker by treatment interaction design
When there is no treatment effect among marker-negative patients, relative efficiency depends heavily on marker prevalence: for low prevalence, the interaction design is more efficient unless the assay is very poor, while for high prevalence, the traditional design is more efficient. When the treatment effect among markernegative patients is half that of marker-positive patients, the interaction design
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
117
requires a very large number of patients, and the traditional design is much more efficient. Recall that in this calculation we have assumed balanced marker subgroups. Results are very similar if the proportion of marker-positive patients included in the trial reflects the proportion of marker-positive people we would find in the general population, as seen in [5].
5.2
Marker-based strategy design
When there is no treatment effect among marker-negative patients, we see that the traditional design is at least as efficient as the marker-based strategy design, and that the efficiency has very little dependence on the marker prevalence. When the assay is perfectly sensitive, the two designs require the same number of patients, regardless of the specificity. When the assay has imperfect sensitivity however, the traditional design requires fewer patients. On the other hand, when the treatment effect among marker-negative patients is half that of marker-positive patients, the traditional design requires fewer patients regardless of the properties of the assay, and the efficiency depends heavily on the marker prevalence. These results are not surprising since the treatment effect is diluted in the marker-based strategy design.
5.3
Modified marker-based strategy design
The modified marker-based strategy design is much less efficient than the traditional design in each of the situations in the simulation. When there is no treatment effect among marker-negative patients, marker prevalence has almost no bearing on the relative efficiency, while prevalence and efficiency have a more complex relationship in the case where the treatment effect among marker-negative patients is half that of marker-positive patients. As in the marker-based strategy design, the treatment effect is diluted in the modified marker-based strategy design relative to the traditional design.
5.4
Targeted design
The targeted design requires fewer patients to be randomized than the traditional design for every combination of prevalence, sensitivity, and specificity in each of the two scenarios. This result is what we might expect since the targeted design includes only those patients for whom we expect to see a large treatment effect. When there is no treatment effect in marker-negative patients, the relative efficiency gain for the targeted design is especially pronounced, particularly when the sensitivity and specificity of the assay are close to one. The efficiency gain for the targeted design is also greater when the true marker prevalence is low; when the prevalence is 100%, the two designs are identical, and very little efficiency is gained from the targeted design for a marker with a high prevalence in the population. When the treatment effect among marker-negative patients is half that of marker-positive patients, these effects are subdued due to the decreased
Amy Laird, Xiao-Hua Zhou
118
ability of the marker to divide patients into groups of sharply-differing treatment effect; the marker has smaller predictive value. Not surprisingly, there is very little efficiency gain for the targeted design when the assay is poor.
5 4 3
2
5
0
+
4
'0
+0
3
+0
2
'+0
·+°0
';<0"000 0
-.tt..++
0
5
+ 0
4
+0
,
+0
.,
0
+0 +0 0 + 0
.
+ 00• .....+++-\1, '. ••• ++ttt+++
00 0
00
0 0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
Specificity = 0.6
Specificity = 0.8
Specificity = 1.0
0 +
proportion ofD+
proportion of D+
0
2
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
5
5
4
4
4
3
3
2
2
2
0
Specificity = 0.6
Specificity = 0.8
Specificity = 1.0
~illm 0.0 0.2 0.4 0.6 0.8 1.0 proportion ofD+
0
-",,"liMIiliiilii~
0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
0
•••• "IiIIilMOOOWOQW
0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
Figure 6: Ratio of number randomized for Traditional versus Marker by Treatment Interaction Design in the case of balanced marker status groups. Upper panels: no treatment effect in marker-negative group. Lower panels: treatment effect in markernegative group half that of marker-positive group. 0 Sensitivity = 1.0; + Sensitivity = 0.8; * Sensitivity = 0.6
6
Conclusions
In designing a trial for marker validation or for assessment of safety and efficacy of a new treatment, the scientific questions in which the investigators are interested should drive the choice of design and the type of hypothesis test. If more than one design would be appropriate to address these questions, then candidate designs may be compared on the basis of the number of patients necessary to carry out the trial. This chapter gave estimates of the number of patients needed for a trial with each of five designs assuming that the marker prevalence and properties of the biomarker assay can be estimated. If one candidate design gives a much lower estimate of the number of patients needed than the others, then this design may be preferred.
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
Specificity = 1.0
0.8
0.4
•••••••••••••••••••
0.2
Specificity = 1.8
Specificity = 1.6
0.8
0.8
0.6
0.6
0.4
....................
0.2 0.0
119
0.4
I I I I II I I I I
.++++++++
....... "' .......... .
0.2
r-.,----.-.--..--,-'
0.0 'T--.--r--,,--..---...,J 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
proportion of D+
proportion of D+
proportion of D+
Specificity = 1.0
Specificity = 1.8
Specificity = 1.6
1.0-F=====! 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
proportion of D+
proportion of D+
proportion of D+
Figure 7: Ratio of number randomized for Traditional versus Marker-Based Strategy Design. Upper panels: no treatment effect in marker-negative group. Lower panels: treatment effect in marker-negative group half that of marker-positive group. 0 Sensitivity = 1.0; + Sensitivity = 0.8; * Sensitivity = 0.6
In this chapter we calculated the number of patients needed in a phase III randomized clinical trial involving a biomarker using the test for equality, assuming that the outcome of interest is on a continuous scale and that the biomarker is binary-valued. Calculations were given as a function of (1) the proportion of truly marker-positive patients included in the trial, and (2) the sensitivity and specificity of the biomarker assay. Calculations were based on those in [5], but equal variances of patient response across marker-treatment subgroups and 1:1 randomization ratios were not assumed. Ratios of the number of patients needed for each of the four alternative designs versus the traditional design, for specific values of the means and variances of response, were presented in graphical form. In the marker by treatment interaction design, if (1) the marker is relatively uncommon, (2) pilot studies indicate that the marker has good predictive value (that is, the treatment effect for marker-positive patients is much greater than that for marker-negative patients), and (3) the assay has high sensitivity and specificity, then this design will be more efficient than the traditional design. The marker-based strategies were less efficient than the traditional design in almost every set of conditions. These designs can be useful if (1) the marker
Amy Laird, Xiao-Hua Zhou
120
Specificity = 1. 6
Specificity = 1.8
Specificity = 1.0 1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.2
OOOOOOOOOOOOooo
....................
0.2
+++++++t+++++++++++
0.0
0.4 '000000000000000
0.2
...................
0.0
0.0
Specificity = 1.6
Specificity = 1.8
Specificity = 1.0 1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2 0.0
6
b.,
00
0.2
0000
~~ttt!!!
0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
0.0
000
000
.~+++++ ..... 0.0 0.2 0.4 0.6 0.8 1.0 proportion ofD+
.....................
0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
0.0 0.2 0.4 0.6 0.8 1.0 proportion of D+
0000000000000000000 111111111111+++++++
'1'1'11111'111111++
0.2
00000000
0.0
~!~ 0.0 0.2 0.4 0.6 0.8 1.0 proportion ofD+
Figure 8: Ratio of number randomized for Traditional versus Modified Marker-Based Strategy Design. Upper panels: no treatment effect in marker-negative group. Lower panels: treatment effect in marker-negative group half that of marker-positive group. 0 Sensitivity = 1.0; + Sensitivity = 0.8; * Sensitivity = 0.6
has many levels, (2) there are more than two treatments in the trial, or (3) it is unethical to administer certain treatments to patients with a certain marker status. The targeted design is more efficient than the traditional design in almost every set of circumstances. The efficiency gain is especially pronounced when the marker has good predictive value or when the marker is relatively uncommon. In each of these alternative designs we must also consider the number of patients needed to be screened for the trial to randomize the desired number. In the case of the targeted design, if the assay is quite expensive or invasive, the benefit of increased efficiency may be compromised. The efficiency graphs are presented only as rough guidelines, as each design addresses a different set of scientific questions. The traditional and targeted designs are for assessment of safety and efficacy of the new treatment, while the marker-based designs and the marker-by-treatment interaction design are for marker validation; among each set, the hypothesis tests a different question. It is of primary concern to the investigator to choose a design that addresses the proper scientific question.
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
Specificity = 1.0 5
Specificity = 0.8 • +0
4
•
•• •
2
• .!/~
• 9,
'-.
0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
proportion of D+
•• ••e•
••••
......
•••
proportion ofD+
proportion of D+
Specificity = 0.8
Speciflcity = 0.6
4
4
3
3
2
2
0'r--.--r---r--r---r 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
proportion of D+
Specificity = 1.0
2
0
.+~
2
•••
0
o
4
• +0
• +0 • + .0 .+0
•e
3
Specificity = 0.6
5
•
4
4
121
0'r--.--r---r--r---r
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
proportion ofD+
proportion of D+
Figure 9: Ratio of number randomized for Traditional versus Targeted Design. Upper panels: no treatment effect in marker-negative group. Lower panels: treatment effect in marker-negative group half that of marker-positive group. 0 Sensitivity = 1.0; + Sensitivity = 0.8; * Sensitivity = 0.6
Future studies could generalize this model to accommodate a binary-valued response or right-censored survival time, categorical or continuous marker values, or the presence of covariates. Additionally, other trial designs can be considered as they become available.
Acknowledgements Zhou's work was supported in part by U.S. Department of Veterans Affairs, Veterans Affairs Health Administration, Clinical Trial Development Award for Evaluating the Effectiveness of Treatment-Diagnosis Combinations. This research was also partially supported by the National Natural Science Foundation of China Overseas Young Scholar Cooperative Research Grant 30728019. Dr. Zhou is Director and Research Career Scientist in the Biostatistics Unit at the HSR&D Center of Excellence, VA Puget Sound Health Care System and Professor in the Department of Biostatistics at University of Washington. Amy Laird is a PhD graduate stu-
122
Amy Laird, Xiao-Hua Zhou
dent in the Department of Biostatistics at University of Washington. This paper presents findings and conclusions of the authors. It does not necessarily represent those of U.S. Department of Veterans Affairs.
Appendix A.I
Traditional design
The expected response in the arm of the trial in which all patients receive treatment A is given by f1,A = E(XIZ = A) = E(XID
= 0, Z
= A)P(D
= 0IZ =
A)
+E(XID = 1, Z = A)P(D = liZ = A)
= E(XID =
0, Z = A)P(D = 0)
+ E(XID = 1, Z = A)P(D = 1) = f1,Ao,
+ f1,Al(l -,)
and similarly the expected response for the treatment B arm is given by f1,B = E(XIZ = B) = f1,BO,
+ f1,81(l-,).
We calculate variances of response using the general formula: VarD(E(XIZ = A,D))
= ED(E(XIZ = A, D) - ED(E(XIZ = A, D)))2 = ED(E(XIZ = A, D) - E(XIZ = A))2 = ED(E(XIZ = A, D = 0) - E(XIZ = A))2 P(D = 0IZ = A) + ED(E(XIZ = A, D = 1) - E(XIZ = A))2 P(D = liZ = A) = E(E(XIZ = A, D = 0) - E(XIZ = A))2 P(D = 0) + E(E(XIZ = A, D = 1) - E(XIZ = A))2 P(D = 1)
= (f1,AO =
- f1,A?'
+ (f1,Al -
f1,A)2(1 -,)
(f1,Al - f1,AO)2,(1- ,),
and ED(Var(XIZ
= A,D)) = Var(XIZ = A,D = O)P(D = 0IZ = A) + Var(XIZ = A,D = l)P(D = liZ = A) = Var(XIZ = A, D = O)P(D = 0) + Var(XIZ = A,D = l)P(D = 1) = aio, + ail (1 - ,).
Hence the variance of response in the arm receiving treatments A and B are given respectively by
71 =
Var(XIZ = A) = 1'(1 -1')(f1,AO - f1,Ad 2 + l'aio
+ (1 -1')ail
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
123
and
A.2
Marker by treatment interaction design
We calculate the expected difference in response among assay-positive patients: VI
= E(XIZ = B, R = 1) - E(XIZ = A, R = 1) = E(XIZ = B,R = 1,D = l)P(D = liZ = B,R = 1) +E(XIZ = B,R = 1,D = O)P(D = 0IZ = B,R = 1) -E(XIZ = A,R = 1,D = l)P(D = liZ -E(XIZ
=
A,R
= 1,D =
O)P(D = 0IZ
= A,R = 1) = A,R = 1)
= B, R = 1, D = l)P(D = 11R = 1) +E(XIZ = B, R = 1, D = O)P(D = OIR = 1) -E(XIZ = A, R = 1, D = l)P(D = 11R = 1) -E(XIZ = A, R = 1, D = O)P(D = OIR = 1)
= E(XIZ
= /-tBlW+
+ /-tBo(l
= (/-tBI - /-tAdw+
- w+) - /-tAIW+ - /-tAo(l - w+)
+ (/-tBO -
/-tAo)(l- w+),
and similarly
We calculate components of the variance of response as such: VarD(E(XIZ
= B,R =
1,D))
= B, R = 1, D))J2 = [E(XIZ = B,R = 1,D = 1) - VBl]2p(D = liZ = B,R = 1) +[E(XIZ = B,R = 1,D = 0) - vBlfp(D = 0IZ = B,R = 1) = [E(XIZ = B, D = 1) - VBl]2 P(D = 11R = 1) +[E(XIZ = B, D = 0) - VBl]2 P(D = 0IR = 1) =
E[E(XIZ = B, R = 1, D) - ED(E(XIZ
=
(/-tBl - VBl)2W+
+ (/-tBO -
vBl)2(1 - w+),
and
= B, R = 1, D)) = Var(XIZ = B,R = 1,D = l)P(D = liZ = B,R = 1) +Var(XIZ = B,R = 1,D = O)P(D = 0IZ = B,R = 1) = Var(XIZ = B,D = l)P(D = 11R = 1) +Var(XIZ = B,D = O)P(D = 0IR = 1) = 0-11W+ + 0-10(1- W+). ED (V ar(XIZ
Amy Laird, Xiao-Hua Zhou
124
We calculate variance of response among assay-positive patients: 2 Tl =
2
2
TBI + TAl = (/-lBl - VBl)2W+
+ (/-lBO - vBl)2(1 - W+) + a~lW+ + a~o(1 - W+) +(/-lAl - VAI)2w+ + (/-lAO - VAl?(I- W+) + a~lW+ + a~o(l- W+) = [(/-lBl - VBl)2 + (/-lAl - VAl)2)w+ + [(/-lBO - VBl)2 + (/-lAO - vAI)2)(I- W+) +[a~l + a~l)W+ + [a~o + a~o)(I- W+).
In the same way, the variance of response among assay-negative patients is: 2
2
2
TO = TBO + TAO = (/-lBl - vBl)2(1 - (L) =
A.3
+ (/-lBO - VBl)2()_ + a~l (1 - ()-) + a~o() +(/-lAl - vAI)2(1 - ()-) + (/-lAO - vAI)2()_ + a~l (1 - ()-) + a~o [(/-lBl - VBl)2 + (/-lAl - VAl)2)(1 - ()-) + [(J.lBO - VBl)2 + (/-lAO - vAI)2)()_ +[a~l + a~d(1 - ()-) + [a~o + a~o]B_· Marker-based strategy design
We calculate the mean of response on Rand D: Vm
= E(XIM = = E(XIM =
Vm
in the marker-based arm by conditioning
1)
= 11M = 1) +E(XIM = 1, R = O)P(R = OIM = 1) = [E(XIM = I,R = I,D = I)P(D = llR = 1) +E(XIM = I,R = I,D = O)P(D = OIR = 1))P(R = 1) +[E(XIM = I,R = O,D = I)P(D = llR = 0) +E(XIM = 1, R = 0, D = O)P(D = OIR = O))P(R = 0) = [E(XIR = 1, D = I)P(D = llR = 1) +E(XIR = 1, D = O)P(D = OIR = 1))P(R = 1) 1, R
=
I)P(R
= 0, D = I)P(D = llR = 0) +E(XIR = 0, D = O)P(D = OIR = O))P(R = 0) = [/-lBlW+ + /-lBo(1 - W+)][Asens(1- ,) + (1 - Aspech) +[/-lAl(l- ()-) + /-lAo()_][I- Asens(1- ,) - (1 - Aspech). +[E(XIR
Using properties of the study design, we calculate the components of the variance of response in the marker-based arm as
VarR[E(XIM =
ER[E(XIM
= [E(XIM
=
= 1, R))
= 1, R) I,R
= 1) -
E(XIM 2
= 1))2
vm J P(R
= 11M = 1)
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
125
+[E(XIM = 1, R = a) - Zlm]2 peR = aiM = 1) = [E(XIM = 1, R = 1, Z = B) - Zlmf peR = 1) +[E(XIM = 1, R = a, Z = A) - Zlm]2 peR = OIM = 1) = [I1BlW+ + I1BO(l - w+)]2[Asens(1- ,) + (1 - Aspech)] +[I1Al(l- B_) + I1AoB_f[l- Asens(l- ,) - (1- Aspech] and
ER[Var(XIM = 1, R)] = Var(XIM = 1, R = l)P(R = 11M = 1) +Var(XIM = 1, R = a)p(R = aiM = 1) = Var(XIM = 1,R= l)P(R= 11M = 1) +Var(XIM = 1, R = a)p(R = aiM = 1) = Var(XIM = 1,R = 1,Z = B)P(R = 11M = 1) +Var(XIM = 1,R = O,Z = A)P(R = OIM = 1) = Var(XIM = 1,R = 1,Z = B)P(R = 1) +Var(XIM = 1,R = O,Z = A)P(R = 0) = [(I1Bl - ZlBl)2W+ + (I1BO - lIBl)2(1- w+) + (J"~lW+ + (J"~0(1- W+)] *[Asens(1- ,) + (1 - Aspech] +[(I1B1 - ZlBl)2(1 - B_) + (I1BO - lIBl )2B_ + (J"~l (1 - B_) + (J"~oB_] *[(1- Asens)(l- ,) + Aspec,]'
A.4
Modified marker-based strategy design
We calculate the mean of response ZlnT in the non-marker-based arm by conditioning on Z and D and using properties of the design:
ZlnT = E(XIM = 2) = Ez[E(XIM = 2, Z)] = E(XIM = 2, Z = B)P(Z = BIM = 2) +E(XIM = 2, Z = A)P(Z = AIM = 2) = En[E(XIM = 2, Z = B, D)]P(Z = BIM = 2) +ED[E(XIM = 2, Z = A, D)]P(Z = AIM = 2) = [E(XIM = 2,Z = B,D = l)P(D = 11M = 2,Z = B) +[E(XIM = 2, Z = B, D = a)p(D = aiM = 2, Z = B)]P(Z = B) +[E(XIM = 2,Z = A,D = l)P(D = 11M = 2,Z = A) +[E(XIM = 2, Z = A, D = a)p(D = OIM = 2, Z = A)]P(Z = A)
Amy Laird, Xiao-Hua Zhou
126
= B,D = l)P(D = 1) + [E(XIZ = B,D = O)P(D = O)]P(Z = B) +[E(XIZ = A, D = l)P(D = 1) + [E(XIZ = A,D = O)P(D = O)]P(Z = A)
= [E(XIZ
=
1
[/lB1(1-')')
~
+ /lBO')'] ~ + 1 + [/lAI(l -')')/lAO')'] ~ + 1
1
= --1 [(1 -')')(/lBI ~+
+ ~/lAI) + ')'(/lBO + ~/lAO)].
The components of the variance of response in the non-marker-based arm are given by Var[E(XIM
= 2, Z)] = Ez[E(XIM = 2, Z) - E(XIM = 2)]2 2 = [E(XIM = 2, Z = B) - lInr ] P(Z = BIM = 2) +[E(XIM = 2, Z = A) - lInrf P(Z = AIM = 2) = [E(XIM = 2, Z = B) - lInr ]2 P(Z = B) +[E(XIM = 2, Z = A) - lInr ]2 P(Z = A) =
[/lBI (1 -')')
1 () ]2 ~ + /lBO')'] 2 ~ + 1 + [/lAI 1 - ')' + /lAO')' ~ + 1
= ~ ~ 1 [[/lB1(1- 'Y) + /lBO,),]2 + ~[/lAI(l
- 'Y)
+ /lAO'Y]2],
and E[Var(XIM
= 2, Z)] = Var(XIM = 2, Z = B)P(Z = BIM = 2) +Var(XIM = 2, Z = A)P(Z = AIM = 2) = Var(XIZ = B)P(Z = BIM = 2) +Var(XIZ = A)P(Z = AIM = 2) 2
1
2
~
=rB~+l +rA~+l =
r~
+ ~r1
~+1
References [1] Cagir, B. et al. Guanylyl Cyclase C Messenger RNA is a Biomarker for Recurrent Stage II Colorectal Cancer. Annals of Internal Medicine 1999; 11 131: 805-812. [2] Chow, S. C., Shao, J., Wang, H. Sample Size Calculations in Clinical Research. Taylor & Francis 2003. [3] Maitournam, A. and Simon, R. On the Efficiency of Targeted Clinical Trials. Statist. Med. 2005; 24:329-339. [4] Mandrekar, S., Grothey, A., Goetz, M., Sargent, D. Clinical Trial Designs for Prospective Validation of Biomarkers. Am J Pharmacogenomics 2005; 5 5:317-325. [5] Young, K. Efficiency of Clinical Trial Designs with the Use of Biomarkers. UW MS thesis, 2007.
Chapter 6 Statistical Methods for Analyzing Two-Phase Studies Jinbo Chen *
Abstract In this paper, we review recent statistical method developments for analyzing two-phase epidemiology studies. For the identification of risk factors related to a human disorder, a two-phase design is an attractive option to reduce the cost for measuring regression variables. In this design, the complete set of variables is only collected on a subset of judiciously selected subjects, while only less costly variables are available on the remaining subjects in a case-control or cohort sample. Widely used two-phase designs include twophase case-control, case-cohort or stratified case-cohort, nested case-control or counter-matching designs. Two-phase studies have two important features: the selection of subjects for measuring expensive variables depends on the regression outcome and/or covariates, and the selection probability is determined a priori by the investigator. Development of statistically and computationally efficient methods for analyzing two-phase data has been a very active area of research in the past decade. These methods differ in statistical efficiency for incorporating incomplete data incurred by study design, in required statistical assumptions, and in practical considerations. We review the analysis methods for analyzing two-phased studies, including estimating-equation based approaches, pseudo-likelihood approaches, and maximum likelihood approaches. We conclude the paper with discussions on extensions of two-phase studies and future directions.
1
Introduction
The two-phase epidemiology study design was initially proposed as a strategy to reduce the cost for measuring confounding variables in a cross-sectional sample when the disease status and exposure of interest are available on a large number of subjects (White, 1982). When the proportion of diseased or exposed subjects is low, one can choose to measure confounding variables on all cases and all who were exposed but only a small fraction of non-exposed controls. The study thus is *Department of Biostatistics and Epidemiology, University of Pennsylvania, School of Medicine, Philadelphia, PA 19104, USA. E-mail: [email protected]
127
128
Jinbo Chen
conceptually conducted in two "phases": in phase I, the disease outcome and exposure variable are collected on all subjects, and in phase II, confounding variables are measured on a selected subset. Although the crude odds ratio (OR) estimates using only fully measured subjects are usually biased, appropriate analysis of the combined data from both phases allows one to obtain an unbiased estimate by appropriately adjusting for the sampling in the analysis. An appropriately designed two-phase study could yield parameter estimates nearly as efficient as a single phase study with the same number of subjects as phase I but at a much lower cost. Here "cost" may refer to monetary expenses, to resources that are not able to replenish easily (such as blood sample), and etc. Consequently, the twophase design has been recognized as a general strategy for reducing study cost. Inexpensive measurements are obtained for all study subjects in the first phase, and more costly measurements are obtained for a subsample of phase I subjects (phase II subjects). We refer to measurements obtained for all study subjects as phase I variables and those only for phase II subjects as phase II variables. Besides as a cost-effective strategy to collect expensive confounding variables for refining the analysis of outcome-exposure association, two-phase studies have also been applied to assess the effect of expensive exposures that are difficult to collect on all study subjects. In the former, phase I data consist of exposures of interest, outcome measurement, and possibly some auxiliary information, and phase II data consist of expensive confounding variables. In the latter, phase I data consist of confounder variables, outcome measurement, and auxiliary information, and phase II data consist of expensive exposures of interest. The data from two-phase studies thus is incomplete by design. Consequently, statistical methods for analyzing two-phase data can often be applied to analyze incomplete data that occur by happenstance. Phase I data then consist of completely collected variables, and phase II data consist of incomplete variables. The central theme for the methodology research on analyzing two-phase data is how to efficiently use information from both phases to obtain consistent estimates of regression parameters. The motivation for improving the estimation efficiency is perhaps the keenest when the exposure of interest is incomplete or when one is interested in interactions between phase I and phase II variables. Efficient design and analysis of two-phase studies has been a very active area of research in recent years. Strategies for the design and analysis of two-phase studies depend on phase I study design, the outcome variable of interest, and the sampling method for the phase II subset. For the study of binary outcomes, the two-phase case-control and cross-sectional designs have been widely studied (Breslow and Holubkov, 1997; Scott and Wild, 1997). For the study of time-toevent outcomes, case-cohort and nested case-control designs, and their stratified version, exposure-stratified case-cohort and counter-matching designs, have been widely applied (Prentice, 1986; Thomas et aI., 1977; Borgan et aI., 2000). In all these designs, phase II subjects can either be obtained by Bernoulli sampling or by finite population sampling. In Bernoulli sampling, an indicator variable for being sampled is generated for each phase I subject based on a known sampling function defined by phase I variables. The sampling across subjects is jointly independent. In finite population sampling, a subset of fixed size are selected
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
129
without replacement. In stratified two-phase studies, the sampling is performed within strata defined by phase I variables. The method development for analyzing the two-phase data has focused on the integrated analysis of phase I and phase II data to improve the precision of consistent estimates of parameters that quantify the association between the outcome variable and covariates. These methods are largely similar irrespective of methods for phase II sampling (e.g., Breslow and Wellner, 2007). We will not distinguish Bernoulli and finite population sampling in this paper. The analysis of two-phase studies is a special missing-data problem, where the missingness is by design and thus is noninformative (missing completely at random or missing at random; Little and Rubin, 1982). Many methods for dealing with missing covariates can be applied to the analysis of two-phase studies. However, due to the fact that the missingness is by design, that the missingness is monotone, and that the missing probability is known a priori, many methods have been developed to specifically cater the two-phase design. We will review these methods. Below we list several example studies. A two-phase case-control study of childhood cancer Engels et al. (2004) examined whether simian virus 40 (SV 40), an animal carcinogen that contaminated poliovirus vaccine before year 1963, could cause human cancer. They studied whether the SV40 infection during pregnancy is associated with higher childhood cancer risk using data from the US-based Collaborative Perinatal Project (CPP) between years 1959 and 1966. They identified 52 eligible cancer cases from the total 54,796 children and selected 200 controls from the CPP according to their vaccination status and time of receiving the poliovirus vaccine in a way to increase the number of controls who acquired SV40 infection during pregnancy. A case-cohort study of type 2 diabetes Thorand et al. (2007) conducted a case-cohort study of type 2 diabetes within two large cohort studies conducted between 1984 and 2002, the population-based Monitoring of Trends and Determinants in Cardiovascular Disease and Cooperative Research in the Region of Augsburg. There were 7,396 eligible cohort members whose ages were between 35 to 74 years. A case-cohort study of type 2 diabetes included 527 incident cases and 1,698 noncases. The investigators adopted a case-cohort design because they were also interested in studying other outcomes such as cardiovascular diseases. A nested case-control study of Colorectal Cancer Wu et al. (2007) studied the association between Plasma 25-Hydroxyvitamin D Concentrations and the risk of colorectal cancer using data from the Health Professions Follow-up study between the date of blood collection (1993-1995) and January 31, 2002. The medical data, life style factors, and nutrition information from food frequency questionnaire, were collected at baseline and updated every two years. Each of the 179 incident colorectal cancer case patients was matched to two control subjects on age (within 2 years), year (same year) and month (within 1 month) of blood donation. A counting-matching study of breast cancer Largent et al. (2007) used the countermatching design to study the risk of contralateral breast cancer in relation to the reproductive history. From five population-based tumor registries, they identified
linbo Chen
130
694 complete counter-matched triplets: two women who had unilateral breast cancer (controls) and one women with bilateral breast cancer (case). The two controls and the case were matched on year of birth, year of diagnosis, registry region, and race, and were counter-matched on radiation exposure in a way that two were exposed to radiation therapy and the other was not with each matched triplets. The rationale for the study design was detailed in Bernstein et al. (2004).
2
Two-phase case-control or cross-sectional studies
For the study of a binary outcome, phase I data can be collected either from a retrospective case-control study or from a cross-sectional/prospective study. Statistical methods for the analysis are largely similar for the two types of studies, in the same sense as the well-known result that a prospective logistic regression analysis of the case-control data yields unbiased and the most efficient odds ratio estimates (Prentice and Pyke, 1979). We will focus on two-phase case-control studies. They adopt complex sampling schemes to improve cost efficiency by intentionally increasing the variation in exposure variables in the collected case-control sample (Breslow, 1996). Phase I data usually include disease status D, some covariates X and/or a collection of variables W that are correlated with phase II variables E. Denote S = (X, W). Then based on D and S, a subset of phase I subjects are selected for the measurement of E. Data can be summarized as the total number of phase I and phase II subjects with outcome D = i in each stratum defined by S, and phase II covariates E sampled from the conditional distribution
p(EID,S). We describe the data from a general two-phase study by a simple hypothetical example illustrated in Table 1, where E denotes exposure of interest. In phase I, a joint classification table of D and confounder X is available, but exposure E is unobserved. The joint classification of D and X then results in four phaseI strata. In the second phase, a subset of phase I subjects are selected within each of the four cells defined by D and X. In this example, a balanced design (Breslow and Cain, 1988) was adopted, and 200 subjects were selected within each cell for the measurement of exposure E. The bold-faced letters in the table are the observed counts for phase I and phase II subjects, and the normal letters represent unobserved counts. If the full data were observed (normal letters), a standard logistic regression analysis would yield an OR estimate 1.75 for X and 2.00 for E. If only phase II data is analyzed in the same way but with two-phase sampling naively ignored, the two OR estimates would be 0.71 and 2.19. The OR estimate for confounder X is obviously biased because the distribution of X in phase II cases and controls, due to selection, is distorted compared to that in the full sample. This simple example highlights two central challenges for analyzing data from two-phase studies: to account for the biased sampling design and to combine the information available at both phases efficiently for the estimation of parameters of interest. In the example above, exploiting all phase I data on D and X will not only be helpful for the efficient and unbiased estimate of the X effect
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
131
and interaction between X and E, but more importantly can also help improve the efficiency for estimating the OR for E because X is correlated with and thus informative of unmeasured E. Of course the precision of odds ratio estimate for X is not of interest in this example. Table 1: Data from an exemplary two-phase case-control study X=1
X=O
E - 1
E - 0
Phase II numbers
D=1 D=O
444 a /149 b 213/112
Phase I numbers
126/51 144/88
570 357
200 200
D= 1
130/58 130/33
300/142 513/167
430 643
200 200
D=O
a: unobserved phase I counts; b: observed phase II counts.
For two-phase studies of binary outcomes, the analysis model of choice is usually the logistic regression model. The methods of analysis include maximum likelihood estimation (MLE) when phase I data S are discrete (Breslow and Holubkov, 1997; Scott and Wild, 1997), pseudo-likelihood approach (Breslow and Cain, 1988; Schill et al., 1993), and estimating equation approaches (Flanders and Greenland, 1994; Robins et al., 1994). We review these methods in separate subsections below. In the rest of this section, we define a binary variable R with R = 1 indicating that a subject is selected into phase II and thus has complete data for X and E. Suppose nl cases and no controls are sampled in phase I. The total number of phase II subjects is L:~ino rio Let (X, W) denote the variables observed for all phase I subjects (phase I variables, not including D) and E denote the variables only observed for phase II subjects (phase II variables). Let Z = (X, E) be the regression covariates. The logistic regression model is specified as
logit p(D
= liZ) == logit P/3(Z) = f30 + f3~X + f3~E.
(2.1)
Thus, we have assumed that phase I variables W, which can include auxiliary information on E, are independent of the outcome variables given X and E. We will frequently use Pi(Z) as p(D = iIZ). Let f3 indicate all odds ratio parameters (f3x, f3e) except for the intercept f30. Let 7f(D, X, W) == p(R = liD, X, W) be the probability that a subject is selected into phase II. This probability is controlled by the investigator and thus known a priori. We will see later that imposing a model for 7f could be useful for improving the precision of estimation, and the sampling by design ensures that we can specify a "correct" model for 7f. Most two-phase methods require 7f(D, X, W) > 0 and that the number of selected subjects in each sampling strata is not too small in order for the asymptotic properties to apply well in a finite sample. Efficient analysis methods have also been developed when only a stratified sample of cases or controls, but not both, are selected into phase II (Chatterjee, Chen, and Breslow, 2003).
linbo Chen
132
2.1
Estimating-equation approaches for analyzing two-pha se case-control studies
We illustrate this class of methods assuming a case-control design for phase I. If E were observed for all phase I subjects, then the data can be analyzed by fitting model (2.1) as if the data were prospectively collected, except that a different intercept parameter other than (30 is used (Prentice and Pyke, 1979). Let p*(DIZ) be a function that is the same as p(DIZ) except that (30 is replaced by a different value, (30, and let pi(Z) denote p*(D = liZ). If all subjects had observations on Z, the MLE of ((30,(3) can be obtained by maximizing the loglikelihood function Li logp*(dilzi ), and the contribution of subject i to the score function is (1, zd {d i - pi( Zi)}. When only phase II subjects have complete data, the summation of their score contributions, Li ri(l, Zi)' {d i - pi(zi)} does not have an expectation zero due to the fact that R depends on (D, X, W) and thus can not be used directly for estimation. The inverse probability weighted estimator, often referred to as Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952), is a method for correcting the complete data score bias by weighting each individual summand by the inverse sampling probability. The resultant weighted estimating equation is as follows: (2.2) It is straightforward to verify that the expectation of the weighted estimating function on the left-hand side of the equation is equal to zero. Consequently, the solution to this equation leads to an unbiased estimate of ((30' (3). Denote it as
(/Jaw, /JW).
The above weighted estimator has been widely used, partly because it can be fitted using standard software. More importantly, if the true relationship between D and Z deviates from that specified by p(DIZ), the estimates would converge to the same large-sample limit as those obtained if one fitted the model (2.2) to all phase I subjects for whom Z were completely observed (Scott and Wild, 1986). Nevertheless, the robustness of this estimator is largely compromised by its unsatisfactory efficiency, which can be improved in two related ways described below. One way is that one replaces the known 7r in (2.2) by a maximum likelihood estimate. Suppose that one models 7r by a logistic regression model logit p(R
= liD, X, W; "() = g(Y, X, W; "(),
where g(Y, X, W; "() is a general relative risk function. Let i denote the MLE of "(, 7ri denote p(r = 1ldi , Xi, Wi; "(), and 7r7 denote p(ri = l!di , Xi, Wi; i). If X and Ware discrete and the sample size allows, one can fit a saturated model on (D, X, W). The new estimator obtained by solving the same equation as (2.2) but with 7r replaced by 7r-y in general has improved efficiency. Denote the resultant estimates as (/J~', /J'). Let Mi be the score function of "( for individual i, Mi = (1, di , Xi, Wi)'{ri - 7r7}, and let Ui = (1, zi)'{di - pi(zi)}. It can be ,
A
Chapter 6 Statistical Methods for Analyzing Two-Phase Studies
133
shown that that the influence function of (~~'Y, ~'Y) for individual i is equal to E{Z8pi(Z)/8(,80,jJ),}-1 multiplied by (2.3) This quantity is actually the projection of riUi/7ri into the space spanned by Mi and thus has smaller variance than riUi/7ri. This observation is best explained by comparing its variance formula var (:U) - cov (:U,
M) var-1(M)cov (~U, M) .
with the asymptotic variance of (~oW, ~W), 8pi(Z) E { Z 8(,80' jJ)'
}-l (R) { 8(,80' jJ)' }-l 8pi(Z)
var
7r
U
E
Z
Thus, using 7r1 instead of true 7r in equation (2.2) improved statistical efficiency for estimating (fJo,,8). In general, the richer the model p(R = I\D,X, W) is, the more efficient (~~'Y, ~'Y) is. The upper efficiency bound is then achieved when one fits a saturated model for 7r(D, X, W) when (X, W) are discrete or one could fit a nonparametric smoothing function for 7r(D, X, W) when (X, W) has continuous components. Interestingly, in these situations, the formula (2.3) is equivalent to r·
r· - 7r'
7ri
7ri
--.:pz - -'--'E(P\d·t, X·'t, w·) t
(2.4)
~
Mathematical details of these results can be found in Robins et al. (1994). An alternative approach to increasing the efficiency of (~OW, ~W) is to augment the estimating equation (2.2) by an additional term contributed by all phase I subjects (Robins et al., 1994): n
L
ri h(zi){ di -
~l~
pr (Zi)} + ri -
~
7ri E
[h(Zi){ di -
pr (Zi)} \di , Xi, Wi] =
0,
(2.5)
where h(z) is a general function of Z and indexes this class of augmented estimating equations. The full class of Robins et al. (1994) that includes the influence functions of all regular and asymptotically linear estimates used a general function of (D,X, W) instead of E[h(Zi){di - pHZi)}ldi,Xi,Wi] in (2.5), and the class (2.5) is an optimal subset for a given function h(Z){D - pi(Z)}. The member in class (2.5) corresponding to the one with h(Z) = Z turns out to be exactly the same as (2.4). Furthermore, the member in this class that has the smallest variance actually achieves the semiparametric efficiency bound (Bickel et aI, 1993). This efficient estimator is asymptotically equivalent to the MLE method (Breslow et al., 2000) described below when both are applicable, although the MLE can be easily obtained largely using any software for fitting standard logistic regression models.
134
Jinbo Chen
To actually perform estimation using a member in class (2.5), one needs to know the conditional distribution of p(ZIX, W). The integration over Z is involved, which could be complex if Z is continuous. An attractive feature of this class of estimating functions is that they always lead to consistent estimators even if p(ZIX, W) is mis-specified because 7r can always be correctly modeled in the twophase study. This observation has been termed as "double robustness" (Robins et al., 1994): one only needs to specify either 7r correctly or p(ZIX, W) correctly in order to obtain consistent estimates. In other words, estimates of (f3 f3) are still consistent even if one of them is mis-specified. As a result, one way to get around the specification of p(ZIX, W) is to specify a working model. The efficiency of estimating of (f3o, f3) is improved over the simple weighted estimator (~oW, ~W) if the working model is correct.
o,
2.2
Nonparametric maximum likelihood analysis of two-ph a se case-control studies
If phase I data is from a cross-sectional sample, the intercept parameter in the
model (2.1) can be consistently estimated. The simplest two-phase cross-sectional study is a case-control study supplemented by the total numbers of cases and and controls in the cohort within which the case-control study is conducted. It is pedagogical to outline the MLE method for this simple design via the profile likelihood method (Scott and Wild, 1989, 1991, 1997). Suppose that the first nl cases and no controls in phase I, the study cohort are selected into phase II, and let Nl and No be the total number of cases and controls in phase 1. The log likelihood function is a function of OR parameters f3 and covariate distribution g(z): nl+nO
L
1
{logp(dilzi ) + logg(zi)} +
i=O
L(Nj -
nj)
J
p(d = jlz)g(z) dz.
(2.6)
j=O
With a discrete Z with a small number of categories, one would naturally apply an EM algorithm (Dempster et al., 1987) to jointly estimate {f30, f3, g(Z)}. When Z is continuous, the standard EM algorithm would require a parametric model for Z, which is usually difficult to specify. In the nonparametric maximum likelihood estimation (NPMLE) approach, one does not specify a model for g(Z). Instead, a point mass /jk is placed to each distinct observed value of Z, Zk, then the nonparametric MLE of (f30, f3) can be jointly obtained by maximizing the likelihood with respect to (f30, f3, /jk) under the constraint l:k /jk = 1. The MLE of (f30, f3) can be solved via the profile likelihood approach: one first maximizes (2.6) only with respect to /jk to obtain 8k(f30, f3). The profile likelihood for (f30, f3) is obtained by replacing /jk with 8k(f30 , f3) in the likelihood function (2.6). With the logistic model (2.1), the profile likelihood of (f30, f3) actually has a closed form, so that the problem of estimating high dimensional parameters (f30, f3, /jk) reduces to that of estimating a Euclidean parameter (f30, f3). It turns out that the MLE solution can be obtained as if one fitted the following modified logistic regression model to a prospective sample: logit pm(D
= liZ) = f30 + logndNl -logno/No + f3Z.
Chapter 6 Statistical Methods for Analyzing Two-Phase Studies
135
The estimates can be obtained by simply fitting a standard logistic regression model to Z with an offset term 10gndNl -log no/No. The asymptotic variance for (3 can be obtained as the inverse of the Hessian matrix, except that the variance for the intercept needs to be subtracted by nIl + n 1 - NIl - N 0 1 . The result is readily generalizable to the situation where the first phase cases and controls are partitioned into S strata and the effect of each stratum is modeled separately in (2.1). There ni and Ni in pm(D = liZ) are simply replaced by the stratumspecific quantities, nis and N is . The correction needed for the asymptotic variance for each stratum-specific effect is obtained by subtracting that from the standard logistic fit of pm(D = liZ) by n Is1 + nos1 - N I / - N These results showed that in simple or stratified case-control studies, if one is only interested in estimates of the OR parameters for Z, {3, and their variances, one could safely ignore the offset. This conclusion is consistent with Prentice and Pyke (1979): the cohort case-control totals are only needed for the estimation of {30 and can otherwise be ignored in simple case-control studies. The MLE solution (Scott and Wild, 1997) for the more general situation, where covariates that are correlated with phase I variables (X, W) are modeled as continuous effects in (2.1), can be obtained similarly but with the offset above modified as
o
o/.
log{(nl - Ad/(N1 - Ad} -log{(no - Ao)/(No - Ao)}, where 1
Ai
=
ni
ni - LLpm(dij i=O j=l
=
ilzij).
Because Ai depends on ({30, (3), the estimation requires iteration, and NewtonRaphson algorithm was found to have better convergence property than the algorithm based on updating the offset. The variance estimates can still be obtained by a simple correction to the Hessian matrix from fitting the model pm(D = liZ). We refer the readers to the original article of Scott and Wild (1997) for the detailed formula. Lee (2008) formally proved that the above NPMLE solution achieves the semiparametric efficiency bound and thus is fully efficient. When phase I data is obtained from a case-control sample, the NPMLE solution to the estimation of {3 can be obtained via a constrained maximum likelihood approach (Breslow and Holubkov, 1997). The theoretical result, in comparison to those for prospective two-phase studies, parallels those for single-phase retrospective versus prospective studies by Prentice and Pyke (1979). For the logistic regression model, one could still fit the pseudo-model pm(D = liZ), except that the intercept parameter becomes {30, the same intercept parameter as for the single phase design in Prentice and Pyke (1979) results. The parameter estimates and their variances can be obtained in exactly the same way as those for the prospective two-phase studies.
136
2.3
Jinbo Chen
Weighted method, pseudo-likelihood method, and NPM LE method
Besides the weighted method and the NPMLE method, an easily obtainable pseud o-likelihood estimator has also been extensively studied in the literature (Breslow and Cain, 1988; Schill et al., 1993). A consistent estimate of (3 can be obtained by fitting the model pm(D = liZ) even if phase I variables used for selection are modeled as continuous effects instead of saturated sampling stratum effects in p(DIZ). Thus, when saturated stratum effects are indeed included in p(DIZ), the pseudo-likelihood estimator is fully efficient. Otherwise, it is less efficient than the NPMLE. But because it uses a fixed offset 10gndNl -log no/No in pm(D = liZ), this estimator does not require iteration for computation. Thus, the estimates from this approach can be used as initial values for the NPMLE estimates. A logistic regression model was assumed in the above discussion of all methods. The NPMLE method and the pseudo-likelihood method actually apply to a wider class of regression models called multiplicative intercept models (Scott and Wild, 1997). Of course the weighted method can apply to any regression model suitable for a binary outcome. The weighted method does not require phase I variables to be discrete, and it yields parameter estimates that would converge to the same quantity as those in a single phase study. The NPMLE, on the other hand, could provide much more precise estimates than the weighted method when both are applicable. Breslow and Holubkov (1997) and Breslow and Chatterjee (1999) provided a detailed comparison of the three methods.
3
Two-phase designs in cohort studies
Case-cohort and nested case-control designs are two commonly used two-phase designs for the study of censored outcome variables. Let T and C denote the failure and censoring times, and let Z denote the collection of explanatory variables for T. In the full cohort, an independent realization of {Y = min(T, C), D = J(T ~ C), Z} is observed for each subject. We make the standard assumption that T and C are independent given Z. We assume that the Cox proportional hazards model (CPH; Cox, 1972) is adopted for the analysis:
A(t; Z)
=
Ao(t) exp({3' Z),
(3.1)
where Ao(t) is a nonparametric baseline hazard function, and {3 is a vector of hazard ratio parameters of the same dimension as Z, say p. We assume that covariates Z do not vary with follow-up time, and we will discuss the relative merits of different methods for handling time-dependent covariates. When Z is observed for the full cohort, the semiparametric efficient estimation of hazard ratio parameters {3 is obtained by maximizing the partial likelihood function (Cox, 1972), the logarithm of which is written as
(3.2)
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
137
Subjects who survive at least by time Yi, {j : I(Yi ~ Yi), j = 1, ... ,n}, constitute the risk set at the observation time Yi. The partial likelihood function actually takes a familiar conditional likelihood format for matched case-control studies: each case is compared with all subjects in the risk set with respect to the exposure. This very fact motivates the case-cohort and nested case-control designs: the estimate of (3 would be unbiased if one compares the case only with a representative subset of the risk set. Equivalently, this fact also intuitively justifies that both case-cohort and nested case-control studies can yield estimates that converge to the same quantity as those obtained from the full cohort study. The case-cohort design takes a random subsample of the cohort at the beginning the study, and the risk set within this random subset at time Yi can then serve as the comparison group for a case that occurs at time Yi. The nested case-control design takes a random subset within the risk set for each case, resulting in a time-matched study. The stratified variants of the two designs are based on the same principle, but the subsample is selected stratified on variables observed for all cohort members (phase I variables). The choice of the two study designs mostly depends on practical concerns, as will be discussed later in the section. Methods for the estimation of parameters (3 using case-cohort or nested casecontrol data are largely based on modifying likelihood (3.2), or modifying the partial likelihood score function for (3, which is the summation of "observed" minus "expected" values of covariate Z at observed failure times Yi: n
U((3) =
L di {Zi - Zi((3)} ,
(3.3)
i=1
where Zi((3) is an estimate of E(ZIYi, di = 1): (3.4) Methods for the estimation of cumulative baseline hazard function Ao(t) = J AO(t) dt are largely obtained by extending the Breslow estimator (Breslow, 1972):
- ( .) _ ~ Ao y. - ~ 1=1
d1I(YI ~ Yi) .. 2:.;=1 I(Yi ~ Yl)e(3Zj
In the following, we discuss the estimation of both (3 and Ao(t), although less details are given for Ao (t). Similar to the two-phase case-control design, we define a binary variable R, with R = 1 indicating that a subject in the cohort is included in the case-cohort or nested case-control sample. Let S denote variables observed for the full cohort other than Y and D, where S could include a subset of Z. We make the standard assumption that A(tIZ, S) = A(tIZ), that is, phase I variables S are independent of the failure time T given Z. The probability of selection, p( R = 1), may depend only on fully observed information: on D alone (case-cohort design), on (D, S) (stratified case-cohort design), on (D, Y) (nested case-control design), or on (D, Y, S) (stratified or counter-matched nested case-control design).
138
3.1
Jinbo Chen
Case-Cohort and stratified case-cohort design
Analysis methods for the case-cohort data can also be largely classified into three classes as for two-phase case-control studies: weighted likelihood or weighted estimating equation approaches, the nonparametric maximum likelihood estimation (NPMLE) approach, and pseudo-likelihood approaches. The weighted estimators encompass methods proposed by Kalbfleisch and Lawless (1988), Chen and Lo (1999), Borgan et al. (estimator II, 2000), Chen (2001), Kulich and Lin (2004), Qi, Wang, and Prentice (2006), and they were called "D-estimators" in Kulich and Lin (2004). Each of these estimators corresponds to a member in a class of augmented weighted estimators proposed by Robins, Rotnitzky, and Zhao (1994). It is required that each subject in the cohort have a positive probability of being sampled. The pseudo-likelihood methods, including Prentice (1986), Self and Prentice (1988), Barlow (1994), and Borgan et al. (estimators I and III, 2000), were referred to as "N-estimators" in Kulich and Lin (2004). These two classes of methods differ in the way how cases are used in the analysis: the former includes all at-risk subjects among sampled cases and sub-cohort members at each failure time when forming weighted partial likelihood function, thus each case is included in all risk sets at and before its failure. The latter includes a case only in the risk set at the time this specific failure occurs. While theories for these two types of estimators can be developed along similar lines, it is instructive to keep them separate since the intuitive justifications of the unbiasedness are somewhat different. The weighted methods closely resembles to the corresponding ones for the two-phase case-control study. 3.1.1
Weighted likelihood/estimating equation approaches
Similar to those for the two-phase case-control design, the weighted likelihood/estimating equation approaches are based on modifying the full-data partial likelihood score function (3.3) by weighting the contributions from cases and sub cohort members with the inverse of their respective sampling probabilities. The proposed approaches differ in whether true or estimated weights were used and in whether and how one managed to incorporate phase-I data when computing weights. To make easy connection with the NPMLE approach, we first illustrate the method for obtaining the simple weighted estimator by maximizing a weighted likelihood (Kalbfleisch and lawless, 1988). Let Wi denote p(ri = 1ldi , Yi, Si)-l. Then usually Wi = 1 for cases. Under the standard assumption of independent censoring between T and C conditional on covariates Z, the part of the weighted log likelihood function for the observed data (Y, D, RZ, S) useful for the estimation of {,8, Ao(t)} can be written as n
L
riWi {
_e{3'zi Ao(t) + di ,8' Zi
+ di log AO(t)} .
i=l
If the cumulative hazard function Ao(t) is treated as a right-continuous step func-
tion with jumps only at the observed failure times, joint maximization of this
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
139
weighted likelihood over {,8, Ao(t)} leads to a weighted estimate of AO(Yi): (3.5) and a weighted estimating equation for ,8: (3.6) The second term in the bracket in (3.6) is a consistent estimate of E(ZIYi, di = 1), so that L(,8) is an asymptotically unbiased estimating function for,8. Existing methods differ mainly in ways of constructing the weight variable w. Let No be the total number of non-failures in the full cohort and no be that in the sampled subcohort. The method of Chen and Lo (1999) is equivalent to using fixed sampling probabilities 'ITi = di + (1- di)no/No. Borgan et al. (estimator II, 2000) extended this estimator to the exposure-stratified case-cohort design by replacing no/No by a stratum-specific value, with cases occupying a single stratum and control strata defined by different values of phase I variables S, thereby allowing the sampling of sub cohort to depend on the failure indicator D and S. Thus, when phase I variables S are available, the Borgan estimator is more efficient than Chen and Lo (1999) since phase I information S was effectively incorporated in the former through the sampling weight. Of course the Chen-Lo method is biased if the sampling of casecohort subsample was stratified on S. The estimator proposed by Chen (2001) for general case-cohort design is based on the idea of imputing unobserved Z values with local averages in the score function (3.3), when applied to the case-cohort design, although more efficient, turns out to be also similar to Chen and Lo (1999) except that no/No is replaced by a time-stratified version as follows. The follow-up time Y is divided into J subintervals (tj-I, tj), j = 1"" ,J. For a non-failure i in the sub cohort with Yi E (tj-I, tj), 'ITi is obtained as the ratio between the number of subjects who are censored in the sub- and full cohorts within (tj-I, tj). The number of time intervals could be chosen in a way that the total number of censored subjects within each interval goes to zero when divided by y'ri. The estimator for Ao(t) can be obtained by plugging these suitable weights into (3.5) above, but the large and final sample properties of the resultant estimators have not been thoroughly studied. The large sample covariance matrix for weighted estimators of ,8 outlined above can be split into two components: the covariance matrix if all cohort members had complete data, plus a non-negative definite matrix representing informa{Z - E(ZIY, D = tion loss due to case-cohort sampling. Define Mz(Y, Z) = l)}I(Y ~ t)ef3'z>..o(t) dt to be the partial likelihood score function (3.3) but with z(,B) replaced by E(ZIY, D = 1). In general, let 0 denote the set of all variables that are predictive of control sampling. It can include only D (Chen and Lo, 1999), or discrete phase I variables Sand D (Borgan et al. 2000), stratified time intervals and D (Chen, 2001), or (Y, D) and both discrete and continuous phase I variables (Kulich and Lin, 2004; Qi, Wang, and Prentice, 2005). The covariance
J;
140
Jinbo Chen
matrix can be written as ~-l [~+ E{(l-
D)( 1fOl - 1) x var(MzIOn] ~-l,
(3.7)
where ~ is the standard full cohort partial likelihood information matrix based on (3.2), and 1fo is the large sample limit of no/No. Thus, the variance of i3 depends on the covariance matrix of score influence terms within strata defined by o (Breslow and Wellner, 2007). The smaller var(MzIO) is, the more efficient an estimator is. This variance form thus immediately hints effective approaches for improving the asymptotic efficiency for the estimation of (3: one can enrich 0 in a way that the enriched 0 are more strongly correlated with M z . Of course one needs to be careful that the dimensionality of the enriched "0" is restricted by the sample size when analyzing a real study data, because the estimated sampling probability based on model p(R = 110) needs to be stable in order for the resultant i3 perform well in a finite sample. When 0 has continuous components such as Y, Qi, Wang, and Prentice (2005) proposed to use a nonparametric smoothing function for p(R = 110). When 0 only contains a small number of discrete variables, the actual calculation of asymptotic variance can be obtained largely by using standard software for fitting CPR models, as described in Samuelsen et al. (2007). Similar as the weighted approach for two-phase case-control studies, all above weighted estimating equations for (3 could be seen as a member in a class of augmented weighted estimating equations similar as that for two-phase case-control design (2.5) proposed by Robins et al. (1994). Define
where E(ZIY, D = 1) is a suitable weighted estimator as in the second term in the bracket in equation (3.6). The following subset of this class of estimating equations actually encompass all weighted estimators considered above:
(3.8) See section 2.1 for motivation to study this sub-class. The most efficient member corresponds to an "0" that includes all variables observed in phase I, (Y, D, S). When 0 has continuous components, the actual computation of the (3 estimate and the variance would require modeling of p(ZIO). One can specify parametric models for these nuisance functions. If these models are mis-specified, the estimate of {3 would still be unbiased due to the "double robustness" feature of this class of estimating functions as discussed in section 2.1, although the efficiency will be penalized. Nonparametric modeling is another option but will impose relatively heavy work load for data analysts. The complete class of augmented weighted estimating equations that includes all regular and asymptotically linear estimators for {3 is by replacing M z (Y, Z)
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
141
in (3.8) (Robins et al., 1994; Nan, Emond, and Wellner, 2004) by M(z,y)(Y, Z), which is defined as M(z,y)
=
faT
[h(Z, t) - E{h(Z, t)IY, D
=
1}] I(Y ~ t)ef3'Z Ao(t) dt.
Here h(Z, T) is a general function of Z and T satisfying certain regularity conditions. The estimator with the smallest variance in this class achieves the semiparametric efficiency bound. Nan (2004) studied this optimal estimator when 0 is discrete with a small number of possible values, the computation of which involved modeling of the censoring time distribution p(CIZ). The NPMLE estimator (see section 3.1.3), when applicable, should be semiparametric efficient.
3.1.2
Pseudo-likelihood estimators
Representative methods in this class included estimators proposed by Prentice (1986), Self and Prentice (1988), Barlowet al. (1994), Borgan et al. (estimators I and III, 2000). Let V denote the sampled subcohort. The Self-Prentice estimator is obtained from the following pseudo-score function:
L({3) =
t i=l
di {Zi _ LjEV I(Yj ~ Yihe~~Zj } , f3 LjEV I(Yj ~ Yi)e J
where the second term in the curly bracket is obviously an unbiased estimate of E(ZIYi, di = 1) because V is a random subset of the full cohort. This estimator is asymptotically equivalent to the original Prentice estimator (1986), which supplemented the sub cohort risk set at Yi, {j : j E V, Yj ~ Yi}, with the case that occur at Yj outside of the subcohort V. Borgan et al. (estimators I and III, 2000) extended the Self and Prentice estimator to the stratified case-cohort design. The essential difference between these estimators and the weighted estimators is that the latter included the case j in any risk set at time Yi with Yj ~ Yi for the estimation of E(ZIYi, di = 1), while the Self-Prentice estimator ignored cases that occur outside of V. In other words, for the estimation of E(ZIX, D = 1), the SelfPrentice estimator requires that the probability that cases that are not in V have a "zero probability" of being sampled. This is an intuitive explanation for why the pseudo-likelihood estimators do not belong to the class of D-estimators. For stratified case-cohort design, the asymptotic variance of ~ has the same form as (3.7), except that 1 - D is replaced by 1 and IT by the limit of the size of sub cohort V divided by the total cohort size n. This variance is straightly larger than that of the weighted estimator (estimator II; Borgan et al., 2000; Chen and Lo, 1999). The asymptotic variance can also be conveniently obtained by utilizing standard statistical software for fitting Cox's proportional hazards models (Therneau and Li, 1999). An estimate of the cumulative baseline hazard function Ao(t) corresponding to pseudo-score estimators for {3 can be obtained as
Jinbo Chen
142
where nv is the size of V. This estimator is expected also to be less efficient than that of the weighted estimator. 3.1.3
Nonparametric maximum likelihood estimation
If the censoring variable C does not depend on Z conditional on S, the NPMLE leads to efficient estimators for the estimation of both {3 and Ao (t) (Chen and Little, 1999; Chen, 2002; Scheike and Martinussen, 2004). This method is based on maximizing the likelihood for the observed data, (Y, D, R, RZ, S), which is a function of parameters {A(t), {3} and the nuisance distribution p(ZIS). With a parametric or nonparametric model imposed for p(ZIS), an EM-type algorithm can be applied for the joint estimation of all parameters. In the maximization, the cumulative baseline hazard function Ao(t) is treated as a step function with jumps only at observed failure time points. The asymptotic variance of estimates for parameter {3 can be obtained as the inverse of the information matrix calculated as the second derivative of the profile likelihood function of {3. The general NPMLE methods have been rigorously and thoroughly studied in Zeng and Lin (2007). We outline the basics of this approach following Scheike and Martinussen (2004), assuming that no variables S are observed in phase I to simplify the notation, and the extension to the case where S is observed is straightforward. Let H(tIZ) be the survival function p(T ;? tlZ) = exp{ -Ao(t) exp ({3' Z)}, and g(Z) is the marginal distribution of Z. The informative part of the likelihood function can be written as
II
A(Yilzi)d; H(Yilzi)g(Zi)
II 1H(Yilz)g(z) dz,
i:ri=O
i:Ti=l
Z
where the likelihood for the sampling indicator R is omitted because the casecohort sampling produces missingness in Z that is at random (Little and Rubin, 1987). When Z is discrete with a small number of levels, one can impose a saturated model for g(z) with the number of parameters equal to the total number of distinct values of Z minus quantity one. In general, when Z include both discrete and continuous components, one can put a point mass Pk at every distinct observed value for Z, Zk, with Pk 's satisfying Lk Pk = 1. The maximization ofthe above likelihood, with g(Z) replaced by Pk so that the integration over Z becomes summation over Zk, can be carried out using the EM algorithm. The expectation is taken over the conditional probability p(Z = zklY ;? t) = H(t; zk)Pkl Lk H(t; ZdPk' Define a{k as TJz;=Zk + (1 - Ti)pI(Z = zklT ;? Yi) at the Ith step of iteration. At the 1+1 iteration, p~I+1) = Li a{kln, /3 is the solution to the score equation
~
~di
[
Zi-
Lj:Yi;;;'Y; {Tje,6I Zi Zj Lj:Yi;;;'Y; {Tje,6'Zi
i=1
+ (1 -
Tj) Lk e,6IZkZka]k}]
+ (1- Tj) Lk e,6I Zka ]k}
and A~I+l)(Yi) is obtained as
AO(Yi) =
t
I(YI ~ Yi)d 1 . Zka 1=1 Lj:Yi;;;'Y; {Tje,6'Zi + (1- Tj) Lk e,6I }k}
,
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
143
The NPMLE method simultaneously provides estimates for all parameters including (3, Ao(t), and p(Zk). The simultaneous estimation of two essentially infinite dimensional parameters, Ao(t) and p(Zk), using an iterative algorithm makes the computation much more challenging than the weighted and pseudolikelihood estimators. The asymptotic variance of (3 can be obtained as the inverse Hessian matrix of the profile l~kelihood for (3 as follows. At the convergence of the EM algorithm, one obtains and p~ using the above formulas with (3 fixed at Then one can plug and p~ back into the likelihood function, which is written as L((3) = L(/3, Af3, p(3). Then one perturbs by a small quantity E as = + E and recalculate the likelihood function L((3.) = L(/3', Af3·, pf3.) and calculate L((32.) similarly. The variance-covariance matrix of (3 can then be obtained as {L((32.)+L((3)-2L((3.)}/E 2. Scheike and Martinussen (2004) proposed to perform the computation based on the partial likelihood score function, which can largely take advantage of many quantities computed in the EM algorithm. Of course one could also adopt a parametric model for g(z) or g(zls) (Chen and Little, 1999). However, mis-specifying these nuisance functions may result in intolerable bias in parameter estimation. In addition, it is not trivial to compute the integration over Z in the likelihood function for the observed data.
Ag Ag
/3.
/3'
3.1.4
/3
/3
Selection of a method for analysis
The decision to choose a method for analysis depends on considerations of scientific hypotheses of interest, data generating processes, statistical efficiency, amount of data available from the full cohort, and availability of computing resources. The NPMLE with the missing covariate distribution modeled nonparametrically is consistent and the most efficient when applicable. However, this method requires a relatively strong assumption that the censoring only depends on the fully observed covariates, the violation of which may result in serious bias (Chen and little, 1999). While one may subjectively speculate on whether the data indeed satisfies this constraint, the observed data does not allow a formal statistical assessment. Of course one could perform suitable sensitivity analysis in light of this concern. The weighted methods, for the consistency of estimators, only require knowledge of a true sub cohort selection model, which is always in the hands of the investigator. These methods guarantee that estimates are consistent to the same quantities as those from the full cohort analysis if complete data were available. To choose one from various weighted methods, one can first estimate the subcohort selection probability within strata defined by discretized Y, phase I variables S, and D. The analysis may involve a pre-investigation process on what variables to use in this post-stratification so that there are a reasonable number of subjects in each stratum to ensure finite-sample consistency and the efficiency gain. The fully augmented weighted estimator (Qi, Wang, and Prentice, 2005) is relatively computationally intensive since it requires non-parametric smoothing. The pseudo-likelihood estimators may be less efficient than the NPMLE and some weighted estimators, but they require the least effort in data collection: they do not require the record of follow-up time Y for non-selected subjects and require the ascertainment of covariate values for a case outside of sub cohort V only at the
144
Jinbo Chen
case's failure time. Both the weighted and NPMLE estimators require covariate observations at all failure times in the full cohort. The NPMLE method requires the record on Y for the full cohort, which may not be precise or readily available for subjects not in the case-cohort sample. For example, the base cohort may consist of several sub cohorts that were assembled at different study sites and involved different study investigators (for example, a cohort consortium). The weighted method may also need to use Y to improve efficiency. While the weighted and pseudo-likelihood estimators can be computed largely by taking advantage of quantities provided by standard software for fitting Cox's proportional hazards models, the programming requirement for the NPMLE is much higher. Sometimes the base cohort may be so large that it may not be feasible to run the EM algorithm. For example, the Breast Cancer Detection and Demonstration Project (BCDDP) at the National Cancer Institute consisted of around 280,000 women (Chen et al., 2008). When the outcome incidence is low and hazard ratio parameters are not very large, the efficiency advantage of the NPMLE is very modest (Scheike and Martinussen, 2004). In particular, if the interest is only in the estimation of hazard ratio parameters for covariates collected in the case-cohort sample but not for phase-I variables, the incorporation of data (Y, D) in the NPMLE analysis for subjects outside of the case-cohort sample probably could only improve the efficiency of estimating parameters of interest marginally. But one could conjecture that the estimation of Ao(t) would be much improved. Limited data is available in the literature documenting the relative performance of the NPMLE approach and weighted estimators. It is of practical interest to compare their efficiency if phase I variables S are available and at least moderately correlated with phase II covariates Z. In this case, the NPMLE involves the estimation of nuisance distribution p(ZIS), which could be challenging when S involves continuous components or many discrete variables. One solution is that one can simply ignore components in S that are not used for case-cohort sampling. Since the weighted and pseudo-likelihood approaches are based on modifications of full data partial likelihood functions, they can conveniently incorporate fully-observed time-varying covariates. For the NPMLE method, no details have been given in the literature how well it can handle time-varying covariates. The likelihood function requires modeling of covariates Z conditioned on the fully observed ones, but such modeling becomes difficult when phase I variables are time-dependent. Additional assumptions may be imposed. For example, one may assume that only the baseline measurements of phase I variables are predictive of the distribution of Z (Chen et al., 2008). The analysis can then proceed largely using the EM-algorithm, except that the update for estimates of Ao(t) is more complex due to the involvement of time-varying covariates in the survivor function. It has also been pointed out that the weighted estimators always converge to the same quantities as the full data analysis when the Cox's proportional hazards model is mis-specified (Scott and Wild, 2002; Breslow and Wellner, 2007). Insufficient data exists in the literature documenting the bias in the NPMLE and pseudo-score estimates comparing to the full cohort analysis when the model (3.1)
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
145
is mis-specified.
3.2
Nested case-control and counter-matching design
The meaning of "nested case-control" design in the epidemiology literature is frequently different from that in the biostatistics literature. In the biostatistics literature, it refers to an individually-matched design where controls are sampled from the risk set at the failure time of the corresponding case. Suppose K out of n subjects in the cohort experienced the event of interest during the study period, and let tl < t2 < .. , < tK denote the K observed event times. Let Rk be the risk set at tk, {i: Yi ~ tk,i = 1"" ,n}, which includes the kth case and all subjects in the full cohort who are event-free at tk, and denote nk as the number of subjects in the set Rk. Then for the kth case, typically, a small number of event-free subjects are sampled from Rk. We assume a fixed number, m - 1 controls are sampled for each case. Let Vk denote the sampled subset at tk together with the kth case. Then the size of Vk is equal to m. Covariates Z are available only for subjects in the sampled risk set Vk. Some auxiliary variables S may be available for the full cohort. In the epidemiology literature, subjects who experienced events and a group of event-free subjects at the end of study follow-up are sampled. The sampling is frequently stratified on some factors such as the length of follow-up, ethnicity group, study center, and etc. The resultant case-control data is often analyzed using standard logistic regression as an unmatched case-control sample, with the stratification factors adjusted as covariates. The same data, with the observation of (Y, D) available for the full cohort, can also be analyzed with the CPR model. The analytical method is closely related to that for the case-cohort study. We focus on methods for analyzing individually-matched case-control design. The classical method of estimating hazard ratio parameters using nested case-control data is by maximizing the log partial likelihood (Thomas, 1977)
L(f3)
=
L ddf3'
Zk -
log
L ef3'
Zl},
IEVk
k
where, with slight abuse of notation, Zk refers to the covariate values of case k at the kth failure time. This likelihood differs from that for the full cohort study only in that the sampled risk set is used instead of the full risk set. This partial likelihood has exactly the same form as the usual conditional likelihood for matched casecontrol data assuming a logistic regression model for analysis. Consequently, the analysis is frequently referred to as the conditional logistic regression a~alysis. The intuitive explanation for consistency of the resultant estimate of f3, f3, is readily available by examining the score function for f3:
U(f3)
="'L.J"' k
f3
Z
{
"" Z e ' Zl _ L..IEVk I k "" ef3' Zl L..!EVk
} '
where the second term is an estimate of E(Zlvk' dk = 1) calculated using the sampled risk set only. This estimator is unbiased because the sampling within each risk set is random, implying that the score function is unbiased. It turns out
Jinbo Chen
146
that the estimate converges to the same value as the full cohort analysis (Langholtz and Goldstein, 2001) if Z were measured on the full cohort. An estimate of the baseline hazard function Ao(t) can be obtained as
When one is only interested in hazard ratio parameters (3, the partial likelihood analysis uses data only from selected subjects, and the only additional information required for the estimation of Ao(t) is nk, which can be conveniently recorded at the time of sampling the risk set. Similarly to stratified case-cohort designs, an exposure-stratified version of the individually-matched nested case-control design, called counter-matching design, can lead to improved efficiency (Langholtz and Borgan, 1995). It is easily seen that the NPMLE method for the analysis of case-cohort data is readily applicable to that of the nested case-control or counter-matched data, so we will omit the discussion of this approach. However, it is worth pointing out that extra caution needs to be taken when applying the NPMLE to the nested case-control data when it is important to adequately adjust for matching variables. For example, when measuring biologic quantities from blood samples, the investigators often match on the date that the blood sample was collected. When using the NPMLE method, one may need to adjust for the matching on the date, which could be rather difficult if the matching is very fine. In this case, the partial likelihood approach would be best suitable. Many methods we review for analyzing the nested case-control or the countermatched data are in analogy with those for the case-cohort design. This is not surprising. "Although studies of nested case-control and case-cohort sampling using Cox's model have mostly been conducted through separate efforts, the focus on searching for more efficient estimators is the same - namely how sampled individuals should be properly reused in constructing estimating equations" (Chen, 2001).
3.2.1
Methods for analyzing the nested case-control data
The general class of weighted estimating equations (3.6) is also applicable to the analysis of nested case-control or counter-matched data. While the probability of including cases is obviously one, the appropriate weight to use for controls, however, is not straightforward to obtain due to the matched sampling. In particular, a control for a former case may experience an event at a later time, and a subject could be sampled as controls for multiple cases. Thus, the probability that a subject is ever selected as a control for some case depends on the at-risk history for the full cohort. When cases and controls are only matched on the time, the selection probability for a control subject i was derived by Samuelsen (1997): 7ri
= 1-
II k:Yk~Yi
{l- (m - l)/(nk - I)}.
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
147
This is the "true probability" for selection. The estimation can then proceed by solving the equation (3.6) with this 7L The theoretical study of this estimator is slightly more involved than that for the case-cohort design, since the probability 7ri is not predictable. When the selection of controls involves matching also on factors other than time, such as ethnicity, date of processing biologic assay, and etc, the extension of this weight to incorporate these finer strata does not seem obvious. Similar to the case-cohort design, one can also use "estimated weights" instead of the "true weight" to improve statistical efficiency. In this regard, the localcovariate averaging method of Chen (2001) and the general weighted method (3.8), which replaces 7r with smoothed estimates in strata defined by (Y, D) or (Y, D, S), are still applicable. Using estimated weights makes it feasible to incorporate stratified risk-set sampling. All these weighted methods allow the exposure of case k to be compared to all subjects in the nested case-control sample who are at risk at time tk rather than only to those matched to case k, thereby leading to increased power. The power gain is more important when the number of matched controls is small and/or the hazard ratio parameters are large and becomes less so otherwise (Samuelsen, 1997; Chen, 2001). More empirical studies are needed to evaluate the performance of these estimators in realistic scenarios. As a tradeoff for efficiency advantage, the weighted methods require more data than the standard partial likelihood method. The former require observations of time-dependent covariates for a subject at all failure times, but the latter requires those only at one failure time at which the subject was sampled. New estimators that use only the time-restricted nested case-control data that is reasonably accurate but more efficient than the partial likelihood estimator has been proposed (Chen, 2001). 3.2.2
Methods for analyzing counter-matched data
Counter-matching design is a stratified nested case-control design that can lead to improved statistical efficiency for assessing the exposure effect. The intuition is best explained by a one-one matched design. In the standard nested case-control design, when the exposure is rare, many case-control pairs would be unexposed and thus are non-informative for the exposure hazard ratio estimation. By sampling a control who has an exposure value different from that of the case, the countermatching design offers a sampling strategy to increase the number of informative pairs (Langholtz and Borgan, 1995). For example, if a case is exposed, the control should be randomly sampled from unexposed event-free subjects in the risk set. The sampling principle also applies when an auxiliary variable is available in the full cohort, but the exposure of interest can only be measured in the countermatched subset. Similar to the Thomas' partial likelihood approach for analyzing the nested case-control data, the counter-matching studies can be analyzed using a partial likelihood approach (Langholz and Borgan, 1993):
148
Jinbo Chen
where Wj (tk) is the inverse sampling probability for subject j at failure time k, calculated as the total number of subjects in the full cohort risk set Rk in sampling stratum where case k is located divided by the number in Vk from the same stratum. Interestingly, even with this sampling weights, this partial likelihood still has the same properties as the regular partial likelihood: the expectation of the score is zero and the expected information equals the covariance matrix of the score. Consequently, standard software for performing the conditional logistic regression analysis can be used for fitting the counter-matched sample. The weighted methods for the nested case-control data can be applied to the analysis of counter-matched data with only minor modifications in the construction of weights. While the principles of analyzing counter-matched data is not much different from those for the standard nested case-control data, the actual implementation of sampling could be much less straightforward except for counter-matching one case with one control on a binary exposure. With continuous exposures or auxiliary phase-I variables, one would need to categorize them in order to perform counter matching. Then cases and controls could be sampled from relatively distant intervals. One would need to decide a priori the number and endpoints of intervals, and some general guidelines are given in Langholtz (2003). In general, the number of categories should be balanced with the number of at-risk subjects within each category. It is also important to keep in mind that counter-matching sampling does not necessarily lead to efficiency gain for every parameter estimate (Langholtz and Borgan, 1995). Thus, one needs to be cautious when applying a counter-matching design to study joint effects of multiple covariates or absolute risk. Nevertheless, efficiency improvement may always be expected for the counter-matched variable or its correlates. In particular, counter-matching on both rare genetic and environmental exposures has been shown to be cost-effective for studying interactions between rare genes and rare environmental exposures (Cologne et al., 2004). Empirical studies showed that counter-matching on an auxiliary measure that is even moderately correlated with the exposure of interest could result in noteworthy efficiency gain (Langholtz and Borgan, 1995). Such gain increases with respect to both sensitivity and specificity of the auxiliary variable but more with specificity and decreases with the number of matched controls. 3.2.3
Unmatched case-control studies
Classical case-control studies, which include the failures together with a subset of non-failures at the end of the study period, are often conducted within an assembled study cohort. The sampling of non-failures (controls) could be stratified on factors such as ethnicity, age, follow-up time, and etc, so that cases and controls are "frequency-matched" on these variables. Epidemiologists often refer to this type of case-control studies as nested case-control studies. These studies are often analyzed using an unconditional logistic regression model for the prevalence of the failure, especially when the censoring is administrative. In the presence of competing risks or a complex relationship between the risk of failure and time, it may be desirable to perform a time-to-event analysis (Chen and Lo, 1999; Chen,
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
149
2001), which also conveniently allows the absolute risk estimation. For the purpose of estimating hazard ratio parameters, it turns out that an unmatched case-control study with kl cases and ko controls is equivalent to a case-cohort study if kl is equal to the total number of cases in the cohort and ko = n{1 - p(t5 = I)} where n is the sub cohort sample size (Chen and Lo, 1999). The estimation method of Chen and Lo (1999) and Borgan et al. (2000) can then be applied to the analysis of the case-control data, except that the asymptotic variance needs to be slightly modified. We include the unmatched case-control design here mainly to distinguish it from the nested case-control design but it is more closely related to the case-cohort design. 3.2.4
Comparing study designs and analysis approaches
The relative merits of case-cohort and nested case-control designs have been discussed extensively in the literature (e.g., Langholtz 2001; Samuelsen et al., 2007). A unique advantage of the case-cohort design is that it is cost effective for studying multiple outcome variables in that only a single random sub cohort is needed as a common control sample for different outcomes. A matched design, on the other hand, must select a new set of controls for a new outcome variable. But a case-cohort design does require that exposure assessments, such as molecular measurements from stored biological materials, are not affected by the passage of time. The time-matched case-control design is free of such concern. The implementation of a case-cohort design is somewhat easier than that for a nested case-control design. One should carefully gauge these and other practical considerations when choosing a design. The NPMLE or weighted likelihood methods make it possible to compare the efficiency of different study designs. For example, Langholz and Thomas (1991) showed that the nested case-control design may have greater efficiency than the case-cohort design when there is moderate random censoring or staggered entry into the cohort. Methods of "refreshing" the sub cohort so as to avoid such efficiency loss are available (Prentice, 1986; Lin and Ying, 1993; Barlow, 1994). We have focused on the methods for fitting the Cox's proportional hazards model on case-cohort or nested case-control data. Analytical methods assuming other regression models, such as semiparametric transformation models (Lu and Tsiatis, 2006; Kong, Cai, Sen, 2004) or additive hazards regression model (Kulich and Lin, 2000), have also been studied in the literature. The weighted approaches and NPMLE approach could be adapted to these settings without conceptual difficulty.
4
Conclusions
The chapter is intended to provide a useful highlight of some recent statistical method development for the analysis of two-phase case-control and cohort studies. It is in no way a comprehensive review of the whole field. Weighted likelihood methods, pseudo-likelihood methods, and nonparametric maximum likelihood methods were outlined for two-phase case-control, (stratified) case-cohort, and nested case-control or counter-matching designs. Each method has its own
150
Jinbo Chen
advantages and limitations, and probably it is difficult, if not impossible, to conclude which method is uniformly preferable. For a particular study, one should carefully gauge whether assumptions required by a method are satisfied, whether a potential efficiency gain could possibly justify any additional computation effort, and etc. We have focused on two-phase methods to reduce costs for measuring covariates. Similar sampling strategies and statistical methods are applicable for the cost-effective measurement of regression outcomes (Pepe, Reilly, and Fleming, 1994; Chen and Breslow, 2004). The case-cohort design has been widely adopted for the study of many human disorders related to expensive exposures. A search of "case-cohort" by title and abstract in Pubmed shows 455 records at the time when this paper was written. The original Prentice (1986) estimator appears still to be the method of choice, partially due to the convenience of computation using existing software. The timematched nested case-control design has long been used as a standard epidemiologic study design, but frequency-matching is probably applied more often than individual-matching for control selection, and conditional and unconditionallogistic regression analyses are more often used than time-to-event analysis. When the individual matching is administered, Thomas' partial likelihood estimator (1977), usually referred to as a conditional logistic regression method, is still the standard method of analysis. A limited literature search shows that many of the reviewed methods, although statistically elegant, have not really reached out to the epidemiology community. This highlights a great need of work that disseminates the mathematically-sophisticated two-phase methods to the scientific community, best accompanied by the development and distribution of user-friendly software. For two-phase case-control studies, some effort has already been devoted and appeared to be fruitful (Breslow and Chatterjee, 1999). We did not review the optimal design of two-phase studies. Within an established cohort where the phase-I sample size is fixed, an optimal design would refer to the sampling strategy to select a pre-specified number of phase-II subjects in a way that the resultant phase I and phase II data would allow the most precise parameter estimates. An optimal sampling strategy depends on sampling proportions within each stratum, the cost of sampling a case or control, and unknown parameters such as the joint distribution of regression variables and auxiliary variables. It is thus usually infeasible to determine the optimal sampling strategy. For the two-phase case-control study with a binary exposure, it was shown that a design that balances the number of subjects in all strata appeared to perform well in many practical situations (Breslow and Cain, 1988). For the unstratified design of case-cohort or nested case-control studies, the design is simply a matter of cost: if all cases are sampled, how many controls does one need to sample to achieve certain power? Cai and Zeng (2004) proposed two approaches for calculating sample size for case-cohort studies based on log-rank type of test statistics. For the stratified case-cohort or counter-matching design, one would need to decide what stratum to use and how many to sample within each stratum. Many useful extensions of two-phase methods have been proposed in the literature. For example, for the examination of two rare exposures, Pfeiffer and Chatterjee (2005) considered a supplemented case-control design where they extended
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
151
methods for two-phase case-control studies by supplementing a case-control sample with a group of non-diseased subjects who are exposed to one of the exposures. Lee, Scott and Wild (2007) proposed maximum likelihood methods when a case sample is augmented either with population-level covariate information or with a random sample that consists of both cases and controls. Their case-augmented design allows the estimation of not only odds ratio parameters, but also the probability of the outcome variable. Reilly et al. (2005) applied two-phase methods to study new outcome variables using existing case-control data, where the new outcome was an exposure in the original study. Their method could largely eradicate possible bias in the analysis of only controls and improve study power by including not only controls but also cases. Chen et al. (2008) extended the NPMLE method of two-phase case-control studies to a situation where phase I is a stratified casecontrol sample but the sampling of phase II subjects was within strata defined by auxiliary variables other than the stratification variable for phase I case-control sampling. Lu and Shih (2006) proposed an extended case-cohort design for the studies of clustered failure time data. Besides the two-phase design, multiple-phase designs have also been considered in the literature (Whittmore, 1997). In summary, the application and extension of two-phase methods are no doubt still a very promising area of research. In the current exciting era of genetic and molecular epidemiology, the cost for measuring many bioassays is nothing less than trivial. The two-phase and multi-phase methods are thus expected to find important and wider applications.
References [1] W. E. Barlow, Robust variance estimation for the case-cohort design. Biometrics 50 (1994), 1064-1072. [2] J. L. Bernstein, B. Langholz, R. W. Haile, L. Bernstein, D. C. Thomas, M. Stovall, K. E. Malone, C. F. Lynch, J. H. Olsen, H. Anton-Culver, R. E. Shore, J. D. Boice Jr, G. S. Berkowitz, R. A. Gatti, S. L. Teitelbaum, S. A. Smith, B. S. Rosenstein, A. L. Berresen-Dale, P. Concannon, and W. D. Thompson, Study design: evaluating gene-environment interactions in the etiology of breast cancer - the WECARE study. Breast Cancer Research 6 (2004), 199-214. [3] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner, Efficient and adaptive estimation for semiparametric models. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, Md, USA, 1993. [4] O. Borgan, B. Langholz, S. O. Samuelson, L. Goldstein, and J. Pogoda, Exposure stratified case-cohort designs. Lifetime Data Analysis 6 (2000),39-58. [5] N. E. Breslow, Discussion of the paper by D. R. Cox. J. R. Statist. Ser. B 34 (1972), 216-217. [6] N. E. Breslow, Statistics in epidemiology: The case-control Study. Journal of the American Statistical Association 91 (1991),14-28. [7] N. E. Breslow and K. C. Cain, Logistic regression for two-Stage case-Control Data. Biometrika 75 (1988), 11-20.
152
Jinbo Chen
[8] N. E. Breslow and N. Chatterjee, Design and analysis of two phase studies with binary outcome applied to wilms tumor prognosis. Applied Statistics 48 (1999), 457-468. [9] N. E. Breslow and R. Holubkov, Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Ser. B 59 (1997), 447-46l. [10] N. E. Breslow, J. M. Robins, J. A. Wellner, On the semiparametric efficiency oflogistic regression under case-control sampling. Bernoulli 6 (2000),447-455. [11] N. E. Breslow and J. A. Wellner, Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scandinarian Journal of Statistics (2007), 86-102. [12] N. Chatterjee, Y. Chen, and N. E. Breslow, A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association, 98 (2003), 158-168. [13] J. B. Chen, N. E. Breslow, Semiparametric efficient estimation for the auxiliary outcome problem with the conditional mean model. Canadian Journal of Statistics 32 (2004), 359-372. [14] J. B. Chen, R. Ayyagari, N. Chatterjee, D. Y. Pee, C. Schairer, C. Byrne, J. Benichou, and M.H. Gail, Breast cancer relative hazard estimates from caseCcontrol and cohort designs with missing data on mammographic density. Journal of the American Statistical Association (2008; in press). [15] H. Y. Chen and R. J. A. Little, Proportional hazards regression with missing covariates. Journal of the American Statistical Association 94 (1999), 896-908. [16] H. Y. Chen, Double-semiparametric method for missing covariates in Cox regression models. Journal ofthe American Statistical Association 97 (2002), 565-576. [17] K. Chen, Generalized case-cohort sampling. Journal of the Royal Statistical Society, Ser. B 63 (2001), 791-809. [18] K. Chen, Statistical estimation in the proportional hazards model with risk set sampling. The Annals of Statistics 32 (2004), 1513-1532. [19] K. Chen and S. Lo, Case-cohort and case-control analysis with Cox's model. Biometrika 86 (1999), 755-764. [20] J. B. Cologne, G. B. Sharp, K. Neriishi, P. K. Verkasalo, C. E. Land, and K. N akachi, Improving the efficiency of nested case-control studies of interaction by selecting controls using counter matching on exposure. International Journal of Epidemiology 33 (2004), 485-492. [21] D. R. Cox, Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Ser. B 34 (1972), 187-220. [22] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Ser. B 39 (1977), 1-38. [23] E. A. Engels, J. Chen, R. P. Viscidi, K. V. Shah, R. W. Daniel, N. Chatterjee, and M. A. Klebanoff, Poliovirus vaccination during pregnancy, maternal seroconversion to simian virus 40, and risk of childhood cancer. American Journal of Epidemiology 160 (2004), 306-316. [24] T. R. Flanders and S. Greenland, Analytic methods for two-stage case-control
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
153
studies and other stratified designs. Statistics in Medicine 10 (1991), 739-747. [25] D. G. Horvitz and D. J. Thompson, A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47 (1952), 663-685. [26] J. D. Kalbfleisch and J. F. Lawless, Likelihood analysis of multi-state models for disease incidence and mortality. Statistics in Medicine 7 (1988), 146-160. [27] L. Kong, J. W. Cai, and P. K. Sen, Weighted estimating equations for semiparametric transformation models with censored data from a case-cohort design. Biometrika 91 (2004), 305-319. [28] M. Kulich and D. Y. Lin, Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association 99 (2004), 832-844. [29] M. Kulich and D. Y. Lin, Additive hazards regression for case-cohort studies. Biometrika 87 (2000), 73-87. [30] J. A. Largent, M. Capanu, L. Bernstein, B. Langholz, L. Mellemkaer, K. E. Malone, C. B. Begg, R. W. Haile, C. F. Lynch, H. Anton-Culver, A. Wolitzer, J. L. Bernstein, Reproductive history and risk of second primary breast cancer: the WECARE study. Cancer Epidemiology, Biomarkers, and Prevention 16 (2007), 906-91l. [31J B. Langholz and O. Borgan, Counter-matching: a stratified nested casecontrol sampling method. Biometrika 82 (1995), 69-79. [32] B. Langholz and L. Goldstein, Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2 (2001),63-84. [33] B. Langholz, Use of cohort information in the design and analysis of casecontrol studies. Scandivavian Journal of Statistics 34 (2006), 120-136. [34] B. Langholz and D. C. Thomas, Efficiency of cohort sampling designs: some surprising results. Biometrics 47 (1991), 1563-157l. [35] J. F. Lawless, J. D. Kalbfleisch, and C. J. Wild, Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society, Ser. B 61 (1999), 413-438. [36] A. J. Lee, A. J. Scott, and C. J. Wild, On the Breslow-Holubkov estimator. Lifetime Data Analysis 13 (2007), 545-563. [37J A. J. Lee, A. J. Scott, and C. J. Wild, Fitting binary regression models with case-augmented samples. Biometrika 93 (2007), 385-397. [38J A. J. Lee, On the semi-parametric efficiency of the Scott-Wild estimator under choice-based and two-phase sampling. Journal of Applied Mathematics and Decision Sciences (2007, in press). [39] D. Y. Lin and Z. Ying, A simple nonparametric estimator of the bivariate survival function under univariate censoring. Biometrika 80 (1993), 573-58l. [40] W. B. Lu and A. A. Tsiatis, Semiparametric transformation models for the case-cohort study. Biometrika 93 (2006), 207-214. [41] S. Lu and J. H. Shih, Case-cohort designs and analysis of clustered failure time data. Biometrics 62 (2006), 1138C1148. [42] B. Nan, Efficient estimation for case-cohort studies. Canadian Journal of Statistics 32 (2004), 403-419. [43J B. Nan, M. J. Emond, and J. A. Wellner, Information bounds for Cox regres-
154
Jinbo Chen
sion models with missing data. The Annals of Statistics 32 (2004), 723-753. [44] M. S. Pepe, M. Reilly, and T. R. Fleming, Auxiliary outcome data and the mean score method. Journal of Statistical Planning and Inference 42 (1994), 137-160. [45] R. L. Prentice, A Case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73 (1986), 1-11. [46] R. L. Prentice and R. Pyke, Logistic disease incidence models and case-control studies. Biometrika 66 (1979), 403-411. [47] M. Reilly, A. Torrang, and A. Klint, Re-use of case-control data for analysis of new outcome variables. Statistics in Medicine 24 (2005), 4009-4019. [48] J. M. Robins, A. Rotnitzky, and L. P. Zhao, Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89 (1994), 846-866. [49] S. O. Samuelsen, H. Aanestad, and A. Skrondal, Stratified case-cohort analysis of general cohort sampling designs. Scandinavian Journal of Statistics 34 (2007), 103-119. [50] T. H. Scheike and T. Marginussen, Maxiumum likelihood estimation for Cox's regression model under case-cohort sampling. Scandinavian Journal of Statistics 31 (2004), 283-293. [51] W. Schill, K. H. Jockel, K. Drescher, and J. Timm, Logistic analysis in casecontrol studies under validation sampling. Biometrika 80 (1993), 339-352. [52] A. J. Scott and C. J. Wild, Fitting logistic models under case-control or choice-based sampling. Journal of the Royal Statistical Society, Ser. B 48 (1986), 170-182. [53] A. J. Scott and C. J. Wild, Fitting regression models to case-control data by maximum likelihood. Biometrika 84 (1997), 57-71. [54] A. J. Scott and C. J. Wild, The robustness of the weighted methods for fitting models to case-control data. Journal of the Royal Statistical Society, Ser. B 64 (2002), 207-219. [55] S. G. Self and R. L. Prentice, Asymptotic distribution theory and efficiency results for case-cohort studies. The Annals of Statistics 16 (1988), 64-81. [56] T. M. Therneau and H. Li, Computing the Cox model for case-cohort designs. Lifetime Data Analysis 5 (1999), 99-112. [57] D. C. Thomas, In appendix to F. D. K. Liddell, J. C. McDonald and D. C. Thomas, Methods of cohort analysis: appraisal by application to asbesto mining. Journal of the Royal Statistical Society, Series A 140 (1977), 460-491. [58] B. Thorand, J. Baumer, H. Kolb, C. Meisinger, 1. Chambless, W. Koenig, C. Herder, Sex differences in the prediction of type 2 diabetes by inflammatory markers: results from the MONICA/KORA Augsburg case-cohort study, 1984-2002. Diabetes Care 30 (2007), 854-860. [59] J. W. Cai and D. Zeng, Sample size/power calculation for case-cohort studies. Biometrics 60 (2004), 1015-1024. [60] E. White, A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology 115 (1982), 119-128. [61] K. Wu, D. Feskanich, C. S. Fuchs, W. C. Willett, B. W. Hollis, and E. L.
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
155
Giovannucci, A nested caseCcontrol study of plasma 25-Hydroxyvitamin D concentrations and risk of colorectal cancer. Journal of the National Cancer Institute 99 (2007), 1120-1129. [62] D. Zeng and D. Y. Lin, Maximum likelihood estimation in semiparametric regression models with censored data (with discussion). Journal of the Royal Statistical Society, Ser. B, 69 (2007), 507-564.
This page intentionally left blank
Part III
Bioinformatics
This page intentionally left blank
Chapter 7 Protein Interaction Predictions from Diverse Sources Yin Liu*
Inyoung Kimt
Hongyu Zhao:!:
Abstract Protein-protein interactions play an important role in many cellular processes. Recent advances in high-throughput experimental technologies have generated enormous amounts of data and provided valuable resources for studying protein interactions. However, these technologies suffer from high error rates due to their inherent limitations. Therefore, statistical and computational approaches capable of incorporating multiple data sources are needed to fully take advantage of the rapid accumulation of data. In this chapter, we describe diverse data sources informative for protein interaction inference, as well as the computational methods that integrate these data sources to predict protein interactions. These methods either combine direct measurements on protein interactions from diverse organisms, or integrate different types of direct and indirect information on protein interactions from various genomic and proteomic approaches. Keywords: Protein-protein interactions, protein complex, domain, genomic feature, data integration
1
Introduction
Protein-protein interactions (PPI) playa critical role in the control of most cellular processes, such as signal transduction, gene regulation, cell cycle control and metabolism. Within the past decade, genome-wide data on protein interactions in humans and many model species have become available (Giot et at., 2003; Ito et at., 2001; LaCount et at., 2005; Li et at. 2004; Rual et al., 2005; Uetz et at., 2000). In the meantime, a large amount of indirect biological information on protein interactions, including sequence and functional annotation, protein localization information and gene expression measurements has also become available. -Department of Neurobiology and Anatomy, University of Texas Health Science Center at Houston, 6431 Fannin Street, Houston, TX 77030, USA, E-mail:[email protected] tDepartment of Statistics, Virginia Tech, 410-A Hutcheson Hall, Blacksburg, VA 24061, USA. +Department of Genetics, Department of Epidemiology and Public Health, Yale University School of Medicine, 60 College Street, New Haven, CT 06520, USA, E-mail:hong [email protected]
159
160
Yin Liu, Inyoung Kim, Hongyu Zhao
These data, however, are far from complete and contain many false negatives and false positives. Current protein interaction information obtained from experimental methods covers only a fraction of the complete PPI networks (Hart et al., 2006; Huang et al., 2007; Scholtens et al., 2007); therefore, there is a great need to develop robust statistical methods, capable of identifying and verifying interactions between proteins. In the past several years, a number of methods have been proposed to predict protein-protein interactions based on various data types. For example, with the genomic information available, the Rosetta stone method predicts the interacting proteins based on the observation that some of single-domain proteins in one organism can be fused into a multiple domain protein in other organisms (Marcotte et al., 1999; Enright et al., 1999). The phylogenetic profile method is based on the hypothesis that interacting proteins tend to co-evolve so that their respective phylogenetic trees are similar (Pazos et al., 2003). The concept of "interolog", which refers to homologous interacting protein pairs among different organisms, has also been used to identify protein interactions (Matthews et al., 2001). The gene neighborhood method is based on the observation that functionally related genes encoding potential interacting proteins are often transcribed as an operon (Overbeek et al., 1999). With genome-wide gene expression measurements available, some methods find gene co-expression patterns in interacting proteins (Jansen et al. 2002). Based on protein structural information, the protein docking methods use geometric and steric considerations to fit multiple proteins of known structure into a bounded protein complex to study interacting proteins at the atomic level (Comeau et al. 2004). Moreover, some methods analyze protein interactions at the domain level, considering protein domains as structural and functional units of proteins (Sprinzak and Margalit, 2001; Deng et al., 2002; Gomez et al., 2003). While each of these methods focuses on a single dataset, either the direct measurement of protein interaction, or the indirect genomics dataset that contains information on protein interaction, it is not surprising that these methods have certain limitations and the detailed reviews of these individual methods can be found elsewhere (Shi et al., 2005; Shoemaker and Panchenko, 2007; Bonvin et al., 2006). In this chapter, firstly we introduce the data sources useful for protein interaction predictions, and then we survey different approaches that integrate these data sources. We focus on three types of approaches: (1) domain-based methods, which predict protein physical interactions based on domain interactions by integrating direct protein interaction measurements from multiple organisms; (2) classification methods, which predict both protein physical interactions and protein co-complex relationship by integrating different types of data sources; (3) complex-detection methods, which only predict protein cocomplex membership by identifying protein complexes from integrating protein interactions data and other genomic data.
Chapter 7 Protein Interaction Predictions from Diverse Sources
2
161
Data sources useful for protein interaction predictions
Different genomic data, such as DNA sequence, functional annotation and protein localization information have been used for predicting protein interactions. Each piece of information provides insights into a different aspect of protein interaction information, thus covers a different subset of the whole interactome. Depending on the data sources, these genomic data can be divided into four categories. First, high-throughput protein interaction data obtained from Yeast two hybrid (Y2H) and Mass Spectrometry of affinity purified protein complexes (AP /MS) techniques provide direct protein interaction information and protein co-complex membership information, respectively. Second, functional genomic data such as gene expression, Gene Ontology (GO) annotation provide information on functional relationships between proteins. Third, sequence- and structure-based data reveal the sequence/structure homology and chromosome location of genes. Finally, network topological parameters calculated from Y2H and AP /MS data characterize the topological properties of currently available protein interaction network. For example, the small-world clustering coefficient of a protein pair, calculated as the p-value from a hypergeometric distribution, measures the similarity between the neighbors of the protein pair in a network (Goldberg et ai., 2003). A list of these features, along with their proteome coverage, is summarized in Table 1. The effects of these data sources on the prediction performance depend on how these data are encoded. With many datasets for a single data type under different experimental conditions available, there are two ways to encode the datasets: the "detailed" encoding strategy treats every experiment separately, while the "summary" encoding strategy groups all experiments belonging to the same feature together and provides a single value (Qi et al., 2006). For example, when all the gene expression datasets are used as one data source, the "summary" encoding strategy generates a single similarity score for each pair of proteins, but we can obtain multiple similarity scores for each protein pair using the "detailed" encoding strategy, with one score computed from one gene expression dataset. Another challenge of the integrated studies is related to the quality of the data to be integrated. It is well known that the prediction power and accuracy would be decreased by including irrelevant and "noisy" data. A list of these data sources, along with their proteome coverage, is summarized in Table 1. Among all the data sources, whole-genome gene expression data are currently the largest source of high-throughput genomic information and considered the most important data source according to the variable importance criterion in a random forest-based method (Qi et ai., 2006). Following gene expression, function annotation from Gene Ontology, which covers about 80% of the yeast proteome, is the second most important data source according to the importance measure obtained from the random forest method. Although some other studies indicate MIPS and GO annotation are more important than gene expression in prediction (Jansen et al., 2003; Lin et ai. 2004; Lu et ai., 2005), this may be due to different number of gene expression datasets used and different encoding styles of the features ("summary" vs. "detailed"). Nonetheless, gene expression, interaction data from Co-IP
Yin Liu, Inyoung Kim, Hongyu Zhao
162
Table 1. Useful data sources for predicting protein interactions Category Protein interaction Data
Functional Genomics Data
Data Type Abbreviation
Data Source
Yeast two-hybrid data
60
Uetz et al.(2000) Ito et al. (2000)
MS
Protein complex data
64
Gavin et al. (2006) Krogan et al (2006)
GE
Gene Expression
100
FUN
GO molecular function
62
GO consortium (2006)
PRO
GO biological process
70
GO consortium (2006)
COM
GO cellular component
72
GO consortium (2006)
80
Mewes et al. (2006)
24
Mewes et al. (2006)
Demeter et al (2007)
PHE
MIPS protein class Mutant phenotype
PE
Protein expression
65
ESS
Co-essentiality
67
Mewes et al. (2006)
essentiality Genetic interaction Transcription regulation Gene fusion Gene neighborhood
99
Lu et al. (2005)
24
Tong et al. (2004)
98
Harbison et al. (2004)
19 22
Lu et al. (2005) Lu et al. (2005)
PP
Pylogenetic profile
SEQ
Sequence similarity
29 100
Lu et al· (2005) Qi et al. (2006)
Interolog
100
Qi et al. (2006)
65
Qi et al. (2006)
MAE GI TR GF GN
INT
Mar~inal
Ghaemmaghami et al. (2003)
COE
Domain-domain interaction Co-evolution scores
22
Goh et al. (2002)
THR
Threading scores
21
Lu et al. (2005)
PF
Protein fold
26
Sprinzak et al. (2005)
CLU
Small-world clusteri g coefficients R
NA
DD
Network Topological Parameters
Proteome Coverage (%)
Y2H
CLA
Sequence! Structure Information
Data Type
Bader et al. (2004) Sharan et al. (2005)
Proteome Coverage, percentage of proteins in S. cerevisiae that are annotated by this data type. Data Source, datasets used for each data type. Here, only the references using the datasets with the highest coverage are listed. NA, not applicable. RThe clustering coefficients are calculated according to protein interaction data.
experiment, functional annotation from MIPS and GO are considered the most important data sources in several studies according to the feature importance analysis (Jansen et al., 2003; Lin et al. 2004; Lu et al., 2005; Qi et al., 2006). It also suggests that a small number of important data sources may be sufficient to predict protein interaction and the prediction performance may not be improved
Chapter 7 Protein Interaction Predictions from Diverse Sources
163
from adding additional weak data sources. When the datasets are noisy, or have low proteome coverage, they may make little contribution to the prediction and lead to biased prediction results. For example, the structural information of proteins is potentially useful for protein interaction prediction, as demonstrated by the "protein-protein docking" approach which assembles protein complexes based on the three-dimensional structural information of individual proteins. However, the availability of the protein structural information is very limited, and the structures of individual proteins in an unbound state differ from those in a bounded complex, therefore, the prediction of protein interactions based on the structural information only is not reliable (Bovin et at., 2006). Furthermore, some genomic datasets may not have strong association with protein interactions. For example, the feature of "transcription regulation" lists the genes co-regulated by the same set of transcription factors. Although it is shown that the co-regulated genes often function together through protein interactions, the proteins encoded by the co-regulated genes do not necessarily interact (Yu et al., 2003).
3
Domain-based methods
Note that only the Y2H experimental technique provides direct evidence of physical protein-protein interactions in high-throughput analysis. As a result, we start the chapter with the methods that use data generated from Y2H technique only. The Y2H experimental approach suffers from high error rates, for example, the false negative rate is estimated to be above 0.5 (Huang et at., 2007; Scholtens et at., 2007). Due to the large number of possible non-interacting protein pairs, the false positive rate, defined as the ratio of the number of incorrect interactions observed over the total number of non-interacting proteins, is small and is estimated as lxl03 or less (Chiang et at., 2007). But, the false discovery rate, defined as the ratio of the number of incorrect interactions observed over the total number of observed interactions, is much greater and is estimated to be 0.2 to 0.5 (Huang et at., 2007), indicating a large portion of the observations from Y2H technique are incorrect. As the Y2H data have become available in many model organisms, such as yeast, worm, fruit fly and humans, several computational methods have been developed to borrow information from diverse organisms to improve the accuracy of protein interaction prediction. Noting that domains are structural and functional units of proteins and are conserved during evolution, these methods aim to identify specific domain pairs that mediate protein interactions, by utilizing domain information as the evolutionary connection among these organisms.
3.1
Maximum likelihood-based methods (MLE)
The Maximum Likelihood Estimation (MLE) method, coupled with the Expectatio n-Maximization (EM) algorithm, has been developed to estimate the probabilities of domain-domain interactions, given a set of experimental protein interaction data (Deng et al., 2002). It was originally used to analyze the Y2H data from a single organism - S. cerevisiae only. More recently, the MLE method was extended beyond the scope of a single organism by incorporating protein interaction data
164
Yin Liu, Inyoung Kim, Hongyu Zhao
from three different organisms, S. cerevisiae, C. elegans and D. melanogaster, assuming that the probability that two domains interact is the same among all organisms (Liu et ai., 2005). It was shown that the integrated analysis provides more reliable inference of protein-protein interactions than the analysis from a single organism (Liu et ai., 2005). The details of the method are briefly described as follows. Let Amn represent the probability that domain m interacts with domain n, and the notation (Dmn E Pijk) denotes all pairs of domains from protein pair i and j in organism k, where k = 1,2"" ,K and K is the number of organisms. Let Pijk represent the protein pair i and j in organism k, with Pijk = 1 if protein i and protein j in organism k interact with each other, and Pijk = 0 otherwise. Further let Oijk = 1 if interaction between protein i and j is observed in species k, and Oijk = 0 otherwise. The definitions of false negative rate (fn) and false positive rate (fp) of protein interaction data are fn
= Pr (Oijk = 0IPijk = 1)
and
fp
= Pr (Oijk = 11Pijk = 0).
= {Oijk = Oijk, Vi :::;; j}, A = {Amn; Dmn n, Vi :::;; j}. With the above assumptions and notation, we have
We further define 0
Pr(Pijk
E
Pijk, Vm :::;;
= 1) = 1-
The probability that protein i and j in species k are observed to be interacting is given in the following equation: Pr (Oijk
= 1) = Pr (Pijk = 1)(1 -
fn)
+ (1 -
Pr (Pijk
=
l))fp.
(3.1)
The likelihood for the observed PPI data across all K organisms is then L(fn, jp, AIO) =
II Pr (Oijk =
l)oi j k {l- Pr (Oijk
= l)}l-oi jk,
(3.2)
ijk which is a function of (Amn, jn, jp). Deng et ai. (2002) and Liu et ai. (2005) specified values for jn and jp based on prior biological knowledge, and then the Amn were estimated using the EM algorithm by treating all DDls and PPls as missing data as follows. For a given A~-;;;,l) obtained from the (t - l)-th EM iteration, the next E-step can be computed as
E(D(ijk)IO .. mn 'Jk
=
..
A(t-l)) _ A~-;;;,l)(l_ jn)Oijk jn(1-oi jk) _ (t-l) Pr(Oijk - OijklAmn )
O'Jk, mn
(3.3)
With the expectations of the complete data, in the M-step, we update the Amn by, (t) _ A~-;;;,l) ' " (1 - fn)Oijk jn(1-oi jk) Amn - ~ ~ (t 1) , mn Pr (Oijk = OijklAm-;;;' )
(3.4)
where N mn is the total number of protein pairs containing domain (m, n) across the three organisms, and the summation is over all these protein pairs.
Chapter 7 Protein Interaction Predictions from Diverse Sources
3.2
165
Bayesian methods (BAY)
Unlike the likelihood based approaches described above, the false negative and false positive rates of the observed protein interaction data are treated as unknown in the Bayesian methods, so that the domain interaction probabilities, the false positive rate and the false negative rates of the observed data can be estimated simultaneously (Kim et al., 2007). The error rates are treated as organism dependent to allow different organisms to have different false negative and false positive rates. In this case, equation (3.1) is replaced by Pr (Oijk
=
= Pr (Pijk =
1)
1)(1 - fnk)
+ (1- Pr (Pijk = l))fPk.
(3.5)
It was assumed that fnk rv unij[u nk ,lInk ] and fPk rv unij[upk ,lIpk ]' Further, Amn was assumed to have a prior distribution: Amn rv 7f60(Amn) + (1 - 7f) B (a, {3) where 60 (.) is a point mass at zero. The full conditional distributions of Amn are proportional to
[Alrest]
oc oc
L(Olfnk, jPk, A)j(Aljnk, jPk) TI[hij (A)(l - jnk)] + {1- hij(A)} jpkf ijk ijk [1- {h ij (A)(l- jnk) + (1 - hij(A))jpkW-Oijk f(Alfnk, jPk)'
The full conditional distributions of fnk and jPk are proportional to
L(Oljnk, jPk, A)f(JnkI A , jPk)f(JPkIA) oc [hij (A)(l - jnk) + {I - hiJCA)} jpkf ijk ijk [1- {h ij (A)(l - jnk) + (1 - hij(A))jpkW-Oijk jUnkl A , jPk); rest [jPkl ] oc L(Oljnk, jPk> A)j(JPkI A , jnk)j(fnkIA) oc IT[h ij (A)(l- jnk) + {1- hij(A)}jPkfijk ijk [1 - {h ij (A)(l - fnk) + (1 - hij(A))jpkW-Oijk jUPkI A , fnk)'
[Inklrest]
oc
IT
The function hij (A) correlates the probability that protein pair i and j interact with the domain interaction probabilities. hij(A) = h{j(A) (defined above) in the studies of Deng et al. (2002) and Liu et al. (2005). Under h}j(A), the full conditional distributions of Amn , jnk, and fPk are log-concave functions. This function can also be formulated in an alternative way to incorporate varying number of domains across different proteins (Kim et al. 2007). The full conditional distributions of jnk and jPk are still log-concave functions, whereas the full conditional distributions of Amn is no longer a log-concave function. In this case, the adaptive rejection Metropolis sampling can be used to generate the posterior samples for Amn. The performance of the MLE method and the BAY method can be assessed by comparing their sensitivities and specificities in predicting interacting domain
Yin Liu, Inyoung Kim, Hongyu Zhao
166
pairs, illustrated by their Receiver Operator Characteristic (ROC) curves (Figure 1a). The protein interaction dataset consists of 12, 849 interactions from 3 organisms (Giot et ai., 2003; Ito et ai., 2001; Li et al. 2004; Uetz et al., 2000). Compared to the likelihood-based methods, the Bayesian-based methods may be more efficient in dealing with a large number of parameters and more effective to allow for different error rates across different datasets (Kim et ai., 2007). (a)MLE vs. BAY
0.3
0.2
0.1
0.0
0.4
1-Specificity (%)
(b) DPEA VS. PE lO ~
e:.c
0
.s:
~
t:
Q)
I=-=- g~EA.I
lO
(f)
I
0
0.00
0.05
0.10
0.15 1-Specificity (%)
0.20
0.25
Figure 1: ROC curves of domain interaction prediction results from different methods.
Comparison of sensitivities and false positive rates (1- specificity) obtained by different methods. MLE, maximum likelihood-based method (Liu et ai., 2005); BAY, Bayesian method (Kim et al., 2007); DPEA, domain pair exclusion analysis (Riley et al., 2005); PE, parsimony explanation method (Guimaraes et ai., 2006). a) MLE vs. BAY methods; b) DPEA vs. PE methods. The set of protein domain pairs in iPFAM are used as the gold standard set (http://www.sanger.ac.uk/Software/Pfam /iPfam/). Here, the sensitivity is calculated as the number of predicted interacting domain pairs that are included in the gold standard set divided by the total number of domain pairs in the gold standard set. The false positive rate is calculated as the number of predicted interacting domain pairs that are not included in the gold standard set divided by the total number of possible domain pairs not included in the gold standard set. The area under the ROC curve is a measurement of prediction accuracy for each method.
3.3
Domain pair exclusion analysis (DPEA)
The maximum likelihood methods may preferentially detect domain pairs with high frequency of co-occurrence in interacting protein pairs. To detect more specific domain interactions, Riley et al. (2005) modified the likelihood-based ap-
Chapter '( Protein Interaction Predictions from Diverse Sources
167
proach and developed the domain pair exclusion analysis. In this approach, two different measures, the () values and the E score, were estimated for each pair of domains. The () value was obtained through the EM method, similarly as the one described above, corresponding to the probability that two domains interact with each other. This value was used as a starting point to compute the E-score for each domain pair (m, n), which was defined as the change in likelihood of the observed protein interactions, when the interaction between domain m and n was excluded from the model. The E-score of a domain pair (m, n) is given by: =
E mn
2..:)0 Ok 'J 0
Pr(Oijk = 1 Idomain pair m, n can interact) g Pr (Oijk = 1 Idomain pair m, n do not interace)
1-
= """' 10
L...J g 1 ijk
II
(1 -
()kl)
(Dkl EPijk )
II
(DklEPijk)
(1 _
()mn) .
kl
Here, the notation (Dkl E Pijk ) denotes all pairs of domains from protein pair i and j in organism k, ()kl was the maximum likelihood estimate of domain interaction probability for the set of domain pairs, and ()f1n represents the same set of domain interaction probabilities. However, the probability of domain pair m and n interacting was set to zero, and then the EM algorithm was rerun to get the maximum likelihood estimates of other domain interactions. In this way, the new values of ()kl were obtained under the condition that domain pair m and n do not interact. The E-score may help to identify the specific interacting domain pairs where the MLE and BAY methods may fail to detect. This approach has been applied to all the protein interactions in the Database of Interacting Proteins (DIP) (Salwinski et at., 2004), assuming no false positive and false negatives.
3.4
Parsimony explanation method (PE)
Using the same dataset constructed by Riley et al. (2005) from the DIP database, the parsimony approach formulates the problem of domain interaction prediction as a Linear Programming (LP) optimization problem (Guimaraes et al. 2006). In this approach, the LP score Xmn associated with each domain pair was inferred by minimizing the object function 2:mn Xmn subject to a set of constraints describing all interacting protein pairs {Pi, Pj }, which require that 2:mE P i,nEPj Xmn :? 1. In addition to the LP score, another measure, the pw-score was defined. Similar to the E value in the DPEA method, the pw-score is used as an indicator to remove the promiscuous domain pairs that occur frequently and have few witnesses. Here, the witness of a domain pair is defined as the interacting protein pair only containing this domain pair. We also compare the performance of DPEA method and PE method using their ROC curves (Figure 1b). The reason contributing to the improved performance of the PE method compared to the DPEA method could be that the DPEA method tends to assign higher probabilities to the infrequent domain pairs in multi-domain proteins, while it is avoided in the PE method (Guimaraes et al. 2006). All the methods described in this section focus on estimating domain interaction by pooling protein interaction information from
168
Yin Liu, Inyoung Kim, Hongyu Zhao
diverse organisms. Box 1 illustrates the difference of these methods in identifying interacting domain pairs. The estimated domain interactions can then be used for the protein interaction prediction by correlating proteins with their associated domains. We expect that the results from these domain interaction prediction methods can be further improved when the domain information is more reliably annotated in the future. The current information on domain annotation is incomplete. For example, only about two-thirds of the proteins and 73% of the sequences in yeast proteome are annotated with domain information in the latest release of PFAM database (Finn et az', 2006). As a result, prediction based on domain interaction will only be able to cover a portion of the whole interactome. To overcome this limitation, a Support Vector Machine learning method was developed (Martin et aZ., 2005). Instead of using the incomplete domain annotation information, this method uses the signature products, which represent the protein primary sequence only, for protein interactions prediction (Martin et al., 2005). In addition to the incomplete domain annotation information, we note that there is another limitation of these domain interaction prediction methods: the accuracy and reliability of these methods highly depend on the protein interaction data. Although the prediction accuracy can be improved by integrating data from mUltiple organisms, the protein interaction data itself may be far from complete. Therefore, efforts have been made to integrate other types of biological information and these are briefly reviewed next. Box 1. Identification of domain interactions under different scenarios Note: PI represents protein 1, DI represents domain 1, PI that protein 1 only contains domain 1. Scenario 1: Protein-domain relationship:
= {DI} represents
Observed protein interaction data: PI
f-+
P 2 , PI
f-+
P 4, PI
f-+
P6 , P2
f-+
P3, P3
Ps
f-+
f-+
P 4, P 3
f-+
P6, P2
f-+
Ps,P4
f-+
P s,
P6 .
Under this scenario, each protein containing domain 1 interacts with each protein containing domain 2. All four methods MLE, BAY, DPEA, and PE can identify the interaction between domain 1 and domain 2. Scenario 2: Protein-domain relationship: Pl
= {DI}, P 2 = {D 2 }, P 3
= {DIl, P 4 = {D 2 }, P s
= {D I }, P 6 = {D 2 }.
Observed protein interaction data: PI
f-+
P 2, P 3
f-+
P 4, P 5
f-+
P6.
Under this scenario, only a small fraction (3/9) of protein pairs containing domain pair 1 and 2 interact. Both MLE and BAY methods may not be able to identify
Chapter 7 Protein Interaction Predictions from Diverse Sources
169
the interaction between domain 1 and 2. But, if the interaction of domain 1 and 2 is excluded, the likelihood of the observed protein interaction data is lower that under the condition domain 1 and 2 interact, which leads to the high Escore of this domain pair in the DPEA method, therefore, DPEA can detect the interaction. For the PE method, the interaction between domain 1 and 2 is the only explanation for the observed data, so it can be detected by the PE method as well. Scenario 3: Protein-domain relationship:
Observed protein interaction data: PI ....... P 2 ,P 3
.......
P 4 ,P 5
.......
P6.
Under this scenario, as the only protein pair (PI, P 2 ) containing domains 3 and 4 interact, both MLE and BAY methods can detect the interaction between domains 3 and 4. While only a small fraction (3/9) of protein pairs containing domain pair 1 and 2 interact, the interaction between domains 1 and 2 may not be detected by MLE and BAY methods. The DPEA method may detect interaction of domain pair (1, 2), and domain pair (3, 4) as well because excluding both domain pairs decreases the likelihood of the observed data. For the PE method, because the interaction between domain pair (1, 2) represents the smallest number of domain pairs to explain the observed data, this interaction is preferred than the interaction between domain pair (3, 4).
4 4.1
Classification methods Integrating different types of genomic information
Many computational approaches, ranging from simple union or intersection of features to more sophisticated machine learning methods have been applied to integrate different types of genomic information for protein interaction predictions. These approaches aim at performing two major tasks: predicting direct physical interactions between two proteins, and predicting whether a pair of proteins are in the same complex, which is a more general definition of protein interaction. Many classification methods have been applied to perform these two tasks by integrating different types of data sources. These methods selected some "gold standard" datasets as training data to obtain the predictive model, and then applied the model on test datasets for protein interaction predictions. For instance, Jansen et al. (2003) was among the first to apply a NaIve Bayes classifier using genomic information including m RNA expression data, localization, essentiality and functional annotation for predicting protein co-complex relationship. Based on the same set of data, Lin et al. (2004) applied two other classifiers, logistic regression and random forest, and demonstrated that random forest outperforms the other two and logistic regression performs similarly with the NaIve Bayesian method. The logistic regression approach, based on functional and network topological properties,
Yin Liu, Inyoung Kim, Hongyu Zhao
170
has also been presented to evaluate the confidence of the protein-protein interaction data previously obtained from both Y2H and AP IMS experiments (Bader et al., 2004). More recently, Sharan et al. (2005) also implemented a similar logistic regression model, incorporating different features, to estimate the probability that a pair of proteins interact. Kernel methods incorporating multiple sources of data including protein sequences, local properties of the network and homologous interactions in other species have been developed for predicting direct physical interaction between proteins (Ben-Hur and Noble, 2005). A summary of some published methods along with their specific prediction tasks are listed in Table 2. Table 2. Summary of previous methods integrating multiple data sources for protein interactions prediction Task
Methods
Data Sources Used
References
Y2H, MS, GE, FUN, PRO, Random Forest Protein physical interaction k-Nearest Neighbor
Logistic Regression Random Forest
k-Nearest Neighbor Co-complex membership (of a protein pair)
Support Vector Machin
Logistic Regression Naive Bayes
COM, CLA, PHE, PE, ESS, GI, TR, GF, GN, PP, SEQ, INT, DD Y2H, MS, GE, FUN, PRO, COM, CLA, PHE, PE, ESS, GI, TR, GF, GN, PP, SEQ, INT, DD Y2H, MS, GE, CLU Y2H, MS, GE, PRO, CLA, ESS Y2H, MS, GE, FUN, PRO, COM,CLA, PHE, PE, ESS, GI, TR,GF, GN, PP, SEQ, INT, DD Y2H, MS, GE, FUN, PRO, COM,CLA, PHE, PE, ESS, GI,TR,GF, GN, PP, SEQ, INT, DD GE, PRO, COM, TR,GF, GN, PP, DD, PF GE, PRO, CLA, ESS
Decision Tree
Y2H, MS, GE, COM,PHE, TR,SEQ, GF, GN,PP
Protein complex detection
BH-subgraph
Y2H, MS, COM
SAMBA
Y2H,MS,GE,PHE,TR
Domain interactions a
Naive Bayes
Y2H, FUN, PRO, GF
Evidence Counting
Y2H, FUN, PRO, GF
Qi et al. (2006)
Qi et al. (2006)
Bader et al. (2004) Sharan et al. (2005) Lin et al. (2004)
Lee et al. (2006)
Qi et al. (2006)
Sprinzak et al. (2005 Jansen et al. (2003) Zhang et al. (2004) Scholtens et al. (2004, 2005) Tanay et al. (2004) Lee et al. (2006) Lee et al (2006)
Data sources used, the list of abbreviated data sources used by each method. There may be several different implementations using different sets of data sources by each method. When using the same set of data sources, these methods may be sorted according to their performance,
Chapter 7 Protein Interaction Predictions from Diverse Sources
171
as shown in this table, with the best-performing methods listed at the top. For predicting protein physical interaction and protein pairwise co-complex membership, the comparison is based on the precision-recall curves and the Receiver Operator Characteristic (ROC) curves of these methods with the "summary" encoding (Qi et at., 2006). For predicting domain interactions, the comparison is based on the ROC curves of these methods (Lee et at., 2006). There is no systematic comparison of the two protein complex detection methods that integrate heterogeneous datasets. aThis task focuses on predicting domain-domain interactions instead of protein interactions. The Y2H data here are obtained from multiple organisms, and the data source GF represents the domain fusion event instead of gene fusion event as used for other tasks.
4.2
Gold standard datasets
The gold standard datasets defined for training purposes have effects on the relative performance of different classification methods, as discussed in the first DREAM (Dialogue on Reverse Engineering Assessments and Methods) conference (Stolovitzky et al., 2007). Protein pairs obtained from DIP database (Salwinski et al., 2004) and protein complex catalog obtained from MIPS database (Mews et al., 2006) are two most widely used sets of gold standard positives. While DIP focuses on physically interacting protein pairs, MIPS complexes catalog captures the protein complex membership, which lists proteins pairs that are in the same complex, but not necessarily have physical interactions. Therefore, when predicting physical protein interactions, the set of co-complex protein pairs from MIPS is not a good means of assessing the method accuracy. Two groups of protein pairs have been treated as the gold standard negatives most popularly in the literature: random/all protein pairs not included in the gold standard positives and the protein pairs annotated to be in different subcellular localization. Neither of them is perfect: the first strategy may include true interacting protein pairs, leading to increased false positives, while the second group may lead to increased false negatives when a multifunctional protein is active in multiple subcellular compartments but only has limited localization annotation. Protein pairs whose shortest path lengths exceed the median shortest path for random protein pairs in a protein network constructed from experimental data are treated as negative samples as well (Bader et al., 2004). However, this strategy may not be reliable as the experimental data are in general noisy and incomplete.
4.3
Prediction performance comparison
The availability of such a wide range of classification methods requires a comprehensive comparison among them. One strategy is to validate prediction results according to the similarity of interacting proteins in terms of function, expression, sequence conservation, and so on, as they are all shown to be associated with true protein interactions. The degree of the similarities can be measured to compare the performance of prediction methods (Stolovitzky et al., 2007; Suthram et al., 2006). However, as the similarity measures are usually used as the input features in the integrated analysis, another most widely applied strategy for comparison uses Receiver Operator Characteristic (ROC) or Precision-Recall (PR) curves. ROC curves plot true positive rate vs. false positive rate, while PR curves plot precision
172
Yin Liu, Inyoung Kim, Hongyu Zhao
(fraction of prediction results that are true positives) vs. recall (true positive rate). When dealing with highly skewed datasets (e.g. the size of positive examples is much smaller than that of negative examples), PR curves can demonstrate differences between prediction methods that may not be apparent in ROC curves (Davis and Goadrich, 2007). However, because the precision does not necessarily change linearly, interpolating between points in a PR curve is more complicated than that in a ROC curve, where a straight line can be used to connect points (Davis and Goadrich, 2007). A recent study evaluated the predictive power and accuracies of different classifiers including random forest (RF), RF-based k-nearest neighbor (k RF), NaIve Bayes (NB), Decision Tree (DT), Logistic Regression (LR) and Support Vector Machines and demonstrated in both PR and ROC curves that RF performs the best for both physical interaction prediction and co-complex protein pair prediction problem (Qi et al., 2006). Due to its randomization strategy, an RF-based approach can maintain prediction accuracy when data is noisy and contains many missing values. Moreover, the variable importance measures obtained from RF method help determine the most relevant features used for the integrated analysis. One point we need to pay attention to, though, is that RF variable importance measures may be biased in situations where potential variables vary in their scale of measurement or their number of categories (Strobl et al., 2007). Although NB classifier allows the derivation of a probabilistic structure for protein interactions and is flexible for combining heterogeneous features, it was the worst performer among the six classifiers. The relatively poor performance could be due to its assumption of conditional independence between features, which may not be the case, especially when many correlated features are used. A boosted version of simple NB, which is resistant to feature dependence, significantly outperforms the simple NB, as demonstrated in a control experiment with highly-dependent features, indicating the limitation of simple NB. Although a recent study showed no significant correlations between a subset of features using Pearson correlation coefficients and mutual information as measures of correlation, there was no statistical significance level measured in this study (Lu et al., 2005). Therefore, we cannot exclude the possibility that genomic features are often correlated with each other, especially when the "detailed" encoding strategy is used. Logistic regression generally predicts a binary outcome or estimates the probability of certain event, but its relative poor performance with "detailed" features could be due to the relative small size of gold standard positives currently available for training to the number of features, known as the problem of over-fitting (Qi et al., 2006). It was shown that when the training size increases, the logistic regression method became more precise and led to improved prediction. However, the RF method still outperforms LR even when a larger training set is used, indicating the superiority of RF method compared to other methods (Qi et al., 2006).
5
Complex detection methods
Unlike the methods described focusing on predicting the pairwise relationship between two proteins, the complex detection methods aim to identifying multi-
Chapter '/ Protein Interaction Predictions from Diverse Sources
173
protein complexes consisting of a group of proteins. Since the Affinity purification coupled with Mass Spectrometry (AP IMS) technique provides direct information on protein co-complex membership in a large-scale, this type of data has been used extensively for protein complex identification. However, this technique is subject to many experimental errors such as missing weak and transient interactions among proteins. Therefore, many methods have been proposed to integrate other types of data for protein complex identification.
5.1
Graph theoretic based methods
Protein interaction data can be modeled as an interaction graph, with set of nodes representing proteins, and edges representing interactions between proteins. Graph-theoretic algorithms define protein complexes as highly connected proteins that have more interactions within themselves and fewer interactions with the rest of the graph. For example, Scholtens et al. (2004, 2005) developed the BHcomplete subgraph identification algorithm coupled with a maximum likelihood estimation based approach to give an initial estimate of protein complex membership. Based on the interaction graph from AP IMS technique, a BH-complete subgraph is defined as a collection of proteins for which all reciprocated edges between bait proteins exist, and all unreciprocated edges exist between bait proteins and hit-only proteins. In the results from AP IMS experiment, the probability Pij of detecting protein j as a hit using protein i as a bait is given by a logistic regression model: log
(
Pij ) -1-= f-L - Pij
+ aYij + (3Sij
where Yij = 1 if the edge between i and j exists in the true AP IMS graph and 0 otherwise. The similarity measure Sij can be used to integrate data sources other than AP IMS experiments. As done in the study of Scholtens and Gentleman (2004), 8ij is calculated according to GO cellular component annotation. After estimating Yij by maximizing the likelihood of observed AP IMS data, the initial protein complexes can be estimated by identifying the maximal BH-complete subgraphs. Then the initial estimated complexes can be merged to yield more accurate complexes considering the missing edges due to false negative observations.
5.2
Graph clustering methods
Based on the maximum likelihood scoring, Tanay et al. (2004) proposed a Statisti cal-Algorithmic Method for Bicluster Analysis (SAMBA) to integrate heterogeneous data to detect molecular complexes. SAMBA models the genomic data as a bipartite graph, with nodes on one side representing proteins, and those on the other side representing properties from different types of genomic data including gene expression information, protein interaction data, transcription factor binding and phenotypic sensitivity information. The edge between a protein node u and a property v indicates that protein u has the property v. In this case, a bicluster is a subset of proteins with a set of common properties. A likelihood ratio is used
Yin Liu, Inyoung Kim, Hongyu Zhao
174
to score a potential biocluster (A, B) as follows: Pc Io g - +
L= (u,l/)E(A,B)
PU,l/
L
(u,l/)~(A,B)
I og
1- Pc 1- PU,l/
Here PU,l/ and pc represent the probability that the edge between u and v exists under the null model and the alternative model, respectively. PU,l/ is estimated by generating a set of random networks preserving the degree of every protein and calculating the fraction of networks in which the edge between u and v exists. Pc is a fixed probability that satisfies Pc > max PU,l/. Then graph-theoretic based algorithms can be used to search for densely connected subgraphs having high log-likelihood scores. With SAMBA algorithm, not only the protein complexes can be identified, the biological properties such as gene expression regulation under different conditions can also be associated with each protein complex, which provides insight into the function of the complex. Some unsupervised clustering methods based on special similarity measurements of protein pairs have been applied to the protein complex identification problem. Several examples of these methods are Super Paramagnetic Clustering (SPC), Restricted Neighborhood Search Clustering (RNSC), Molecular Complex Detection (MCODE) and Markov Clustering (MCL). Although the development of these clustering methods represents an important research area in protein complex detection, they are solely applied to direct protein interaction data, either from Y2H or AP /MS techniques, as we can see from the detailed description of these methods in their original publications (Bader and Hogue, 2003; Blatt et al., 1996; Enright et al., 2002; King et al., 2004). Since all these methods perform the clustering task based on the similarity measurements of protein pairs, other types of indirect information, such as gene expression and function annotation should be easily integrated into the analysis when computing the similarities between protein pairs.
5.3
Performance comparison
A recent study evaluated four clustering methods (SPC, RNSC, MCODE and MCL) based on the comparison of the sensitivities, positive predictive values and accuracies of the protein complexes obtained from these methods (Brohee and van HeIden, 2006). A test graph was built on the basis of 220 known complexes in the MIPS database and 41 altered graphs were generated by randomly adding edges to or removing edges from the test graph in various proportions. The four methods were also applied to six graphs obtained from high-throughput experiments and the identified complexes were compared with the known annotated complexes. It was found that the MCL method was most robust to graph alterations and performed best in complex detection on real datasets, while RNSC was more sensitive to edge deletion and relatively less sensitive to suboptimal parameters. SPC and MCODE performed poorly under most conditions.
Chapter 7 Protein Interaction Predictions from Diverse Sources
6
175
Conclusions
The incomplete and noisy PPI data from high-throughput techniques require robust mathematical models and efficient computational methods to have the capability of integrating various types of features for protein interaction prediction problem. Although many distinct types of data are useful for protein interactions predictions, some data types may make little contribution to the prediction or may even decrease the predictive power, depending on the prediction tasks to be performed, the ways the data sets are encoded, the proteome coverage of these data sets and their reliabilities. Therefore, feature (data set) selection is one of the challenges faced in the area of integrating multiple data sources for protein interaction inference. In addition to these challenges, comparison and evaluation of these integration methods are in great need. Each method used for integration has its own limitations and captures different aspects of protein interaction information under different conditions. Moreover, as the gold standard set used for evaluating prediction performance of different methods is incomplete, the prediction results should be validated by small scale experiments as the ultimate test of these methods. With these issues in mind, we expect there is much room for current data integration methods and their evaluation to be improved.
Acknowledgements This work was supported in part by NIH grants R01 GM59507, N01 HV28286, P30 DA018343, and NSF grant DMS 0714817.
References [1] Bader, G.D. and Hogue C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4,2 [2] Bader, J.S. et al. (2004) Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol 22, 78-85 [3] Ben-Hur, A. and Noble, W.S. (2005) Kernel methods for predicting proteinprotein interactions. Bioinformatics 20, 3346-3352 [4] Blatt, M. et al. (1996) Superparamagnetic clustering of data. Phys Rev Lett. 76, 3251-3254 [5] Bonvin, A.M. (2006) Flexible protein-protein docking. Curr Opin Struct Biol.16, 194-200 [6] Brohee, S. and van HeIden, J. (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488 [7] Chiang, T. et al. (2007) Coverage and error models of protein-protein interaction data by directed graph analysis. Genome Biol. 8, R186 [8] Comeau, S. et al. (2004) ClusPro: an automated docking and discrimination method for the prediction of protein complexes. Bioinformatics 20, 45-50 [9] Davis, J. and Goadrich, M. (2007) The relationship between precision-recall
176
[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]
[31]
Yin Liu, Inyoung Kim, Hongyu Zhao and ROC curves. Proceedings of the 23"d International Conference on Machine Learning, (pp 233-240), Pittsburgh, PA. Deng, M. et al. (2002) Inferring domain-domain interactions from proteinprotein interactions. Genome Res. 12, 1540-1548 Enright, A.J. et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402,86-90 Enright, A.J. et al. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30, 1575-1584 Finn, RD. et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247-D251 Gene Ontology Consortim (2006) The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34, D322-326 Giot, L. et al. (2003) A protein interaction map of drosophila melanogaster. Science 302, 1727-1736 Gomez, S.M. et al. (2003) Learning to predict protein-protein interactions from protein sequences. Bioinformatics 19, 1875-1881 Goldberg, D.S. and Roth, F.P. (2003) Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA. 100,4372-4376 Guimaraes, K. S. et al. (2006) Predicting domain-domain interactions using a parsimony approach. Genome Biol7, RI04 Hart, G.T. et al. (2006) How complete are current yeast and human proteininteraction networks. Genome Biol. 7, 120 Huang, H. et al. (2007) Where have all the interactions gone. Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol. 3, e214 Ito, T. et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98, 4569-4574 Jansen, R et al. (2002) Relating whole-genome expression data with proteinprotein interaction. Genome Res. 12, 37-46 Jansen, R et al. (2003) A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science 302, 449-453 Kim, 1. et al. (2007) Bayesian methods for predicting interacting protein pairs using domain information. Biometrics 63, 824-833 King, A.D. et al. (2004) Protein complex prediction via cost-based clustering. Bioinformatics. 2004 Nov 22;20(17):3013-3020 LaCount, D.J. et al. (2005) A protein interaction network of the malaria parasite plasmodium falciparum. Nature 438, 103-107 Lee, H. et al. (2006) An integrated approach to the prediction of domaindomain interactions. BMC Bioinformatics 7, 269 Li, S. et al. (2004) A map of the interactome network of the metazoan c. elegans. Science 303, 540-543 Lin, N. et al. (2004) Information assessment on prediction protein-protein interactions. BMC Bioinformatics 5, 154 Liu, Y. et al. (2005) Inferring protein-protein interactions through highthroughput interaction data from diverse organisms. Bioinformatics 21, 32793285 Lu, L.J. et al. (2005) Assessing the limits of genomic data integration for
Chapter 7 Protein Interaction Predictions from Diverse Sources
177
predicting protein networks. Genome Res. 15, 945-953 [32] Marcotte, E.M. et al. (1999) Detecting protein function and protein-protein interactions from genome sequence. Science 285, 751-753 [33] Martin, S. et al. (2005) Predicting protein-protein interactions using signature products. Bioinformatics 21, 218-226 [34] Matthews, L.R. et al. (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res. 11, 2120-2126 [35] Overbeek, R. et al. (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad USA 96, 2896-2901 [36] Pazos, F. et al. (2003) Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. J Mol BioI. 352, 1002-1015 [37] Qi, Y. et al. (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: structure, function and bioinformatics 63, 490-500 [38] Riley, R. et al. (2005) Inferring protein domain interactions from databases of interacting proteins. Genome BioI. 6, R89 [39] Rual, J.F. et al. (2005) Towards a proteome-scale map of the human proteinprotein interaction network. Nature 437, 1173-1178 [40] Salwinski, 1. et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449-D451 [41] Scholtens, D. et al. (2007) Estimating node degree in bait-prey graphs. Bioinformatics 24, 218-224 [42] Sharan, R. et al. (2005) Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA 102, 1974-1979 [43] Shi, T. et al. (2005) Computational methods for protein-protein interaction and their application. CUrT Protein Pept Sci 6, 443-449 [44] Shoemaker, B.A. and Panchenko, A.R. (2007) Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PloS Comput BioI 3(4), e43 [45] Sprinzak, E. and Margalit, H. (2001) Corrlated sequence-signatures as markers of protein-protein interaction. J Mol BioI. 311, 681-692 [46] Sprinzak, E. et al. (2005) Characterization and prediction of protein-protein interactions within and between complexes. Proc Natl Acad Sci USA 103, 14718-14723 [47] Stolovitzky, G. et al. (2007) Dialogue on Reverse-Engineering Assessment and Methods: The DREAM of High-Throughput Pathway Inference. Ann N Y Acad Sci. 1115:1-22. Strobl, C. et al. (2007) Bias in random forest variable importance measures: [48] Illustrations, sources and a solution. BMC Bioinformatics 8, 25 [49] Suthram, S. et al. (2006) A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics 7, 360 Tanay, A. et al. (2004) Revealing modularity and organization in the yeast [50] molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 101, 2981-2986 [51] Uetz, P. et al. (2000) A comprehensive analysis of protein-protein interactions
178
Yin Liu, Inyoung Kim, Hongyu Zhao
in Saccharomyces cerevisiae. Nature 403, 623-627 [52] Yu, H. et al. (2003) Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 19, 422-427
Chapter 8 Regulatory Motif Discovery: From Decoding to Meta-Analysis Qing Zhou*
Mayetri Guptat
Abstract Gene transcription is regulated by interactions between transcription factors and their target binding sites in the genome. A motif is the sequence pattern recognized by a transcription factor to mediate such interactions. With the availability of high-throughput genomic data, computational identification of transcription factor binding motifs has become a major research problem in computational biology and bioinformatics. In this chapter, we present a series of Bayesian approaches to motif discovery. We start from a basic statistical framework for motif finding, extend it to the identification of cis-regulatory modules, and then discuss methods that combine motif finding with phylogenetic footprinting, gene expression or ChIP-chip data, and nucleosome positioning information. Simulation studies and applications to biological data sets are presented to illustrate the utility of these methods. Keywords: Transcriptional regulation; motif discover; cis-regulatory; Gene expression; DNA sequence; ChIP-chip; Bayesian model; Markov Chain Monte Carlo.
1
Introduction
The goal of motif discovery is to locate short repetitive patterns ("words") in DNA that are involved in the regulation of genes of interest. In transcriptional regulation, sequence signals upstream of each gene provide a target (the promoter region) for an enzyme complex called RNA polymerase (RNAP) to bind and initiate the transcription of the gene into messenger RNA (mRNA). Certain proteins called transcription factors (TFs) can bind to the promoter regions, either interfering with the action of RNAP and inhibiting gene expression, or enhancing gene expression. TFs recognize sequence sites that give a favorable binding energy, which often translates into a sequence-specific pattern (rv 8-20 base pairs long). Binding sites thus tend to be relatively well-conserved in composition - such a conserved *Department of Statistics, University of California at Los Angeles, Los Angeles, CA 90095, USA, E-mail: [email protected] tDepartment of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516-7420, USA, E-mail:[email protected]
179
180
Qing Zhou, Mayetri Gupta
pattern is termed as a "motif". Experimental detection of TF-binding sites (TFBSs) on a gene-by-gene and site-by-site basis is possible but remains an extremely difficult and expensive task at a genomic level, hence computational methods that assume no prior knowledge of the motif become necessary. With the availability of complete genome sequences, biologists can now use techniques such as DNA gene expression microarrays to measure the expression level of each gene in an organism under various conditions. A collection of expressions of each gene measured under various conditions is called the gene expression profile. Genes can be divided into clusters according to similarities in their expression profiles-genes in the same cluster respond similarly to environmental and developmental changes and thus may be co-regulated by the same TF or the same group of TFs. Therefore, computational analysis is focused on the search for TFBSs in the upstream of genes in a particular cluster. Another powerful experimental procedure called Chromatin ImmunoPrecipitation followed by microarray (ChIP-chip) can measure where a particular TF binds to DNA in the whole genome under a given experimental condition at a coarse resolution of 500 to 2000 bases. Again, computational analysis is required to pinpoint the short binding sites of the transcription factor from all potential TF binding regions. With these high throughput gene expression and ChIP-chip binding data, de novo methods for motif finding have become a major research topic in computational biology. The main constituents a statistical motif discovery procedure requires are: (i) a probabilistic structure for generating the observed text (Le. in what context a word is "significantly enriched") and (ii) an efficient computational strategy to find all enriched words. In the genomic context, the problem is more difficult because the "words" used by the nature are never "exact", i.e., certain "mis-spellings" can be tolerated. Thus, one also needs a probabilistic model to describe a fuzzy word. An early motif-finding approach was CONSENSUS, an information theorybased progressive alignment procedure [42]. Other methods included an EMalgorithm [11] based on a missing-data formulation [24], and a Gibbs sampling algorithm [23]. Later generalizations that allowed for a variable number of motif sites per sequence were a Gibbs sampler [28, 33] and an EM algorithm for finite mixture models [2]. Another class of methods approach the motif discovery problem from a "segmentation" perspective. MobyDick [6] treats the motifs as "words" used by nature to construct the "sentences" of DNA and estimates word frequencies using a Newton-Raphson optimization procedure. The dictionary model was later extended to include "stochastic" words in order to account for variations in the motif sites [16, 36] and a data augmentation (DA) [43] procedure introduced for finding such words. Recent approaches to motif discovery have improved upon the previous methods in at least two primary ways: (i) improving and sensitizing the basic model to reflect realistic biological phenomena, such as multiple motif types in the same sequence, "gapped" motifs, and clustering of motif sites (cis-regulatory modules) [30, 51, 17], and (ii) using auxiliary data sources, such as gene expression microarrays, ChIP-chip data, phylogenetic information and the physical structure of DNA
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
181
[9, 21, 52, 18]. In the following section we will discuss the general framework of de-novo methods for discovering uncharacterized motifs in biological sequences,
focusing especially on the Bayesian approach.
2
A Bayesian approach to motif discovery
In this section, unless otherwise specified, we assume that the data set is a set of N unaligned DNA fragments. Let S = (81 , ... ,8N ) denote the N sequences of the data set, where sequence 8i is oflength Li (i = 1" .. ,N). Multiple instances of the same pattern in the data are referred to as motif sites or elements while different patterns are termed motifs. Motif type k (of, say, width Wk) is characterized by a Position-Specific Weight matrix (PWM) 8 k = (Ok1,'" ,OkWk)' where the J-dimensional (J = 4 for DNA) vector Oki = ((Jki1,'" ,(JkiJ)T represents the probabilities of occurrence of the J letters in column i, (i = 1,'" ,Wk). The corresponding letter occurrence probabilities in the background are denoted by 0 0 = ((J01, ... ,(JOJ). Let E> = {8 1 , ... ,8 K }. We assume for now that the motif widths, Wk (k = 1"" ,K) are known (this assumption will be relaxed later). The locations of the motif sites are unknown, and are denoted by an array of missing indicator variables A = (A ijk ), where Aijk = 1 if position j (j = 1"" ,Li ) in sequence i (i = 1"" ,N) is the starting point of a motif of type k (k = 1,," ,K). For motif type k, we let Ak = {Aijk : i = 1"" ,N; j = 1"" ,Li }, i.e., the indicator matrix for the site locations corresponding to this motif type, and define the alignment: 1''; - , 1 '" , N', J' - , 1 ... , 1L·} S1(Ak) -- {S·· 1,)'. A"1,J k -,(J " - , 1 ... ,L·} S 2CAk) -- {Si,j+1·. A·· tJk -- 1''; ,. -1 - " .. , N', J' t,
CAk) -- {s·t , J. + W k1- '. A"'LJ k SWk
--
1''; - , 1 ... , N·, J' - , 1 '" ,L·} ,(J t·
si
In words, Ak ) is the set of letters occurring at position i of all the instances of the type-k motif. In a similar fashion, we use S(A to denote the set of all letters occurring in the background, where S(K) = S \ U~=1 U;:1 SI(Ak) (for two sets A, B, A c B, B \ A == B n AC). Further, let C : S -+ Z4 denote a "counting" function that gives the frequencies of the J letters in a specified subset of S. For example, if after taking the set of all instances of motif k, in the first column, we observe a C
)
C(si
Ak total occurrence of 10 'A's, 50 'T's and no 'C' or 'G's, )) Assuming that the motif columns are independent, we have
[C(Si
Ak
)), ... ,C(Si,;!k))]
rv
Product-Multinomial[8k
=
(Ok1,'"
=
(10,0,0,50).
,OkWk)]'
i.e., the i-th vector of column frequencies for motif k follows a multinomial distribution parametrized by (Jki' We next introduce some general mathematical notation. For vectors v = (Vl,'" ,vp)T, let us define Ivl = IVll + ... + Ivpl, and f(v) = r(vt) .. · r(vp).
182
Qing Zhou, Mayetri Gupta
Then the normalizing constant for a p-dimensional Dirichlet distribution with parameters a = (al,'" ,ap)T can be denoted as f(lal)/f(a). For notational convenience, we will denote the inverse of the Dirichlet normalizing constant as ID(a) = f(a)/f(lal). Finally, for vectors v and U = (Ul,'" ,up), we use the shorthand u v = I1f=l UYi. The probability of observing S conditional on the indicator matrix A can then be written as
For a Bayesian analysis, we assume a conjugate Dirichlet prior distribution for (}o, (}o '" Dirichlet (,80) , ,80 = (/101,'" ,/10D), and a corresponding product-Dirichlet prior (i.e., independent priors over the columns) PD(B) for 8k (k = 1"" ,K), where B=(,8kl,,8k2"" ,,8 kW k) is a J x Wk matrix with ,8ki = (/1kil, ... ,/1kiJ)T. Then the conditional posterior distribution of the parameters given A is:
For the complete joint posterior of all unknowns (8, (}, A), we further need to prescribe a prior distribution for A. In the original model [23], a single motif site per sequence with equal probability to occur anywhere was assumed. However, in a later model [28] that can allow multiple sites, a Bernoulli(7r) model is proposed for motif site occurrence. More precisely, assuming that a motif site of width W can occur at any of the sequence positions, 1, 2, ... ,L * - w + 1 in a sequence of length L *, with probability 7r, the joint posterior distribution is:
where L = 2:~1 (Li - w) is the adjusted total length of all sequences and IAI = 2:f:=1 2:~1 2:~~1 Ajk. If we have reason to believe that motif occurrences are not independent, but occur as clusters (as in regulatory modules), we can instead adopt a prior Markovian model for motif occurrence [17, 44] which is discussed further in Section 3.
2.1
Markov chain Monte Carlo computation
Under the model described in (2.1), it is straightforward to implement a Gibbs sampling (GS) scheme to iteratively update the parameters, i.e., sampling from [8, eo I C, A], and impute the missing data, i.e., sampling from [A I C, 8, eo]. However, drawing 8 from its posterior at every iteration can be computationally inefficient. Liu et al. [28] demonstrated that marginalizing out (8, eo) from the posterior distribution can lead to much faster convergence of the algorithm [29]. In
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
183
other words, one can use the Gibbs sampler to draw from the marginal distribution p(A I 8,1f) =
JJ
p(8,00 I 8,A,1f)p(A)p(8,00)d8dO o,
(2.2)
which can be easily evaluated analytically. If 1f is unknown, one can assume a beta prior distribution Beta (a1' (2) and marginalize out 1f from the posterior, in which case p(A I 8) can be derived from (2.2) by altering the last term in (2.2) to the ratio of normalizing constants for the Beta distribution, B(IAI + a1, L - IAI + (2)/ B(a1' (2). Based on (2.2), Liu et al. [28] derived a predictive updating algorithm for A, which is to iteratively sample each component of A according to the predictive distribution P(Aijk = 1 I 8) = _1f_ Wk P(Aijk
= 0 18) A
where the posterior means are Okl =
1-1f
g
(ihl)
c(s(A k )+(3
C(Si,j+l,k)
00
(2.3)
' C(sW)
(3
~A) (3kl and 0 0 = +(30 . IC(Sl k)+ kll IC(SW)+ 01 A
Under the model specified above, it is also possible to implement a "partitionbased" data augmentation (DA) approach [16] that is motivated by the recursive algorithm used in Auger and Lawrence [1]. The DA approach samples A jointly according to the conditional distribution P(A I 8,8) =
N
L i -1
i=l
j=l
II P(A iLi 18,8) II P(A ij IA i ,j+1,'"
,AiLi,8,8).
At a position j, the current knowledge of motif positions is updated using the conditional probability P(Aij I A i ,j+1,'" ,AiLi,8) (backward sampling), with A i ,j-1,'" ,Ail marginalized out using a forward summation procedure (an example will be given in Section 3.1). In contrast, at each iteration, GS iteratively draws from the conditional distribution: P( A ijk IA \ A ijk , 8), iteratively visiting each sequence position i, updating its motif indicator conditional on the indicators for other positions. The Gibbs approach tends to be "sticky" when the motif sites are abundant. For example, once we have set A ijk = 1 (for some k), we will not be able to allow segment S[i,j+1:j+ Wk] to be a motif site. The DA method corresponds to a grouping scheme (with A sampled together), whereas the GMS corresponds to a collapsing approach (with 8 integrated out). Both have been shown to improve upon the original scheme [29].
2.2
Some extensions of the product-multinomial model
The product-multinomial model used for e is a first approximation to a realistic model for transcription factor binding sites. In empirical observations, it has been reported that certain specific features often characterize functional binding sites. We mention here a few extensions of the primary motif model that have been recently implemented to improve the performance of motif discovery algorithms.
Qing Zhou, Mayetri Gupta
184
In the previous discussion, the width w of a motif G was assumed to be known and fixed; we may instead view w as an additional unknown model parameter. Jointly sampling from the posterior distribution of (A, G, w) is difficult as the dimensionality of G changes with w. One way to update (w, G) jointly would be through a reversible jump procedure [15]. However, note that we can integrate out G from the posterior distribution to avoid a dimensionality change during the updating. By placing an appropriate prior distribution p( w) on w (a possible choice is a Poisson(A)), we can update w using a Metropolis step. Using a Beta(a1' (2) prior on 7f, the marginalized posterior distribution is P(A, wlS) IX ID(C(S(Acl) + (3 )
rr
w
ID(C(S;Al) + (3i) B(IAI ID({3i)
+ aI, L - IAI + (2)
(w).
B(a1' (2) p Another assumption in the product multinomial model is that all columns of a weight matrix are independent- however, it has been observed that about 25% of experimentally validated motifs show statistically significant positional correlations. Zhou and Liu [49] extend the independent weight matrix model to including one or more correlated column pairs, under the restriction that no two pairs of correlated columns can share a column in common. A MetropolisHastings step is added in the Gibbs sampler [28] that deletes or adds a pair of correlated column at each iteration. Other proposed models are a Bayesian treelike network modeling the possible correlation structure among all the positions within a motif model [4], and a permuted Markov model in which the assumption is that an unobserved permutation has acted on the positions of all the motif sites and that the original ordered positions can be described by a Markov chain [48]. Mathematically, the model [49] is a sub-case of [48], which is, in turn, a sub-case of [4J.
o
3
2=1
Discovery of regulatory modules
Motif predictions for higher eukaryotic genomes are more challenging than that for simpler organisms such as bacteria or yeast, for reasons such as (i) large sections of low-complexity regions (repeat sequences), (ii) weak motif signals, (iii) sparseness of signals compared to entire region under study-binding sites may occur as far as 2000-3000 bases away from the transcription start site, either upstream or downstream. In addition, in complex eukaryotes, regulatory proteins often work in combination to regulate target genes, and their binding sites have often been observed to occur in spatial clusters, or cis-regulatory modules (Figure 1). One approach to locating cis-regulatory modules (CRMs) is by predicting novel motifs and looking for co-occurrences [41]. However, since individual motifs in the cluster may not be well-conserved, such an approach often leads to a large number of false negatives. Here, we describe a strategy to first use existing de novo motif finding algorithms and motif databases to compose a list of putative binding motifs, 'D = {G l ,··· ,G D }, where D is in the range of 50 to 100, and then simultaneously update these motifs and estimate the posterior probability for each of them to be included in the CRM [17]. Let S denote the set of n sequences with lengths L 1 , L 2 , ... ,L n , respectively,
Chapter' 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
(
...
)
u
185
u
Figure 1: Graphical illustration of a CRM
corresponding to the upstream regions of n co-regulated genes. We a.'3sume that the CRM consists of K different kinds of motifs with distinctive PWMs. Both the PWMs and K are unknown and need to be inferred from the data. In addition to the indicator variable A defined in Section 2, we define a new variable ai,j, that denotes the location of the jth site (irrespective of motif type) in the ith sequence. {aij;i = 1"" ,n; j = 1,,,, ,Li}' Associated with each site is its type Let a indicator Ti,j, with Ti,j taking one of the K values (Let T (Tij)). Note that the specification (a, T) is essentially equivalent to A. Next, we model the dependence between Ti,j and Ti,j+! by a K x K probability transition matrix T. The distance between neighboring TFBSs in a CRM, dij ~,i+l ai,j, is assumed to follow Q( ; A, w), a geometric distribution truncated at w, i.e. Q(d; A, w) = (1- A)d-w A (d = w, w + 1,·.·). The distribution of nucleotides in the background sequence is a multinomial distribution with unknown parameter P (PA,'" ,PT)' Next, we let u be a binary vector indicating which motifs are included in the module, Le. u = (UI,'" ,UD)T, where Uj = 1(0) if the j th motif type is present (absent) in the module. By construction, lui = K. Thus, the information regarding K is completely encoded by u. In light of this notation, the set of PWMs for the CRM is defined as e {E>j :11,j = 1}. Since now we restrict our inference of CRM to a subset of D, the probability model for the observed sequence data can be written as:
P(SID,T,u,A,p)= LLP(Sla,T,D,T,U,A,p)P(aIA)P(Tla,T). a T From the above likelihood formulation, we need to simultaneously estimate the optimal u and the parameters (D, T, A, p). To achieve this, we first prescribe a prior distribution on the parameters and missing data:
Here the fiO's are (product) Dirichlet distributions. Assuming each Ui takes the value 1 with a prior probability of 1f (i.e. 1f is the prior probability of including a motif in the module), gl (u) represents a product of D Bernoulli (1f) distributions; and 92(A), a generally fiat Beta distribution. More precisely, we assume
186
Qing Zhou, Mayetri Gupta
a priori that 8i "-' ITj=l Dirichlet(,i3ij) (for i = 1"" ,D); P "-' Dirichlet(,i3o); = K), each row of T is assumed to follow an independent Dirichlet. Let the i-th row vilu "-' Dirichlet(O:i), where i = 1"" ,K. Let n = (V,T,>',p) denote the full parameter set. Then the posterior distribution of n has the form
>. "-' Beta(a, b). Given u (with lui
F(n, u IS) exF(S Iu,n)h(V IU)h(T Iu)h(p)gl(U)g2(>')'
(3.1)
Gibbs sampling approaches were developed to infer the CRM from a special case of the posterior distribution (3.1) with fixed u [44, 51]. Given the flexibility of the model and the size of the parameter space for an unknown u, it is unlikely that a standard MCMC approach can converge to a good solution in a reasonable amount of time. If we ignore the ordering of sites T and assume components of a to be independent, this model is reduced to the original motif model in Section 2 which can be updated through the previous Gibbs or DA procedure.
3.1
A hybrid EMC-DA approach: EMCmodule
With a starting set of putative binding motifs V, an alternative approach was proposed by Gupta and Liu [17], which involves simultaneously modifying the motifs and estimating the posterior probability for each of them to be included in the CRM. This was acheived through iterations of the following Monte Carlo sampling steps: (i) Given the current collection of motif PWMs (or sites), sample motifs into the CRM by evolutionary Monte Carlo (EMC); (ii) Given the CRM configuration and the PWMs, update the motif site locations through DA; and (iii) Given motif site locations, update all parameters including PWMs. 3.1.1
Evolutionary Monte Carlo for module selection
It has been demonstrated that the EMC method is effective for sampling and op-
timization with functions of binary variables [26]. Conceptually, we should be able to apply EMC directly to select motifs comprising the CRM, but a complication here is that there are many continuous parameters such as the 81's, >., and T that vary in dimensionality when a putative motif in V is included or excluded from the CRM. We therefore integrate out the continuous parameters analytically and condition on variables a and T when updating the CRM composition. Let n(u) = (8, p, T, >.) denote the set of all parameters in the model, for a fixed u. Then, the marginalized conditional posterior probability for a module configuration u is:
where only 8 and T are dependent on u; and a and T are the sets of locations and types, respectively, of all putative motif sites (for all the D motifs in V). Thus, only when the indicator Ui for the weight matrix 8 i is 1, do its site locations and types contribute to the computation of (3.2). When we modify the current u by
Chapter 8
Regulatory Motif Discovery: Prom Decoding to Meta-Analysis
187
excluding a motif type, its site locations and corresponding motif type indicators are removed from the computation of (3.2). For EMC, we need to prescribe a set of temperatures, t1 > t2 > ... > tM = 1, one for each member in the population. Then, we define ¢i(Ui) ex: exp[logP(ui I a,T,S)/td, and ¢(U) ex: rr~l¢i(Ui). The "population" U = (U1,··· ,UM) is then updated iteratively using two types of moves: mutation and crossover. In the mutation operation, a unit Uk is randomly selected from the current population and mutated to a new vector Vk by changing the values of some of its bits chosen at random. The new member Vk is accepted to replace Uk with probability min(l, r m ), where rm = ¢k(Vk)/¢k(Uk). In the crossover step, two individuals, Uj and Uk, are chosen at random from the population. A crossover point x is chosen randomly over the positions 1 to D, and two new units Vj and Vk are formed by switching between the two individuals the segments on the right side of the crossover point. The two "children" are accepted into the population to replace their parents Uj and Uk with probability . (1 ,rc) , were h CMV;)
3.1.2
UM
(for temperature
tM
=
1) follow the target
Sampling motif sites A through recursive DA
The second part of the algorithm consists of updating the motif sites conditional on a CRM configuration (Le., with U fixed). For simplicity, we describe the method for a single sequence S = (Sl,··· ,s L)- the same procedure is repeated for all sequences in the data set. For simplicity of notation, we assume that all motifs are of width w. For fixed u, let F(i,j,k,u) = P(S[i,j,kj I n(u),u) denote the probability of observing the part of the sequence S from position i to j, with a motif of type k {k E V : Uk = I} occupying positions from j - w + 1 to j (k = 0 denotes the background). Let K = L:~=1 Uk denote the number of motif types in the module. For notational simplicity, let us assume that U represents the set of the first K motifs, indexed 1 through K. Since the motif site updating step is conditional given u, we drop the subscript U from F( i, j, k, u) in the remaining part of the section. In the forward summation step, we recursively calculate the probability of different motif types ending at a position j of the sequence:
FCI,;, k) ~ [~t,FCI' i, l)'l,k QU -i-w; A, w) + PC'I',j-""ol IP)] x F(j-w+1,j,k). By convention, the initial conditions are: F(O, 0, k) = 1, (k = 0,1,··· ,K), and F(i,j, k) = 0 for j < i and k > O. In the backward sampling step, we use Bayes theorem to calculate the probability of motif occurrence at each position, starting from the end of the sequence. If a motif of type k ends at position i in the sequence,
188
Qing Zhou, Mayetri Gupta
the probability that the next motif further ahead in the sequence spans position (i' - w + 1) to i', (i' ~ i - w), and is of type k', is:
= 1 I S, n, A.,i-w+l,k = 1) F(I, i', k') P(s[i'+l,i-w,oJlp) F(i-w+l, i, k) Q(i-i'-w;..\, w) Tk',k F(I, i, k)
P(A.,i'-w+l,k'
The required expressions have all been calculated in the forward sum. Finally, given the motif type indicator u and the motif position and type vectors a and T, we now update the parameters n = (8, p, T,..\) by a random sample from their joint conditional distribution. Since conjugate priors have been assumed for all parameters, their conditional posterior distributions are also of the same form and are straightforward to simulate from. For example, the posterior of E>i will be Il~=l Dirichlet(J3 ij + nij), where nij is a vector containing the counts of the 4 nucleotides at the jth position of all the sites corresponding to motif type i. For those motifs that have not been selected by the module (i.e., with Ui = 0), the corresponding 9's still follow their prior distribution. The posterior distributions of the other parameters can be similarly calculated using conjugate prior distributions.
3.2
A case-study
We compared the performance of EMCmodule with EM- and Gibbs samplingbased methods in an analysis of mammalian skeletal muscle regulatory sequences [44]. The raw data consist of upstream sequences of lengths up to 5000 bp each corresponding to 24 orthologous pairs of genes in the human and mouse genomeseach of the sequences being known to contain at least one experimentally reported transcription-factor binding site corresponding to one of 5 motif types: MEF, MYF2, SRF, SPI and TEF. Following the procedure of Thompson et al. [44], we aligned the sequences for each orthologous pair (human and mouse) and retained only the parts that shared a percent identity greater than 65%, cutting down the sequence search space to about 40% of the original sequences. Using BioProspector and EM (MEME), we obtained initial sets of 100 motifs including redundant ones. The top-scoring 10 motifs from BioProspector and MEME respectively contained 2 and 3 matches to the true motif set (of 5). The Gibbs sampler under a module model [44] found 2 matches, but could find 2 others with a more detailed and precise prior input (the number of sites per motif and motif abundance per sequence), which may not be available in real applications. The best scoring module configuration from EMCmodule contained 3 of the true 5, MYF, MEF2, and SPl, and two uncharacterized motifs. There are few TEF sites matching the reported consensus in these sequences, which may explain why they were not found. The relative error rates for the algorithms were compared using knowledge of the 154 experimentally determined TFBSs [44]. Table 1 shows that EMCmodule significantly cuts down the percentage of false positives in the output, compared to the methods that do not adjust for positional clustering of motifs.
Chapter 8
4
Regulatory Motif Discovery: From Decoding to Meta-Analysis
189
Motif discovery in multiple species
Modeling CRMs enhances the performance of de novo motif discovery because it allows the use of information encoded by the spatial correlation among TFBS's in the same module. Likewise, the use of multiple genomes enhances motif prediction because it allows the use of information from the evolutionary conservation of TFBS's in related species. Several recent methods employ such information to enhance the power of cis-regulatroy analysis. PhyloCon [45] builds multiple alignments among orthologs and extends these alignments to identify motif profiles. CompareProspector [31] biases motif search to more conserved regions based on conservation scores. With a given alignment of orthologs and a phylogenetic tree, EMnEM [34], PhyME [40], and PhyloGibbs [39] detect motifs based on more comprehensive evolutionary models for TFBS's. When evolutionary distances among the genomes are too large for the orthologous sequences to be reliably aligned, Li and Wong [25] proposed an ortholog sampler that finds motifs in multiple species independent of ortholog alignments. Jensen et al. [19] used a Bayesian clustering approach to combine TF binding motifs from promoters of multiple orthologs.
Table 1: Error rates for module prediction methods. Method EM BioProspector GS GSP* EMCmodule True
SENS (sensitivity)
MEF 0 6 6 14 12
MYF 1 1 6 14 12
SP1 21 8 2 4 5
SRF 0 1 1 6 7
Total 161 155 84 162 180
32
50
44
28
154
==
SENS 0.14 0.10 0.10 0.25 0.23
SPEC 0.14 0.10 0.25 0.23 0.20
(# predicted true positives)/(# true positives); SPEC
TSpec 0.20 0.36 0.44 0.60 0.67
(specificity)
==
(# predicted true positives)/(# predicted sites). TSpec: Total specificity- the fraction of the predicted motif types that "correspond" to known motifs (match in at least 80% of all positions). The Gibbs sampler (GS) requires the total number of motif types to be specified (here = 5). GSP* denotes the GS using a strong informative prior.
In this section, we review in details the coupled hidden Markov model (cHMM) developed by Zhou and Wong [52] (ZW hereafter) as an example for motif discovery that utilizes information from cis-regulatory modules and multiple genomes. The authors use a hidden Markov model (HMM) to capture the colocalization tendency of multiple TFBS's within each species, and then couple the hidden states (which indicate the locations of modules and TFBS's within the modules) of these HMMs through multiple-species alignment. They developed evolutionary models separately for background nucleotides and for motif binding sites, in order to capture the different degrees of conservation among the background and among the binding sites. A Markov chain Monte Carlo algorithm is devised for sampling CRMs and their component motifs simultaneously from their joint posterior distribution.
190
4.1
Qing Zhou, Mayetri Gupta
The coupled hidden Markov model
The input data consist of upstream or regulatory sequences of n (co-regulated) genes from N species, i.e., a total of n x N sequences. Assuming these genes are regulated by CRMs composed of binding sites of K TFs, one wants to find these TFBS's and their motifs (PWMs). Assume that the N species are closely related in the sense that their orthologous TFs share the same binding motif, which applies to groups of species within mammals, or within Drosophila, etc. Let us first focus on the module structure in one sequence. Assume that the sequence is composed of two types of regions, modules and background. A module contains multiple TFBS's separated by background nucleotides, while background regions contain only background nucleotides. Accordingly, we assume that the sequence is generated from a hidden Markov model with two states, a module state (M) and a background state (B). In a module state, the HMM either emits a nucleotide from the background model (of nucleotide preference) 00 , or it emits a binding site of one of the K motifs (PWMs) 8 1 ,8 2 ,'" ,8 K . The probability for emission from 00 and 8k(k = 1,2, ... ,K) is denoted by qo and qk, respectively (~~=o qk = 1) (Figure 4.1A). Note that a module state can be further decomposed to K + 1 states, corresponding to within-module background (Mo) and K motif binding sites (M1 to MK), i.e. M = {Mo ,M1,'" ,MK}' Assuming that the width of motif k is Wk, a binding site of this motif, a piece of sequence of length Wk, is treated as one state of Mk as a whole (k = 1,2,'" ,K). The transition probability from a background to a module state is r, i.e., the chance of initiating a new module is r. The transition probability from a module state to a background state is t, i.e., the expected length of a module is l/t. Denote the transition matrix by
T = [T(B,B) T(B,M)] = T(M, B) T(M, M)
[1-t
r
r ]
1- t .
(4.1)
This model can be viewed as a stochastic version of the hierarchical mixture model defined in [51]. The HMMs in different orthologs are coupled through multiple alignment, so that the hidden states of aligned bases in different species are collapsed into a common state (Figure 1B). For instance, the nucleotides of state 4 in the three orthologs are aligned in Figure lB. Thus these three states are collapsed into one state, which determines whether these aligned nucleotides are background or binding sites of a motif. (Note that these aligned nucleotides in different orthologs are not necessarily identical.) Here hidden states refer to the decomposed states, i.e. Band Mo to M K , which specify the locations of modules and motif sites. This coupled hidden Markov model (c-HMM hereafter) has a natural graphical model representation (lower panel of Figure 1B), in which each state is represented by a node in the graph and the arrows specify the dependence among them. The transition (conditional) probabilities for nodes with a single parental node are defined by the same T in equation 4.1. We define the conditional probability for a node with multiple parents as follows: If node Y has m parents, each in state
Chapter 8
Yi (i =
Regulatory Motif Discovery: From Decoding to Meta-Analysis
191
1,2,··· ,m), then we have
CB CM -T(B, Y) + -T(M, Y), (4.2) m m where C B and C M are the numbers of the parents in states Band M, respectively (m = CB + CM). This equation shows that the transition probability to a node with multiple parents is defined as the weighted average from the parental nodes in background states and module states. The same emission model described in the previous paragraph is used for unaligned states. For aligned (coupled) states, ZW assume star-topology evolutionary models with one common ancestor. The c-HMM first emits (hidden) ancestral nucleotides by the emission model defined in Figure 2A given the coupled hidden states. Then, different models are used for the evolution from the ancestral to descendant nucleotides depending on whether they are background or TFBS's.
P (Y IY1 ,··· ,Ym
l~blII~~1
...
)
=
4
::!I!!::
11~~!I!~
B~lK
~
0- ...~~
(0_ : -0) (A)
(B)
(e)
Figure 2: The coupled hidden Markov model (c-HMM). (A) The HMM for module structure in one sequence. (B) Multiple alignment of three orthologous sequences (upper panel) and its corresponding graphical model representation of the c-HMM (lower panel). The nodes represent the hidden states. The vertical bars in the upper panel indicate that the nucleotides emitted from these states are aligned and thus collapsed in the lower panel. Note that a node will emit Wk nucleotides if the corresponding state is Mk (k = 1, ... ,K). (C) The evolutionary model for motifs using one base of a motif as an illustration. The hidden ancestral base is Z, which evolves to three descendant bases X(1), X(2), and X(3). Here the evolutionary bond between X(I) and Z is broken, implying that X(1) is independent of Z. The bond between X(2) and Z and that between X(3) and Z are connected, which means that X(2) = X(3) = Z
A neutral substitution matrix is used for the evolution of aligned background nucleotides, both within and outside of modules, with a transition rate of a and a transversion rate of 13:
<1>=
1
r
~/Job 1 ! /Job ~
~ 1
a
13
1 - /Job
13
13
a
13
1 - /Job
(4.3)
where the rows and columns are ordered as A, C, G, and T, and /Job = a + 213 is defined as the background mutation rate. ZW assume an independent evolution
192
Qing Zhou, Mayetri Gupta
for each position (column) of a motif under the nucleotide substitution model of Felsenstein [13]. Suppose the weight vector of a particular position in the motif is e. The ancestral nucleotide, denoted by Z, is assumed to follow a discrete distribution with the probability vector e on {A, C, G, T}. If X is a corresponding nucleotide in a descendant species, then either X inherits Z directly (with probability ILf) or it is generated independently from the same weight vector (with probability 1- ILf ). The parameter ILf' which is identical for all the positions within a motif, reflects the mutation rate of the TFBS's. This model takes PWM into account in the binding site evolution, which agrees with the non-neutral constraint of TFBS's that they are recognized by the same protein (TF). It is obvious that under this model, the marginal distribution of any motif column is identical in all the species. This evolutionary model introduces another hidden variable which indicates whether X is identical to or independent of Z for each base of an aligned TFBS. These indicators are called evolutionary bonds between ancestral and descendent bases (Figure 4.1C). If X = Z, we say that the bond is connected; If X is independent of Z, we say that the bond is broken.
e
4.2
Gibbs sampling and Bayesian inference
The full model involves the following parameters: the transition matrix T defined in equation 4.1, the mixture emission probabilities qQ, q1,··· ,qK, the motif widths WI,··· ,WK, the PWMs 8 1 , ... ,8 K , the background models for ancestral nucleotides and all current species, and the evolutionary parameters a, (3, and ILf. The number of TFs, K, and the expected module length, L, are taken as input, and the transition probability t is fixed to t = 1/ L in T. Compared to the HMx model in [51], this model has three extra free parameters, a, (3, and ILf' related to the evolutionary models. Independent Poisson priors are put on motif widths and flat Dirichlet distributions are used as priors for all the other parameters. With a given alignment for each ortholog group, one may treat as missing data the locations of modules and motifs (i.e. the hidden states), the ancestral sequences, and the evolutionary bonds. ZW develop a Gibbs sampler (called MultiModule, hereafter) to sample from the joint posterior distribution of all the unknown parameters and missing data. To consider the uncertainty in multiple alignment, they adopt an HMM-based multiple alignment [3, 22] conditional on the current parameter values. This is achieved by adding a Metropolis-Hastings step in the Gibbs sampler to update these alignments dynamically according to the current sampled parameters, especially the background substitution matrix (equation 4.3). In summary, the input data of MultiModule are groups of orthologous sequences, and the program builds an initial alignment of each ortholog group by a standard HMM-based multiple alignment algorithm. Then each iteration of MultiModule is composed of three steps: (1) Given alignments and all the other missing data, update motif widths and other parameters by their conditional posterior distributions; (2) Given current parameters, with probability u, update the alignment of each ortholog group; (3) Given alignments and parameters, a dynamic programming approach is used to sample module and motif locations, ancestral sequences, and evolutionary bonds. The probability u is typically chosen in the
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
193
range [0.1,0.3]. (See [52] for the details of the Gibbs sampling of MultiModule.) Motif and module predictions are based on their marginal posterior distributions constructed by the samples generated by MultiModule after some burn-in period (usually the first 50% of iterations). The width of each motif is estimated by its rounded posterior mean. MultiModule records the following posterior probabilities for each sequence position in all the species: (1) Pk , the probability that the position is within a site for motif k, i.e., the hidden state is Mk (k = 1,2, ... , K); (2) Pm, the probability that the position is within a module, i.e., the hidden state is M; (3) Pa , the probability that the position is aligned. All the contiguous segments with Pk > 0.5 are aligned (and extended if necessary) to generate predicted sites of motif k given the estimated width Wk. The corresponding average Pa over the bases of a predicted site is reported as a measure of its conservation. All the contiguous regions with Pm > 0.5 are collected as candidates for modules, and a module is predicted if the region contains at least two predicted motif binding sites. The boundary of a predicted module is defined by the first and last predicted binding sites it contains. Under the c-HMM, if one fixes r = 1 - t = 1 in the transition matrix T (equation 4.1), then MultiModule reduces to a motif discovery method, assuming the existence of K motifs in the sequences. This setting is useful when the motifs do not form modules, and it is defined as the motif mode of MultiModule in [52].
4.3
Simulation studies
Here we present the simulation studies conducted in [52] to illustrate the use of MultiModule. The authors used the following model to simulate data sets in this study: They generated 20 hypothetical ancestral sequences, each of length 1000 bps. Twenty modules, each of 100 bps and containing one binding site of each of the three TFs, were randomly placed in these sequences. TFBS's were simulated from their known weight matrices with logo plots [38] shown in Figure 3. Then based on the choices of the background mutation rate /-Lb (with a = 3(3 in equation 4.3) and the motif mutation rate /-L f' they generated sequences of three descendant species according to the evolutionary models in section 4.1. The indel (insertion-deletion) rate was fixed to O.l/-Lb. After the ancestral sequences were removed, each data set finally contains 60 sequences from three species. The simulation study was composed of two groups of data sets, and in both groups they set /-L f = 0.2/-Lb but varied the value of /-Lb. In the first group, they set /-Lb = 0.1 to mimic the case where species are evolutionarily close. In the second group, they set /-Lb = 0.4 to study the situation for remotely related species. For each group 10 data sets were generated independently. MultiModule was applied to these data sets under three different sets of program parameters: (A) Module mode, L = 100, U = 0.2; (B) Motif mode, U = 0.2; (C) Motif mode, u = O. For each set of parameters, they ran MultiModule for 2,000 iterations with K = 3, searching both strands of the sequences. Initial alignments were built by ordinary HMM based multiple alignment methods. If u = 0, these initial alignments were effectively fixed along the iterations. The results are summarized in Table 2, which includes the sensitivity, the
Qing Zhou, Mayetri Gupta
194
2
T _
A N
Figure 3: Logo plots for the motifs in the simulated studies: (A) Oct4, (B) Sox2, and (C) Nanog
specificity, and an overall measurement score of the performance, defined as the geometric average of the sensitivity and specificity. One sees that updating alignments improves the performance for both J.Lb = 0.1 and 0.4, and the improvement is more significant for the latter setting (compare results of Band C in Table 2). The reason is that the uncertainty in alignments for the cases with J.Lb = 0.4 is higher than that for J.Lb = 0.1, and thus updating alignments, which aims to average over different possible alignments, has a greater positive effect. Considering module structure shows an obvious improvement for J.Lb = 0.1, but it is only slightly better than running the motif mode for J.Lb = 0.4 (compare A and B in Table 2). For J.Lb = 0.4, MultiModule found all the three motifs under both parameter settings (A and B) for five data sets, and the predictions in A with an overall score of 70% definitely outperformed that in B with an overall score of 58%. For the other five data sets, no motifs were identified in setting A, but in setting B (motif mode) subsets of the motifs were still identified for some of the data sets. This may be caused by the slower convergence of MultiModule in setting A, because of its higher model complexity, especially when the species are farther apart. One possible quick remedy of this is to use the output from setting B as initial values for setting A, which will be a much better starting point for the posterior sampling. Table 2: Results for the simulation study
(A) (B) (e) (A) (B) (e)
Oct4 (60) N2/N l 38.7/57.4 27.7/45.3 22.2/36.7 18.8/24.8 9.3/29.4 5.1/8.1
Sox2 (60) N2/N l 51.6/66.0 52.2/91.3 42.4/89.6 22.7/30.6 34.0/51.1 14.2/18.2
Nanog (60) N2/ N l 40.8/46.3 27.6/37.8 23.6/39.0 21.0/33.9 21.7/31.6 8.3/14.2
Three Sen 73% 60% 49% 35% 36% 15%
motifs in total Spe Overall 77% 75% 61% 62% 51% 53% 70% 49% 46% 58% 32% 68%
N2 and Nl refer to the numbers of correct and total predictions for each motif, respectively. TF names are followed by the numbers of true sites in parentheses. The upper and lower halves
refer to the average results over 10 independently generated data sets with J.Lb = 0.1 and 0.4, respectively. "Overall" is the geometric average of sensitivity ("Sen") and specificity ("Spe"). For each data set, the optimal results (in terms of overall score) among three independent runs under the same parameters were used for the calculation of averages. Parameter sets (A), (B), (e) are defined as: (A) Module mode, L = 100, U = 0.2; (B) Motif mode, U = 0.2; (e) Motif mode, U = o.
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
195
MultiModule has also been tested on two well-annotated data sets from the human and the Drosophila genomes, and the results were compared to experimental validations. Please see [52] for more details.
5
Motif learning on ChIP-chip data
In recent years, a number of computational approaches have been developed to combine motif discovery with gene expression or ChIP-chip data, e.g., [7, 9, 10, 20]. These approaches identify a group of motifs, and then correlate expression values (or ChIP-intensity) to the identified motifs via linear or other regression techniques. The use of ChIP-chip data has great advantage in understanding TF-DNA binding: Such data not only provide hundreds or even thousands of high resolution TF binding regions, but also give quantitative measures of the binding activity (ChIP-enrichment) for such regions. In this section, we introduce a new approach developed by Zhou and Liu [50] (ZL hereafter) for motif learning from ChIP-chip data, to illustrate the general framework of this type of methods. In contrast to many approaches that directly build generative statistical models in the sequence space such as those discussed in the previous sections, ZL map each ChIP-chip binding region into a feature space composed of generic features, background frequencies, and a set of motif scores derived from both known motifs documented in biological databases and motifs discovered de novo. Then, they apply the Bayesian additive regression trees (BARTs) [8] to learn the relationship between ChIP-intensity and these sequence features. As the sum of a set of trees, the BART model is flexible enough to approximate almost any complex relationship between responses and covariates. With carefully designed priors, each tree is constrained to be a weak learner, only contributing a small amount to the full model, which effectively prevents the BART model from overfitting. The learning of the model is carried out by Markov chain Monte Carlo sampling of the posterior BART distribution that lives in the additive tree space, which serves to average over different BART models. These posterior draws of BARTs also provide a natural way to rank the importance of each sequence feature in explaining ChIP -intensity. Compared to other motif learning approaches with auxiliary data, there are at least two unique features of the Zhou-Liu method [50]. First, the features (or covariates) used in the method contain not only the discovered motifs, but also known motifs, background word frequencies, and other features such as the GC content and cross-species conservation. Second, the additive regression tree model is more flexible and robust than the regression methods used in the other approaches. These advantages will be illustrated in the application of this method to a recently published genome-wide human ChIP-chip data set. Consider a set of n DNA sequences {Sl, S2,'" , Sn}, each with a ChIP-chip intensity Yi that measures the level of enrichment of that segment (fold changes) compared to the normal genomic background. In principle, the Yi serves as a surrogate of the binding affinity of the TF to the corresponding DNA segment in
196
Qing Zhou, Mayetri Gupta
the genome. We write {(Yi, Si), for i = 1,2,··· ,n}. For each Si, we extract p numerical features Xi = [XiI, ... ,Xip], and transform the dataset to {(Yi, Xi) }i=I' on which we "learn" a relationship between Yi and Xi using the BART model. The details on how to extract features from each sequence Si will be described in section 5.1. In comparison to the standard statistical learning problem, two novel features of the problem described here are that (a) the response variable Yi is continuous instead of categorical; and (b) the features are not given a priori, but need to be produced from the sequences by the researcher.
5.1
Feature extraction
Zhou and Liu [50] extract three categories of sequence features from a repeatmasked DNA sequence, the generic, the background, and the motif features. The generic features include the length, the GC content, and the average conservation of the sequence. For background features, they compute the number of occurrences of each k-mer (only for k = 2 and 3 in this paper) in the sequence. They count both forward and backward strands of the DNA sequence, and merge the counts of each k-mer and its reverse complement. For each value of k, if there are C k distinct words after merging reverse complements, only the frequencies of (Ck - 1) of them will be included in the feature vector since the last one is uniquely determined by the others. Note that the zeroth order frequency (k = 1) is equivalent to the GC content. The motif features are extracted from a compiled set of motifs, each paramet rized by a PWM. The compiled set includes known motifs from TF databases such as TRANSFAC [46] or JASPAR [37], and new motifs found from the positive ChIP sequences in the data set of interest using a de novo motif search tool. ZL fit a segment-wise homogeneous first-order Markov chain as the background sequence model [27], which helps to account for the heterogeneous nature of genomic sequences such as regions of low complexities (eg. GC / AT rich). Intuitively, this model assumes that the sequence in consideration can be segmented into an unknown number of pieces and within each piece the nucleotides follow a homogeneous first-order Markov chain. Using a Bayesian formulation and Markov chain Monte Carlo, one estimates the background transition probability of each nucleotide. Suppose the current sequence is S = R I R 2 ··· R L , the PWM of a motif of width w is 8 = 8 i (j) (i = 1,··· ,w,j = A, C, G, T), and the background transition probability of Rn given R n- I is Bo(RnIRn-l) (1 ~ n ~ L). For each w-mer in S, say Rn··· R n+w - I , we calculate a probability ratio
(5.1) Considering both strands of the sequence, we have 2 x (L - w + 1) such ratios. Then the motif score for this sequence is defined as log(I:~1 r(k)/L), where r(k) is the kth ratio in descending order. In [50], the authors take m = 25, i.e., the top 25 ratios are included.
Chapter 8
5.2
Regulatory Motif Discovery: From Decoding to Meta-Analysis
197
Bayesian additive regression trees
Here we give a brief review of the Bayesian additive regression tree (BART) model developed in [8]. Let Y be the response variable and X = [Xl,··· ,Xp], the feature vector. Let T denote a binary tree with a set of interior and terminal nodes. Each interior node is associated with a binary decision rule based on only one feature, typically ofthe form {Xj ~ a} or {Xj > a}, for 1 ~ j ~ p. Suppose the number of terminal nodes ofT is B. Then, the tree partitions the feature space into B disjoint regions, each associated with a parameter /Lb (b = 1,· .. ,B) (see Figure 4 for an illustration). Consequently, the relationship between Y and X is approximated by a piece-wise constant function with B distinct pieces. Let M = [/LI,··· ,/LE]' and we denote this tree-based piece-wise constant function by g(X, T, M). The additive regression tree model is simply a sum of N such piece-wise constant functions: N
Y =
2:= g(X, Tm , Mm) +
E,
E '"
N(O, (72),
(5.2)
m=l
in which each tree Tm is associated with a parameter vector Mm (m = 1,··· ,N). The number oftrees N is usually large (100 to 200), which makes the model flexible enough to approximate a complex relationship between Y and X. We assume that each observation {(Vi, Xi)}, i = 1,··· ,n, follows Eq. (5.2) and is independent of each other.
Figure 4: A regression tree with two interior and three terminal nodes. The decision rules partition the feature space into three disjoint regions: {Xl:::;; c, X2 :::;; d}, {Xl :::;; C,X2 > d}, and {Xl> c}. The mean parameters attached to these regions are f.1,1,f.1,2, and f.1,3, respectively
To complete a Bayesian inference based on model (5.2), one needs to prescribe prior distributions for both the tree structures and the associated parameters, Mm and (72. The prior distribution for the tree structure is specified conservatively in [8] so that the size of each tree is kept small, which forces it to be a weak learner. The priors on Mm and (72 also contribute to preventing from overfitting. In particular, the prior probability for a tree with 1, 2, 3, 4, and ~ 5 terminal nodes is 0.05, 0.55, 0.28, 0.09, and 0.03, respectively. Chipman et al. [8] developed a Markov chain Monte Carlo approach (BART
198
Qing Zhou, Mayetri Gupta
MCMC) to sample from the posterior distribution (5.3) Note that the tree structures are also updated along with MCMC iterations. Thus, the BART MCMC generates a large number of samples of additive trees, which form an ensemble of models. Now given a new feature vector x*, instead of predicting its response y* based on the "best" model, BART predicts y* by the average response of all sampled additive trees. More specifically, suppose one runs BART MCMC for J iterations after the burin-in period, which generates J sets of additive trees. For each of them, BART has one prediction: y*(j) = 2::;';;,=1 g(x*, T;;[), M;;[)) (j = 1,··· ,J). These J predicted responses may be used to construct a point estimate of y* by the plain average, as used in the following applications, or an interval estimate by the quantiles. Thus, BART has the nature of Bayesian model average.
5.3
Application to human ChIP-chip data
Zhou and Liu [50] applied BART to two recently published ChIP-chip data sets of the TFs Oct4 and Sox2 in human embryonic stem (ES) cells [5]. The performance of BART was compared with those of linear regressions [9], MARS [14, 10], and neural networks, respectively, based on ten-fold cross validations. The DNA microarray used in [5] covers -8 kb to +2kb of ",17,000 annotated human genes. A Sox-Oct composite motif (Figure 5) was identified consistently in both sets of positive ChIP-regions using de novo motif discovery tools (e.g., [23]). This motif is known to be recognized by the protein complex of Oct4 and Sox2, the target TFs of the ChIP-chip experiments. Combined with all the 219 known high-quality PWMs from TRANSFAC and the PWMs of 4 TFs with known functions in ES cells from the literature, a final list of 224 PWMs were compiled for motif feature extraction. Here we present their cross-validation results on the Oct4 ChIP-chip data as a comparative study of several competing motif learning approaches. 2
]: __ I -
~
TILJIG~~ M
~
~
~
~
00
~
S
=
~
~
Figure 5: The Sox-Oct composite motif discovered in the Oct4 positive ChIP-regions
Boyer et al. [5] reported 603 Oct4-ChIP enriched regions (positives) in human ES cells. ZL randomly selected another 603 regions with the same length distribution from the genomic regions targeted by the DNA micro array (negatives). Note that each such region usually contains two or more neighboring probes on the array. A ChIP-intensity measure, which is defined as the average array-intensity
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
199
ratio of ChIP samples over control samples, is attached to each of the 1206 ChIPregions. We treat the logarithm of the ChIP-intensity measure as the response variable, and those features extracted from the genomic sequences as explanatory variables. There are a total of 1206 observations with 224 + 45 = 269 features (explanatory variables) for this Oct4 data set. ZL used the following methods to perform statistical learning on this data set: (1) LR-SO, linear regression using the Sox-Oct composite motif only; (2) LR-Full, linear regression using all the 269 features; (3) Step-SO, stepwise linear regression starting from LR-SO; (4) Step-Full, stepwise linear regression starting from LRFull; (5) NN-SO, neural networks with the Sox-Oct composite motif feature as input; (6) NN-Full, neural networks with all the features as input; (7) MARS, multivariate adaptive regression splines using all the features with different tuning parameters; (8) BART with different number N of trees ranging from 20 to 200. In Step-SO, one started from the LR-SO model, and used the stepwise method (with both forward and backward steps) to add or delete features in the linear regression model based the AIC criterion (see R function "step"). The Step-Full was performed similarly, but starting from the LR-Full model. For neural networks, ZL used the R package "nnet" with different combinations of the number of hidden nodes (2, 5, 10, 20, 30) and weight decay (0, 0.5, 1.0, 2.0). For MARS, they used the function "mars" in the R package "mda" made by Hastie and Tibshirani, with up to two-way interactions and a wide range of penalty terms. For BART, they ran 20,000 iterations after a burn-in period of 2,000 iterations, and used the default settings in the R package "BayesTree" for all other parameters. The ten-fold cross validation procedure in [50] was conducted as follows. They first divided the observations into ten subgroups of equal sizes at random. Each time, one subgroup (called "the test sample") was left out and the remaining nine subgroups (called "the training sample") were used to train a model using the stated method. Then, they predicted the responses for the test sample based on the trained model and compared them with the observed responses. This process was continued until every subgroup had served as the test sample once. In [50], the authors used the correlation coefficient between the predicted and observed responses as a measure of the goodness of a model's performance. This measure is invariant under linear transformation, and can be intuitively understood as the fraction of variation in the response variable that can be explained by the features (covariates). We call this measure the CV-correlation (or CV-cor) henceforth. The cross validation results are given in Table 3. The average CV-correlation (over 10 cross validations) of LR-SO is 0.446, which is the lowest among all the linear regression methods. Since all the other methods use more features, this shows that sequence features other than the target motif itself indeed contribute to the prediction of ChIP-intensity. Among all the linear regression methods, Step-SO achieves the highest CV-cor of 0.535. Only the optimal performance among all the combinations of parameters were reported for the neural networks. However, even these optimal results are not satisfactory. The NN-SO showed a slight improvement in CV-cor over that of LR-SO. For different parameters (the number of hidden nodes and weight decay), NN-SO showed roughly the same performance except for those with 20 or more hidden nodes and weight decay =
Qing Zhou, M ayetri Gupta
200
0, which overfitted the training data. The neural network with all the features as input encountered a severe overfitting problem, resulting in CV-cor's < 0.38, even worse than that of LR-SO. In order to relieve the overfitting problem for NNs, ZL reduced the input independent variables to those selected by the stepwise regression (about 45), and employed a weight decay of 1.0 with 2, 5, 10, 20, or 30 hidden nodes. More specifically, for each training data set, they performed Step-SO followed by NNs with features selected by the Step-SO as input. Then they calculated the CV-cor's for the test data. We call this approach Step+NN, and it reached an optimal CV-cor of 0.463 with 2 hidden nodes. ZL applied MARS to this data set under two settings: the one with no interaction terms (d = 1) and the one considering two-way interactions (d = 2). For each setting, they chose different values of the penalty A, which specifies the cost per degree of freedom. In the first setting (d = 1), they set the penalty A = 1,2"" ,10, and observed that the CV-cor reaches its maximum of 0.580 when A = 6. Although this optimal result is only slightly worse than that of BART (Table 3), we note that the performance of MARS was very sensitive to the choice of A. With A = 2 or 1, MARS greatly overfitted the training data, and the CV-cor's dropped to 0.459 and 0.283, respectively, which are almost the same or even worse than that of LR-SO. MARS with two-way interactions (d = 2) showed unsatisfactory performance for A ~ 5 (i.e., CV-cor < 0.360). They then tested A in the range of [10,50] and found the optimal CV-cor of 0.561 when A = 20. Table 3: Ten-fold cross validations for log-ChIP-intensity prediction on the Oct4 ChIP-chip data Method LR-SO Step-SO NN-SO MARSl,6 MARS2,20 BART20 BART60 BART100 BART140 BART180 Step-M MARSl,6-M
Cor 0.446 0.535 0.468 0.580 0.561 0.592 0.596 0.600 0.599 0.595 0.456 0.511
Imprv 0% 20% 5% 30% 26% 33% 34% 35% 34% 33% 2% 15%
Method LR-Full Step-Full Step+NN MARSl,1 MARS2,4 BART40 BART80 BART120 BART160 BART200 BART-M MARS2,20-M
Cor 0.491 0.513 0.463 0.283 0.337 0.599 0.597 0.599 0.594 0.593 0.510 0.478
Imprv 10% 15% 4% -37% -24% 34% 34% 34% 33% 33% 14% 7%
Reported here are the average CV-correlations (Cor). LR-SO, LR-Full, Step-SO, Step-Full, NNSO, and Step+NN are defined in the text. MARSa,b refers to the MARS with d = a and)" = b. BARTm is the BART with m trees. Step-M, MARSa,b-M, and BART-M are Step-SO, the optimal MARS, and BARTlOO with only motif features as input. The improvement ("Imprv") is calculated by Cor/Cor(LR-SO)-l.
Notably, BARTs with different number of trees outperformed all the other methods uniformly. BARTs reached a CV-cor of about 0.6, indicating a greater than 30% of improvement over that of LR-SO and the optimal NN, and more than 10% of improvement over the best performance of the stepwise regression
Chapter 8
Regulatory Motif Discovery: Prom Decoding to Meta-Analysis
201
method. In addition, the performance of BART was very robust for different choices of the number of trees included. This is a great advantage over MARS, whose performances depended strongly on the choice of the penalty parameter A, which is typically difficult for the user to set a priori. Compared to NNs, BART is much less prune to overfitting, which may be attributable to its Bayesian model averaging nature with various conservative prior specifications. To further illustrate the effect of non-motif features, ZL did the following comparison. They excluded non-motif features from the input, and applied BART with 100 trees, MARS (d = 1, A = 6), MARS (d = 2, A = 20), and Step-SO to the resulting data set to perform ten-fold cross validations. In other words, the feature vectors contained only the 224 motif features. The CV-correlations of these approaches are given in Table 3, denoted by BART-M, MARS1,6-M, MARS2,20M and Step-M, respectively. It is observed that the CV-correlations decreased substantially (about 12% to 15%) compared to the corresponding methods with all the features. One almost obtains no improvement (2%) in predictive power by taking more motif features in the linear regression. However, if the background and other generic features are incorporated, the stepwise regression improves dramatically (20%) in its prediction. This does not mean that the motif features are not useful, but their effects need to be considered in conjunction with background frequencies. Step-M is equivalent to MotifRegressor [9] and MARS-M is equivalent to MARS Motif [10] with all the known and discovered (Sox-Oct) motifs as input. Thus, this study implies that BART with all three categories of features outperformed MotifRegressor and MARS Motif by 32% and 17% in CV-correlation, respectively.
6
Using nucleosome positioning information in motif discovery
Generally TF-DNA binding is represented as a one-dimensional process; however, in reality, binding occurs in three dimensional space. Biological evidence [32] shows that much of DNA consists of repeats of regions of about 147 bp wrapped around nucleosomes, separated by stretches of DNA called linkers. Recent techniques [47] based on high density genome tiling arrays have been used to experimentally measure genomic positions of nucleosomes, in which the measurement "intensities" indicate how likely that locus is to be nucleosome-bound. These studies suggest that nucleosome-free regions highly correlate with the location of functional TFBSs, and hence can lead to significant improvement in motif prediction, if considered. Genome tiling arrays pose considerable challenges for data analysis. These arrays involve short overlapping probes covering the genome, which induces a spatial data structure. Although hidden Markov models or HMMs [35] may be used to accommodate such spatial structure, they induce an exponentially decaying distribution of state lengths, and are not directly appropriate for assessing structural features such as nucleosomes that have restrictions in physical dimension.
202
Qing Zhou, Mayetri Gupta
For instance, in Yuan et al. [47], the tiling array consisted of overlapping 50mer oligonucleotide probes tiled every 20 base pairs. The nucleosomal state can thus be assumed to be represented by about 6 to 8 probes, while the linker states had no physical restriction. Since the experiment did not succeed in achieving a perfect synchronization of cells, additionally a third "delocalized nucleosomal" state was modeled, which had intensities more variable in length and measurement magnitude than expected for nucleosomal states. Here, we describe a general framework for determining chromatin features from tiling array data and using this information to improve de novo motif prediction in eukaryotes [18].
6.1
A hierarchical generalized HMM (HGHMM)
Assume that the model consists of K (= 3) states. The possible length duration in state k, (k = 1,," ,K) is given by the set Dk = {rk,'" ,Sk} eN (i.e. N denotes the set of positive integers). The generative model for the data is now described. The initial distribution of states is characterized by the probability vector 7r = (1l'I,'" ,1l'K)' The probability of spending time d in state k is given by the distribution Pk (dl4», d E Dk (1 :::;; k :::;; K), characterized by the parameter 4> = ((Pl,'" ,¢K)' For the motivating application, Pk(d) is chosen to be a truncated negative binomial distribution, between the range specified by each D k . The latent state for probe i is denoted by the variable Zi (i = 1, ... ,N). Logarithms of spot measurement ratios are denoted by Yij (1:::;;i:::;;N; l:::;;j:::;; r) for N spots and r replicates each. Assume that given the (unobservable) state Zi, Yij'S are independent, with YijlZi = k rv gk( . ;~ik,(J';k)' For specifying gk, a hierarchical model is developed that allows robust estimation of the parameters. Let J1, = (11-1,'" ,I1-K) and ~ = {(J';k; 1:::;; i :::;; N; 1:::;; k :::;; K}. Assume YijlZi = k'~ik,(J';k rv N(~ik,(J';k)' ~ikll1-k,(J';k '" N(l1-k,TO(J';k)' (J';k '" Inv-Gamma(Pk, ak), where at the top level, I1-k ex constant, and Pk, ak, and TO are hyperparameters. Finally, the transition probabilities between the states are given by the matrix T = (Tjk), (1 :::;; j, k :::;; K). Assume a Dirichlet prior for state transition probabilities, i.e. Tkl,"', Tk,k-1, Tk,k+l,'" ,Tk,K '" Dirichlet("1) , where "1 = (7]1,'" ,7]k-l,7]k+1,'" ,7]K). Since the duration in a state is being modeled explicitly, no transition back to the same state can occur, i.e. there is a restriction Tkk = 0 for all 1 :::;; k :::;; K.
6.2
Model fitting and parameter estimation
For notational simplicity, assume Y = {Yl' ... ,Y N }, is a single long sequence of length N, with r replicate observations for each Yi = (Yil,'" Yir)'. Let the set of all parameters be denoted by () = (J1"T,4>,7r,~), and let Z = (ZI,'" ,ZN) and L = (L 1 ,'" ,LN) be latent variables denoting the state identity and state lengths. Li is a non-zero number denoting the state length if it is a point where a run of states ends, i.e. if Zj=k, (i-l+1):::;;j:::;;i; Zi+l,Zi-!#·k; l:::;;k:::;;K otherwise.
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
203
The observed data likelihood then may be written as:
L(O; Y) = L LP(YIZ, L, O)P(LIZ, O)P(ZIO). Z L
(6.1)
The likelihood computation (6.1) is analytically intractable, involving a sum over all possible partitions of the sequence Y with different state conformations, and different state lengths (under the state restrictions). However, one can formulate a data augmentation algorithm which utilizes a recursive technique to efficiently sample from the posterior distributions of interest, as shown below. The key is to update the states and state length durations in an recursive manner, after calculating the required probability expressions through a forward summation step. Let an indicator variable It take the value 1 if a segment boundary is present at position t of the sequence, meaning that a state run ends at t (It = 1 {=> L t -=I- 0). In the following, the notation Y[l:tj is used to denote the vector {Yl, Y2, ... ,yd· Define the partial likelihood of the first t probes, with the state Zt = k ending at t after a state run length of L t = l, by the "forward" probability:
at(k, l) = P(Zt = k, L t = l,It = 1, Y[l:tj)' Also, let the state probability marginalized over all state lengths be given by f3t(k) = L,;~1'kat(k,l). Let d(l) = min{D 1 ,'" ,DK} and d(K) = max{D 1 , " ' , D K }. Then, assuming that the length spent in a state and the transition to that state are independent, i.e. P(l, kll', k') = P(L t = llZt = k)Tk1k = Pk(l)Tk1k, it can be shown that
at(k,l)
= P(Y[t-I+1:tjIZt = k)Pk(l) LTk1kf3t-l(k'),
(6.2)
k'#k for 2 ~ t ~ N; 1 ~ k ~ K; l E {d(1),d(l)+l, .. · ,min[d(K),t]}. The boundary conditions are: at(k,l) = 0 for t < l < d(l), and al(k,l) = 7l'kP(Y[1:ljIZI = k)Pk(l) for d(l) ~ l ~ d(K), k = 1,'" ,K. Pk(') denotes the k-th truncated negative binomial distribution. The states and state duration lengths (Zt, Ld (1 ~ t ~ N) can now be updated, for current values of the parameters 0 = (IL, T, cp, 7r, ~), using a backward sampling-based imputation step:
1. Set i = N. Update ZNly,O using P(ZN = kly,O) = 2. Next, update LNIZN = k, y, 0 using
P(LN = llZN = k, y, 0) =
~~~~k)'
P(LN = l, ZN = kly, 0) _ aN(k, l) P(ZN = kly, 0) - f3N(k) .
3. Next, set i=i - LN, and let LS(i) = LN. Let D(2) be the second smallest value in the set {Dl' ... ,DK}' While i > D(2), repeat the following steps: • Draw Zily, 0, ZHLS(i) , LHLS(i) using
where k E {I"" ,K} \ ZHLS(i)'
Qing Zhou, Mayetri Gupta
204
• Draw LilZi , y, 0 using P(L i = llZi, y, 0) • Set LS(i - L i ) = L i , i = i - L i .
6.3
= Q~:f~}
.
Application to a yeast data set
The HGHMM algorithm was applied to the normalized data from the longest contiguous mapped region, corresponding to about 61 Kbp (chromosomal coordinates 12921 to 73970), of yeast chromosome III [47]. The length ranges for the three states were: (1) linker: Dl = {I, 2, 3, ... }, (2) delocalized nucleosome: D2 = {9,'" ,30}, and (3) well-positioned nucleosome: D3 = {6, 7, 8}. It is of interest to examine whether nucleosome-free state predictions correlate with the location of TFBSs. Harbison et al. (2004) used genomewide location analysis (ChIP-chip) to determine occupancy of DNA-binding transcription regulators under a variety of conditions. The ChIP-chip data give locations of binding sites to only a lKb resolution, making further analysis necessary to determine the location of binding sites at a single nucleotide level. For the HGHMM algorithm, the probabilities of state membership for each probe were estimated from the posterior frequencies of visiting each state in M iterations (excluding burn-in). Each region was assigned to the occupancy state k, for which the estimated posterior state probability P(Zi = klY) = E~l J(Z}j) = k)/M was maximum. For all probes, this probability ranged from 0.5 to 0.9. Two motif discovery methods SDDA [16] and BioProspector [30] were used to analyze the sequences for motif lengths of 8 to 10 and a maximum of 20 motifs per set. Motif searches were run separately on linker (L), nucleosomal (N) and delocalized nucleosomal (D) regions predicted by the HGHMM procedure. The highest specificity (proportion of regions containing motif sites corresponding to high binding propensities in the Harbison et al. (2004) data) was for the linker regions predicted by HGHMM: 61% by SDDA and 40% by BP (Table 4). Sensitivity is defined as the proportion of highly TF-bound regions found when regions were classified according to specific state predictions. The highest overall specificity and sensitivity was observed for the linker regions predicted with HGHMM, indicating nucleosome positioning information may aid significantly in motif discovery when other information is not known. Table 4: Specificity (Spec) and Sensitivity (Sens) of motif predictions compared to data from Harbison et al SDDA Linker Deloc Nucl Nucleosomal
7
Spec 0.61 0.19 0.16
BP Sens 0.7 0.8 0.5
Spec 0.40 0.15 0.09
Sens 0.87 0.63 0.43
Conclusion
In this article we have tried to present an overview of statistical methods related to the computational discovery of transcription factor binding sites in genomic DNA
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
205
sequences, ranging from the initial simple probabilistic models to more recently developed tools that attempt to use auxiliary information from experiments, evolutionary conservation, and chromatin structure for more accurate motif prediction. The field of motif discovery is a very active and rapidly expanding area, and our aim was to provide the reader a snapshot of some of the major challenges and possibilities that exist in the field, rather than give an exhaustive listing of work that has been published (which would in any case be almost an impossible task in the available space). With the advent of new genomic technologies and rapid increases in the volume, diversity, and resolution of available data, it seems that in spite of the considerable challenges that lie ahead, there is strong promise that many exciting discoveries in this field will continue to be made in the near future.
References [1) Auger, 1. E. and Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bull. Math. BioI., 51(1):39-54. [2) Bailey, T. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36. [3) Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M.A. (1994). Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA, 91, 1059-1063. [4) Barash, Y., Elidan, G., Friedman, N., and Kaplan, T. (2003). Modeling dependencies in protein-DNA binding sites. In RECOMB proceedings, 28-37. [5J Boyer, L.A., Lee, T.L, Cole, M.F., Johnstone, S.E., Levine, S.S., Zucker, J.P., et al. (2005). Core transcriptional regulatory circuitry in human embryonic stem cells. Cell, 122, 947-956. [6) Bussemaker, H. J., Li, H., and Siggia, E. D. (2000). Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA, 97(18):10096-10100. [7} Bussemaker, H.J., Li, H., and Siggia, E.D. (2001). Regulatory element detection using correlation with expression, Nat. Genet., 27, 167-17l. [8} Chipman, H.A., George, E.1., and McCulloch, R.E. (2006). BART: Bayesian additive regression trees, Technical Report, Univ. of Chicago. [9J Conlon, E.M., Liu, X.S., Lieb, J.D., and Liu, J.S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis, Proc. Natl. Acad. Sci. USA, 100, 3339-3344. [lOJ Das, D., Banerjee, N., and Zhang, M.Q. (2004). Interacting models of cooperative gene regulation, Proc. Natl. Acad. Sci. USA, 101, 16234-16239. [11J Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B,39(1):1-38. [12] Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological sequence analysis. Cambridge University Press. [13J Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol., 17, 368-376.
206
Qing Zhou, Mayetri Gupta
[14] Friedman, J.H. (1991). Multivariate adaptive regression splines, Ann. Statist., 19, 1-67. [15] Green, P. J. (1995). Reversible jump MCMC and Bayesian model determination. Biometrika, 82,711-732. [16] Gupta, M. and Liu, J. S. (2003). Discovery of conserved sequence patterns using a stochastic dictionary model. J. Am. Stat. Assoc., 98(461):55-66. [17] Gupta, M. and Liu, J. S. (2005). De-novo cis-regulatory module elicitation for eukaryotic genomes. Pmc. Nat. Acad. Sci. USA, 102(20):7079-7084. [18] Gupta, M. (2007). Generalized hierarchical Markov models for discovery of length-constrained sequence features from genome tiling arrays. Biometrics, in press. [19] Jensen, S.T., Shen, L., and Liu, J.S. (2006). Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes. Bioinjormatics, 21, 3832-3839. [20] Keles, S. et al., van der Laan, M., and Eisen, M.B. (2002). Identification of regulatory elements using a feature selection method, Bioinjormatics, 18, 1167-1175. [21] Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. (2003). Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423,241-254. [22] Krogh, A., Brown, M., Mian, L.S., Sjoander, K., and Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol., 235, 1501-153l. [23] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wooton, J.C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-214. [24] Lawrence, C. E. and Reilly, A. A. (1990). An expectation-maximization (EM) algorithm for the identification and characterization of common sites in biopolymer sequences. Proteins, 7,41-5l. [25] Li, X., and Wong, W.H. (2005). Sampling motifs on phylogenetic trees. Pmc. Natl. Acad. Sci. USA, 102,9481-9486. [26] Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: applications to cp model sampling and change point problem. Statistica Sinica, 10,317-342. [27] Liu, J.S. and Lawrence, C. E. (1999). Bayesian inference on biopolymer models, Bioinjormatics, 15, 38-52. [28] Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc., 90,1156-1170. [29] Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81,27-40. [30] Liu, X., Brutlag, D. L., and Liu, J. S. (2001). Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium on Biocomputing, 127-138. [31] Liu, Y., Liu, X.S., Wei, L., Altman, R.B., and Batzoglou, S. (2004). Eukaryotic regulatory element conservation analysis and identification using com-
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
207
parative genomics. Genome Res. 14, 451-458. [32] Luger, K. (2006). Dynamic nucleosomes. Chromosome Res, 14,5-16. [33] Neuwald, A. F., Liu, J. S., and Lawrence, C. E. (1995). Gibbs Motif Sampling: detection of bacterial outer membrane protein repeats. Protein Science, 4,1618-1632. [34] Moses, A.M., Chiang, D.Y., and Eisen, M.B. (2004). Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac. Smp. Biocomput., 9, 324-335. [35] Rabiner, L. R (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257-286. [36] Sabatti, C. and Lange, K. (2002). Genomewide motif identification using a dictionary model. IEEE Proceedings, 90,1803-1810. [37] Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W., and Lenhard, B. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles, Nucleic Acids Res., 32, D91-D94. [38] Schneider, T.D. and Stephens, RM. (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18,6097-6100. [39] Siddharthan, R, Siggia, E.D., and van Nimwegen, E. (2005). PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol., 1, e67. [40] Sinha, S., Blanchette, M., and Tompa, M. (2004). PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, 5, 170. [41] Sinha, S. and Tompa, M. (2002). Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research, 30,55495560. [42] Stormo, G. D. and Hartzell, G. W. (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA, 86,1183-1187. [43] Tanner, M. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc., 82,528-550. [44] Thompson, W., Palumbo, M. J., Wasserman, W. W., Liu, J. S., and Lawrence, C. E. (2004). Decoding human regulatory circuits. Genome Research, 10,1967-1974. [45] Wang, T. and Stormo, G.D. (2003). Combining phylogenetic data with coregulated genes to identify regulatory motifs. Bioinformatics, 19, 2369-2380. [46] Wingender, E., Chen, X., Hehl, R, Karas, H., Liebich, 1., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., and Schacherer, F. (2000). TRANSFAC: An integrated system for gene expression regulation. Nucleic Acids Res., 28, 316-319. [47] Yuan, G.-C., Liu, y'-J., Dion, M. F., Slack, M. D., Wu, L. F., Altschuler, S. J. and Rando, O. J. (2005). Genome-scale identification of nucleosome positions in S. cerevisiae. Science, 309, 626-630. [48] Zhao, X., Huang, H., and Speed, T. P. (2004). Finding short DNA motifs using permuted markov models. In RECOMB proceedings, 68-75. [49] Zhou, Q. and Liu, J. S. (2004). Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 20(6):909-916.
208
Qing Zhou, Mayetri Gupta
[50] Zhou, Q. and Liu, J.S. (2008). Extracting sequence features to predict proteinDNA interactions: a comparative study. Nucleic Acids Research, in press. [51] Zhou, Q. and Wong, W.H. (2004). CisModule: De novo discovery of cisregulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA, 101, 12114-12119. [52] Zhou, Q. and Wong, W.H. (2007). Coupling hidden Markov models for the discovery of cis-regulatory modules in multiple species. Ann. Appl. Statist. to appear.
Chapter 9 Analysis of Cancer Genome Alterations U sing Single Nucleotide Polymorphism (SNP) Microarrays Cheng Li *
Samir Amin t
Abstract Loss of heterozygosity (LOR) and copy number changes of chromosomal regions bearing tumor suppressor genes or oncogenes are key events in the evolution of cancer cells. Identifying such regions in each sample accurately and summarizing across multiple samples can suggest the locations of cancerrelated genes for further confirmatory experiments. Oligonucleotide SNP microarrays now have up to one million SNP markers that can provide genotypes and copy number signals simultaneously. In this chapter we introduce SNP-array based genome alteration analysis methods of cancer samples, including paired and non-paired LOR analysis, copy number analysis, finding significant altered regions across multiple samples, and hierarchical clustering methods. We also provide references and summaries to additional analysis methods and software packages. Many visualization and analysis functions introduced in this chapter are implemented in the dChip software (www.dchip.org), which is freely available to the research community.
1 1.1
Background Cancer genomic alterations
A normal human cell has 23 pairs of chromosomes. For the autosomal chromosomes (1 to 22), there are two copies of homologous chromosomes inherited respectively from the father and mother of an individual. Therefore, the copy number of all the autosomal chromosomes is two in a normal cell. However, in a cancer cell the copy number can be smaller or larger than two at some chromosomal regions due to chromosomal deletions, amplifications and rearrangement. These alterations start *Department of Biostatistics and Computational Biology, Harvard School of Public Health, Dana-Farber Cancer Institute, 44 Binney St. Boston, MA, 02115, USA, Email: eli @ hsph. harvard. edu tDepartment of Medical Oncology, Dana-Farber Cancer Institute, 44 Binney St. Boston, MA, 02115, USA.
209
210
Cheng Li, Samir Amin
to happen randomly in a single cell but are subsequently selected and inherited in a clone of cells if they confer growth advantage to cells. Such chromosomal alterations and growth selections have been associated with the conversion of ancestral normal cells into malignant cancer cells. The most common alterations are loss-of-heterozygosity (LOH), point mutations, chromosomal amplifications, deletions, and translocations. LOH (the loss of one parental allele at a chromosomal segment) and homozygous deletion (the deletion of both parental alleles) can disable tumor suppressor genes (TSG) [1]. In contrast, chromosomal amplifications may increase the dosage of oncogenes that promote cell proliferation and inhibit apoptosis. The detection of these alterations may help identify TSGs and oncogenes and consequently provide clues about cancer initiation or growth [2, 3]. Analyzing genomic alteration data across multiple tumor samples may distinguish chromosomal regions harboring cancer genes from regions with random alterations due to the instability of cancer genome, leading to identification of novel cancer genes [4, 5]. Defining cancer subtypes based on genomic alterations may also provide insights into the new classification and treatment of cancer [6, 7].
1.2
Identifying cancer genomic alterations using oligonucleotide SNP microarrays
Several experimental techniques are currently used to identify genomic alterations at various resolutions and aspects. The cytogenetic methods range from fluorescence in situ hybridization (FISH) to spectral karyotyping (SKY) [8] and provide global view of chromosomal organizations at the single cell level. At individual loci, low-throughput methods measure allelic polymorphisms or DNA copy numbers using locus-specific primers, while high-throughput methods, such as digital karyotyping [9] and comparative genomic hybridization (CGH) using chromosomes or micro arrays [10-12], measure DNA copy number changes at many loci simultaneously. Single nucleotide polymorphisms (SNP) are single base pair variations that occur in the genome of a species. They are the most common genetic variations in the human genome and occur on average once in several hundred basepairs. The NCB! dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP /) stores about 10 million human SNPs identified by comparing the DNA of different individuals. High-density oligonucleotide SNP micro arrays have also been developed by Affymetrix for high-throughput genotyping of human SNPs. The SNP marker density has increased rapidly during the recent years, from Mapping 10K array (10,000 markers) and Mapping lOOK array set (116,204 markers spaced at 23.6 Kb) to the latest SNP 6.0 array with near one million SNPs [13-16]. The experimental assays require only 250 ng starting DNA sample and one primer to amplify the portions of the genome containing the SNPs interrogated by the array. This complexity-reduced sample is then fragmented, labeled and hybridized to an oligonucleotide SNP array containing probes complementary to the sequences surrounding the SNP loci. The Affymetrix analysis software is commonly used to analyze the scanned array image and compute genotype calls of all the SNPs in a
Chapter 9
Analysis of Cancer Genome Alterations US'ing Single N~Lcleotide ... 211
C
CCTCGGACTAATGGCCATT Probe Sequence
CGGAGCCTGACTACCGGTAA
:::::: gl::I:::1111I11111:11111111'111!1111l:::I::1:':11lllll111:111111111111'11111111\: (A)
(B)
1: (A) The designing scheme of a probe set on the HuSNP array, which contains 1400 SNPs. Five quartets (columns) of oligonucleotide probes interrogate the genotype of a SNP. In the central quartet, the perfect match (PM) probes are complementary to the reference DNA sequences surrounding the SNP with two alleles (denoted as allele A and B). The mismatch (MM) probes have substituted central base pair compared to the corresponding PM probes and they control for cross-hybridization signals. Shifting the central quartet by -1, 1 and 4 basepairs forms additional four quartets. The newer generations of SNP arrays have selected quartets from both forward and reverse strands Figure 6). (B) A genotyping algorithm makes genotype calls for this SNP in three different based on hybridization patterns. For a diploid genome, three genotypes are possible in different individuals: AA, AB and BB
sample with high accuracy (Figure 1) [17,79]. Several groups have pioneered the use of SNP arrays for LOR analysis of cancer [6, 13, 18]. These studies compared the SNP genotype calls or probe signals of a cancer sample to that of a paired normal sample from the same patient (Figure 2), and identified chromosomal regions with shared LOH across multiple samples of a cancer type. They provided a proof of principle that SNP arrays can identify genomic alterations at comparable or superior resolution to microsatelite markers. Additional studies have applied SNP arrays to identify LOH in variouscancer tissue types [7, 19]. In addition, The SNP arrays have been utilized for copy number analysis of cancer samples [20, 21]. probe level signals on the SNP arrays identified genomic amplifications and homozygous deletions, which were confirmed by Q-PCR and arrays CGH on the same samples. These analysis methods have been implemented in several software packages to be used by the research community (see section 5). In this chapter, we will review the analysis methods and software for SNP array applications in cancer genomic studies. We will focus on the methods and software developed by our group since they are representative, but we will also discuss related methods developed by others and future directions.
Cheng Li, 8amir Amin
212
Tumor
Normal
A
A
A B
LOR (subject #58)
B
A
Retention
B
(subject #57)
Non-informative B
(subject #55)
2: Comparing the genotypes of a SNP in paired normal and tumor samples gives LOH calls. Only SNP markers that are heterozygous in normal samples are informative to provide LOH or retention calls
2 2.1
Loss of heterozygosity analysis using SNP arrays LOH analysis of paired normal and tumor samples
For each SNP, there are two alleles existing in the population and are usually represented by A and B (corresponding to two of the four different nucleotides: A/T/G/C). Each person has two copies of every chromosome, one inherited from each parent. Therefore, the genotype of an SNP in one sample can be homozygous (AA or BB) if the SNP alleles in the two chromosomes are the same, or heterozygous (AB) if they are different. Loss of heterozygosity (LOH) refers to the loss of contribution of one parent in selected chromosome regions, due to hemizygous deletion or mitotic gene conversion [23]. LOH events frequently occur when somatic normal cells undergo transformations to become cancerous cells. By comparing the normal and tumor cells from the same patient, the SNPs can be classified as LOH or retention of heterozygosity (or simply "retention") (Figure 2). Afterwards, the LOH data can be visualized along a chromosome (Figure 3), and non-informative calls may be inferred to have LOR or retention calls through the "Nearest neighbor" or "The same boundary" method [22].
2.2
Tumor-only LOH inference
Often paired normal samples for primary tumors or cell lines are not available, and for hematologic malignancies, blood-derived normal DNA is difficult to obtain. As higher density SNP arrays become available, it is possible to use only the tumor
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nu.cleotide ... 213
Figure 3: Observed LOH calls by comparing normal (N) and tumor (T) genotypes at the same SNP. The sample pairs are displayed on the columns (column names only show tumor samples) and the SNPs are ordered on the rows by their p-arm to q-arm positions in one chromosome. Yellow: retention (AB in both N and T), Blue: LOH (AB in N, AA or BB in T), Red: Conflict (AAjBB in N, BBjAA or AB in T), Gray: non-informative (AAjBB in both Nand T), White: no call (no call in N or T)
sample to identify LOH regions. For example, when we see a long stretch of homozygous calls in a chromosome region of a tumor sample, it is highly likely that LOH has happened in this region. If we assume the average heterozygosity is 0.3 and the SNPs are independent the probability of observing N consecutive homozygous SNP calls when no LOH happens is (1 -- O.3)N, which can be very small when N is large, favoring the hypothesis that the region has undergone LOR. Previously, the homozygosity mapping of deletion (HOMOD) method identified regions of hemizygous deletion in unmatched tumor cell lines by defining a region of LOR as more than five consecutive homozygous microsatellite markers [24J. However, applying this approach to SNP array data is not trivial since SNP markers are less polymorphic, the inter-marker distances are variable, and the haplotype structure may lead to interdependence in the genotype calls of neighboring SNPs. In addition, as with other methods there is a small rate of genotyping error. These multiple sources of information are not comprehensively accounted for in a
Cheng Li, Samir Amin
214
simple probability or counting method. We utilized a Hidden Markov Model (HMM; [25]) to formalize this inference and use these multiple sources of information [26]. HMMs have been used for modeling biological data in diverse applications such as protein and DNA sequence analysis [27-29], linkage studies [30, 31], and array CGH analysis [32]. SNP genotype or signal data along a chromosome are chain-like, have SNP-specific genotype or signal distributions, and have variable inter-marker distances. For tumor-only LOH analysis, we conceptualize the observed SNP calls (Homozygous call A or B, Heterozygous call AB) as being generated by the underlying unobserved LOH states (either Loss or Retention) of the SNP markers (Figure 4). Under the LOH Retention state, we observe homozygous or heterozygous SNP calls with probabilities determined by the allele frequency of an SNP, Under the LOH Loss state, we will almost surely observe homozygous SNP call. However, we need to account for possible SNP genotyping errors at< 0.01 rate [14] and also for wrong mapping of a few markers. These distributions give the "Emission probabilities" of the HMM. haplotype-related SNP-SNP dependence
observed SNPcall
~0 -.@-
~~~b~b7lity-1
/1
undenlying LOH state
RET
I
-.0-·-.0 -.0- -.0- ~
1--8-1
I RET
1--1 t
I RET
I
I
I--~I LOSS 1--
transition probability
Figure 4: Graphical depiction of the elements comprising the Hidden Markov Model (HMM) for LOH inference. The underlying unobserved LOR states (Loss or Retention (RET)) of the SNP markers generate the observed SNP calls via emission probabilities. The solid arrows indicate the transition probabilities between the LOR states. The broken arrows indicate the dependencies between the consecutive SNP genotypes within a block of linkage disequilibrium, which are handled in an expanded RMM model (not shown)
In order to determine the HMM initial probabilities, we next estimated the proportion of the genome that is retained by assuming that no heterozygous markers should be observed in a region of Loss except in the case of genotyping error. To this end, the proportion of the retention genome in a tumor sample was estimated by dividing the proportion of heterozygous markers in the tumor sample by the average rate of heterozygosity of the SNPs in the population (0.35 for the 10K arrays and 0.27 for the lOOK array [14, 15]. This retention marker proportion was used as the HMM initial probability, specifying the probability of observing the Retention state at the beginning of the p-arm, and was also used as the background LOH state distribution in a sample. Lastly, we specified the transition probabilities that describe the correlation between the LOH states of adjacent markers. In a particular sample, the larger the
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 215
distance between the two markers is, the more likely genetic alteration breakpoints will happen within the interval and make the LOH states of the two markers independent. We use the Haldane's map function B = (1 - e- 2d )/2 [31] to convert the the physical distance d (in the unit of 100 Megabases :::::: 1 Morgan) between two SNP markers to the probability (2B) that the LOH state of the second marker will return to the background LOH state distribution in this sample and thus independent from the LOH state of the first marker. Although Haldane's map function is traditionally used in linkage analysis to describe meiotic crossover events, the motivations for applying it in the context of LOH events are the following: (1) LOH can be caused by mitotic recombination events [33] and mitotic recombination events may share similar initiation mechanisms and hot spots with the meiotic crossover events [34]; (2) the empirical transition probabilities estimated from observed LOH calls based on paired normal and tumor samples agree well with Haldane's map function (see Figure 2 in [26]). We used this function to determine the transition probabilities as follows. If one marker is Retention, the only way for the next adjacent marker to be Loss is that the two markers are independent and the second marker has background LOH state distribution. This happens with probability 2() . Po (Loss), where Po (Loss) is the background LOH Loss probability described previously. If the first marker is Loss, the second could be Loss either due to the dependence between the two markers (occurring with the probability 1- 2()), or due to that the second marker has background LOH state distribution (occurring with the probability 2()· Po (Loss)). Therefore, the probabilities of the second marker being Loss given the LOH status of the current marker are: P (LossJLoss) = 2B· Po (Loss) + (1 - 2B), { P (LossJRetention) = 2() . Po (Loss). These equations provided the marker-specific transition probabilities of the HMM, determining how the LOH state of one marker provides information about the LOH state of its adjacent marker. The HMM with these emission, initial and transition probabilities specifies the joint probability of the observed SNP marker calls and the unobserved LOH states in one chromosome of a tumor sample. The Forward-backward algorithm [29] was then applied separately to each chromosome of each sample to obtain the probability of the LOH state being Loss for every SNP marker, and the inferred LOH probabilities can be displayed in dChip and analyzed in downstream analyses. To evaluate the performance of the HMM and tune its parameters, we applied it to the 10K array data of 14 pairs of matched cancer (lung and breast) and normal cell lines [21]. Only data from autosomes were used. The average observed sample LOH proportion of these paired samples is 0.52 ± 0.15 (standard deviation). We assessed the performance of the HMM by comparing the LOR events inferred using only the tumor data to the LOH events observed by comparing tumor data to their normal counterparts. Since the tumor-only HMM provides only the probability for each SNP being Loss or Retention, rather than making a specific, we used the least stringent threshold to make LOH calls, in which a SNP is called Loss if it has a probability of Loss greater than 0.5, and Retention otherwise. Using
Cheng Li, Samir Amin
216
this threshold, very similar results were achieved when comparing the inferred LOR from unmatched tumor samples to the observed LOR from tumor/normal pairs (Figure 5). Specifically, we found that 17, 105 of 17, 595 markers that were observed as Loss in tumor/normal pairs were also called Loss in unmatched tumors (a sensitivity of 97.2%), and 15, 845 of 16, 069 markers that were observed as Retention in tumor/normal pairs were also called Retention in unmatched tumors (for a specificity of 98.6%). However, comparison of the tumor only inferences (Figure 5B) to the observed LOH calls (Figure 5A) identifies occasional regions that are falsely inferred as LOH in the tumor only analysis (indicated by arrows), We found that these regions are due to haplotype dependence of SNP genotypes, which can be addressed by extending the HMM model to consider the genotype dependence between neighboring SNPs (Figure 4; see [26] for details). Tumor/normal observations Chro 10 I II III IV
Tumor only inferences I II III IV
1".,,' .
"' :.·1•.·• .•·• 1<.
+._ .....
~I~I
(A)
(B)
Figure 5: Comparison of HMM-based LOH inferred from tumor-only samples to the observed LOH based on paired normal and tumor samples, using chromosome 10 data of four pairs of 10K SNP arrays. (A) Observed LOH form normal/tumor pairs. LOH is indicated in blue/dark color. (B) LOn inferred from nMM by using only tumor samples. Blue/dark color indicates a probability of LOn near 1 and yellow/light color indicates a probability of LOH near 0
3 3.1
Copy number analysis using SNP arrays Obtaining raw copy numbers from SNP array data
Initially SNP arrays were used mainly through genotypes in linkage analysis and LOH analysis. As higher density arrays provide more reliable probe signal values, SNP arrays have been adapted to analyze copy number changes in tumor m:...... ..,.'"o. We and others developed the analysis methods to perform copy number of tumor samples usingSNP arrays [20, 21]. Our procedures described below were tested on 14 pairs of lung and breast normal and cancer cell lines hybridizing to
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 217
the 10K SNP array (the 10K dataset used in [21]). Normalizing arrays. The median PM and MM probe intensities of the arrays in an experiment usually differ by several fold (110 to 329 in the 10K dataset), indicating the need for normalization in order to compare the signals across different arrays. We used the Invariant Set Normalization method [35] to normalize all the arrays at the probe intensity level to a baseline array with moderate overall intensity. This method adaptively selects probes that have similar ranks between one array and the baseline array to determine the normalization function. Computing model-based signals. Since the probe response patterns of the three genotypes (AA, BB and AB) are similar only within each genotype, we summed up the PM intensities of allele A and allele B probes within a quartet, and do the same for the MM probes (Figure 6). This transformation makes the probe intensity pattern and magnitude of a probe set comparable across arrays regardless of the genotypes. Then the PM/MM difference model [36] was applied to the transformed probe-level data to compute the model-based signal values for each SNP and sample. The model-based method weights probes by their sensitivity and consistency when computing signal values, and the image artifacts are also identified and eliminated by the outlier detection algorithm in this step. Computing raw copy numbers. For each SNP, the signal values of all normal samples were averaged to obtain the mean signal of 2 copies (for SNPs on the chromosome X of male samples, the signals are multiplied by 2 before averaging), and the observed copy number in all samples is defined as (observed signal/mean signal of 2 copies)* 2, and visualized in a white (0 copy) to red color scale (Figure 7A). Since we hybridize the same amount of amplified DNA sample to each array and the normalization also makes arrays have similar overall brightness, the absolute copy number information is not available from the SNP array data. We can only obtain copy numbers relative to the sample ploidy (average copy number), which is scaled to be the normal copy of two in SNP array data. This does not affect the utility of SNP array in genomic alteration studies, since the relative copy number changes relative to ploidy are more informative for finding cancer genes than absolute copy numbers.
3.2
Inferring integer copy numbers
Similar to the tumor-only LOH analysis (Section 2.2), we used an HMM to model multiple sources of information and the copy number dependency among the neighboring markers on a chromosome. After the components of the HMM are set up as below, the HMM gives the probability of the data (the noisy raw copy numbers) as the function of the underlying unobserved real copy number. This probability function is maximized by the Viterbi algorithm to infer the most probable real copy numbers. Emission probabilities. They specify the SNP-specific distribution of raw copy numbers (Raw Copy) given the underlyingreal copy number (integer valued, or stepped by an interval such as 0.2 copy to account for relative copy numbers). We assume that for each SNP the standardized observed raw copy numbers are
218
Cheng Li, Sarnir Amin
AS
A?
(A)
(B)
(C)
6: (A) The dChip PM/MM data view displays the probe level data of an SNP in SaInpleS from top to bottom. The 20 probe pairs are ordered and connected from Within a pair the blue/dark color represents the PM probe intensity and the gray/light color represents the MM probe. This probe set has 5 quartets of complementary to the DNA forward strand and 5 quartets to the reverse strand. (B) The same set is displayed as raw intensity values in the format of quartets as in lB. (C) Making the probe set patterns and signals of genotype call AA, AB and BB comparable by adding up the allele A probe and allele B probe signals. After this addition, the PM/MM difference signals are comparable for genotype AB (1st row) and AA (2nd row). The probe signals on the 3rd row are about half of the signals on the 2nd row, presumably corresponding to a hemizygous deletion with genotype A in this sample
random values drawn from a student t distribution. The standardization uses the underlying real copy number (0) and the sample standard deviations (Std) of the raw copy number of the normal samples scaled by Fold(C/2) :
(R~~~~r:a~
0)
'" t(40). The degree of freedom 40 of the t-distribution is used to obtain smooth inferred copy number while allowing for outlier data points. Transition and initial probabilities. They specify the real copy number of one marker (denoted by 0 1 ) provides information of the real copy number of the adjacent marker (denoted by C2 ). We assume that the copy number t,,,a,,,!',,",, are caused by genetic recombination events and close SNP markers are more to have the same copy number. The Haldane's map function (J = (1 - e- 2d /2) [31] is used to convert genetic distance d (in the unit of 100 Megabases """,,,I Morgan) between two SNP markers to the probability (20, denoted by Po) that the
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 219
(A)
(9)
Figure 7: (A) The raw copy number view in dChip. The SNPs on chromosome 8 are displayed on the rows with the p-arm to q-arm ordering. The samples are displayed on the columns with normal sample first and the paired tumor sample next. The 0 copy is displayed in white, 5 or more copies in red (black), and intermediate copy numbers in light red (gray) color. The curve on the right in the shaded box represents the raw copy numbers of a selected tumor sample (H2171), and the vertical line represents 2 copit',s. Note that larger raw copy numbers are associated with larger variation, as seen from the two peaks in the curve. (B) The HMM-inferred integer copy numbers
copy number of the 2nd marker will return to the background distribution of copy numbers in this sample and thus independent from the copy number of the first marker. These probabilities are used together with the background probabilities (Pb ) estimated below to form the HMM's transition probabilities: if C 1
¥-
if C1 = C 2 •
The initial or background probabilities of copy numbers in each sample are set to be noninformative:a and 0.9 for copy 2 and O.l/(N -1) for copy 0 to N except 2, where N is the maximal allowed copy number in the inference. The HMM is run as described below and then the inferred copy number are used to re-estimate the sample specific background probabilities of the copy numbers. After this, the HMM model is re-run to obtain the final inference results. These background distributions are used as the HMM's initial probabilities specifying the probability of observing a particular copy number at the beginning of the p-arm, and are also used to form the transition probabilities as described above. Viterbi algorithm to infer copy number. A HMM model with these probabilities specifies the joint probability of the unobserved copy number and the observed raw copy numbers, and the Viterbi algorithm [25J is then used to obtain the most probable underying copy number path of the SNPs in a chromosome (in the p-arm to q-arm ordering), given the observed raw copy numbers. The HMM is applied to all chromosomes and all samples separately and the most probable
Cheng Li, Samir Amin
220
copy number patb is defined as the inferred copy number values and visualized in the same way as the raw copy number (Figure 7B). In applying the HMM, we set the initial, transition, and emission probabilities !lB parameters estimated from the data before applying HMM, instead of using the Baum-Wekll algorithm to estimate them together with the unobserved copy number states [29]. The main reason is that there are many parameters in the model including the SNP and copy number dependent transition probabilities, but there 8re only a few samples to estimate them together with the unobserved copy numbers. This could lead the Baum-Welch algorithm to quickly converge to local maxima depending on the starting values.
3.3
Copy number analysis results of the 10K dataset
The copy number analysis of the 10K array cell line samples found the following [21]: (1) Most high-copy amplifications (7-14 copy), homozygous deletions (0 copy) and hemizygous deletions (1 copy) predicted by SNP inferred copy number were confirmed by Q-PCR experiments. These regions harbor many known oncogenes and TSGs. The agreement is good (R2 = 0.98) for low copy numbers between the SNP inferred copy numbers and the Q-PCR measured copy numbers (Figure 8A). However, there is saturation effect of the SNP based copy number at high copy numbers, which may be due to the PCR amplification step in the SNP array assay. (2) Combining LOH and copy number analysis of SNP array can distinguish LOH due to copy neutral events from hemizygous deletion (Figure 8B and 8C). Note the,t the copy number and LOH information was obtained through separate analysis of the same data.
..,
7
'<=1
Ol
6 5
£....
3
8
~
i
4 2
I:l
1
~
0
<.l
-1 copy number by SNP array (A)
(B)
(C)
Figure 8: (A) The agreement between the inferred copy number from SNP array (Xaxis) and the Q-PCR based copy number (Y-axis) on the same samples. The error bars represent the standard deviation of Q-PCR based copy number of several loci having the same inferred copy number. The figure is courtesy of Xiaojun Zhao. (B) The LOB events on chromosome 8 of six tumor samples are highlighted in blue/dark color. (C) The corresponding raw copy numbers of both normal and tumor samples are shown for the same samples. In the sample 1648 (the 1st column), the whole chromosome LOB is due to hemizygous deletion in the p-arm but gene conversion in the q-arm
In the five lung primary tumor samples profiled by SNP arrays, one predicted
Chapter 9 Analysis of Cancer Genome Alterations Using Single Nucleotide ... 221 amplification was confirmed but two homozygous deletions appeared to be false positive. This calls for purification of tumor samples or improved data analysis method. The mixture experiments of normal and tumor cell lines from the same individual show that amplifications suffer less than homozygous deletions, When they are detected by SNP arrays in normal-contaminated tumor samples. With 40% normal sample contamination we can still identify amplifications of 7 to 9 copies, but 10%-20% normal sample contamination will render homozygous deletions undetectable. In addition, the acceptable performance of the LOR analysis by comparing paired normal and mixture normal-tumor samples was achieved only when normal sample contamination is less than 20%. The Affymetrix genotyping algorithm for SNP array was developed to genotype normal samples accurately. This algorithm makes conservative "No Calls" or incorrect genotype calls in tumor samples where deletion, amplification and normal sample contamination are common. These alterations create complex genotypes such AAB at chromosome regions of three-copy amplifications, or hemizygous deletion in normal-contaminated tumor samples. Calling AAB incorrectly as AB in a tumor sample will miss a real LOR event. In the section below, we will discuss improved analysis methods that address these issues.
3.4
Allele-specific copy numbers and major copy proportion
As seen in the previous sections, the LOR and copy number analyses have been performed separately based on the same raw probe signal data. Since high-density SNP arrays have separate probes to detect signals from two different SNP alleles (Figure 1 and Figure 6A), they can be used to estimate allele-specific copy numbers (ASCN) and parent-specific copy numbers from SNP array data [37, 38]. The parent-specific copy number (PSCN) at an SNP locus in a sample is defined as the copy number pair (C 1 ,C2 ), where C1 is the smaller parental copy number and C 2 is the larger parental copy number, unless C 1 is equal to C 2 (e.g. in normal samples, they are both 1). At heterozygous SNPs with genotype AB, PSCNs are the same as allele-specific copy numbers. To obtain allele-specific raw copy numbers from the probe level data, the arrays are first normalized as in Section 3.1. We can then compute allele-specific signals by applying the PM/MM difference model [36] separately to the probe-level data of A alleles and B alleles of all samples at an SNP probe set (Figure 6A). Next, the allele-specific signal values of all normal samples (usually? 10, e.g. [39]) in a dataset and their genotypes are used to estimate allele-specific signal distribution. Specifically, for each SNP, the A allele signal of a genotype AB or half of the A allele signal of a genotype AA are regarded as a sampling data point from the signal distribution of one copy of A when used in an RMM allele and are used to estimate the mean and standard deviation of this distribution. The similar is done for the B allele's signal distribution. When there are fewer than six observed data points to estimate allele-specific distribution, the total signal (sum of A and B allele signals) of all normal samples will be used to construct allele-independent distribution of one copy [21] and used in place of allele-specific distributions. Finally, the allele-
222
Cheng Li, Samir Amin
specific signals and standard deviations are divided by the allele-specific means to obtain the allele-specific raw copy numbers (RA and RB) and the standard deviation of copy number one (Std A and Std B ) for each SNP. These allele-specific raw copy numbers can be smoothed [37] or modeled via HMM to obtain allele-specific and parent-specific copy numbers. However, PSCN is a pair of copy numbers, which can lead to many underlying copy number states and inefficient computation when used in an HMM. Thus we have also proposed to estimate major copy proportion (MCP). The MCP of an SNP is defined as C2/(C1 + C2), where C 1 and C2 are the parental copy numbers at this SNP in a sample and C 1 ~ C 2 . The value of MCP is between 0.5 and 1 by definition, with various values corresponding to different relative proportions of parental copy numbers. The MCP is 0.5 for normal loci or allelic balanced copy alterations, 1 for LOH, and a value between 0.5 and 1 for allelic imbalanced copy number alterations. MCP therefore quantifies allelic imbalances and is a natural extension of LOH analysis. MCP and total copy number (C 1 + C2 ) together provide the same amount of information as allele-specific copy numbers, while each of them is a scalar quantity that can be more efficiently estimated and conveniently used in downstream analysis. We describe here two examples that can benefit from estimating MCP values. First, normal sample contamination in tumors often leads to conservative "No Call" genotypes and intervening LOH or retention calls. MCP can better quantify the proportion of normal sample contamination while still identify allelicimbalanced regions due to LOH. Second, tumors with hyperploidy often contain allelic-imbalanced regions with both parental alleles kept. If the total copy number in such regions is close to the cell ploidy, copy number analysis will reveal a normal relative copy and LOH analysis will show retention. However, ASCN or MCP analysis can discover allelic-imbalance as genomic alteration in such regions. The details of the HMM for inferring MCP and is comparison with LOH and copy number analysis can be found in [40].
3.5
Copy number variations in normal samples and other diseases
Recent studies have discovered large-scale structural and copy number variations (CNV) in normal individuals [41,42]. These structural variations and CNVs have intriguing implications in human evolution history and susceptibility to diseases [43, 44]. Subsequently larger scale studies identified many more CNVs in the human genome and correlated them with gene expression and haplotypes [45-47]. These CNVs were found to often occur in multiple populations and in the loci of frequent segmental duplications, suggesting a common mechanism of CNVs in normal individuals and cancer patients. The CNVs in normal individuals could hinder the copy number analysis of cancer samples because: (1) We use normal samples to obtain reference signal and distributions to identify CNV in tumor samples, however we cannot always assume that the reference normal samples all have copy number of 2 at an SNP locus; (2) Some copy number alterations found in tumor samples could be germline inherited rather than cancer-related somatic
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 223
alterations, confounding the aim of identifying cancer-related loci from SNP array data. To address the issue (1) above and another fact that researchers often cannot obtain enough (> 10) normal reference samples to analyze together with tumor samples, we propose to use a trimming method to obtain reference signal distribution. Specifically, we assume that, in a set of reference normal samples (or a set of tumor samples when normal samples are not available or too few), for any SNP, at most a certain percent of the samples have abnormal copy numbers at the SNP locus (the abnormal percentage P is user defined, such as 10%). Then for each SNP, (P/2)% of samples with extreme signals are trimmed from the high and low end, and the rest samples are used to estimate the mean and standard deviation of the signal distribution of normal copy number 2 at this SNP. This trimming method is designed to accommodate the small amount of CNVs in reference normal samples. To address the issue (2) above and also test the feasibility of the method proposed in (1), we used a public lOOK SNP array dataset of 90 normal individuals of European ethnicity (https: / / www.affymetrix.com/support / technical/sample_data / hapmap_trio_data. affx). We set abnormal percentage as 5% to search for CNVs in these samples, and this method is able to identify 39 CNVs in the 90 normal samples with size ranging from 120Kb to 16Mb (4 to 411 SNPs) (Figure 9). Some of the CNVs are recurrent in related family members. These identified CNVs in normal samples can be excluded when analyzing lOOK SNP data of tumor samples of Caucasian origin, so that the same CNVs found in tumor samples are not considered in downstream analysis. If some studies contain paired normal samples for tumors, one can screen the normal samples for CNVs and the CNVs found in both normal and tumor will be excluded from cancer locus analysis. In addition, public databases such as Database of Genomic Variants (http://projects.tcag.ca/variationj) and Human Structural Variation Database [48] contain the known CNVs from various studies. The information of the genome position, size, and individual ethnicity of the CNVs in these databases can be used to exclude the known CNVs in normal individuals when analyzing the cancer samples of the corresponding ethnic groups.
Figure 9: The CNVs of chromosome 22 identified in 90 normal individuals using the lOOK Xba SNP array. The rows are ordered SNPs and the columns are samples. The white or light red colors correspond to lower than 2 copies (homozygous or hemizygous deletions). The CNVs are between the region 20.7 - 22 Mb (22qll.22 - ql1.23). The CNV in this region have been reported in other studies [42, 48]. Samples F4 and F6 are from the same family and share a CNV region
224
Cheng Li, Samir Amin
Furthermore, the SNP-array based methods have also been used to find genomic alterations in other diseases involving chromosomal changes, such as Down syndrome (chromosome 21 has 3 copies), autism spectrum disorder [49, 50] and mental retardation [51]. Similar trimming method as above may be used to identify copy number changes in the samples of these diseases. The DECIPHER database (https:/ / decipher.sanger.ac. uk!) can be queried to check if a CNV is related to a disease phenotype.
3.6
Other copy number analysis methods for SNP array data
Another major research topic in copy number analysis using SNP or CGH arrays is how to infer the underlying real copy numbers in cells based on the raw copy number or log ratios observed from arrays. In addition to the HMM method introduced, median smoothing is a simple but intuitive method. One can set an SNP marker window size (e.g 10) so that a SNP's smoothed copy number is the median of the raw copy numbers of the SNPs in the surrounding window. Compared to the HMM-inferred copy number, this method performs faster and gives closer result to the raw copy numbers. It is also more robust to outliers in raw copy numbers, and does not need parameter specifications as in the HMM method. However, median-smoothed copy numbers are not as smooth as HMMinferred copy numbers, and copy changes smaller than half of the window size will be smoothed out. The circular binary segmentation method has been introduced to analyze array CGH data and identify copy number change points in a recursive manner [52]. This algorithm utilizes the permutation of probe orders to test the significance of candidate change points, based on which the mean and variance of chromosomal segments are estimated piecewise. Furthermore, a regression model is used to formalize the detection of DNA copy number alterations as a penalized least squares regression problem [53]. One study compared 11 different algorithms for smoothing array CGH data, including methods using segmentation detection, smoothing, HMM, regression, and wavelets [54]. These authors also provided a website allowing analysis of array CGH data using multiple algorithms (http://compbio.med.harvard.edu/CGHweb!).
4 4.1
High-level analysis using LOH and copy number data Finding significantly altered chromosome regions across multiple samples
After inferring LOH and copy number altered chromosome regions in individual samples, we are interested in defining the regions of alteration shared by multiple tumors more than random chance. Under the clonal selection process that generates the tumor cells, the alterations at cancer gene loci are more probable to
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 225
be selected in the tumor samples than the alterations elsewhere in the genome. Therefore, the non-randomly shared alteration regions are likely to harbor cancercausing genes or are more susceptible to chromosomal damages in cancer samples, while the rest of the genome contains random alterations due to the genetic instability of tumor. There are existing methods to infer cancer-related chromosome regions from LOH data based on microsatellite markers or array CGH data of multiple samples. The instability-selection modeling of allelic loss and copy number alteration data has been used to locate cancer loci and discover combinations of genetic events associated with cancer sample clusters [57,58]. This method has been applied to a meta-analysis of 151 published studies of LOH in breast cancer [59]. A tree-based multivariate approach focuses on a set of loci with marginally most frequent alterations [60]. The pairwise frequencies among these loci are then transformed into distances and analyzed by tree-building algorithms. The "Cluster along chromosomes" (CLAC) algorithm constructs hierarchical clustering trees for a chromosome and selects the interesting clusters by controlling the False Discovery Rate (FDR) [61]. It also provides a consensus summary statistic and its FDR across a set of arrays. We have developed a scoring method to identify significantly shared LOH regions unlikely to be due to random chance [22]. For paired normal and tumor samples, the first step is to define an LOH scoring statistic across samples. For a particular chromosomal region containing one or more SNPs, we defined a score in each individual to quantify the region's likelihood of being real LOH: the proportion of LOH markers among all the informative markers in this region with some penalty given to conflict LOH calls. We used the proportion of LOH markers rather than the actual counts of LOH markers due to different marker densities at different chromosomal regions. The scores of all individuals are then summed up to give a summary LOH score for this chromosomal region. The second step is to use permutation to assess the significance of the score. Under the null hypothesis that there are no real LOH regions for the entire chromosome (all the observed LOH markers come from measurement error), one can generate the null distribution of the LOH statistic by permuting the paired normal and tumor samples and then obtain LOH scoring statistics based on the permuted datasets. If all the observed LOH events are due to call errors and thus are not cancer-related (the null hypothesis), then the paired normal and tumor samples are conceptually indistinguishable, and the observed differences between them represent the background noise from which we would like to distinguish potentially real LOH events. Therefore, we can create the background noise by permuting the "normal" and "tumor" labels on each pair. We then compare the LOH events in the original dataset with the LOH events in a large number of such permuted datasets to assess the statistical significance of the former. The MaxT procedure [62] is used to adjust the p-values for multiple testing. For each permuted dataset, we obtain the maximal score among all the regions in the genome or a chromosome. The adjusted p-value of a specific region is the proportion of the permuted maximal scores that are greater than the observed LOH score. The significantly shared LOH regions in an SNP dataset of 52 pairs of normal and prostate cancer
226
Cheng Li, Samir Amin
(A)
(13)
Figure 10: (A) In the dChip LOR view, the tumor-only LOR inference calls LOR (blue color) in multiple prostate samples. The curve and the vertical line on the represent LOR scores and a significance threshold. (B) The 3 Mb region of the LOR scores in (A) contains the known TSG retinoblastoma 1. The cytoband and gene information are displayed on the left. Data courtesy of Rameen Beroukhim
samples harbor the known PTEN tumor suppressor gene [22]. We also tested the method on an breast cancer data where LOR events occur much more frequently [7]. The p-value curve for all chromosomes is able to capture the where the LOR events occur frequently across multiple tumors and the LOR T'lAltl',p,'n cluster samples into two meaningful subtypes. The permutation approach above tests the null hypothesis that the observed shared LOR is due to SNP genotyping or mapping errors and that the normal and tumor labeling is non-informative for producing genomic alterations. This is reflected in the permutation scheme used to randomly switch normal and tumor sample within a sample pair to generate permutation datasets under the null hypothesis. A more direct null hypothesis is that there is no region in the tumor genome that is selected to have more shared LOR events than the rest of the genome regions. In addition, when the paired normal sample is not available for LOH or copy number analysis, we can still use a simple scoring method such as the proportion of samples having LOH or copy alterations to quantify the sharing of LOH (Figure lOA), but permuting paired samples is not an option anymore, In such situations, we propose to permute SNP loci in each sample while preserving most dependence structure of neighboring SNPs. Specifically, for each sample with SNPs ordered first by chromosomes and then by positions within chromosome, we randomly partition the whole genome into K 2) blocks, and randomly switch the order of these blocks but preserving the order of SNPs in each block. In this way the SNPs associated with LOH or copy number alterations in a sample are randomly relocated in blocks to new positions in the genome, while only minimally perturbing the dependence of the LOH or copy number data of neighboring markers. The same permutation applies to all samples using a different random partition for each sample. The LOH or copy number alteration score at
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 227
each SNP locus can then be computed for the permuted dataset, and the MaxT method can be to assess the significance of the original scores. A LOH scoring curve can be computed and displayed next to the LOH data along with a significance threshold to help researchers locate and explore significantly shared LOH regions (Figure 10). Another cancer alteration scoring and permutation method GIS TIC [63] had been successfully applied to SNP array data of glioma and lung cancer samples to identify novel cancer-related genes that were subsequently experimentally confirmed [5].
4.2
Hierarchical clustering analysis
Clustering and classification methods have been very popular and widely used in microarray data analysis since the introduction of gene expression microarrays [64, 65]. Increasingly SNP and array CGH data have been analyzed by these methods since they offer a global view of the genome alterations in cancer samples. It is often interesting to look for the subsets of tumor samples harboring similar alteration events across the genome and correlate such tumor sub clusters with clinical outcomes. Clustering chromosome loci may also reveal genes on different chromosomes that undergo genomic alterations simultaneously and therefore possibly belong to the same oncogenic pathways [66,67]. We have implemented a sample clustering algorithm using the LOH data of one or all chromosomes [22] (Figure llA). The sample distances are defined as the proportion of markers that have discordant Loss or Retention status. This method can discover subtypes of breast cancer and lung cancer based on LOH profiles [6, 7]. However, we have noticed that when using the LOH data of all SNPs for clustering, the result tends to be driven by the SNP markers that are in the "Retention" status across all samples. Therefore, a filtering procedure similar to that of gene expression clustering can be applied to only use those SNPs that have significant alteration scores (Section 4.1) These SNPs have excessive alterations and may harbor known or potential cancer genes, so their LOH status may be more informative for sample clustering. Following similar reasoning, we have implemented a function to filter the SNPs that are located within a certain distance to a list of specified genes (such as all known cancer genes [68] or kinase genes [69]) and use their LOH or copy number status for clustering. Similarly, chromosome regions can be clustered based on LOH or copy number data. A chromosome region can be a whole chromosome, cytoband or the transcription and promoter region of a gene. For the 500K SNP array, a single gene may contain multiple SNPs. The inferred LOH or copy number data for SNPs in a region can be averaged to obtain the genomic alteration data of the region, and the data for regions in turn will be used for region filtering to obtain interesting regions to cluster region and samples, similar to the SNP-wise filtering above. The advantage of this region-based clustering is that we will cluster hundreds of regions instead of 500K to one million SNPs, and chromosome regions can be defined by selected or filtered genes or cytobands to interrogate the genome at various resolutions or according to specific set of genes. For LOH data, we can define the distance between two regions as the average absolute value of
228
Cheng Li, Sarnir Arnin
(A)
(B)
Figure 11: Clustering the same set of breast cancer samples using SNP LOH data (A) and expression data (B). A similar sample cluster emerge in both clustering figures (note the left-most sample branches highlighted in blue). The blue/yellow colors indicate LOH/retention events (A), and the red/blue colors indicate high or low expression level (B). The labels below sample name are lymph node status (p: positive, n: np<'Rtivp) Data courtesy of Zhigang Wang and Andrea Richardson
(LOn probability of region l-LOH probability of region 2) across samples; for copy number data, this distance is defined as the average absolute value of ((copy number of region I-copy number of region 2) / the larger of two) across samples. Average linkage can be used to merge chromosome region and sample clusters. Such combined sample and region clustering may reveal different alteration events characteristic of tumor subtypes, as well as co-occurring or mutually exclusive chromosome regions that harbor genes in the same or different oncogenic pathways. For example, clustering methods have been applied to array-based copy number data of multiple myeloma to correlate genome alterations with clinical features [70J.
4.3
Integrating SNP array data with gene expression data
In recent years, high-throughput technologies (gene expression microarray, SNP array, array CGH, etc.) have generated abundant high-resolution genomic data across different platforms. This has created critical need of an integrative approach for cross-platform comparison of data for a comprehensive understanding of genome. Although there is limited availability of software packages for such comparison (Section 5), several studies have been reported so far showing significant correlation between SNP array or array CGH inferred DNA copy number alteration and gene expression profiling data [71-73]. Such integrated analysis clearly indicated the effect of chromosomal structural defects (DNA copy gain or loss) on functional pattern (mRNA transcription) and thus helped identifying novel candidate target genes, classifying tumors, and predicting treatment response, survival rate and metastatic risk [74, 75]. From these results, it is important to note the complexity of relationship between copy number alterations and gene expressions.
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 229
This can be direct, meaning copy gain/loss is associated with up/down-regulation of the genes in the same region [76], or indirect where gene expression changes are noted at different location or even at different chromosomes [72, 77]. Below we outline two exemplary studies. Garraway et al. [75] used lOOK SNP arrays to characterize the genomes of the NCI-60 set of cell lines, which contain nine different cancer types. Hierarchical clustering identified several sample groups with distinct copy number alterations, including a subgroup of melanoma cell lines with amplification within the chromosome arm 3p. Supervised analysis of gene expression data from the same cell lines revealed a set of genes whose expression was significantly upregulated in cell lines with 3p amplification, compared to the cell lines without this amplification. There was only one gene, MITF, that both had high relative expression in the cells with 3p amplification and was located within the amplified region. Subsequently with use of polymerase chain reaction (PCR) on DNA samples, MIT F amplification was observed in 3 of 30 (10%) primary tumors and 7 of 32 (22%) metastatic tumors, but not in the ten benign nevi tested. Together these data suggested MIT F, a transcription factor that is a key regulator of melanocyte differentiation and survival, as the probable oncogene targeted by this genetic alteration. As another example, clustering the expression array data of a set of breast cancer samples revealed two major clusters (Figure HB): one of which contains six lymph node negative (LNN) samples. Interestingly, the LOR-based sample clustering of the same samples also has a subcluster of LNN samples (Figure llA) (data from [7]). The two LNN clusters are both significant in their respective clustering analysis by hyper-geometric test and they have four samples in common. This similar clustering by both expression and LOR data may suggest a mechanism through which the LOR events in chromosome 5q lead to the expression changes of multiple genes, which in turn results in the clinical feature of being LNN. Such hypothesis from integrative analysis of microarray data can then be tested by more data or experiments. One can also check the intersection of chromosome 5q genes and the genes differentially expressed between LNN and the rest samples. This may lead to finding the driver cancer genes that is targeted by the 5q LOR events.
5 5.1
Software for cancer alteration analysis using SNP arrays The dChip software for analyzing SNP array data
We have developed the DNA-Chip Analyzer (dChip) software package [22, 36, 78] for analysis of oligonucleotide gene expression and SNP microarray data. It contains low-level analysis methods such as invariant-set normalization and modelbased signal computation, as well high-level analysis functions such as hierarchical clustering, Gene Ontology analysis and time course data analysis. It has a userfriendly interface and many data views for visualizing array image, probe signal values, clustering and chromosome data. The software is freely available to the research community (www.dchip.org). It has established itself as one of the stan-
230
Cheng Li, Samir Amin
dard analysis software packages for analyzing gene expression and SNP microarray data. Many visualization and analysis functions introduced in this chapter are implemented in dChip, such as automated reading of SNP array probe data and genotype calls, visualizing probe data (Figure 6), visualizing SNP, LOH and copy number data along chromosomes in the context of genes and cytobands (Figures 3 and 7), inferring LOH without using paired normal samples (Figure 5), inferring copy numbers from probe level data (Figure 7), making statistical inference to identify significantly shared LOH regions (Figure 10), and clustering samples based on LOH and copy number data and correlating the clustering results to clinical variables (Figure 11).
5.2
Other software packages
In this section we list several other software packages that are for SNP array and integrative microarray analysis. • CNAG (Copy number analyzer for GeneChip, http://www.genome.umin.jp / CNAGtop2.html). Another SNP-array copy number analysis software package. It also provides non-paired allelic composition analysis. • SNPscan and SNPtrio (http) /pevsnerlab.kennedykrieger.org/). SNPscan is a web-accessible tool that allows users to upload SNP datasets and perform data analysis and data visualization. SNPtrio can analyze family trio data to locate regions of uniparental inheritance (UPI) and Mendelian inconsistency (MI), identify the parental types (paternal vs. maternal), and assess the associated statistical probability of occurrence by chance. • SIGMA (System for Integrative Genomic Microarray Analysis, http://sigma. bccrc.ca). A web-based java application which provides easy and interactive online viewing of high resolution array CGH data. The main functions include: interrogation of a single sample, visualization and analysis a single group of samples comparative analysis of two groups of samples, and integration of data from multiple platforms. • CNAT (The Affymetrix Chromosome Copy Number Analysis Tool, http://www. affymetrix.com/products/ software/specific/cnat.affx). It implements algorithms to identify chromosomal gains and losses and LOH. CNAT is also integrated into the Affymetrix GeneChip Genotyping Analysis Software (GTYPE) to allow genotype calling and copy number calculations to be performed within one software application. • ACE-it (Array CGH Expression integration tool, http://zeus.cs.vu.nl / programs / acewww /). ACE-it links the chromosomal position of the gene copy number measured by array CGH to the genes measured by the expression array. ACE-it uses this link to statistically test whether gene copy number affects RNA expression.
Chapter 9 Analysis of Cancer Genome Alterations Using Single Nucleotide ... 231
6
Prospects
We have presented the major analysis methods that SNP analysis can be used in discovering cancer genome alterations. SNPs arrays have also been widely used in linkage and association studies. The discovery of frequent copy number variations (CNVs) in the human genome further stimulated the use of these arrays to identify and correlate CNVs with disease phenotypes, haplotypes and gene expressions. Although high-throughput sequencing technologies are likely to displace arraybased genomics methods, we envision the SNP arrays will be around for sometime as its density increases while its price drops. Although there are many analysis methods developed, powerful and userfriendly analysis software packages encompassing the new methods need to be developed and maintained for daily use by biologist and data-analysts. Web-based software tools emerged as convenient solutions but their processing speed may be slow. Software solutions for integrating copy number, gene expression and other genomics data types are also urgently needed and under active development. Nowadays many large-scale datasets are freely available in public databases such as Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress (www.ebi.ac.uk/arrayexpress).This greatly facilitates reanalysis or meta-analysis of published studies for developing analysis methods or software as well as generating new biological insights. We welcome and encourage the readers of this chapter to explore these research frontiers in this exciting genomics era.
Acknowledgements This chapter contains the work of several published papers [21, 22, 26, 40, 75]. We thank our collaborators in these projects for their valuable insights and contributions: Ming Lin, Rameen Beroukhim, Xiaojun Zhao, Barbara A. Weir, Levi A. Garraway, Edward A.Fox, Wing Hung Wong, Nikhil Munshi, William R. Sellers, and Matthew Meyerson.
References [1] A. G., Knudson: Two genetic hits (more or less) to cancer. Nat Rev Cancer 2001, 1:157-162. [2] J. Li, C. Yen, D. Liaw, K. Podsypanina, S. Bose, S. I. Wang, J. Puc, C. Miliaresis, L. Rodgers, R. McCombie, et al: PTEN, a putative protein tyrosine phosphatase gene mutated in human brain, breast, and prostate cancer. Science 1997, 275:1943-1947. [3] P. P. Di Fiore, J. H. Pierce, M. H. Kraus, O. Segatto, C. R. King, S. A. Aaronson: erbB-2 is a potent oncogene when overexpressed in NIH/3T3 cells. Science 1987, 237:178-182. [4] C. G. Mullighan, S. Goorha, 1. Radtke, C. B. Miller, E. Coustan-Smith, J. D. Dalton, K. Girtman, S. Mathew, J. Ma, S. B. Pounds, et al: Genome-wide
232
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16] [17]
Cheng Li, Samir Amin analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 2007, 446:758-764. B. A. Weir, M. S. Woo, G. Getz, S. Perner, L. Ding, R. Beroukhim, W. M. Lin, M. A. Province, A. Kraja, L. A. Johnson, et al: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007,450:893-898. P. A. Janne, C. Li, X. Zhao, L. Girard, T. H. Chen, J. Minna, D. C. Christiani, B. E. Johnson, M. Meyerson: High-resolution single-nucleotide polymorphism array and clustering analysis of loss of heterozygosity in human lung cancer cell lines. Oncogene 2004, 23:2716-2726. Z. C. Wang, M. Lin, Lin, L. J. Wei, C. Li, A. Miron, G. Lodeiro, L. Harris, S. Ramaswamy, D. M. Tanenbaum, M. Meyerson, et al: Loss of heterozygosity and its correlation with expression profiles in subclasses of invasive breast cancers. Cancer Res 2004, 64:64-7l. E. Schrock, S. du Manoir, T. Veldman, B. Schoell, J. Wienberg, M. A. Ferguson-Smith, Y. Ning, D. H. Ledbetter, 1. Bar-Am, D. Soenksen, et al: Multicolor spectral karyotyping of human chromosomes. Science 1996, 273:494-497. T. L. Wang, C. Maierhofer, M. R. Speicher, C. Lengauer, B. Vogelstein, K. W. Kinzler, V. E. Velculescu: Digital karyotyping. Proc Natl Acad Sci U S A 2002, 99:16156-1616l. D. PinkIe, R. Segraves, D. Sudar, S. Clark, 1. Poole, D. Kowbel, C. Collins, W. L. Kuo, C. Collins, W. L. Kuo, C. Chen, Y. Zhai, et al: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998, 20:207-211. A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray, E. Waldman, D. Pinkel: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 1992, 258:818-82l. J. R. Pollack, C. M. Perou, A. A. Alizadeh, M. B. Eisen, A. Pergamenschikov, C. F. Williams, S. S. Jeffrey, D. Botstein, P. O. Brown: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 1999, 23:41-46. K. Lindblad-Toh, D. M. Tanenbaum, M. J. Daly, E. Winchester, W. O. Lui, A. Villapakkam, S. E. Stanton, C. Larsson, T. J. Hudson, B. E. Johnson, et al: Loss-of-heterozygosity analysis of small-cell lung carcinomas using single-nucleotide polymorphism arrays. Nat Biotechnol 2000, 18:10011005. G. C. Kennedy, H. Matsuzaki, S. Dong, W. M. Liu, J. Huang, G. Liu, X. Su, M. Cao, W. Chen, J. Zhang, et al: Large-scale genotyping of complex DNA. Nat Biotechnol 2003, 21:1233-1237. H Matsuzaki, S Dong, H Loi, X Di, G Liu, E Hubbell, J Law, T Berntsen, M Chadha, H Hui, et al: Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nature Methods 2004, 1:109-111. K. W. Broman, E. Feingold: SNPs made routine. Nat Methods 2004, 1:1045. W. M. Liu, X. Di, G. Yang, H. Matsuzaki, J. Huang, R. Mei, T. B. Ryder, T. A. Webster, S. Dong, G. Liu, et al: Algorithms for large-scale
Chapter 9 Analysis of Cancer Genome Alterations Using Single Nucleotide ... 233 genotyping microarrays. Bioinformatics 2003, 19:2397-2403. [18] R. Mei, P. C. Galipeau, C. Prass, A. Berno, G. Ghandour, N. Patil, R. K. Wollff, M. S. Chee, B. J. REID, D. J. Lockhart: Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays. Genome Res 2000, 10:1126-1137. [19] K. K. Wong, Y. T. Tsang, J. Shen, R. S. Cheng, Y. M. Chang, T. K. Man, C. C. Lau: Allelic imbalance analysis by high-density single-nucleotide polymorphic allele (SNP) array with whole genome amplified DNA. Nucleic Acids Res 2004, 32:e69. [20] G. R. Bignell, J. Huang, J. Greshock, S. Watt, A. Butler, S. West, M. Grigorova, K. W. Jones, W. Wei, M. R. Stratton, et al: High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res 2004, 14:287-295. [21] X. Zhao, C. Li, J. G. Paez, K. Chin, P. A. Janne, T. H. Chen, 1. Girard, J. Minna, D. Christiani, C. Leo, et al: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64:3060-3071. [22] M. Lin, L. J. Wei, W. R. Sellers, M. Lieberfarb, W. H. Wong, C. Li: dChipSNP: significance curve and clustering of SNP-array-based loss-ofheterozygosity data. Bioinformatics 2004, 20:1233-1240. [23] P. Devilee, A. M. Cleton-Jansen, C. J. Cornelisse: Ever since Knudson. Trends Genet 2001, 17:569-573. [24] E. K. Goldberg, J. M. Glendening, Z. Karanjawala, A. Sridhar, G. J. Walker, N. K. Hayward, A. J. Rice, D. Kurera, Y. Tebha, J. W. Fountain: Localization of multiple melanoma tumor-suppressor genes on chromosome 11 by use of homozygosity mapping-of-deletions analysis. Am J Hum Genet 2000, 67:417-431. [25] R. Dugad, U. B. Desai: A tutorial on Hidden Markov Models. Technical report No. SPA NN-96 , 1 1996. [26] R. Beroukhim, M. Lin, K. Hao, X. Zhao, 1. A. Garraway, E. A. Fox, E. P. Hochberg, M. D. Hofer, A. Descazeaud, M. A. Rubin, et al: Inferring Lossof-Heterozygosity from Tumor-only Samples Using High-Density Oligonucleotide SNP Arrays. PLOB Computational Biology 2006, 2:e41. [27] C. Burge, S. Karlin: Prediction of complete gene structures in human genomic DNA. J Mol Biol1997, 268:78-94. [28] G. A. Churchill: Stochastic models for heterogeneous DNA sequences. Bull Math BioZ1989, 51:79-94. [29] R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press; 1999. [30] E. S. Lander, P. Green: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U B A 1987, 84:2363-2367. [31] K. Lange: Mathematical and statistical methods for genetic analysis, 2 edn. New York: Springer-Verlag; 2002. [32] J. Fridlyand, A. M. Snijders, D. Pinkel, D. G. Albertson, A. N. Jain: Hidden Markov models approach to the analysis of array CGH data. Journal
234
Cheng Li, Samir Amin
of Multivariate Analysis 2004, 90:132-153. [33] H. Lodish, A. Berk, S. L. Zipursky, P. Matsudaira, D. Baltimore, J. Darnell: Molecular Cell Biology, 4 ed. New York: W.H. Freeman & Company; 1999. [34] A. J. Jeffreys, C. A. May: Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat Genet 2004,36:151156. [35J C. Li, W. H. Wong: Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol2001, 2:RESEARCH0032. [36J C. Li, W. H. Wong: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Pmc Natl Acad Sci USA 2001, 98:31-36. [37J T. LaFramboise, B. Weir, X. Zhao, R. Beroukhim, C. Li, D. Harrington, W. R. Sellers, M. Meyerson: Allele-Specific Amplification in Cancer Revealed by SNP Array Analysis. PLoS Comput Biol2005, l:e65. [38] J. Huang, W. Wei, J. Chen, J. Zhang, G. Liu, X. Di, R. Mei, S. Ishikawa, H. Aburatani, K. W. Jones, et al: CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics 2006, 7:83. [39J Y. Nannya, M. Sanada, K. Nakazaki, N. Hosoya, L. Wang, A. Hangaishi, M. Kurokawa, S. Chiba, D. K. Bailey, G. C. Kennedy, et al: A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 2005, 65:6071-6079. [40J C. Li, R. Beroukhim, B. A. Weir, W. Winckler, L. A. Garraway, W. R. Sellers, M. Meyerson: Major copy proportion analysis of tumor samples using SNP arrays. BMC Bioinformatics 2008, 9:204. [41J J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Maner, H. Massa, M. Walker, M. Chi, et al: Large-scale copy number polymorphism in the human genome. Science 2004, 305:525-528. [42J A. J. Iafrate, L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe, Y. Qi, S. W. Scherer, C. Lee: Detection of large-scale variation in the human genome. Nat Genet 2004, 36:949-95l. [43] E. Gonzalez, H. Kulkarni, H. Bolivar, A. Mangano, R. Sanchez, G. Catano, R. J. Nibbs, B. 1. Freedman, M. P. Quinones, M. J. Bamshad, et al: The influence of CCL3Ll gene-containing segmental duplications on HIV1/ AIDS susceptibility. Science 2005, 307:1434-1440. [44J A. B. Singleton, M. Farrer, J. Johnson, A. Singleton, S. Hague, J. Kachergus, M. Hulihan, T. Peuralinna, A. Dutra, R. Nussbaum, et al: alpha-Synuclein locus triplication causes Parkinson's disease. Science 2003, 302:84l. [45J B. E. Stranger, M. S. Forrest, M. Dunning, C. E. Ingle, C. Beazley, N. Thorne, R. Redon, C. P. Bird, A de Grassi, C. Lee, et al: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315:848-853. [46J R. Redon, S. Ishikawa, K. R. Fitch, L. Feuk, G. H. Perry, T. D. Andrews, H.
Chapter 9
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 235
Fiegler, M. H. Shapero, A. R. Carson, W. Chen, et al: Global variation in copy number in the human genome. Nature 2006,444:444-454. M. Jakobsson, S. W. Scholz, P. Scheet, J. R. Gibbs, J. M. VanLiere, H. C. Fung, Z. A. Szpiech, J. H. Degnan, K. Wang, R. Guerreiro, et al: Genotype, haplotype and copy-number variation in worldwide human populations. Nature 2008,451:998-1003. A. J. Sharp, D. P. Locke, S. D. McGrath, Z. Cheng, J. A. Bailey, R. U. Vallente, L. M. Pertz, R. A. Clark, S. Schwartz, R. Segraves, et al: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 2005, 77:78-88. C. R. Marshall, A. Noor, J. B. Vincent, A. C. Lionel, L. Feuk, J. Skaug, M. Shago, R. Moessner, D. Pinto, Y. Ren, et al: Structural variation of chromosomes in autism spectrum disorder. Am J Hum Genet 2008, 82:477-488. P. Szatmari, A. D. Paterson, L. Zwaigenbaum, W. Roberts, J. Brian, X. Q. Liu, J. B. Vincent, J. L. Skaug, A. P. Thompson, L. Senman, et al: Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nat Genet 2007,39:319-328. J. M. Friedman, A. Baross, A. D. Delaney, A. Ally, L. Arbour, L. Armstrong, J. Asano, D. K. Bailey, S. Barber, P. Birch, et al: Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 2006,79:500-513. A. B. Olshen, E. S. Venkatraman, R. Lucito, M. Wigler: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004, 5:557-572. T. Huang, B. Wu, P. Lizardi, H. Zhao: Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics 2005, 21:3811-3817. W. R. Lai, M. D. Johnson, R. Kucherlapati, P. J. Park: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005, 21:3763-3770. S. Ishikawa, D. Komura, S. Tsuji, K. Nishimura, S. Yamamoto, B. Panda, J. Huang, M. Fukayama, K. W. Jones, H. Aburatani: Allelic dosage analysis with genotyping microarrays. Biochem Biophys Res Commun 2005, 333:1309-1314. G. Yamamoto, Y. Nannya, M. Kato, M. Sanada, R. L. Levine, N. Kawamata, A. Hangaishi, M. Kurokawa, S. Chiba, D. G. Gilliland, et al: Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix singlenucleotide-polymorphism genotyping microarrays. Am J Hum Genet 2007, 81:114-126. M. A. Newton: Discovering combinations of genomic alterations associated with cancer. Journal of American Statistical Association 2002, 97:931. M. A. Newton, Y. Lee: Inferring the location and effect of tumor suppressor genes by instability-selection modeling of allelic-loss data.
236
Cheng Li, Samir Amin
Biometrics 2000, 56:1088-1097. [59] B. J. Miller, D. Wang, R. Krahe, F. A. Wright: Pooled analysis of loss of heterozygosity in breast cancer: a genome scan provides comparative evidence for multiple tumor suppressors and identifies novel candidate regions. Am J Hum Genet 2003,73:748-767. [60] F. Jiang, R. Desper, C. H. Papadimitriou, A. A. Schaffer, O. P. Kallioniemi, J. Richter, P. Schraml, G. Sauter, M. J. Mihatsch, H. Moch: Construction of evolutionary tree models for renal cell carcinoma from comparative genomic hybridization data. Cancer Res 2000,60:6503-6509. [61] P. Wang, Y. Kim, J. Pollack, B. Narasimhan, R. Tibshirani: A method for calling gains and losses in array CGH data. Biostatistics 2005,6:45-58. [62] P. H. Westfall, S. S. Young: Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment. New York: Wiley; 1993. [63] R. Beroukhim, G. Getz, L. Nghiemphu, J. Barretina, T. Hsueh, D. Linhart, 1. Vivanco, J. C. Lee, J. H. Huang, S. Alexander, et al: Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci USA 2007, 104:2000720012. [64] M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95:14863-14868. [65] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531-537. [66] L. Girard, S. Zochbauer-Muller, A. K. Virmani, A. F. Gazdar, J. D. Minna: Genome-wide allelotyping of lung cancer identifies new regions of allelic loss, differences between small cell lung cancer and non-small cell lung cancer, and loci clustering. Cancer Res 2000, 60:4894-4906. [67] A. H. Bild, G. Yao, J. T. Chang, Q. Wang, A. Potti, D. Chasse, M. B. Joshi, D. Harpole, J. M. Lancaster, A. Berchuck, et al: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006, 439:353-357. [68] P. A. Futreal, L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman, M. R. Stratton: A census of human cancer genes. Nat Rev Cancer 2004,4:177-183. [69] G. Manning, D. B. Whyte, R. Martinez, T. Hunter, S. Sudarsanam: The protein kinase complement of the human genome. Science 2002, 298:19121934. [70] D. R. Carrasco, G. Tonon, Y. Huang, Y. Zhang, R. Sinha, B. Feng, J. P. Stewart, F. Zhan, D. Khatry, M. Protopopova, et al: High-resolution genomic profiles define distinct clinico-pathogenetic subgroups of multiple myeloma patients. Cancer Cell 2006, 9:313-325. [71] Y. H. Kim, L. Girard, C. P. Giacomini, P. Wang, T. Hernandez-Boussard, R. Tibshirani, J. D. Minna, J. R. Pollack: Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct
Chapter 9
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 237
expression signatures of MYC family gene amplification. Oncogene 2006, 25:130-138. H. Lee, S. W. Kong, P. J. Park: Integrative analysis reveals the direct and indirect interactions between DNA copy number aberrations and gene expression changes. Bioinformatics 2008. J. Yao, S. Weremowicz, B. Feng, R. C. Gentleman, J. R. Marks, R. Gelman, C. Brennan, K. Polyak: Combined cDNA array comparative genomic hybridization and serial analysis of gene expression analysis of breast tumor progression. Cancer Res 2006, 66:4065-4078. M. Lastowska, V. Viprey, M. Santibanez-Koref, 1. Wappler, H. Peters, C. Cullinane, P. Roberts, A. G. Hall, D. A. Tweddle, A. D. Pearson, et al: Identification of candidate genes involved in neuroblastoma progression by combining genomic and expression microarrays with survival data. Oncogene 2007,26:7432-7444. L. A. Garraway, H. R. Widlund, M. A. Rubin, G. Getz, A. J. Berger, S. Ramaswamy, R. Beroukhim, D. A. Milner, S. R. Granter, J. Du, et al: Integrative genomic analyses identify MITF as a lineage survival oncogene amplified in malignant melanoma. Nature 2005, 436:117-122. D. Tsafrir, M. Bacolod, Z. Selvanayagam, I. Tsafrir, J. Shia, Z. Zeng, H. Liu, C. Krier, R. F. Stengel, F. Barany, et al: Relationship of gene expression and chromosomal abnormalities in colorectal cancer. Cancer Res 2006, 66:2129-2137. J. M. Nigro, A. Misra, L. Zhang, I. Smirnov, H. Colman, C. Griffin, N. Ozburn, M. Chen, E. Pan, D. Koul, et al: Integrated array-comparative genomic hybridization and expression array profiles identify clinically relevant molecular subtypes of glioblastoma. Cancer Res 2005, 65:1678-1686. C. Li, W. H. Wong: DNA-Chip Analyzer (dChip). In: The analysis of gene expression data: methods and software Edited by G Parmigiani, ES Garrett, R Irizarry, SL Zeger. pp. 120-141. New York: Springer; 2003: 120141. N. Rabbee, T. P. Speed: A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics 2006.22:7-12.
This page intentionally left blank
Chapter 10 Analysis of ChIP-chip Data on Genome Tiling Microarrays W. Evan Johnson * Jun S. Liu t
X. Shirley Liu
:j:
Abstract Chromatin immunoprecipitation on microarray (ChIP-chip) experiments are a powerful tool for the detection of in vivo protein-DNA binding activity essential to the regulation of gene expression. Coupled with newly introduced tiling microarrays that can interrogate select parts or even an entire genome with high resolution, Chip-chip technology allows for the unbiased mapping of DNA-binding proteins throughout the genome. However, the increased resolution from tiling microarrays results in large, noisy, and correlated data sets that require powerful and computationally efficient analysis methods. In this chapter we first introduce ChIP-chip technology and tiling microarrays. Then we discuss methods for analyzing ChIP-chip data to obtain a list of putative protein-DNA interaction regions. Furthermore, we consider relevant follow-up analyses such as correlating ChIP-chip data with gene expression and searching for known and novel DNA-binding motifs in the putative binding regions. Keywords: Transcription factor; chromatin immunoprecipitation; tiling microarray; background subtraction; normalization; peak-finding; motif finding.
1
Background molecular biology
The DNA of every living cell contains a blueprint or set of instructions that uniquely defines the organism. This DNA can be further dissected to elements called genes, which often determine specific traits of the organism. Each gene typically contains the unique code for constructing specific proteins, which are composed of long chains of amino acids as determined by the DNA sequence of the gene. Proteins are the primary functioning molecule within a cell, performing 'Department of Statistics, Brigham Young University, Department of Oncolo gical Sciences, University of Utah, 223 TMCB, Provo, UT 84602, USA. E-mail:[email protected] tDepartment of Statistics, Harvard University, 1 Oxford St., Cambridge, MA 02138, USA. +Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, CSL-11022, 44 Binney St., Boston MA 02115, USA.
239
240
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
essential roles in biological processes such as metabolism, cell cycle, cell signaling, immune response, cell adhesion, and take part in many other important structural or mechanical functions within the cell. Therefore, the identity, shape and function of any cell is determined primarily by which genes are expressed, or more precisely, by which proteins are present and active in the cell. The DNA in eukaryotic cells is contained in a membrane-enclosed nucleus that keeps the DNA separate from the rest of the cell. Proteins are synthesized outside the nucleus in the cytoplasm. Because the DNA never leaves the nucleus, it is necessary to pass genetic information outside the nucleus in order to construct proteins. The protein coding information is passed from the nucleus to the rest of the cell by ribonucleic acid (RNA). RNA is created through a process called transcription (and so the RNA are often called transcripts), in which the two strands of the DNA at the gene are separated and a complementary molecule, the RNA, is constructed from one of the strands. These RNA strands then pass through the nuclear membrane and carry the protein-building blueprint from the gene outside the nucleus to ribosomes, where the RNA are translated into proteins. One of the primary goals of active research in molecular biology is to understand the mechanisms of gene or transcription regulation. It is well-known that a small subset of functioning proteins, called transcription factors (TFs), returns to the nucleus and plays a primary role in regulating the transcription of genes in the cell. TFs function by binding to the DNA in specific locations on or near genes, acting as switches to regulate the production of transcripts. Each individual TF typically prefers to bind to a small DNA sequence, usually 5-20 base pairs long, called a binding motif. Multiple variations of each binding motif can be bound by the TF. Typically researchers try to identify a consensus sequence, which can be defined as the highest affinity binding motif. If the right combinations of TFs are bound to the DNA near the gene, the proper transcription machinery is recruited and the gene is actively transcribed. One of the most well characterized binding motifs is the TATA box, which usually contains the DNA sequence 'TATAA' followed by three or more As. This sequence is found in the upstream sequence, or promoter region, of genes in almost all organisms. It is typically located very close to the transcription start site and is known to assist in directing the transcription machinery to the starting site. For researchers working to understand transcription regulatory networks, one of the most important objectives is to identify target genes that are directly regulated by the TF of interest. However, due to the complex nature of most transcription regulatory networks, this is not an easy task. Until recently, the most common method of finding target genes was find the genes whose expression is correlated with the TF of interest and scan the promoter regions of these genes for the known binding motifs of the TF. However, the presence of the binding motif does not infer that the protein is actively binding with the DNA. Also, identifying genes correlated with the TF was often difficult when the number of gene expression experiments was limited or could only be done only a few genes at a time. Additionally, the relevant binding motifs (and variations) for the TF of interest may not have been well characterized, which made the process even more difficult. A more direct method to find binding sites is to utilize a ChIP-chip experiment,
Chapter 10 Analysis of ChIP-chip Data on Genome Tiling Microarrays
241
which is described in detail below. A ChIP-chip experiment can isolate regions of approximately 500-1000 base-pairs that contain at least one TF-DNA interaction site. With these regions, researchers can scan for known motifs and also locate and characterize new motifs. These regions can be mapped to nearby genes or be correlated with gene expression data to determine the genes directly regulated by the TF. Additionally, using a ChIP-chip experiment it is possible to identify other DNA binding proteins, or cofactors, that are directly interacting with the TF of interest. This paper will first introduce a ChIP-chip experiment; the protocol and the types and platforms of microarrays that can be utilized in the experiment. Then the data will be described and many of the analysis algorithms that have been applied to ChIP-chip experiments will be reviewed. Finally, this work will discuss the follow-up analysis, meaning the steps one needs to take after localizing the TF binding regions, in order to find existing and new motifs, find target genes and detect protein co-factors.
2
A ChIP-chip experiment
To determine the in vivo binding locations of a specific TF or protein of interest, biologists often utilize a recently developed new technology, called a chromatinimmunoprecipitation (ChIP) on a DNA micro array (chip), usually denoted as a ChIP-on-chip or ChIP-chip experiment. Such an experiment allows for the unbiased mapping of protein-DNA binding sites across the entire genome (depending on the coverage of the microarray used). ChIP-chip experiments have been used successfully to characterize the binding properties of many DNA-binding proteins, such as proteins involved in activating/repressing transcription [8, 7, 25] and proteins important for determining chromatin structure [41, 34]. An excellent review of ChIP-chip experiments is given in [3]. In a ChIP-chip experiment, a complete DNA sample is enriched with the protein-DNA binding sequences of interest (ChIP) and then hybridized on a whole-genome DNA tiling microarray (chip). The putative protein binding sites can then be inferred from the data produced by the ChIP-chip experiment. Figure 1 contains a diagram of the processes involved in a ChIP-chip experiment and the following subsections give more details about the experiment.
2.1
Chromatin immunoprecipitation
To enrich a DNA sample with the TF-bound fragments, experimenters can apply a technique called chromatin immunoprecipitation (ChIP). To conduct a ChIP experiment, a researcher will typically start with large number of cells, usually on the order of 108 cells. The cells are first treated with formaldehyde, which crosslinks the protein-DNA interaction sites by creating a covalent bond between the protein and DNA. This cross-linking step also cross-links RNA-protein bonds and protein-protein bonds, which allows researchers to simultaneously extract regions that are bound to the DNA-binding cofactors of the particular TF. After crosslinking, the cells are lysed, and the DNA is sheared into smaller segments through
242
W. Evan Johnson, Jun S. Liu, X. Shirley Lin
Cross-link DNA in vivo and lyse cells
Shear and extract DNA, introduce antibody
Pull out DNA-proteinantibody complexes
Wash away anti-bodies, proteins, amplify and label sample
Hybridize on tiling microarray
Figure 1: Description of a ChIP-chip experiment
sonication in which the DNA is cut by introducing high frequency sound waves. This should slice the DNA into segments around lOO-600bps in size. At this point, a small amount of DNA is retained for the inpnt or control samples. At this point in the protocol, the researchers introduce antibodies that recognize the TF in the sample. These antibodies should have been independently prepared and mounted on magnetic or agarose beads. After the antibodies have had adequate time to bind to the TF---which is still bound to the DNA-the antibody-TF-DNA complexes are pulled-out or precipitated from the sample using a magnet (for magnetic beads) or a centrifuge (for the agarose beads). After precipitation, the beads and antibodies are washed away, the DNA-protein bonds are
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays
243
broken and the DNA is purified. Often this leaves only a small amount of DNA in the sample which is not enough to hybridize on a micro array, so the fragments must be amplified. The control samples should also be amplified in the same manner, omitting the immunoprecipitation step. If the immunoprecipitation worked perfectly, a ChIP experiment would only extract fragments bound to the TF of interest. Assuming a random shear distribution from the sonication procedure, if one measured the abundance of DNA relative to location in the genome, one would observe a peak at the binding site with an exponential-decay on either side. However, the antibodies often nonspecifically bind to incorrect molecules and the pull-out process does not perfectly segregate the antibody-bound fragments from those not bound. A typical ChIP experiment will result in a DNA sample containing a background distribution of fragments from the entire genome, but with more copies of or enriched with sequences around the actual binding location.
2.2
Hybridization to a tiling microarray
After the ChIP experiment is completed and the DNA has been amplified, there is need to measure the abundance of each fragment in the sample and map these measurements to their genomic location. This is usually done by labeling the DNA fragments and hybridizing them to a genome tiling micro array with probes that cover the entire genome or at least the genomic regions of interest. More details concerning tiling micro arrays are given in the following sections.
2.3
Early ChIP-chip experiments
The first ChIP-chip experiments [35, 20, 30] were conducted using yeast cells on a proximal promoter microarrays. These microarrays were eDNA based and were constructed by spotting PCR products on a slide. The eDNA clones were typically designed so they mapped back to -700bp to +200bp from the transcription start site (TSS) of a gene of interest. Using these microarrays, researchers were able to successfully characterize all TFs in yeast [15]. However, these microarrays had limited coverage, and active TFs in eukaryotes can bind very far away from the gene [7]. In addition to the possibility of missing the binding region all together, the limited number of probes on the microarray provide poor resolution and create challenges for motif finding. Thus for more complex organisms, such as mammals, these promoter microarrays do not work well.
2.4
Commercial tiling microarrays
Recently, commercial tiling microarrays have been developed that tile the entire genomes or targeted broad regions in a genome of organisms at a very high resolution [23]. Tiling micro arrays get their name from microarrays that were constructed using overlapping probes, like tiles on a roof. Later, many different types of tiling micro arrays have been developed. Although not all of these contain overlapping probes, these microarrays are designed to interrogate the genome at a
244
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
very high resolution. Commericial tiling microarrays are available in a variety of the platforms and designs. Platforms have been developed for one and two color experiments. Probes on the array range from short ('" 25bps) oligonucleotides to long ('" 50 - 60bps) oligonucleotides, and micro arrays are designed to cover genomes a number of different resolutions. A few notable companies, including those mentioned in the following paragraphs, offer several different microarray platforms and designs that are useful for ChIP-chip experiments. Examples include promoter microarrays, which cover several thousand base pairs before and after the TSS and whole-genome tiling microarrays that attempt coverage on the entire non-repetitive genome. Additionally, custom microarrays, where the probes can be selected to cover any part of the genome at any resolution, can be deigned to investigate specific genomic regions of interest. Affymetrix (Santa Clara, CA) currently offers the densest commercial microarrays available and can put the most probes on a single microarray-up to 6 million probes. These micro arrays usually come at a lower cost than the microarrays from the other companies mentioned below. Typically, the probes on the Affymetrix tiling microarrays are single channel micro arrays and contain 25-mer oligonucleotide probes that tile at different resolutions for different organisms. The Affymetrix human genome tiling microarrays usually contain one probe for every 35 base pairs in the non-repeat regions in the genome. They have developed a set of 7 microarrays that tile the entire non-repetitive human genome, and also currently offer promoter microarrays for human and mouse that tile regions -7. 5Kbps to 2.5Kbps from the TSS. However, the Affymetrix method for micro array synthesis requires expensive masks, which makes these microarrays the least flexible microarray designs, and the short oligonucleotide probe design makes Affymetrix data the nosiest of those mentioned here. Agilent (Palo Alto, CA) has a very flexible and reliable method of synthesizing microarrays. Their microarrays are designed for two-color experiments and usually contain 60-mer probes. Their microarray synthesizing procedure, which consists of using HP InkJet technology, allows for the construction of custom microarrays, so researchers can choose the probes that are spotted on the microarray. The data from these microarrays are often very clean, but they currently can only spot up to 244 thousand probes on a micro array. Agilent offers promoter microarrays for the human and mouse genomes that tile on average -5.5Kbps to +2.5Kbps from the TSS. These microarrays usually come in 2 microarray sets, covering nearly 17 thousand promoters with one 60-mer probe per 200bps. Agilent also offers a human whole genome set containing many microarrays at a substantially larger cost than the Affymetrix microarrays. NimbleGen (Madison, WI) microarrays are also very flexible and allow for custom design. Their microarrays are suited for two-color experiments and contain '" 50-mer probes. They offer full service hybridization, meaning that experimenters can submit a sample after ChIP and amplification, and NimbleGen will label the sample, hybridize it on the microarray, and scan the microarray and return the data. They also have a variety of available tiling and promoter microarrays that contain 385-700 thousand probes per microarray, and they have recently developed microarrays that contain over 2 million probes per microarray.
Chapter 10
3
Analysis of ChIP-chip Data on Genome Tiling Microarrays
245
Data description and analysis
There are many new statistical and computational challenges facing researchers attempting to analyze data from a ChIP-chip experiments on tiling microarrays. When analyzing data from more traditional microarray experiments (e.g. gene expression), researchers often assume the measurements across probes to be independent. However, for tiling microarray data, the probes map to genomic regions that overlap or are very close together, so these measurements should be highly correlated. Additionally, data from experiments on high resolution tiling microarrays tend to be much noisier than in other microarrays, because the microarray contains only one copy of each probe and more importantly because the higher resolution reduces opportunities for careful probe selection. Additionally, a researcher may only be looking for a few hundred genomic binding sites, and a typical ChIP-chip experiment on a tiling microarray covering the whole human genome can produce millions of data points. Therefore ChIP-chip analysis requires highly sensitive statistical methods that filter noise and account for local correlation while balancing the logistical constraints of working with the large data sets that are produced. Microarray analysis is typically conducted in two phases-a data preprocessing step, usually called low-level analysis, followed by high-level analysis that usually answers the particular question that researchers are proposing. Although some sophisticated methods have been applied that integrate these low and high level analyses, the two-phase approach will be described here. Figure 2 is a workflow diagram for a standard ChIP-chip data analysis.
ChiP-Chip Experiment
~, Low-Level Analysis Normalization Background Correction
g
'"
High-Level Analysis Peak-Finding
./
Map Regions to genes Corollate with Gene Expression GO analysis
~
Motif Finding
Known Motifs De Novo Searching
Figure 2: Workflow diagram for ChIP-chip data analysis
246
3.1
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
Low-level analysis
In gene-expression micro array experiments, the preprocessing typically consists of two parts. The first part, called normalization, accounts for sample effects by forcing either the overall brightness of the micro array images or the distribution of intensity values to be similar across microarrays. One of the most common normalization procedures for gene expression array experiments is called quantilenormalization [2], which adjusts the distributions of signals from all samples produced by the experiment to be the same, under the assumption that the overall array brightness and distribution should be exactly the same across experiments. However, in ChIP-chip experiments, ChIP enrichment is expected to inflate the upper tails of the distribution of the treatment samples as compared to the control samples, so this method may reduce the significance of many of the highly enriched regions, and therefore quantile-normalization may not be ideal for normalizing data from ChIP-chip experiments. This method could, however, be applied to ChIPchip experiments by not adjusting the whole distribution, but by adjusting the median and some of the percentiles (such as the 25th and 75th percentiles) or the minimum-length inter-quartile range so the upper tail of the distribution is not disturbed. The second phase of a typical preprocessing step is to account for probe specific background or probe effects and to summarize the observation from other identical probes on the microarrays. Many have suggested methods for standard micro arrays [17, 28, 19], but since most tiling microarrays do not contain mUltiple copies of each probeset (as probes are continuous on the genome without definition of probesets) these methods are not applicable. Li et al. [29] developed one of the first probe background adjustment methods for ChIP-chip experiments. They used ChIP-chip data from multiple experiments (for different TFs) conducted tiling arrays with the same array design. They asdistribution, sumed that the log-intensity values from probe i follows an N(J-li' where Ili and fJl are estimated using the probe i log-intensity values across the multiple experiments. However, this method requires the availability of multiple experiments on the same array design, which are not always available. Huber et al. [18] developed an empirically-based normalization method that only requires a few arrays for background estimation. For each experiment, they recommend hybridizing a few control replicates consisting of labeled, sonicated DNA. They assumed that the intensity of the ith probe, yi, follows the relationship Yi = Pi +aixi, where Xi is proportional to the specific abundance of the target molecule (and value of interest) and Pi and ai are probe-specific additive and multiplicative background noise, respectively. The estimate of ai, ai, is obtained using the geometric mean of probe i intensity values from the control arrays. Pi is estimated using IJi = !(ai) where f is obtained as follows. The control probe values are grouped into strata based on the 10,20, ... ,100% quantiles of ai and a robust estimate of the probe values is obtained for each stratum. ! is given by a smoothed version of these robust estimates. The background adjusted probe values are given by
on
Chapter 10 Analysis of ChIP-chip Data on Genome Tiling Micmarrays
247
also incorporated a data adjustment to normalize across arrays (indexed by k) based on the transformation y~~ = (Y~k - d k ) / Ck, where d k and Ck are array-specific offset and relative scaling factor estimated from the data from array k. Johnson et al. [22] developed a model-based method for background subtraction and normalization. Their method is based on the assumption that probes with similar probe sequence will have similar binding behavior and that most of the probes (say> 99%) measure background signal in a ChIP-chip experiment. Therefore, most of the probes on the array are measuring background signal, so each array therefore contains its own controls. Under these assumptions, they modeled log-intensity from probe i as L
Yi = (miT
+
z= z= j=l kE{A,C,G}
{3jk 1ijk
+
z=
"/jk n 7k
+ 8 log Ci + Ei
kE{A,C,G,T}
where nik is the probe i nucleotide k count; • a is the baseline value based on the number of T nucleotides on the probe (i.e. 25a is the baseline when the probe sequence is a run of 25 Ts); • L is the length (in nucleotides) of probe i; • I ijk is an indicator function such that I ijk = 1 if the nucleotide at position j is k in probe i, and Iijk = 0 otherwise; • {3jk is the relative effect of having nucleotide k at probe position j (versus having a T in position j); • "/k is the effect of nucleotide k count squared; • 8 is the effect of the log of the number of times the probe appears in the genome, Ci; and • Ei is the probe-specific error term, assumed to follow a normal distribution. •
Johnson et al. used ordinary least squares to estimate the parameters of the model, but if researchers are still worried that the small percentage of probes measuring true signal would bias the parameter estimates, robust regression or an iteratively re-weighted least-squares approach cold be used. This model was applied separately to the probes on each array separately, and then obtained normalized and background-adjusted estimates by
where mi is the model-estimate for probe i and Si is the estimated standard error of probes on the array with similar model-estimates to that of probe i. Because the model is applied to each array separately, this adjustment simultaneously adjusts for probe background effects and normalizes the arrays. Song et al. [40] applied a similar model-based approach designed specifically for two-color microarray data.
3.2
High-level analysis
The second step to the analysis of ChIP-chip data is the localization of interesting genomic features, usually attempting to find the TF-DNA interaction sites. To
248
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
do this, one will typically look for regions with many probes with high intensity values, or what will be denoted as peak regions. Many different methods have been proposed for localizing peak regions in data from ChIP-chip experiments. One of the first peak finding methods, proposed by Cawley et al. [8], pools the data from all probes and all samples in a fixed genomic region (500-1000 bp) around each probe, and then utilizes a Wilcoxon Rank Sum test to compare the treatment probe measurements with the controls. More specifically, let window k be defined as the genomic window [Xk - 500bps, Xk + 500bps] where Xk is the chromosomal position of probe k. Now suppose an experiment contained only one ChIP and one control sample. Let y~hIP = {Yi : Xi E [Xk - 500bps, Xk + 500bps]}, and let Y~ontrol be similarly defined. Assuming that probe i is in window k, define rf be the rank of the ChIP sample probe i intensity in yk = (Y~hIp,y~ontrol)' and let Uk
= '" rk ~
,
_ nk(n
k
+ 1)
2'
j
where n k is the number of probes from the ChIP samples in window k. A p-value for window k can then be calculated using the large-sample distribution of Uk, which is approximately N(nk;nk , nk.nki~nk+l)). A window is called significant if its p-value is below a predefined stringent p-value cutoff. If there are replicate ChIP and control samples, the probes from all arrays are pooled together and then the test is conducted as outlined above. Johnson et al. [22] also developed a sliding window-based method, but they used a window scoring method based on a trimmed mean weighted by the square root of the number of probes in the region. In other words, defining window k, y~hIP' and y~ontrol as above, Johnson et al. used the window statistic Zk
= v:;;F [TM(Y~hIP)
- TM(Y~ontrol)] ,
where T M(y) is the mean of y after trimming the top and bottom 10% of the values in y. The purpose of the v:;;F weight is to account for windows with different numbers of probes. They assumed that zk follows a N(/-L, ( 2 ) distribution where /-L is estimated by the median of the zk s and u 2 is estimated using the zk values which are below the median (to avoid bias from the significant values in the righttail of the distribution). They used this empirical distribution of window scores to calculate p-values and call significant regions. Yet another peak-finding method was presented by Ji and Wong [21], who used a hierarchical empirical Bayes model that produces a scoring statistic with robust variance estimate. Let Yijk be the log-intensity value for probe i = 1, ... ,N, condition j E {ChIP, control} and replicate k = 1, ... , K j . They made the following model hierarchical model assumptions:
/-Lij I/-Lo,
T5 ex 1
uTllIo,w6
rv
Inv - X2( lIo,w6)
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays
Now defining v = L.j(Kj - 1), s; 8 = L.i (s; - 82)2, and
B=
= L.j L.k(Yijk - Yij)2/ v ,
2/v N -1 1 + 2/v N
1
-2
+ 1 + 2/v (2/v)(s
82
249
= L.i sUN and
2N -1 ) -8-
O"'f can be estimated by
o-'f
=
(1 -
B)s'f + B82 .
The posterior distribution of /-lij can be approximated by an N (Yij, O"U K j ) distribution. Now for comparing the ChIP and the control samples, Ji and Wong used the statistic: t. YiChI P - Yicontrol • - Jo-'f(l/KchIP + l/Kcontrol) After obtaining ti for each probe, they combined neighboring probe statistics using a moving average method or a Hidden Markov Model (HMM). We will describe a similar HMM method here. Consider a two-state Markov process taking on values o or 1 where state 1 represents interesting genomic regions (e.g. TF binding sites). The probe statistics are assumed to have a density of h(t) == 7rfo(t) + (1- 7r)h(t) where 1-7r and h (t) are the proportion and distribution of probe statistics in state 1. For probes close together, Markov chain transition matrix takes on transition probabilities P(O ---t 1) = al and P(l ---t 0) = ao, and when the probes are far apart, the Markov chain has transition probabilities P(O ---t 1) = 1 - 7r and P(l ---t 0) = 7r. The densities fa, hand 7r can be estimated by assuming a normal or t-distribution on the data, or by applying permutations, bootstrap or unbalanced mixture subtraction (described in detail by Ji and Wong [21]). The remaining model and transition parameters can then be estimated using the BaumWelch (or EM) algorithm [13, 12J. Others have also presented methods for identifying peak regions [29, 24, 14, 42J, but these will not be discussed in detail here. The interested reader is referred to the references above for more details.
4
Follow-up analysis
The output from the peak finding algorithms above is a set of genomic regions that are putative protein-DNA interactions sites. These regions are usually 5001000 base pairs in length, and will likely contain one or more active TF-DNA binding sites. After biological validation of a small number of these sites, there are several different follow-up analyses that need to be conducted to answer the situational questions of interest. One natural follow-up analysis is to find genes in close proximity to the binding site, to attempt to determine if the binding site is actively involved in regulating the gene and to see if the regulated genes are involved in a common biological function. Another important follow-up analysis is to try to better characterize the DNA-binding properties of the TF, such as searching the putative regions for new binding motifs, scanning the regions for known motifs to try to detect occurrences of the known binding motifs for the TF,
250
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
or to look for potential DNA-binding protein cofactors for the TF. Motif finding is discussed in detail below.
4.1
Correlating ChIP-chip and Gene Expression Data
Correlating gene expression and ChIP-chip data is an important follow-up analysis. It allows researchers to identify which binding sites are actively effecting gene expression and helps researchers better understand function of the TF. There are currently few mathematically rigorous methods for combining ChIP-chip and gene expression data. For the most part, a researcher looks for significant ChIP-chip regions that contain TF binding motif and are close to (preferably in the promoter of) differentially expressed genes. For example, Caroll et al. [7] conducted a ChIP-chip experiment to determine the DNA-binding locations of estrogen receptors (ER) on breast a cancer cell line. In addition, they conducted a gene expression experiment on the cell line under normal conditions (control) and on cells deprived of estrogen (ER will not bind to DNA without estrogen). Changes in gene expression were observed. These gene changes were correlated with the ChIP-chip significant regions, meaning that significant regions that were close to differently expressed genes and that also contained an ER binding motif were inferred to be actively regulating the gene. The genes that are inferred to be directly influenced by the TF can then be further studied by searching for over-represented biological pathways or gene functions, using databases such as the Gene Ontology (GO) database (www.geneontology. org), to identify whether the TF regulates genes that have a common biological function. Also of interest is to see if the TF is involved in repressing or promoting genes from a particular pathway or function.
4.2
Scanning sequences for known motifs
There are several publicly available literature-based databases of known transcription factor binding motifs [33, 37]. These provide researchers the ability to scan a set of DNA segments, such as promoters, or even scan the entire genome for occurrences of a particular motif for a given TF. With these database tools, researchers can scan regions of interest for many motifs, making it possible to find genes related to the factor and also to find related DNA-binding co-factors. One way to scan for a known motif is to merely check for the occurrence of the motif or its consensus sequence. This approach is only useful when the TF's binding motif has only a few variations and these variations are all well characterized. In most cases, however, other approaches must be applied. A motif and its variations can be more precisely represented by a motif matrix, or position specific weight matrix. For a motif with length n, the weight matrix can be defined as a 4 x n matrix, where the columns represent the positions of the motif and the rows represent the nucleotides ACGT. The matrix elements Pij are the probability that the nucleotide in the jth position of the motif contains the ith nucleotide. For example, the position weight matrix from the TF HNF4 (Hepatocyte Nuclear Factor 4) from the JASPAR database (http://jaspar.genereg.net/) is
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays
251
given by Position: 1 2 3 4 5 6 7 8 9 10 11 12 13
A 0.4 0.0 0.2 0.1 0.0 0.9 0.8 0.8 0.1 0.1 0.0 0.1 0.6
C 0.1 0.0 0.1 0.3 0.8 0.0 0.0 0.0 0.1 0.0 0.3 0.7 0.1
G 0.4 0.8 0.5 0.3 0.1 0.0 0.1 0.1 0.9 0.5 0.2 0.0 0.1
T 0.1 0.1 0.2 0.3 0.1 0.1 0.0 0.0 0.0 0.4 0.5 0.1 0.1
Note that some of the columns do not sum to 1 because the values have been rounded to the nearest tenth. A more visually appealing way of representing a DNA-binding motif is to use a sequence logo. A sequence logo is a graphical representation of a motif matrix that is basically a barplot showing the information content in each position. The bars are replaced by the nucleotides (A,C,G,T) and the relative height of each nucleotide in each position represents its frequency Pij' See Schneider and R.M. Stephens [38] for a more comprehensive description of sequence logos and Crooks et al. [11] for an online tool (http://weblogo.berkeley.edu/) for generating sequence logos. Figure 3 is a sequence logo for HNF4.
,.
Figure 3: Sequence logo for the DNA-binding motif for the HNF transcription factor from the JASPAR database
Assuming that independence of the bases in the motif or assuming that the distribution of the nucleotide occurrence across the motif is a product multinomial distribution with probabilities defined by the motif matrix, each n-mer in the sequence can be scored for the presence of each motif. For example, suppose the motif matrix was length L and let y be a candidate DNA sequence which to be scored by the motif. Let Yj be the jth nucleotide in the candidate sequence. The
252
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
likelihood score for the motif can then be represented as L
MotifScore(y)
= II
L
Pijl(Yj
= i)
(4.1)
j=l iE{A,C,G,T}
where Pij is the ijth element of the position weight matrix and I(Yj = i) = 1 if Yj = i and I(Yj = i) = 0 otherwise. Using the motif score, researchers can then scan sequences of interest (e.g. ChIP-chip peak regions) for the occurrences of known motifs by setting a score cutoff or by looking for an enrichment of the motif in the sequences of interest versus the entire genome. More sophisticated scoring methods can also be derived by incorporating conservation score, a high order Markov background, allowing multiple copies of motif, or allowing for combinations of multiple motifs.
4.3
Identifying new binding motifs
Finding new motifs is usually a difficult task, and there are many notable methods described in the following sections. One of the most difficult challenges faced by the search for new motifs is that the binding motifs are often not well characterized. In addition, the many repetitive elements in DNA often add noise to the motif search and make it difficult to find which repetitive elements are true binding motifs, especially since many motifs look very similar or are GC-rich (e.g. E-box motif-CACGTG). Other factors leading to the poor sensitivity and specificity of de novo motif finding methods are that the motifs are often highly variable and the binding site borders not always clear. Also, for nearly all motif finding methods, there is a limit to the size of sequences that can be used to find a motif and this is complicated by the fact that most motifs only appear a few times in the sequences to be searched. 4.3.1
Regular expression enumeration
One of the simplest ways to look for new motifs is to count the occurrence of each word of size w in the selected sequences and compare this with the number expected at random. More sophisticated methods for word enumeration have been proposed. For example, Bussemaker et al. [5] builds long motifs from short ones. They start with a simple dictionary of small overrepresented motifs, and then build up the dictionary by adding overrepresented longer motifs based on the smaller motifs already in the dictionary. For example, the human genome is composed of roughly 30% C and G nucleotides and 20% As and Ts. Therefore, in any set of human sequences, one would expect to observe the di-nucleotide frequencies of 9% for CC, CG, GC, GG, 6% for AT, AG, CA, CT, GA, GT, TC, TG, and 4% for AA, AT, TA, and TT pairs. If for example, the frequency of CC was 15%, and the frequencies of TA and AA were 10%, then the initial library would contain {AA, CC, TA}. The next stage would look for over-represented tri-nucleotide combinations that contain AA, CC, or TA. The process can then continued until no over-represented motifs are found or the sequences in the library reach a predetermined size.
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays
253
However, note that these methods are limited in that they are only able to detect static motifs and cannot distinguish or quantify similar motifs with the same consensus sequence.
4.3.2
Position weight matrix update
Many researchers have attempted to search for motifs by estimating the position weight matrix of the motif. These methods start with a reasonable guess at the motif matrix and then use the sequence data to update the matrix. Hertz et al. [16] developed a method that adds one sequence at a time to look for the best motifs obtained with the additional sequence. This algorithm runs very fast, but the sequence order plays a significant role in affecting algorithmic performances. For example, usually the first two sequences must contain real motif sizes and the first few sequences need to be enriched with the motif sites. Others have applied missing data approaches such as the expectation-maximization (EM) algorithm or Gibbs sampling. Most EM approaches [27, 1, 4] treat the motif position as missing data and iteratively find the expected location of the motif (E-step) and then estimate the motif weight matrix (M-step) based on this expectation. A straight-forward way to apply an EM approach to motif searching is given in the following. For a sequence of length N, let YI,'" ,YN represent each position of the sequence. Define Xl = {Yl,'" ,YL}, X2 = {Y2,'" ,YL+l}, ... , XN-L+l = {YN-L+l, ... , YN}. Now suppose that each Xj for j = 1, ... , N - L + 1 was either derived from background sequence or from the position weight matrix. Let {Pim} for i E{ A,C,G,T} and m = 1, ... ,L be the probabilities from the motif matrix and let {b im } be a (usually known) position weight matrix for background sequence. Define the (unobserved) indicator dj to be equal to 1 if Xj is derived from the motif, and 0 if Xj is background sequence. Assuming that the d j follow a Bernoulli(7r) distribution, it follows that the complete data likelihood can be expressed as
where hex) and fo(x) are the motif scores (as in Equation 4.1) from the motif and background matrices, respectively. Using standard EM methodology, it follows that the E-step at iteration k consists of imputing that estimates dY) for dj into Equation 4.2 where
where *Ck-l) is the estimate for 7r at iteration k - 1, and f}k-l)(x) is the motif ,Ck-l)} . . k 1 } say { Pim score for sequence X using the estimates 0 f { Pim, ' at IteratIOn - . The M-step consists of updating the parameters by maximizing Equation 4.2 after imputing the djs. This results in the following estimators for the parameters
w.
254
Evan Johnson, Jun S. Liu, X. Shirley Liu
where Xjm is the mth element of Xj. The E-step and M-step are then iterated until convergence. Others have applied stochastic simulation, or Gibbs sampling to find motifs [26, 36, 31]. These methods are stochastic in nature, sampling the motifs and weight matrices from their posterior distributions given a motif model such as the one used in the EM approach.
4.4
Using microarrays to guide motif search
Others have utilized microarray data to guide the motif search. Since there are many such methods, we will briefly mention a few here, however, the interested reader should refer to the references for details. Bussemaker et al. [6] developed a method that models the microarray expression values given the number of known motif occurrences. Chiang et al. [9] uses expression data for de novo motif search. For each w-mer, they finds the G genes whose upstream sequence contain the w-mer, and calculate a weighted average of expression given the the w-mer occurrence. If this average for a particular w-mer is extremely high or low, the w-mer is reported as a motif. Liu et al. [32] presented a method that is designed to specifically find motifs in sequences produced by ChIP-chip experiments, and combines both word enumeration and motif weight matrices. This method first uses word enumeration to look for w-mers that are over-represented in the sequences and then combines similar motifs to create a weight matrix. Conlon et al. [10] combines the work of Conlon et al. and with a model similar to the model proposed by Bussemaker et al. to find motifs in the promoter sequences of genes and then based on the motif score, applied a step-wise regression of matrix scores on expression to try and predict the role of each TF in the expression of each gene. Shim and Keles [39] incorporates quantitative ChIP-chip data to guide motif searching.
5
Conclusion
ChIP-chip experiments provides researchers with the ability to accurately map in vivo TF binding cites across the entire genome of an organism. This work presented many statistical and computational methods that efficiently extract information from the extremely large and noisy data sets produced by ChIP-chip experiments on genome tiling arrays. However, there is still much more methodological work that needs to be done for these data, including methods for using quantitative ChIP-chip data to guide motif searches, and developing straight-forward methods for integrating ChIP-chip, sequences and expression data.
References [1] T. L. Bailey and C. Elkan. The value of prior knowledge in discovering motifs with meme. Proc Int Can! Intell Syst Mol Biol, 3:21-29, 1995.
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays
255
[2] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19:185-193,2003. [3] M.J. Buck and J.D. Lieb. Chip-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83:349-360,2004. [4] J. Buhler and M. Tompa. Finding motifs using random projections. J Comput BioI, 9:225-42, 2002. [5] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory element detection using a probabilistic segmentation model. Proc Int Conf Intell Syst Mol BioI, 8:67-74, 2000. [6] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory element detection using correlation with expression. Nat Genet, 27:167-171, 200l. [7] J. S. Carroll, X. S. Liu, A. S. Brodsky, W. Li, C. A. Meyer, A. J. Szary, J. Eeckhoute, W. Shao, E. V. Hestermann, T. R. Geistlinger, E. A. Fox, P. A. Silver, and M. Brown. Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein foxal. Cell, 122:33-43, 2005. [8] S. Cawley, S. Bekiranov, H. H. Ng, P. Kapranov, E. A. Sekinger, D. Kampa, A. Piccolboni, V. Sementchenko, J. Cheng, A. J. Williams, R. Wheeler, B. Wong, J. Drenkow, M. Yamanaka, S. Patel, S. Brubaker, H. Tammana, G. Helt, K. Struhl, and T. R. Gingeras. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding mas. Cell, 116:499-509, 2004. [9] D. Y. Chiang, P. O. Brown, and M. B. Eisen. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics, 17 Suppl 1:S49-55, 200l. [10] E. M. Conlon, X. S. Liu, J. D. Lieb, and J. S. Liu. Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci U S A, 100:3339-44, 2003. [11] G.E Crooks, G. Hon, J.M. Chandonia, and S.E. Brenner. Weblogo: a sequence logo generator. Genome Research, 14:1188-1190,2004. [12] A. P. Dempster, N. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39:1-38, 1977. [13] R. Durbin, S. Eddy, A. Krogh, and G. Mitchinson. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998. [14] R. Gottardo, W. Li, W.E. Johnson, and X.S. Liu. A flexible and powerful bayesian hierarchical model for chipchip experiments. Biometrics, 2007. doi:10.1111/j.1541-0420.2007.00899.x. [15] C. T. Harbison, D. B. Gordon, T. 1. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett, J. B. Tagne, D. B. Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger, D. K. Pokholok, M. Kellis, P. A. Rolfe, K. T. Takusagawa, E. S. Lander, D. K. Gifford, E. Fraenkel, and R. A. Young. Transcriptional regulatory code of a eukaryotic genome. Nature, 431:99-104,2004.
256
W. Evan Johnson, Jun S. Liu, X. Shirley Liu
[16] G. Z. Hertz, 3rd Hartzell, G. W., and G. D. Stormo. Identification of consensus patterns in unaligned dna sequences known to be functionally related. Comput Appl Biosci, 6:81-92, 1990. [17] E. Hubbell, W. M. Liu, and R. Mei. Robust estimators for expression analysis. Bioinjormatics, 18:1585-1592,2002. [18] W. Huber, J. Toedling, and L.M. Steinmetz. Transcript mapping with highdensity oligonucleotide tiling arrays. Bioinjormatics, 22:1963-1970, 2006. [19] R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed. Summaries of affymetrix genechip probe level data. Nucleic Acids Res, 31:eI5, 2003. [20] V. R. Iyer, C. E. Horak, C. S. Scafe, D. Botstein, M. Snyder, and P. O. Brown. Genomic binding sites of the yeast cell-cycle transcription factors sbf and mbf. Nature, 409:533-538., 200l. [21] H. Ji and W. H. Wong. Tilemap: create chromosomal map of tiling array hybridizations. Bioinjormatics, 21:3629-3636,2005. [22] W. E. Johnson, W. Li, C. A. Meyer, R. Gottardo, J. S. Carroll, M. Brown, and X. S. Liu. Model-based analysis of tiling-arrays for chip-chip. Proc Natl Acad Sci USA, 103:12457-12462,2006. [23] P. Kapranov, S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. Fodor, and T. R. Gingeras. Large-scale transcriptional activity in chromosomes 21 and 22. Science, 296:916-919, 2002. [24] S. Keles. Mixture modeling for genome-wide localization of transcription factors. Biometrics, 63:10-21, 2007. [25] T. H. Kim, L. O. Barrera, M. Zheng, C. Qu, M. A. Singer, T. A. Richmond, y. Wu, R. D. Green, and B. Ren. A high-resolution map of active promoters in the human genome. Nature, 436:876-880, 2005. [26] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, 262:208-214, 1993. [27] C. E. Lawrence and A. A. Reilly. An expectation maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7:41-51, 1990. [28] C. Li and W. H. Wong. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA, 98:31-36, 200l. [29] W. Li, C. A. Meyer, and X. S. Liu. A hidden markov model for analyzing chipchip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinjormatics, 21 Suppll:i274-i282, 2005. [30] J. D. Lieb, X. Liu, D. Botstein, and P. O. Brown. Promoter-specific binding ofrap 1 revealed by genome-wide maps of protein-dna association. Nat Genet, 28:327-334., 200l. [31] X. Liu, D. L. Brutlag, and J. S. Liu. Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, pages 127-138., 200l. [32] X. S. Liu, D. L. Brutlag, and J. S. Liu. An algorithm for finding protein-dna binding sites with applications to chromatin-immunoprecipitation microarray
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays
257
experiments. Nat Biotechnol, 20:835-839., 2002. [33] V. Matys, E. Fricke, R Geffers, E. Gossling, M. Haubrock, R Hehl, K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, D. U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R Munch, 1. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender. Ttansfac: transcriptional regulation, from patterns to profiles. Nucleic Acids Res, 31:374-378,2003. [34] F. Ozsolak, J.S. Song, X.S. Liu, and D.E. Fisher. High-throughput mapping of the chromatin structure of human promoters. Nature Biotechnology, 25:244248,2007. (35) B. Ren, F. Robert, J. J. Wyrick, O. Aparicio, E. G. Jennings, 1. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R A. Young. Genome-wide location and function of dna binding proteins. Science, 290:2306-2309, 2000. [36] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church. Finding dna regulatory motifs within unaligned noncoding sequences clustered by wholegenome mrna quantitation. Nat Biotechnol, 16:939-945, 1998. [37] A. Sandelin, W. Alkema, P. Engstrom, W. W. Wasserman, and B. Lenhard. Jaspar: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 32:D91-94, 2004. [38] T.D Schneider and RM. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18:6097-6100, 1990. [39] H. Shim and S Keles. Integrating quantitative information from chip-chip experiments into motif finding. Biostatistics, 9:51-65,2007. [40] Jun S. Song, W. Evan Johnson, Xiaopeng Zhu, Xinmin Zhang, Wei Li, Arjun K. Manrai, Jun S. Liu, Runsheng Chen, and X. Shirley Liu. Modelbased analysis of 2-color arrays (ma2c). Genome Biology, 8:R178, 2007. [41] G. C. Yuan, Y. J. Liu, M. F. Dion, M. D. Slack, 1. F. Wu, S. J. Altschuler, and O. J. Rando. Genome-scale identification of nucleosome positions in s. cerevisiae. Science, 309:626-630,2005. [42] M. Zheng, L. O. Barrera, B. Ren, and Y. N Wu. Chip-chip: data, model, and analysis. Biometrics, 63:787-796,2007.
This page intentionally left blank
Subject Index
A
E
additive-accelerated rate model, 36, 43 anisotropic, 76
effective sample, 74 Expectation and Maximization (EM) algorithm, 86
B F
Bayesian model, 198, 201 Bicluster, 173 binding motif, 240 Biomarker, 103 biomarker, 103-105, 118 Breslow-type estimator, 38
failure time, 3, 4, 11, 14, 17 For tumor-only LOH, 214 frailty models, 75
G
C censoring, 4, 12 ChIP-chip, 180, 195 circular binary segmentation, 224 cis-regulatory, 180, 184, 189 Classification methods, 169 clinical trial designs, 103 comparative genomic hybridization (CGH), 210 complex detection methods, 172 conditional auto-regressive (CAR), 80 conditional Martingale covariance rate function, 90 copy number analysis, 216 copy number variations (CNV), 222 Cox's model, 3, 8, 10, 12, 17 cross-validation, 44
D dChip software, 209, 229 dispersion parameter, 52 DNA sequences, 195 Domain-based methods, 163 Donsker class, 46
Gaussian process, 40, 47 GEE estimator, 53 gene expression, 179, 180, 195 generalized estimating equation, 50, 52 generalized linear mixed models, 73 genomic data, 161 geostatistical, 73 gold standard datasets, 171
H Haldane's map function, 215, 218 Hidden Markov Model (HMM), 214 Hierarchical clustering analysis, 227 HP InkJet, 244
I integrate, 160 interaction graph, 173 intrinsic stationarity, 76 Invariant Set Normalization method, 217
K kriging, 80
259
260
Subject Index
L
R
LASSO, 59 likelihood, 3-5 linear mixed models, 73 link function, 83 Loss of heterozygosity (LOR), 212 loss-of-heterozygosity, 210
Receiver Operator Characteristic, 166 restricted maximum likelihood estimation, 79
M Mapping lOOK array, 210 Markov chain Monte Carlo, 182 median smoothing, 224 mismatch (MM) probes, 211 modeling, 3, 4, 6, 9 modified Cholesky decomposition, 55 motif discovery, 179-181, 183
N NeIder-Meader simplex method, 38 non parametric smoothing, 15 nugget, 77
o oncogenes, 210
p P-Donsker, 45 penalized QIF, 59 Penalized Quasi-likelihood (PQL) method, 86 perfect match (PM) probes, 211 P-Glivenko-Cantelli, 45 PM/MM difference model, 217 predictive biomarker, 103 predictive value, 104-107, 118 proportional intensity model, 35 proportional rate model, 35, 36 Protein-protein interactions, 159
s semi parametric normal transformation model, 75, 93 semiparametric normal transformation models, 73 Single nucleotide polymorphisms (SNP),21O smoothly clipped absolute deviation, 59 spatial covariance function, 78 spatial data, 73 spatial frailty survival models, 88 Spatial generalized linear mixed models (SGLMMs),83 Spatial prediction (Kriging), 80 spatial survival data, 73 spatial variance components, 83 strongly stationary, 76
T time-dependent covariates, 44 time-varying coefficient model, 56 time-varying coefficient, 44 transcriptional regulation, 179 Treatment Selection, 103 tumor suppressor genes (TSG), 210 tumor-only LOR analysis, 217 two-phase epidemiology study design, 127
V validation, 106, 107, 118 Viterbi algorithm, 219
W Q QIF estimator, 55 quadratic inference function, 50
weakly stationary, 76
Author Index A
Bailey, T., 205 Baldi, P., 205 Baltimore, D., 234 Bamshad, M. J., 234 Banerjee, N., 205 Banerjee, S., 77, 96 Bar-Am, I., 232 Barany, F., 237 Barash, Y., 205 Barber, S., 235 Barlow, W. E., 138, 141, 149 Barnard, J., 56 Barndorff-Nielsen, O. E., 8 Baross, A., 235 Barrera, L. 0., 256, 257 Barretina, J., 236 Barreto, M. L., 43 Batzoglou, S., 206 Baumer, J., 154 Beazley, C., 234 Begg, C. B., 153 Begun, J. M., 7 Bekiranov, S., 255, 256 Bell, S. P., 257 Ben-Hur, A., 170 Benichou, J., 152 Bennetts, S., 28 Berchuck, A., 236 Berger, A. J., 237 Berk, A., 234 Berkowitz, G. S., 151 Berno, A., 233 Bernstein, J. L., 130 Bernstein, L., 130 Berntsen, T., 232 Beroukhim, R., 232, 233, 234, 236, 237 Besag, J., 75 Bickel, P. J., 8, 13, 28, 29, 133 Bignell, G. R., 233
Aanestad, H., 154 Aaronson, S. A., 231 Abramowitz M., 77 Aburatani, H., 234, 235 Aitkin, H., 5 Akaike, H., 25 Albert, P. S., 50 Albertson, D. G., 233 Alexander, J., 234 Alexander, S., 236 Alizadeh, A. A., 232 Alkema, W., 207, 257 Ally, A., 235 Altman, R. B., 206 Altschul, S. F., 206, 256 Altschuler, S. J., 207, 257 Amato, D. A., 14 Andersen, P. K., 6, 7, 14, 30, 35 Anderson, D. R., 58 Anderson, G. L., 66 Andrews, D. W. K., 66 Andrews, T. D., 234 Anton-Culver, H., 151, 152 Aparicio, 0., 257 Arbour, L., 235 Armstrong, L., 235 Asano, J., 235 Astrand, M., 255 Auger, I. E., 205 Ayyagari, R., 152
B Bacolod, M., 237 Bader, G. D., 174 Bader, J. S., 170, 171 Bailey, D. K., 234, 235 Bailey, J. A., 235 Bailey, T. L., 254
261
262
A uthor Index
Bild, A. H., 236 Birch, P., 235 Bird, C. P., 234 Birren, B., 206 Blanchette, M., 207 Blatt, M., 174 Boguski, M. S., 206, 256 Boice Jr, J. D., 151 Bolivar, H., 234 Bolstad, B. M., 256 Bonvin, A. M., 160 Boos, D. D., 55 Borgan, 0., 6, 30, 128, 138, 139, 141, 146, 147, 148, 149 Bose, S., 231 Botstein, D., 232, 236, 256 Boyer, L. A., 205 Breiman, L., 25, 58, 59 Brennan, C., 237 Brenner, S. E., 255 Breslow, N. E., 6, 7, 75, 84, 86, 87, 88, 91, 128-131, 133, 135-137, 140, 144, 150 Brian, J., 235 Brodsky, A. S., 255 Brohee, S., 174 Broman, K. W., 232 Brook, D., 80 Brown, M., 206, 256 Brown, P.O., 232, 236, 255, 256 Brubaker, S., 255 Brutlag, D. L., 206, 256 Bryk, A. S., 50 Buck, M. J., 255 Buhler, J., 255 Bulhmann, P., 95 Bunea, F., 24 Burge, C., 233 Burnham, K. P., 58 Burnside, C., 67 Bussemaker, H. J., 205, 252, 254, 255 Butler, A., 233 Byrne, C., 152
C Cagir, B., 126 Cai, B., 56 Cai, J., 11, 14, 18-26,30,35,36,90,92 Cai, J. W., 153 Cai, Z., 18
Cain, K. C., 130, 131, 136, 150 Caligiuri, M. A., 236 Cantoni, E., 58, 62 Cao, M., 232 Capanu, M., 153 Carey, V. J., 56 Carlin, B. P., 75 Carlstein, E., 93, 95 Carrasco, D. R., 236 Carroll, J. S., 255, 256 Carroll, R. J., 57 Carson, A. R., 235 Catano, G., 234 Cawley, S. E., 248, 255, 256 Cawley, S., 248, 255, 256 Chadha, M., 232 Chambless, L., 154 Chandonia, J. M., 255 Chang, J. T., 236 Chang, Y. M., 233 Chang, Y., 14 Chasse, D., 236 Chatterjee, N., 131, 136, 150 Chauvin, Y., 205 Chee, M. S., 233 Chen, C., 232 Chen, H. Y., 60, 61, 142, 143 Chen, J. B., 150, 151 Chen, J., 152, 234 Chen, K., 138, 139, 141, 146-149 Chen, M., 237 Chen, Runsheng, 257 Chen, T. H., 232, 233 Chen, W., 232, 235 Chen, X., 207 Chen, Y., 131 Cheng, J., 255 Cheng, R. S., 233 Cheng, S. C., 29 Cheng, Z., 235 Chi, M., 234 Chiang, C.-T., 56 Chiang, D.Y., 207, 254, 255 Chiang, T., 56, 62, 163 Chiba, S., 234, 235 Chin, K., 233 Chipman, H. A., 205 Chow, S. C., 126 Christiani, D. C., 232 Christiani, D., 233
A uthor Index Churchill, G. A., 233 Clark, R. A., 235 Clark, S., 232 Clayton, D. G., 5, 75, 84, 86, 87 Clayton, D., 75 Cleton-Jansen, A. M., 233 Coin, L., 236 Cole, M. F., 205 Coller, H., 236 Collin, F., 257 Collins, C., 232 Colman, H., 237 Cologne, J. B., 148 Comeau, S., 160 Conlon, E. M., 205, 254, 255 Cope, L. M., 256 Cornelisse, C. J., 233 Coustan-Smith, E., 231 Cox, D. R., 1, 5, 6, 8, 12, 15, 18, 20, 23, 30, 36, 88, 89, 136 Craven, P., 57 Cressie, N., 74, 75, 77, 78, 80, 82, 92 Crooks, G. E., 251, 255 Cullinane, C., 237 Cuzick, J., 29
D Dai, M., 56 Dalton, J. D., 231 Daly, M. J., 232 Danford, T. W., 255 Daniel, R.W., 152 Daniels, M. J., 56 Darnell, J., 234 Das, D., 205 Davis, C. S., 49, 66 Davis, J., 172 de Grassi, A., 234 Degnan, J. H., 235 del Pino, G., 80 Delaney, A. D., 235 Dempster, A. P., 134, 205, 255 Deng, M., 160, 163-165 Desai, U. B., 233 Descazeaud, A., 233 Desper, R., 236 Devilee, P., 233 Di Fiore, P. P., 231 Di, X., 232, 234 Diggle, P. J., 8, 49, 75, 83, 96
263 DiNardo, J., 67 Ding, L., 232 Dion, M. F., 205, 257 Doksum, K. A., 11 Donahoe, P. K., 234 Dong, S., 232 Down, T., 236 Downing, J. R., 236 Drenkow, J., 255, 256 Drescher, K., 154 du Manoir, S., 232 Du, J., 237 Dugad, R., 233 Dunning, M., 234 Dunson, D. B., 56 Durbin, R., 205, 233, 255 Dutra, A., 234 Dziak, J. J., 58, 59
E Eddy, S., 205, 233, 255 Eeckhoute, J., 255 Eichenbaum, M., 67 Eisen, M. B., 206, 232, 236, 255 Elidan, N., 205 Elkan, C., 205 , 254 Emond, M. J., 141 Endrizzi, M., 206 Engels, E. A., 129 Engstrom, P., 207, 257 Enright, A. J., 160, 174 Estep, P. W., 257
F Fan, J., 5, 6, 9, 10, 11, 14, 18-28, 30, 56, 58,59, 62, 68 Faraggi, D., 24 Farrer, M., 234 Feingold, E., 232 Felsenstein, J., 192, 205 Feng, B., 236, 237 Ferguson, T. S., 54 Ferguson-Smith, M. A., 232 Feskanich, D., 155 Feuk, L., 234, 235 Fiegler, H., 235 Finn, R. D., 168 Fisher, D. E., 257
264 Fitch, K R., 234 Fitzmaurice, G. M., 50, 66 Flanders,T. R., 131 Fleming, T. R., 30, 150 Fodor, S. P., 256 Forrest, M. S., 234 Foster, D. P., 25 Fountain, J. W., 233 Fox, E. A., 233, 255 Fraenkel, E., 255 Frank, I. E., 58, 59 Freedman, B. I., 234 Fricke, E., 257 Fridlyand, J., 233 Friedman, J. H., 58, 59, 206 Friedman, J. M., 235 Friedman, N., 205 Fu, W. J., 58 Fuchs, C.S., 155 Fukayama, M., 235 Fung, H. C., 235 Fung, W., 62 Futreal, P. A., 236
G Gaasenbeek, M., 236 Gail, M. H., 152 Galipeau, P. C., 233 Garraway, L. A., 233, 234, 237 Gastwirth, J., 67 Gatti, R. A., 151 Gazdar, A. F., 236 Geffers, R., 257 Geistlinger, T. R., 255 Gelman, R., 237 Gentleman, R. C., 237 Gentleman, R., 173 George, E. I., 25, 205 Getz, G., 232, 236, 237 Ghandour, G., 233 Ghosh, D., 36, 43 Giacomini, C. P., 236 Gibbons, R. D., 49 Gibbs, J. R., 235 Gifford, D. K, 255 Gijbels, I., 5, 6, 9, 10, 30 Gill, R. D., 6, 14, 35 Gilliland, D. G., 235 Gingeras, T. R., 255, 256 Giot, L., 159, 166
A uthor Index Giovannucci, E. L., 155 Girard, L., 232, 233, 236 Girtman, K, 231 Glendening, J. M., 233 Goadrich, M., 172 Goetz, M., 126 Goldberg, D. S., 161 Goldberg, E. K, 233 Goldstein, L., 146 Golub, T. R., 236 Gomez, S. M., 160 Gonzalez, E., 234 Goorha, S., 231 Gordon, D. B., 255 Gossling, E., 257 Gottardo, R., 255, 256 Granter, S. R., 237 Gray, J. W., 232 Green, P. J., 86, 206 Green, P., 233 Green, R. D., 256 Greene, W. H., 55, 67 Greenland, S., 131 Greshock, J., 233 Griffin, C., 237 Grigorova, M., 233 Grothey, A., 126 Guerreiro, R., 235 Guimaraes, K S., 166, 167 Guo, W., 56 Gupta, M., 206 Gurka, M. J., 58
H Hague, S., 234 Haile, R.W., 151, 153 Haining, R., 75 Hall, A. G., 237 Hall, P., 67 Hall, W. J., 8 Halpern, J., 155 Hamilton, S. A., 42 Hampel, F. R., 62 Hangaishi, A., 234, 235 Hannett, N. M., 255 Hannett, N., 257 Hansen, B. E., 54, 55, 67 Hansen, L. P., 67 Hao, K., 233
A uthor Index Harbison, C. T., 255 Harpole, D., 236 Harrington, D., 24, 30, 234 Harris, L., 232 Hart, G. T., 160 Hartzell, G. W., 207 Harville, D. A., 79 Hastie, T. J., 56 Hastie,.T., 12, 21 Haubrock, M., 257 Hauck, W. W., 50 Hayward, N. K., 233 He, X., 62 Heagerty, P. J., 50, 82, 92, 94 Heagerty, P., 49 Heaton, J., 67 Hedeker, D., 49 Hehl, R., 207, 257 Helt, G., 255 Hennerfeind, A., 88 Herder, c., 154 Hernandez-Boussard, T., 236 Hertz, G. Z., 253, 256 Hestermann, E. V., 255 Hinkley, D. V., 8 Hobbs, B., 256 Hochberg, E. P., 233 Hofer, M. D., 233 Hogue, C.W., 174 Hollis, B.W., 155 Holubkov, R., 128, 131, 135, 136 Hon, G., 255 Hoover, D. R., 56, 62 Horak, C. E., 256 Hornischer, K., 257 Horowitz, J. L., 67 Horvitz, D. G., 132 Hosoya, N., 234 Host, G., 82 Hougaard, P., 14 Hsu, L., 14 Hsueh, T., 236 Hu, F., 66 Huang, H., 160, 163, 207 Huang, J. H., 236 Huang, J. Z., 56, 62 Huang, J., 8, 12, 13, 14,24, 28, 232, 233, 234, 235 Huang, T., 235
265 Huang, y., 236 Huard, C., 236 Hubbard, T., 236 Hubbell, E., 232, 256 Huber, W., 246, 256 Hudson, T. J., 232 Hughes, J. D., 257 Hui, H., 232 Hulihan, M., 234 Hunkapiller, T., 205 Hunter, T., 236
I Iafrate, A. J., 234 Ibrahim, J. G., 28, 29 Imbens, G. W., 67 Ingle, C. E., 234 Irizarry, R. A., 255, 256 Ishikawa, S., 234, 235 Ito, T., 159, 166 lyer, V. R., 256
J Jain, A. N., 233 J akobsson, M., 235 Janne, P. A., 232, 233 Jansen, R., 160-162, 169 Jeffrey, S. S., 232 Jeffreys, A. J., 234 Jennings, E. G., 255, 257 Jensen, S. T., 206 Jewell, N. P., 55 Ji, H., 248, 249, 256 Jiang, F., 236 Jiang, J., 11, 14, 20, 21, 22, 23, 24, 27, 28,30 Jiang, W., 58 Jackel, K. H., 154 Johnson, B. E., 232 Johnson, J., 234 Johnson, L. A., 232 Johnson, M. D., 235 Johnson, W.E., 247, 248, 255,256,257 Johnston, J., 67 Johnstone, S. E., 205 Jones, K. W., 233, 234, 235 Joshi, M. B., 236 Journel, A. G., 75, 77
266
K Kachergus, Jo, 234 Kalbfleisch, Jo Do, 30, 63, 138 Kaldor, Jo, 75 Kallioniemi, Ao, 232 Kallioniemi, 0o Po, 232, 236 Kamman, E. Eo, 81 Kampa, Do, 255 Kanin, E., 257 Kaplan, To, 205 Kapranov, Po, 255, 256 Karanjawala, Zo, 233 Karas, Do, 257 Karas, Ho, 207 Karlin, So, 233 Kato, Mo, 235 Kawamata, No, 235 Keiding, No, 6 Kel, Ao Eo, 257 Keles, So, 206 , 254, 256, 257 Kellis, Mo, 206, 255 Kel-Margoulis, 0o Vo, 257 Kennedy, Go Co, 232, 234 Khatry, Do, 236 Kim, I., 165, 166, 168 Kim, To Ho, 256 Kim, Yo Ho, 236 Kim, Yo, 236 King, Ao Do, 174 King, Co Ro, 231 King, Mo, 5, 6, 9, 10, 30 Kinzler, K. W 0' 232 Klaassen, Co Ao, 8, 13, 151 Klebanoff, Mo Ao, 152 Klint, Ao, 154 Kloos, Do Vo, 257 Knudson, Ao Go, 231 Koenig, Wo, 154 Kohn, R., 56 Kolb, Ho, 154 Komura, Do, 235 Kong, Ao, 206 Kong, Lo, 149 Kong, So Wo, 237 Kosorok, Mo Ro, 29, 30 Kott, Po So, 95 Koul, Do, 237 Kowbel, Do, 232 Krahe, R., 236 Kraja, Ao, 232 Kraus, Mo Ho, 231
A uthor Index Krieger, Ao, 67 Krier, Co, 237 Krogh, Ao, 205, 206, 233, 255 Kucherlapati, R., 235 Kuha, Jo, 58 Kulich, Mo, 28, 138, 139, 149 Kulkarni, Ho, 234 Kunsch, Ho, 95 Kuo, Wo Lo, 232, 232 Kurera, Do, 233 Kurokawa, Mo, 234, 235
L LaCount, Do Jo, 159 LaFramboise, To, 234 Lahiri, So No, 95 Lai, To, 66, 67 Lai, Wo R., 235 Laird, No Mo, 50, 83, 152, 205 Laird, No, 255 Lakshmi, B., 234 Lancaster, Jo Mo, 236 Land, Co Eo, 152 Land, So, 257 Lander, E. So, 233, 255 Lander, E., 206 Lange, K., 207 , 233 Langholz, B., 147, 149 Largent, Jo Ao, 129 Larsson, Co, 232 Lastowska, Mo, 237 Lau, Co Co, 233 Law, Jo, 232 Lawless, Jo Fo, 8, 30, 35, 138 Lawless, Jo, 67 Lawrence, Co E., 183, 205, 206, 207, 256 Ledbetter, Do Ho, 232 Lee, Ao Jo, 135, 151 Lee, Co, 234 Lee, Eo Wo, 14 Lee, Ho, 159 , 237 Lee, Jo Co, 236 Lee, To I., 205, 255 Lee, Y., 50, 86, 235 Lele, So R., 82, 92, 94 Lele, So, 95 Lengauer, Co, 232 Lenhard, Bo, 207, 257
A uthor Index Leo, C., 233 Levine, R. L., 235 Levine, S. S., 205 Lewicki-Potapov, B., 257 Li, B., 49, 50, 53, 54, 55 Li, C., 232, 233, 234, 237, 256 Li, H., 141, 205, 255 Li, J., 231 U,R,U,~,W,5~~,OO,~,~
Li, S., 159, 166 Li, W., 255, 256, 257 Li, X., 206 ~Y,5~~,~,~,OO,~,W,%,OO
Liang, P., 206 Liang, KY., 14, 50, 52, 53 Liaw, D., 231 Lieb, J. D., 205 , 255, 256 Lieberfarb, M., 233 Liebich, I., 207 Lin, 232 Lin, D. Y., 14, 15, 17, 18, 28, 138, 139, 142, 149 Lin, M., 232, 233 Lin, N., 161, 162, 169 Lin, W. M., 232 Lin, X., 57, 75, 88, 90, 91, 93-96 Lindblad-Toh, K, 232 Lindsay, B. G., 49, 50, 53, 54, 55, 65, 67 Linhart, D., 236 Lionel, A. C., 235 Listewnik, M. L., 234 Little, R J. A., 60, 61, 129, 142, 143 Liu, G., 232, 234 Liu, H., 237 Liu, J. S., 182, 183, 184, 186, 195, 196, 198, 206, 207, 208, 256, 257 Liu, L. X., 56 Liu, N. P., 56 Liu, Q., 88 Liu, R., 95 Liu, W. M., 232, 256 Liu, X. Q., 235 Liu, X. S., 205, 206, 256, 257 Liu, X. Shirley, 257 Liu, X., 58, 206, 256 Liu, Y., 159, 164, 165, 166, 206 Liu, Y. J., 207, 257 Lizardi, P., 235 Lo, S., 138, 139, 142, 149 Locke, D. P., 235
267 Lockhart, D. J., 233 Lodeiro, G., 232 Lodish, H., 234 Loh, J. M., 95 Loh, M. L., 236 Loi, H., 232 Louis, T. A., 75 Lu, L. J., 161, 162, 172 Lu, S., 149 Lu, W. B., 151 Lucito, R, 235 Lui, W.O., 232 Lundin, P., 234 Lynch, C. P., 151, 153
M Matyas, L., 67 Ma, J., 231 Ma, S., 29, 30 Macisaac, K D., 255 Mackenzie, G., 56 Maierhofer, C., 232 Maitournam, A., 126 Malone, K E., 151, 153 Man, T. K, 233 Mandrekar, S., 126 Maner, S., 234 Mangano, A., 234 Manning, G., 236 Manrai, Arjun K, 257 Marcotte, E. M., 160 Margalit, H., 160 Marginussen, T., 154 Marks, J. R, 237 Marshall, C. R, 235 Marshall, M., 236 Martin, S., 168 Martinez, R, 236 Martinussen, T., 56 Masry, E., 22 Massa, H., 234 Mathew, S., 231 Matsudaira, P., 234 Matsuzaki, H., 232 Matthews, L. R, 160 Matys, V., 207, 257 May, C. A., 234 McClure, M. A., 205 McCulloch, R., 56, 82, 231
268 McCulloch, R. E., 205 McGrath, S. D., 235 McKeague, I. W., 24 McMullagh, P., 56 Mei, R., 232, 233, 234, 256 Meinhardt, T., 207 Meisinger, C., 154 Mellemkaer, L., 153 Meng, X.-L., 56 Mesirov, J. P., 236 Meyer, C. A., 255, 256 Meyerson, M., 232, 234 Mian, L. S., 206 Michael, H., 257 Mihatsch, M. J., 236 Miliaresis, C., 231 Miller, B. J., 236 Miller, C. B., 231 Milner, D. A., 237 Minna, J. D., 236 Minna, J., 232, 233 Miron, A., 232 Misra, A., 237 Mitchinson, G., 255 Mitchison, G., 233 Moch, H., 236 Moessner, R., 235 Mollie, A., 75 Morgan, G. D., 95 Moses, A. M., 207 Mullighan, C. G., 231 Munch, R., 257
N Nadeau, C., 35 Nakachi, K, 152, 234 Nan, B., 141 Nannya, Y., 234, 235 Narasimhan, B., 236 NeIder, J. A., 50, 52, 82, 86 Neriishi, K, 152 Neuhaus, J. M., 50 Neuwald, A. F., 206, 207, 256 Newton, M. A., 235 Neyman, J., 54 Ng, H. H., 255 Nghiemphu, L., 236 Nibbs, R. J., 234 Nigro, J. M., 237 Ning, Y., 232
A uthor Index Nishimura, K, 235 Noble, W. S., 170 Noor, A., 235 Nussbaum, R., 234 Nychka, D., 81
o Oakes, D., 8, 30, 36 Olsen, J. H., 151 Olshen, A. B., 235 Opsomer, J. D., 81 Overbeek, R., 160 Owen, A. B., 67 Ozburn, N., 237 Ozsolak, F., 257
p Paciorek, C. J., 83 Paez, J. G., 233 Palumbo, M. J., 207 Pan, E., 237 Pan, J., 56 Pan, W., 58 Panchenko, A. R., 160 Panda, B., 235 Papadimitriou, C. H., 236 Park, P. J., 235, 237 Patel, S., 255 Paterson, A. D., 235 Patil, N., 233 Patterson, N., 206 Pazos, F., 160 Pearson, A. D., 237 Pee, D.Y., 152 Pepe, M. S., 35, 66, 150 Pergamenschikov, A., 232 Perner, S., 232 Perou, C. M., 232 Perry, G. H., 234 Pertz, L. M., 235 Peters, H., 237 Pettitt, A. N., 29 Peuralinna, T., 234 Phair, J. P., 62 Piccolboni, A., 255 Pierce, D. A., 88 Pierce, J. H., 231 Pinkel, D., 232, 233
269
A uthor Index Pinto, D., 235 Podsypanina, K., 231 Pogoda, J., 151 Pokholok, D. K., 255 Politis, D. N., 95 Polk, B. F., 62 Pollack, J. R., 232, 236 Pollack, J., 236 Polyak, K., 237 Poole, I., 232 Potti, A., 236 Pounds, S. B., 231 Pourahmadi, M., 56 Prass, C., 233 Preisser, J. S., 62 Prentice, R. L., 14, 30, 92, 128, 130, 132, 135, 138-141, 143, 149 Protopopova, M., 236 Province, M. A., 232 Pruss, M., 207 Puc, J., 231 Pyke, R., 130, 132, 135
Q Qaqish, B. F., 62 Qi, y., 161, 162, 172, 234 Qin, J., 67 Qu, A.,49, 50, 51, 53, 54, 55, 57, 58, 60, 61, 62, 65, 67 Qu, C., 256 Quinones, M. P., 234
R Rabbee, N., 237 Rabiner, L. R., 207 Radtke, I., 231 Rahman, N., 236 Ramaswamy, S., 232, 237 Rando, O. J., 207, 257 Raudenbush, S. W., 50 Redon, R., 234 REID, B. J., 233 Reilly, A. A., 206, 256 Reilly, M., 150, 151 Ren, B., 256, 257 Ren, y., 235 Reuter, I., 207, 257 Reynolds, D. B., 255 Rice, A. J., 233
Rice, J. A., 56 Richmond, T. A., 256 Richter, J., 236 Riley, R., 166, 167 Rinaldi, N. J., 255 Ritov, y., 8, 13, 29, 151 Rivera, M. N., 234 Robert, F., 257 Roberts, P., 237 Roberts, W., 235 Robins, J. M., 131, 133, 134, 138, 140, 141 Rodgers, L., 231 Rolfe, P. A., 255 Romano, J. P., 95 Rosenstein, B. S., 151 Rotert, S., 257 Roth, F. P., 176, 257 Rotnitzky, A., 61, 138 Roverato, A., 56 Rual, J. F., 159 Rubin, D. B., 60., 129, 142, 205 , 255 Rubin, M. A., 233, 237 Ruppert, D., 57 Rutovitz, D., 232 Ryan, L., 56, 75, 88 Ryder, T. B., 232
S Sabatti, C., 207 Saltzman, N., 81 Salwinski, L., 167, 171 Samuelsen, S. 0., 140, 146, 147, 149 Sanada, M., 234, 235 Sanchez, R., 234 Sandelin, A., 207, 257 Santibanez-Koref, M., 237 Sargent, D., 126 Sauter, G., 236 Saxel, H., 257 Scafe, C. S., 256 Schacherer, F., 207 Schaffer, A. A., 236 Schairer, C., 152 Schall, R., 83, 86, 87 Schaubel, D. E., 36 Scheer, M., 257 Scheet, P., 235 Scheike, T. H., 56, 142-144
270
A uthor Index
Scherer, S. W., 234 Schill, W., 131, 136 Schneider, T. D., 207, 251, 257 Schoell, B., 232 Scholz, S. W., 235 Schraml, P., 236 Schreiber, J., 257 Schrock, E., 232 Schumaker, L., 12 Schwartz, S., 235 Schwarz, G., 25 Scott, A. J., 128, 131, 132, 134, 135, 136, 144, 151 Sebat, J., 234 Segatto, 0., 231 Segraves, R., 232, 235 Sekinger, E. A., 255 Self, S. G., 14, 138, 141 Sellers, W. R., 233, 234 Selvanayagam, Z., 237 Sementchenko, V., 255 Sen, P. K., 149 Senman, L., 235 Shago, M., 235 Shah, K. V., 152 Shao, J., 126 Shao, W., 255 Shapero, M. H., 235 Sharan, R., 170 Sharp, A. J., 235 Sharp, G. B., 152 Shen, J., 233, 206 Sherman, M., 93 Shi, T., 160 Shia, J., 237 Shibata, R., 25 Shih, J. H., 151 Shim, H., 254, 257 Shoemaker, B. A. Shore, R. E., 151 Siddharthan, R., 207 Siggia, E. D., 205, 207, 255 Silver, P. A., 255 Simon, I., 257 Simon, R., 24, 126 Singer, M. A., 256 Singh, K., 95 Singleton, A. B., 234 Singleton, A., 234 Sinha, R., 236 Sinha, S., 207
Skaug, J. L., 235 Skaug, J., 235 Skrondal, A., 154 Slack, M. D., 207, 257 Slonim, D. K., 236 Small, D., 58, 61, 66, 67 Smirnov, I., 237 Smith, M., 56 Smith, S. A., 151 Snijders, A. M., 233 Snyder, M., 256 Soenksen, D., 232 Solomon, P. J., 88 Song, J. S., 247, 257 Song, P. X., 51, 60, 61, 62 Speed, T. P., 207, 237, 255, 256 Speicher, M. R., 232 Spellman, P. T., 236 Spiekerman, C. F., 14 Sprinzak, E., 160 Sridhar, A., 233 Stanton, S. E., 232 Stegun, I. A., 77 Stein, M. L., 90, 95 Steinmetz, L. M., 256 Stengel, R. F., 237 Stephens, R. M., 207, 251, 257 Stewart, J. P., 236 Stolovitzky, G., 171 Stormo, G. D., 207 Stovall, M., 151 Stranger, B. E., 234 Stratton, M. R., 233, 236 Strausberg, R. L., 256 Strobl, C., 172 Struhl, K., 255 Su, X., 232 Sudar, D., 232 Sudarsanam, S., 236 Sun, Y., 18 Suthram, S., 171 Szary, A. J., 255 Szatmari, P., 235 Szpiech, Z. A., 235
T Tagne, J. B., 255 Takusagawa, K. T., 255 Tamayo, P., 236
A uthor Index Tammana, H., 255 Tanay, A., 173 Tanenbaum, D. M.,232 Tanner, M., 207 Tebha, Y., 233 Teitelbaum, S. L., 151 Therneau, T. M., 42, 141 Thiele, S., 257 Thomas, D. C., 128, 145, 147, 149, 150 Thompson, A. P., 235 Thompson, D. J., 132 Thompson, W., 188, 207 Thorand, B., 129 Thorne, N., 234 Tibshirani, R., 12, 21, 24, 25, 56, 59, 236 Timm, J., 154 Toedling, J., 256 Tompa, M., 207, 255 Tanon, G., 236 Torrang, A., 154 Troge, J., 234 Tsafrir, D., 237 Tsafrir, 1., 237 Tsang, Y. T., 233 Tsiatis, A. A., 7, 149 Tsuji, S., 235 Tweddle, D. A., 237
U Uetz, P., 159, 166
V Vallente, R. U., 235 van HeIden, J., 174 van Nimwegen, E., 207 VanLiere, J. M., 235 Velculescu, V. E., 232 Veldman, T., 232 Venkatraman, E. S., 235 Verkasalo, P. K, 152 Villapakkam, A., 232 Vincent, J. B., 235 Viprey, V., 237 Virmani, A. K, 236 Viscidi, R. P., 152 Vivanco, 1., 236 Vogelstein, B., 232 Volkert, T. L., 257
271
W Wahba, G., 57 Waldman, E., 232 Walker, G. J., 233 Walker, M., 234 Waller, 1. A., 75 Wand, M. P., 81 Wang, D., 236 Wang, H., 126 Wang, K, 235 Wang, L., 58, 234 Wang, N., 57 Wang, P., 236 Wang, Q., 236 Wang, S. 1., 231 Wang, T. L., 232 Wang, T., 207 Wang, Y., 56 Wang, Z. C., 232 Wappler, 1., 237 Ware, J. H., 50, 83 Wasserman, W. W., 207, 257 Watt, S., 233 Webster, T. A., 232 Wedderburn, R. W. M., 52 Wedderburn, R. W., 82 Wei,1. J., 14, 15, 17,29,232,233 Wei, L., 206 Wei, W., 233, 234 Weir, B. A., 232, 234 Weir, B., 234 Weissfeld, L., 14, 15, 17 Wellner, J. A., 8, 13, 129, 140, 141, 144 Weremowicz, S., 237 West, K D., 67 West, S., 233 Westfall, P. H., 236 Wheeler, R., 255 White, E., 127 Whyte, D. B., 236 Widlund, H. R., 237 Wienberg, J., 232 Wigler, M., 235 Wild, C. J., 128, 131, 132, 134, 135, 136, 144, 151 Willett, W. C., 155 Williams, A. J., 255 Williams, C. F., 232
272
A uthor Index
Wilson, C. J., 257 Winchester, E., 232 Winckler, W., 234 Wingender, E., 207, 257 Wolitzer, A., 153 Wollff, R. K, 233 Wong, W. H., 248, 249, 256 Wong, B., 255 Wong, K K, 233 Wong, W. H., 189, 206, 208, 233, 234, 237 Woo, M. S., 232 Wooster, R., 236 Wooton, J. C., 206, 256 Wright, F. A., 236 Wu, B., 235 Wu, C. 0., 56, 62 Wu, K, 129 Wu, L. F., 207, 257 Wu, W., 56 Wu, Y. N, 257 Wu, Y., 256 Wypij, D., 61 Wyrick, J. J., 257
y Yamamoto, G., 235 Yamanaka, M., 255 Yang, G., 232, 236 Yao, J., 237 Ye, H., 56 Yen, C., 231 Yin, G., 28, 29 Ying, Z., 28, 29, 40, 45, 149 Yoo, J., 255 York, J., 75 Young, J., 234
Young, K, 126 Young, R. A., 255, 257 Young, S. S., 236 Yu, H., 163 Yuan, G. C., 202, 207, 257
Z Zeger, S. L., 8, 50, 52, 53 Zeitlinger, J., 255, 257 Zeng, D., 28, 29, 36, 142, 150 Zeng, Z., 237 Zhai, Y., 232 Zhan, F., 236 Zhang, H., 83 Zhang, J., 56, 62, 232, 234 Zhang, L., 237 Zhang, M. Q., 205 Zhang, Xinmin, 255, 257 Zhang, Y., 236 Zhao, H., 235 Zhao, L. P., 138 Zhao, X., 207, 232, 233, 234 Zheng, M., 256, 257 Zhou, H., 11, 14, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 30 Zhou, L., 56, 62 Zhou, Q., 189, 195, 196, 198, 20~ 208 Zhou, Y., 18, 19 Zhu, J., 95 Zhu, Xiaopeng, 257 Zipursky, S. L., 234 Zochbauer-Muller, S., 236 Zou, H., 59 Zucchini, W., 58 Zucker, J. P., 205 Zwaigenbaum, L., 235
S t
(
...
u
2E
u
Figure I of Chapter 8: Graphi ca l illu stration of a CRM
4
Figure 2 of Chapter 8: The coup led hidden Markov model (c-HMM). (A) The HMM for module structure in one sequence. (B) M ul tiple alignment of three Otlhologous sequences (upper panel) and its conesponding graphi ca l model representation of the c-HMM (lower panel). The nodes represent the hidden states. The vertica l bars in the upper pa nel indicate that the nucleotides emitted fro m these states are ali gned and thu s co ll apsed in the lower panel. Note that a node will em it \Ilk llucleotides if the cOlTesponding state is M k (k = I ,' " ,A.). (C) The evoluti onaty model for motifs using one base of a moti f as an illush·ation. The hidden ancestral base is Z, which evo lves to tlu'ee descendant bases X(l ), X (2 ), and X (3 ) Here the evo lutio nary bond between XCI) and Z is broken , impl yi ng that XCI) is independent of Z. The bond between X (2) and Z and that betweenX (3 ) and Z are connected, w hich means
thatX (2 )=X C3 )=Z
2
Figure 3 of Chapter 8: Logo pl ots for the motifs in the simulated stud ies: (A) Oct4, (B) Sox2, and (C) Na nog
Figure 5 of Chapter 8: The Sox-Oct composite motif discovered in the Oct4 positive CblP- regions
-I
-4
Reference
+1
+4
MMA
)'I MA
M:o.V\
MMA
MMA
I'MA
p'-tA
PMA
I'MA
PMA
I'MB
PM B
PM B
1'\18
I>MB
t-.'I I1-IB
M)"18
Ml\I B
MMB
/'-11'- 18
/
c
~
G C C T C G G A C T A T G G C C AT T A Sequence ...
Probe Sequence eGG A lieleA C G G C G G A llele B eG G
(A)
A A A A
G G G G
C C C C
C C C C
T G A T G A T G A T G A
C T A G T A T T A A T A
C C C C
C C C C
G G G G
G G G G
T T T T
A A AA A A A A
(B)
Figure I of Chapter 9: (A) The designing scheme ofa probe set on the HuSNP array, which contains 1400 SNPs. Five quartets (co lumns) of oli go nucleotide probes interrogate the genotype of a SNP. In the central quartet, the perfect match (PM) probes are comp lementary to the reference DNA sequences surrounding the SNP with two alleles (denoted as alle le A and B). The mismatch (MM) probes\index {mismatch (MM) probes} have substituted centra l base pair compared to the correspond ing PM probes and they contro l fo r cross-hybridization signa ls. Shifting the centra l quartet by -4, - I , I and 4 basepairs form s additional four qua11ets. The newer generations of SNP arrays have se lected quartets from both forward and reverse strand s (see F igure 6). (B) A genotyping algorithm makes genotype calls for thi s SNP in three different sampl es based on hybri di zation patterns. For a dip loid genome, three ge notypes are poss ible in different individuals: AA, AB and BB
Tumor
Normal
A
A
B
LOH (subj ect #58)
A
A
Retention
B
B
(s ubj ect #57)
••B
. . . .~. . . .~. .~ B
Non- informat ive (subj ect #55)
Figure 2 of Chapter 9: Comparing the genotypes of a SNP in pa ired n0l111a l and tumor samples gives LOH calls. Only SNP markers that are heterozygous in norma l samples are informative to provide LOH or retention calls
--
----- --=-
= --= -~-=;;;;;;;;
-
-~
---
i iiii
-
-= ---
! !!!!
-- -
-
-=
=
====-=:=- -=
Figure 3 of Chapter 9: Observed LOH calls by comparing nOimal (N) and tumor (T) genotypes at the same SNP. T he samp le pa irs are di sp layed on the co lumns (co lumn names on ly show tumor sampl es) and the SNPs are ordered on the rows by their p-arm to q-arm positions in one chromosome. Ye ll ow: retention (AB in both Nand T), Blue: LOH (AB in N, AA or BB in T), Red: Conflict (AAIBB in N, BB/AA or AB in T), Gray: non-infOimative (AAIBB in both Nand T), White: no call (no call in Nor T)
F ig ure 5 of C hapter 9: Comparison ofRMM-based LO H infelTed from tumor-only samples to the observed LOR based on paired nomlal and tumor samples, using chromosome 10 data offour pairs of 10K SNP arrays. (A) Observed LOR form nOlmal/tumor pairs. LOR is indicated in blue/dark co lor. (B) LOH inferred from HMM by using only tumor sampl es . B lue/dark color indicates a probabi lity of LOH near I and yellow/light co lor indicates a probability of LOR near 0
[GC(1c 4(,QS40 156SC Il I OO7 [
Y dl splayrange:(O,46 14 ,.ca l1 AB,P". OO .
A probe Forward
B probe Forward
~ ..... -'''-.,.
---
A probe Reve rse
~-"T
.....
B probe Reverse
.• -
...
A8
Modd U1Clhod PM 'MM dIfference model
II ••••••i u.
prnA
"
\
PIllB Illm S
\ I
~
Ydi
i
I
I
__ ___ .L•
' dlsplaYTllngc' IO.-I08JI
AA
t
pm" pmB
• t
,. -+ ''"
I I
"
-'.
\
•
dl5 play TllllgC:
Yd lsjllaynmgc: (0.46 1-1J,Call' A, I'O ,.: O" .
(O.40~3 )
A? 1lll1lA
prnA plIlB
,r
(A)
(B)
-'
(C)
Figure 6 of Chapter 9: (A) The dChip PMlMM data view di splays the probe leve l data of a SNP in three samples fro m top to bottom. The 20 pro be pairs are ordered and co nnected fro m left to ri ght. Wi thin a pair the blue/dark co lor represents the PM probe intensity and the gray/light co lor represents the MM pro be. T his probe set has 5 q uartets of probes comp lementary to the DNA fo rward strand and 5 q uartets to the reverse stra nd. (B) The same probe set is di splayed as raw intensity va lues in the format of quartets as in F igure 1 B. (C) Making the probe set patterns and sig na ls of genotype ca ll AA, AB and BB comparab le by adding up the a ll ele A probe and allele B probe signa ls. After thi s additi on, the p MlMM d ifference signals are comparab le for genotype AB (1 st row) and AA (2nd row). The probe signals on the 3rd row are about half of the signals on the 2nd row, presumably corresponding to a hem izygous deletion w ith genotype A in thi s sample
_
;;.:i
~ ~~~ ~g~~~~§ ~~ ~~~ ~~ ~ ~~~~ ~~~~~~ ~ ~~ ~~~ ;j~~~§ ~ y
6
L r
(B)
(A)
Figure 7 of Chapter 9 : (A) T he raw copy number view in dChip. The SNPs on clu-omosome 8 are disp layed on the rows with the p-arm to q-arm ordering. The samples are displ ayed on th e co lumn s with norm al sample fi rst and the paired tumor sample next. The 0 copy is displayed i.n white, 5 or more copies in red (b lack), and intelmediate copy numbers in light red (gray) co lor. The curve on the right in the shaded box represents the raw copy numbers of a se lected tumor sample (H2 171), and the veltica l line represents 2 copies. Note that larger raw copy numbers are assoc iated w ith large r va riatio n, as seen from the two peaks in the curve . (B) T he HMM-inferred integer copy numbers
.§ -.; ~ >,
..0
:;; ..0
E :::;
"0.
>,
0
u
7
6 5 4 3 2 I 0 -I 0
-
......---=
~
!!!!
-
%::1[= =
2
4
6
-
copy number by SNP aITay
CAl
(B)
(C)
F igure 8 of Chapter 9: (A) The agreement between the inferred copy number fro m SNP array (X-ax is) and the Q-PCR based copy number (Y-ax is) o n the sa me sa mples. The error bars represent the standard deviation of Q-PCR based copy number of several loci hav ing the sa me infe rred copy nu mber. T he figure is courtesy of X iaojun Zhao. (B) The LOH events on chromosome 8 of six tumor samples are hig hlighted in blue/dark color.
(C) The correspondi ng raw copy numbers of both n0l111al and tumor samples are shown for the same samples. In the sampl e 1648 (the I st co lumn), the who le chromosome LOH is due to hemi zygous deletio n in the p-arm but gene conversio n in the q-alTl1
Figure 9 of Chapter 9: The CNVs of chromosome 22 identifi ed in 90 normal indi viduals using the l OOK Xba SNP array. The rows are ordered SNPs and the co lumn s are samp les. The white or light red co lors cOITespond to lower than 2 cop ies (homozygous or hemi zygo us de letions). The CNVs are between the region 20.7 - 22 Mb (22q 11.22 -- q 11.23). The CNV in thi s region have been reported in other studies [42 , 48]. Samp le F4 and F6 are from the same fa mil y and share a CNV region
!\:.:!~\;;!i~~,,~g}~
)
(/ ~
I< I<
II I
I I II
~~--!,
II I II I
I I II UII I I II 1111
III
I I II 11 11
-If
II I
I I II
Jf ''''I'.~
III
I I II 1111
1 11
1111 1111
__...... "'.... 111 IIII leI'I:"""'-:
1111
1111 1111 1111
~'f_
jt."".I
til,..
1f'~IMr;.J
ildllJ:t1tntoJrnll
1:..1/1, .-- ....
----,-..,~-..-
qr...,-tUtnl('d. r~11lZ
.....,-
(
~"""R'r'lr
Il l tJjrioII
I:::::
~"",W1er"
"'-,-
(A)
I~"~
.....,.'~IIGI'ft?
(B)
Figure 10 of Chapter 9: (A) In the dChip LOH view, the tumor-only LOH inference ca lls LOH regions (blue color) in multip le prostate samples. The curve and the vertical line on the right represent LOH scores and a significance tlu·eshold. (B) The 3 M b region of the peak LOH scores in (A) contains the known TSG retinoblastoma I. The cytoband and gene info rmation are displayed on the left. Data courtesy of Ra meen Beroukh im
Cross-link DNA in vivo and lyse cells
Shear and extract DNA, introduce antibody
~~00. ~ ~ ~~ ~
~ ~~ ~,,~
Pull out DNA-proteinantibody complexes
Wash away anti-bodies, proteins, amplify and label sample
Hybridize on tiling microarray
A----- r - - - --- ---- ------- ------------------------------------------------- ---- - ------- -- --------------------------- ---------------- - ------------------------- ------------ ------_n n
n n n_
Figure I of Chapter 10: Description of a ChIP-chi p experi ment
..
~P 1 SJ)
rh-rr, f~ f' ~ iI=,.,-,r;-n-,-, ... . ~-- -
•
])152
111..\3
p13.3 ]lUI pili ]l121
pl3.1 pl-l1 pl -1 3
_-
1~~£~a B~~ s s~9~B ii~ a ~~ ~a i ~~,~
I
,II p2L2
p22.2 p231 p3 11
pJ2 1'322
,J<
p352
(A)
(B)
Figure 11 of Chapter 9: Clustering the same set of breast cancer samp les using SNP LOR data (A) and express ion data (B) . A similar sample cluster emerge in both clustering figures (note the left-most sample branches highlighted in blue). The blue/yellow co lors indicate LOHIretention events (A) , and the red/blue colors indicate high or low expression level (B). The labels below sample name are lymph node status (p: positive, n: negative). Data cOlIItesy of Zhigang Wang and Andrea Richardson
;
.
Figure 3 of Chapter 10: Sequence logo for the DNA-binding 1110tiffor the HNF transcription factor from the JASPAR database
This page intentionally left blank